Authors: José Benedito De Souza Brito; Aletéia Patrícia Favacho De Araújo
Addresses: Department of Computer Science, Universidade de Brasília (UnB), POB: 4.466, ZIP: 90.910-900, Brasília, DF, Brazil ' Department of Computer Science, Universidade de Brasília (UnB), POB: 4.466, ZIP: 90.910-900, Brasília, DF, Brazil
Abstract: This paper describes the HCEm model, designed to estimate the size of a cluster running Hadoop, in a given timeframe on cloud environments. The HCEm consists of a light optimisation layer for MapReduce jobs and a model to estimate the size of a Hadoop cluster. Additionally, this paper presents a comparative study of HCEm using similar applications and workloads in two production Hadoop clusters, the Amazon Elastic MapReduce and a private cloud in a large financial company, in order to evaluate the performance of the model in real and intensive processing environments. The estimates generated by the HCEm model and processing performed are representative and consistent, which can help researchers and engineers understand the workload characteristics of Hadoop clusters in their production environments. The performance differences shown between the real environments, confirmed that the increased sharing of physical computing host resources reduces the accuracy of the model.
Keywords: distributed computing; computational efficiency; Hadoop benchmarks; big data; data analysis; resource allocation; performance model; MapReduce; Hadoop performance evaluation; job estimation; Hadoop clusters; cloud computing.
International Journal of Big Data Intelligence, 2017 Vol.4 No.1, pp.47 - 60
Received: 08 Apr 2015
Accepted: 22 Oct 2015
Published online: 26 Dec 2016 *