Title: A novel entropy-based dynamic data placement strategy for data intensive applications in Hadoop clusters

Authors: K. Hemant Kumar Reddy; Vishal Pandey; Diptendu Sinha Roy

Addresses: Computer Science and Engineering, National Institute of Science and Technology, Berhampur, India ' David Eccles School of Business, University of Utah, Utah, USA ' National Institute of Technology, Bijni Complex, Laitumkhrah, Shillong 793003, Meghalaya, India

Abstract: In the last decade, efficient data analysis of data-intensive applications has become an important research issue. Hadoop is the most widely used platform for data intensive application. However, majority of data placement strategies attempt placing related-data close to each other for faster access without considering new datasets, generated or for different MapReduce jobs. This paper deals with improving the map-reduce performance over multi-cluster datasets by means of a novel-entropy-based data placement strategy (EDPS) in three-phases. K-means clustering strategy is employed to extract dependencies among different datasets and group them into data-groups. Then these data-groups are placed in different datacenters while considering heterogeneity. Finally, an entropy-based grouping of the newly generated datasets where these datasets are grouped with most similar existing cluster based on their relative entropy. The experimental results show efficacy of the proposed three-fold dynamic grouping and data placement policy, which significantly reduces the time and improve Hadoop performance.

Keywords: dynamic data placement strategy; Hadoop clusters; MapReduce; k-means clustering; entropy.

DOI: 10.1504/IJBDI.2019.097395

International Journal of Big Data Intelligence, 2019 Vol.6 No.1, pp.20 - 37

Received: 09 Jun 2017
Accepted: 17 Oct 2017

Published online: 21 Jan 2019 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article