Title: Data analysis on big data: improving the map and shuffle phases in Hadoop Map Reduce

Authors: J.V.N. Lakshmi

Addresses: SCSVMV University, Sri Chandrasekharendra Saraswathi Viswa Mahavidyalaya, Sri Jayendra Saraswathi Street, Enathur, Kanchipuram, Tamil Nadu, 631 561, India

Abstract: The data management has become a challenging issue for network centric applications which need to process large amount of datasets. System requires advanced tools to analyse these datasets. As an efficient parallel computing programming model Map Reduce and Hadoop are used for large-scale data analysis. However, Map Reduce still suffers with performance problems Map Reduce uses a shuffle phase individual shuffle service component with efficient I/O policy. The map phase requires an improvement in its performance as this phase's output acts as an input to the next phase. Its result reveals the efficiency, so map phase needs some intermediate check points which regularly monitor all the splits generated by intermediate phases. This acts as a barrier for effective resource utilisation. This paper implements shuffle as a service component to decrease the overall execution time of jobs, monitor map phase by skew handling and increase resource utilisation in a cluster.

Keywords: Map Reduce; Hadoop; shuffle; big data; data analytics; Hadoop distributed file system; HDFS; rack awareness; stragglers; light weight processing; OLAP; OLTP.

DOI: 10.1504/IJDATS.2018.094130

International Journal of Data Analysis Techniques and Strategies, 2018 Vol.10 No.3, pp.305 - 316

Accepted: 18 Dec 2016
Published online: 02 Aug 2018 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article