Title: An improved parallel K-means algorithm based on MapReduce

Authors: Dongbo Zhang; Yanfang Shou; Jianmin Xu

Addresses: School of Civil Engineering and Transportation, South China University of Technology, Guangzhou, China; Department of Computer Science, Guangdong University of Science and Technology, Dongguan, China ' Guangzhou Institute of Modern Industrial Technology, South China University of Technology, Guangzhou, China ' School of Civil Engineering and Transportation, South China University of Technology, Guangzhou, China

Abstract: The K-means algorithm is one of the most popular clustering algorithms. However, it is sensitive to initialised partitions and circular dataset. To address this problem, this paper introduces a CK-means clustering algorithm based on the K-means algorithm and the Canopy algorithm, which uses the MapReduce programming model of Hadoop platform. The experimental results prove that the CK-means algorithm has strong advantages for processing large datasets. The theoretical analysis shows that the CK-means algorithm and the traditional algorithm are of the same order of magnitude. The experimental results on artificial data show that the improved algorithm is better than the traditional algorithm in terms of acceleration ratio, accuracy and expansion rate. An experiment on real data is performed to obtain appropriate parameters.

Keywords: cloud computing; MapReduce model; K-means clusters; Canopy algorithm; big data.

DOI: 10.1504/IJES.2017.084700

International Journal of Embedded Systems, 2017 Vol.9 No.3, pp.275 - 282

Received: 05 Dec 2015
Accepted: 22 Jun 2016

Published online: 21 Jun 2017 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article