Int. J. of Big Data Intelligence   »   2014 Vol.1, No.1/2



Title: A new approach for accurate distributed cluster analysis for Big Data: competitive K-Means


Authors: Rui Máximo Esteves; Thomas Hacker; Chunming Rong


Department of Electrical and Computer Engineering, University of Stavanger, Norway
Computer and Information Technology, Purdue University, West Lafayette, Indiana
Department of Electrical and Computer Engineering, University of Stavanger, Norway


Abstract: The tremendous growth in data volumes has created a need for new tools and algorithms to quickly analyse large datasets. Cluster analysis techniques, such as K-Means can be distributed across several machines. The accuracy of K-Means depends on the selection of seed centroids during initialisation. K-Means++ improves on the K-Means seeder, but suffers from problems when it is applied to large datasets. In this paper, we describe a new algorithm and a MapReduce implementation we developed that addresses these problems. We compared the performance with three existing algorithms and found that our algorithm improves cluster analysis accuracy and decreases variance. Our results show that our new algorithm produced a speedup of 76 ± 9 times compared with the serial K-Means++ and is as fast as the streaming K-Means. Our work provides a method to select a good initial seeding in less time, facilitating fast accurate cluster analysis over large datasets.


Keywords: k-means clustering; k-means++; streaming k-means; MapReduce; distributed cluster analysis; big data; initial seeding.


DOI: 10.1504/IJBDI.2014.063844


Int. J. of Big Data Intelligence, 2014 Vol.1, No.1/2, pp.50 - 64


Submission date: 15 Nov 2013
Date of acceptance: 03 Apr 2014
Available online: 23 Jul 2014



Editors Full text accessPurchase this articleComment on this article