Authors: Rui Máximo Esteves; Thomas Hacker; Chunming Rong
Addresses: Department of Electrical and Computer Engineering, University of Stavanger, Norway ' Computer and Information Technology, Purdue University, West Lafayette, Indiana ' Department of Electrical and Computer Engineering, University of Stavanger, Norway
Abstract: The tremendous growth in data volumes has created a need for new tools and algorithms to quickly analyse large datasets. Cluster analysis techniques, such as K-Means can be distributed across several machines. The accuracy of K-Means depends on the selection of seed centroids during initialisation. K-Means++ improves on the K-Means seeder, but suffers from problems when it is applied to large datasets. In this paper, we describe a new algorithm and a MapReduce implementation we developed that addresses these problems. We compared the performance with three existing algorithms and found that our algorithm improves cluster analysis accuracy and decreases variance. Our results show that our new algorithm produced a speedup of 76 ± 9 times compared with the serial K-Means++ and is as fast as the streaming K-Means. Our work provides a method to select a good initial seeding in less time, facilitating fast accurate cluster analysis over large datasets.
Keywords: k-means clustering; k-means++; streaming k-means; MapReduce; distributed cluster analysis; big data; initial seeding.
International Journal of Big Data Intelligence, 2014 Vol.1 No.1/2, pp.50 - 64
Available online: 23 Jul 2014 *Full-text access for editors Access for subscribers Purchase this article Comment on this article