Title: Efficient clustering techniques on Hadoop and Spark

Authors: Sami Al Ghamdi; Giuseppe Di Fatta

Addresses: Department of Computer Science, University of Reading, Reading, UK ' Department of Computer Science, University of Reading, Reading, UK

Abstract: Clustering is an essential data mining technique that divides observations into groups where each group contains similar observations. K-means is one of the most popular clustering algorithms that has been used for over 50 years. Due to the current exponential growth of the data, it became a necessity to improve the efficiency and scalability of K-means even further to cope with large-scale datasets known as big data. This paper presents K-means optimisations using triangle inequality on two well-known distributed computing platforms: Hadoop and Spark. K-means variants that use triangle inequality usually require caching extra information from the previous iteration, which is a challenging task to achieve on Hadoop. Hence, this work introduces two methods to pass information from one iteration to the next on Hadoop to accelerate K-means. The experimental work shows that the efficiency of K-means on Hadoop and Spark can be significantly improved by using triangle inequality optimisations.

Keywords: K-means; Hadoop; Spark; MapReduce; efficient clustering; triangle inequality K-means.

DOI: 10.1504/IJBDI.2019.100898

International Journal of Big Data Intelligence, 2019 Vol.6 No.3/4, pp.269 - 290

Received: 08 Mar 2018
Accepted: 15 Aug 2018

Published online: 19 Jul 2019 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article