Title: Efficient clustering techniques on Hadoop and Spark
Authors: Sami Al Ghamdi; Giuseppe Di Fatta
Addresses: Department of Computer Science, University of Reading, Reading, UK ' Department of Computer Science, University of Reading, Reading, UK
Abstract: Clustering is an essential data mining technique that divides observations into groups where each group contains similar observations. K-means is one of the most popular clustering algorithms that has been used for over 50 years. Due to the current exponential growth of the data, it became a necessity to improve the efficiency and scalability of K-means even further to cope with large-scale datasets known as big data. This paper presents K-means optimisations using triangle inequality on two well-known distributed computing platforms: Hadoop and Spark. K-means variants that use triangle inequality usually require caching extra information from the previous iteration, which is a challenging task to achieve on Hadoop. Hence, this work introduces two methods to pass information from one iteration to the next on Hadoop to accelerate K-means. The experimental work shows that the efficiency of K-means on Hadoop and Spark can be significantly improved by using triangle inequality optimisations.
Keywords: K-means; Hadoop; Spark; MapReduce; efficient clustering; triangle inequality K-means.
DOI: 10.1504/IJBDI.2019.100898
International Journal of Big Data Intelligence, 2019 Vol.6 No.3/4, pp.269 - 290
Received: 08 Mar 2018
Accepted: 15 Aug 2018
Published online: 19 Jul 2019 *