Efficient clustering techniques on Hadoop and Spark
by Sami Al Ghamdi; Giuseppe Di Fatta
International Journal of Big Data Intelligence (IJBDI), Vol. 6, No. 3/4, 2019

Abstract: Clustering is an essential data mining technique that divides observations into groups where each group contains similar observations. K-means is one of the most popular clustering algorithms that has been used for over 50 years. Due to the current exponential growth of the data, it became a necessity to improve the efficiency and scalability of K-means even further to cope with large-scale datasets known as big data. This paper presents K-means optimisations using triangle inequality on two well-known distributed computing platforms: Hadoop and Spark. K-means variants that use triangle inequality usually require caching extra information from the previous iteration, which is a challenging task to achieve on Hadoop. Hence, this work introduces two methods to pass information from one iteration to the next on Hadoop to accelerate K-means. The experimental work shows that the efficiency of K-means on Hadoop and Spark can be significantly improved by using triangle inequality optimisations.

Online publication date: Fri, 19-Jul-2019

The full text of this article is only available to individual subscribers or to users at subscribing institutions.

 
Existing subscribers:
Go to Inderscience Online Journals to access the Full Text of this article.

Pay per view:
If you are not a subscriber and you just want to read the full contents of this article, buy online access here.

Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Big Data Intelligence (IJBDI):
Login with your Inderscience username and password:

    Username:        Password:         

Forgotten your password?


Want to subscribe?
A subscription gives you complete access to all articles in the current issue, as well as to all articles in the previous three years (where applicable). See our Orders page to subscribe.

If you still need assistance, please email subs@inderscience.com