Authors: S. Poomagal; P. Saranya; S. Karthik
Addresses: Department of Applied Mathematics and Computational Sciences, PSG College of Technology, Coimbatore – 641004, Tamilnadu, India ' SAP Labs India Private Limited, 138, EPIP Area, Whitefield Road, Bangalore – 560066, Karnataka, India ' Skava Systems, An Infosys Company, Module 108, Tidel Park, Coimbatore – 641014, Tamilnadu, India
Abstract: In data mining, clustering is a method of grouping similar points together. This grouping can be done using partitioning or hierarchical clustering algorithms. K-means is one of the partitioning clustering algorithms which is simple and faster than other clustering algorithms. Major drawbacks of K-means algorithm are the selection of initial centroids and number of clusters (K). This paper aims at providing a solution for selecting initial centroids in which a new point is calculated at each iteration and the point in the dataset which is closest to the calculated point is selected as the centroid. The performance of the proposed work is compared with existing methods using four datasets collected from UCI repository. From the results, it is proved that the proposed work increases accuracy by 88.74% for Iris dataset, 28.18% for Breast cancer dataset, 34.03% for Seeds dataset and 18.18% for PIMA I Diabetes dataset over the other methods.
Keywords: K-means clustering; centroid selection; distance; similarity; seed selection; initial centroids; data mining; clusters.
International Journal of Intelligent Systems Technologies and Applications, 2016 Vol.15 No.3, pp.230 - 239
Received: 25 Jun 2015
Accepted: 12 Jan 2016
Published online: 02 Aug 2016 *