Authors: Cuong Duc Nguyen; Trong Hai Duong
Addresses: Faculty of Information Technology, Ton Duc Thang University, HoChiMinh City, Vietnam ' Institute of Science and Technology of Industry 4.0, Nguyen Tat Thanh University, HoChiMinh City, Vietnam
Abstract: K-means often converges to a local optimum. In improved versions of K-means, k-means++ is well-known for achieving a rather optimum solution with its cluster initialisation strategy and high computational efficiency. Incremental K-means is recognised for its converging to the empirically global optimum but having a high complexity due to its stepping of the number of clusters K. The paper introduces K-means** with a doubling strategy on K. Additional techniques, including only doubling big enough clusters, stepping K for the last few values and searching on other candidates for the last K, are used to help K-means** have a complexity of O(K logK), which is lower than the complexity of incremental K-means, and still converge to empirically global optimum. On a set of synthesis and real datasets, K-means** archive the minimum results in almost of test cases. K-means** is much faster than incremental K-means and comparable with the speed of k-means++.
Keywords: data clustering; K-means; k-means++; incremental K-means; IKM; data mining.
International Journal of Intelligent Information and Database Systems, 2018 Vol.11 No.1, pp.27 - 45
Received: 12 Oct 2016
Accepted: 25 May 2017
Published online: 26 Apr 2018 *