Title: CC-K-means: a candidate centres-based K-means algorithm for text data

Authors: Xuan Li; Yongquan Liang; Yuhao Cai

Addresses: College of Information Science and Technology, Shandong University of Science and Technology, Qingdao, 266590, China ' College of Information Science and Technology, Shandong University of Science and Technology, Qingdao, 266590, China ' College of Information Science and Technology, Shandong University of Science and Technology, Qingdao, 266590, China

Abstract: K-means algorithm, one of the clustering algorithms, is widely applied to solve clustering problems of various data thanks to its simplicity and efficiency. However, the randomness of selecting centre points of the traditional K-means algorithm results in some defects such as low-speed of convergence or instability of clustering results. To overcome the impact of high-dimension during text clustering, latent semantic index (LSI) model is firstly adopted to reduce the dimensions of feature vector, and then weighted adjusted cosine similarity is used to calculate the similarity between documents to obtain better clustering effects. The high-density candidate centre points are partly updated to get the final clustering centres on the basis of density in the process of finding clustering centres. Experiment results show that the proposed algorithm can accurately find representative and decentralised clustering centres, which express a better performance in clustering.

Keywords: text clustering; LSI model; latent semantic index; K-means clustering; initial clustering centres; candidate centres; text data.

DOI: 10.1504/IJCI.2016.077147

International Journal of Collaborative Intelligence, 2016 Vol.1 No.3, pp.189 - 204

Received: 31 Dec 2015
Accepted: 16 Feb 2016

Published online: 21 Jun 2016 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article