Title: Automatic generation of initial value k to apply k-means method for text documents clustering

Authors: Namita Gupta, P.C. Saxena, J.P. Gupta

Addresses: Maharaja Agrasen Institute of Technology, PSP Area, Plot No. 1, Sector – 22, Rohini, Delhi – 110086, India. ' School of Computer and System Sciences, Jawaharlal Nehru University, New Mehrauli Road, New Delhi – 110067, India. ' Jaypee Institute of Information Technology, A-10, Sector-62, Noida, Uttar Pradesh – 201307, India

Abstract: Retrieving relevant text documents on a topic from a large document collection is a challenging task. Different clustering algorithms are developed to retrieve relevant documents of interest. Hierarchical clustering shows quadratic time complexity of O(n²) for n text documents. K-means algorithm has a time complexity of O(n) but it is sensitive to the initial randomly selected cluster centres, giving local optimum solution. Global k-means employs the k-means algorithm as a local search procedure to produce global optimum solution but shows polynomial time complexity of O(nk) to produce k clusters. In this paper, we propose an approach of clustering text documents that overcomes the drawback of k-means and global k-means and gives global optimal solution with time complexity of O(lk) to obtain k clusters from initial set of l starting clusters. Experimental evaluation on Reuters newsfeeds (Reuters-21578) shows clustering results (entropy, purity, F-measure) obtained by proposed method comparable with k-means and global k-means.

Keywords: document clustering; global k-means; information retrieval; data mining; text documents.

DOI: 10.1504/IJDMMM.2011.038810

International Journal of Data Mining, Modelling and Management, 2011 Vol.3 No.1, pp.18 - 41

Published online: 26 Feb 2015 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article