Title: Correlation-based concept-oriented bisecting k-means clustering and topic detection for scientific literature and news tracks
Authors: J. Jayabharathy; S. Kanmani
Addresses: Department of Computer Science and Engineering, Pondicherry Engineering College, Puducherry, India ' Department of Information Technology, Pondicherry Engineering College, Puducherry, India
Abstract: Extracting relevant documents from a larger document corpus is a challenging task. The process of clustering groups together the documents sharing similar topics. Incorporating semantic features will improve the accuracy of document clustering methods. Topic detection deals with discovering meaningful and concise labels for the clusters. In this paper, we propose a clustering algorithm named as correlation-based concept-oriented bisecting k-means algorithm using semantic-based similarity measure. This algorithm uses our existing modified semantic-based model in which related terms are extracted as concepts for concept-based document clustering and topic discovery method. The performance of the proposed work is compared with the existing term-based method and also with our earlier work on concept based algorithm. Additional experiments are conducted to demonstrate the ability of the proposed correlation-based concept-oriented bisecting k-means algorithm considering terms only, synonyms and hyponyms and correlated using F-measure and purity as evaluation metrics. Experimental results demonstrate the performance enhancement of the proposed algorithm.
Keywords: document clustering; topic discovery; semantic similarity; testor theory; correlation; k-means clustering; scientific literature; news tracks; semantics; information retrieval; topic detection; concept-based clustering.
International Journal of Knowledge Engineering and Data Mining, 2015 Vol.3 No.2, pp.170 - 189
Accepted: 17 Nov 2014
Published online: 19 Aug 2015 *