Title: Correlation-based concept-oriented bisecting k-means clustering and topic detection for scientific literature and news tracks

Authors: J. Jayabharathy; S. Kanmani

Addresses: Department of Computer Science and Engineering, Pondicherry Engineering College, Puducherry, India ' Department of Information Technology, Pondicherry Engineering College, Puducherry, India

Abstract: Extracting relevant documents from a larger document corpus is a challenging task. The process of clustering groups together the documents sharing similar topics. Incorporating semantic features will improve the accuracy of document clustering methods. Topic detection deals with discovering meaningful and concise labels for the clusters. In this paper, we propose a clustering algorithm named as correlation-based concept-oriented bisecting k-means algorithm using semantic-based similarity measure. This algorithm uses our existing modified semantic-based model in which related terms are extracted as concepts for concept-based document clustering and topic discovery method. The performance of the proposed work is compared with the existing term-based method and also with our earlier work on concept based algorithm. Additional experiments are conducted to demonstrate the ability of the proposed correlation-based concept-oriented bisecting k-means algorithm considering terms only, synonyms and hyponyms and correlated using F-measure and purity as evaluation metrics. Experimental results demonstrate the performance enhancement of the proposed algorithm.

Keywords: document clustering; topic discovery; semantic similarity; testor theory; correlation; k-means clustering; scientific literature; news tracks; semantics; information retrieval; topic detection; concept-based clustering.

DOI: 10.1504/IJKEDM.2015.071285

International Journal of Knowledge Engineering and Data Mining, 2015 Vol.3 No.2, pp.170 - 189

Accepted: 17 Nov 2014
Published online: 19 Aug 2015 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article