Title: Efficient text document clustering with new similarity measures

Authors: R. Lakshmi; S. Baskar

Addresses: Department of Computer Science and Engineering, K.L.N. College of Engineering, Sivagangai District, Tamilnadu, India ' Department of Electrical and Electronics Engineering, Thiagarajar College of Engineering, Madurai, Tamilnadu, India

Abstract: In this paper, two new similarity measures, namely distance of term frequency-based similarity measure (DTFSM) and presence of common terms-based similarity measure (PCTSM), are proposed to compute the similarity between two documents for improving the effectiveness of text document clustering. The effectiveness of the proposed similarity measures is evaluated on reuters-21578 and WebKB datasets for clustering the documents using K-means and K-means++ clustering algorithms. The results obtained by using the proposed DTFSM and PCTSM are significantly better than other measures for document clustering in terms of accuracy, entropy, recall and F-measure. It is evident that the proposed similarity measures not only improve the effectiveness of the text document clustering, but also reduce the complexity of similarity measures based on the number of required operations during text document clustering.

Keywords: document clustering; similarity measures; accuracy; entropy; recall; F-measure.

DOI: 10.1504/IJBIDM.2021.111741

International Journal of Business Intelligence and Data Mining, 2021 Vol.18 No.1, pp.49 - 72

Received: 04 Jan 2018
Accepted: 21 Apr 2018

Published online: 14 Dec 2020 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article