You can view the full text of this article for free using the link below.

Title: Efficient text document clustering with new similarity measures

Authors: R. Lakshmi; S. Baskar

Addresses: Department of Computer Science and Engineering, K.L.N. College of Engineering, Sivagangai District, Tamilnadu, India ' Department of Electrical and Electronics Engineering, Thiagarajar College of Engineering, Madurai, Tamilnadu, India

Abstract: In this paper, two new similarity measures, namely distance of term frequency-based similarity measure (DTFSM) and presence of common terms-based similarity measure (PCTSM), are proposed to compute the similarity between two documents for improving the effectiveness of text document clustering. The effectiveness of the proposed similarity measures is evaluated on reuters-21578 and WebKB datasets for clustering the documents using K-means and K-means++ clustering algorithms. The results obtained by using the proposed DTFSM and PCTSM are significantly better than other measures for document clustering in terms of accuracy, entropy, recall and F-measure. It is evident that the proposed similarity measures not only improve the effectiveness of the text document clustering, but also reduce the complexity of similarity measures based on the number of required operations during text document clustering.

Keywords: document clustering; similarity measures; accuracy; entropy; recall; F-measure.

DOI: 10.1504/IJBIDM.2021.111741

International Journal of Business Intelligence and Data Mining, 2021 Vol.18 No.1, pp.49 - 72

Received: 04 Jan 2018
Accepted: 21 Apr 2018

Published online: 06 Nov 2020 *

Full-text access for editors Access for subscribers Free access Comment on this article