Title: Using term similarity measures for classifying short document data

Authors: Hirohisa Seki; Shuhei Toriyama

Addresses: Department of Computer Science and Engineering, Graduate School of Engineering, Nagoya Institute of Technology, Nagoya, Japan ' Department of Computer Science and Engineering, Graduate School of Engineering, Nagoya Institute of Technology, Nagoya, Japan

Abstract: Term expansion (a.k.a. document expansion), proposed by Carpineto et al., is a method used for text classification. When handling short text data like social media and blogs, we can apply the term expansion method to expand the sparse information in them. While the prior works on term expansion use an formal concept analysis (FCA)-based similarity measure defined between terms (or words), this paper studies the effectiveness of using two kinds of measures for term expansion: one is weighted similarity measures studied in FCA, and the other is some correlation measures, like cosine and all-conf, often employed in data mining. We also present some properties on the relationship between these term similarity/correlation measures and the notion of relevancy in classification. We show empirically that cosine correlation measure outperforms the prior methods in our two short document data. We also make a comparison of our approach with an latent Dirichlet allocation (LDA)-based term expansion approach by Rogers et al.

Keywords: term expansion; similarity measure; correlation; formal concepts; latent Dirichlet allocation; LDA; short document data; classification.

DOI: 10.1504/IJCISTUDIES.2021.115430

International Journal of Computational Intelligence Studies, 2021 Vol.10 No.2/3, pp.181 - 197

Received: 08 Jun 2020
Accepted: 13 Jul 2020

Published online: 02 Jun 2021 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article