Title: Using term similarity measures for classifying short document data
Authors: Hirohisa Seki; Shuhei Toriyama
Addresses: Department of Computer Science and Engineering, Graduate School of Engineering, Nagoya Institute of Technology, Nagoya, Japan ' Department of Computer Science and Engineering, Graduate School of Engineering, Nagoya Institute of Technology, Nagoya, Japan
Abstract: Term expansion (a.k.a. document expansion), proposed by Carpineto et al., is a method used for text classification. When handling short text data like social media and blogs, we can apply the term expansion method to expand the sparse information in them. While the prior works on term expansion use an formal concept analysis (FCA)-based similarity measure defined between terms (or words), this paper studies the effectiveness of using two kinds of measures for term expansion: one is weighted similarity measures studied in FCA, and the other is some correlation measures, like cosine and all-conf, often employed in data mining. We also present some properties on the relationship between these term similarity/correlation measures and the notion of relevancy in classification. We show empirically that cosine correlation measure outperforms the prior methods in our two short document data. We also make a comparison of our approach with an latent Dirichlet allocation (LDA)-based term expansion approach by Rogers et al.
Keywords: term expansion; similarity measure; correlation; formal concepts; latent Dirichlet allocation; LDA; short document data; classification.
DOI: 10.1504/IJCISTUDIES.2021.115430
International Journal of Computational Intelligence Studies, 2021 Vol.10 No.2/3, pp.181 - 197
Received: 08 Jun 2020
Accepted: 13 Jul 2020
Published online: 02 Jun 2021 *