Title: Weighting schemes based on EM algorithm for LDA

Authors: Yaya Ju; Jianfeng Yan; Zhiqiang Liu; Lu Yang

Addresses: School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu, China ' School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu, China ' School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu, China ' School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu, China

Abstract: Latent Dirichlet allocation (LDA) is a popular probabilistic topic modelling method, which automatically finds latent topics from a corpus. LDA users often encounter two major problems: first, LDA treats each word equally, and common words tend to scatter across almost all topics without reason, thereby leading to bad topic interpretability, consistency, and overlap. Second, an appropriate way to distinguish low-dimensional topic features for better classification performance is lacking. To overcome these two shortcomings, we propose two novel weighting schemes: a word-weighted scheme, which is realised by introducing a weight factor during the iterative process, and a topic-weighted scheme, which is realised by combining the Jenson-Shannon (JS) distance and the entropy of the generated low-dimensional topic features as a weight coefficient, using expectation-maximisation (EM). Experimental results show that the word-weighted scheme can find better topics for improving the clustering performance effectively, and the topic-weighted scheme has a larger effect on text classification than traditional methods.

Keywords: latent Dirichlet allocation; LDA; expectation-maximisation; word-weighted scheme; topic-weighted scheme.

DOI: 10.1504/IJHPSA.2018.094152

International Journal of High Performance Systems Architecture, 2018 Vol.8 No.1/2, pp.94 - 104

Received: 17 Nov 2017
Accepted: 16 Apr 2018

Published online: 01 Aug 2018 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article