Title: Parallel topic model and its application on document clustering

Authors: Lidong Wang; Yuhuai Wang; Shihua Cao; Yun Zhang; Kang An

Addresses: Qianjiang College, Hangzhou Normal University, P.O. Box 456, No. 16 XueLin Street, Xiasha Advanced Education Park, Hangzhou, China ' Qianjiang College, Hangzhou Normal University, P.O. Box 456, No. 16 XueLin Street, Xiasha Advanced Education Park, Hangzhou, China ' Qianjiang College, Hangzhou Normal University, P.O. Box 456, No. 16 XueLin Street, Xiasha Advanced Education Park, Hangzhou, China ' Zhejiang University of Media and Communications, Room 415, No. 1 Lab Buiding, No. 998 Xueyuan Street, Xiasha Advanced Education Park, Hangzhou, China ' Qianjiang College, Hangzhou Normal University, P.O. Box 456, No. 16 XueLin Street, Xiasha Advanced Education Park, Hangzhou, China

Abstract: This paper presents PLDACOL, our parallel implementation on LDACOL model, to effectively cluster large-scale documents. Since phrases contain more semantic information than the sum of its individual word, we use topic model LDACOL for phrase discovery, and use Gibbs sampling for parameter inference. PLDACOL overcomes the high computation time cost in parameter inference by the distributed computing framework based on Hadoop. We show that our PLDACOL can be applied to the clustering of large-scale documents in different size and produces significant improvements on both effectiveness and efficiency compared with other related traditional algorithms.

Keywords: document clustering; topic model; parallel computing; Hadoop; LDACOL model.

DOI: 10.1504/IJICT.2017.087459

International Journal of Information and Communication Technology, 2017 Vol.11 No.4, pp.552 - 563

Received: 08 Nov 2014
Accepted: 09 Jun 2015

Published online: 16 Oct 2017 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article