Title: Parallel topic model and its application on document clustering
Authors: Lidong Wang; Yuhuai Wang; Shihua Cao; Yun Zhang; Kang An
Addresses: Qianjiang College, Hangzhou Normal University, P.O. Box 456, No. 16 XueLin Street, Xiasha Advanced Education Park, Hangzhou, China ' Qianjiang College, Hangzhou Normal University, P.O. Box 456, No. 16 XueLin Street, Xiasha Advanced Education Park, Hangzhou, China ' Qianjiang College, Hangzhou Normal University, P.O. Box 456, No. 16 XueLin Street, Xiasha Advanced Education Park, Hangzhou, China ' Zhejiang University of Media and Communications, Room 415, No. 1 Lab Buiding, No. 998 Xueyuan Street, Xiasha Advanced Education Park, Hangzhou, China ' Qianjiang College, Hangzhou Normal University, P.O. Box 456, No. 16 XueLin Street, Xiasha Advanced Education Park, Hangzhou, China
Abstract: This paper presents PLDACOL, our parallel implementation on LDACOL model, to effectively cluster large-scale documents. Since phrases contain more semantic information than the sum of its individual word, we use topic model LDACOL for phrase discovery, and use Gibbs sampling for parameter inference. PLDACOL overcomes the high computation time cost in parameter inference by the distributed computing framework based on Hadoop. We show that our PLDACOL can be applied to the clustering of large-scale documents in different size and produces significant improvements on both effectiveness and efficiency compared with other related traditional algorithms.
Keywords: document clustering; topic model; parallel computing; Hadoop; LDACOL model.
DOI: 10.1504/IJICT.2017.087459
International Journal of Information and Communication Technology, 2017 Vol.11 No.4, pp.552 - 563
Received: 08 Nov 2014
Accepted: 09 Jun 2015
Published online: 16 Oct 2017 *