Title: BioTopic: a topic-driven biological literature mining system

Authors: Xi Wang; Peiyan Zhu; Tao Liu; Ke Xu

Addresses: State Key Lab of Software Development Environment, Beihang University, Beijing 100191, China ' State Key Lab of Software Development Environment, Beihang University, Beijing 100191, China ' State Key Lab of Software Development Environment, Beihang University, Beijing 100191, China ' State Key Lab of Software Development Environment, Beihang University, Beijing 100191, China

Abstract: Biology and biomedicine are flourishing disciplines, with massive biological data produced in experiments and huge amount of research papers published in journals. In such a big data context, unsupervised data mining methods such as topic models are used to extract topics from large-scale document collections. In this paper, we present a biological literature mining system based on topic modelling (BioTopic). Experiments show that the perplexity reduction percentage of our pre-processing method is 5% larger that of a traditional pre-processing method. The precision of our search performance reaches 86%, which is better that that of a unigram language model. Our method employs linguistic information from shallow parsing to better pre-process biological literature for topic models. BioTopic with fine-grained pre-processing and topic modelling works better than traditional literature mining systems.

Keywords: biological literature; biological topics; topic modelling; topic mining; big data; data mining; shallow parsing; fine-grained pre-processing; bioinformatics.

DOI: 10.1504/IJDMB.2016.075822

International Journal of Data Mining and Bioinformatics, 2016 Vol.14 No.4, pp.373 - 386

Received: 13 Nov 2015
Accepted: 18 Nov 2015

Published online: 06 Apr 2016 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article