Title: Unsupervised corpus distillation for represented indicator measurement on focus species detection

Authors: Chih-Hsuan Wei; Hung-Yu Kao

Addresses: Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan (ROC) ' Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan (ROC)

Abstract: The gene ambiguity with the highest dimension is the species with which an entity is associated in biomedical text mining. Furthermore, one of the bottlenecks in gene normalisation is focus species detection. This study presents a method which is robust for all types of articles, particularly those without explicit species mentions. Since our method requires a training corpus, we developed an iterative distillation method to extend the corpus. Unsupervised corpus is therefore helpful for the detection of focus species. In experiments, the proposed method achieved a high accuracy of 85.64% and 84.32% in datasets with and without species mentions respectively.

Keywords: document classification; focus species identification; represented indicator measurement; gene ambiguity; biomedical text mining; iterative distillation; unsupervised corpus distillation; bioinformatics.

DOI: 10.1504/IJDMB.2013.056615

International Journal of Data Mining and Bioinformatics, 2013 Vol.8 No.4, pp.413 - 426

Received: 04 May 2011
Accepted: 04 May 2011

Published online: 20 Oct 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article