Title: Word Sense Disambiguation in biomedical ontologies with term co-occurrence analysis and document clustering

Authors: Bill Andreopoulos, Dimitra Alexopoulou, Michael Schroeder

Addresses: Biotechnological Centre, Technischen Universitat Dresden, Germany. ' Biotechnological Centre, Technischen Universitat Dresden, Germany. ' Biotechnological Centre, Technischen Universitat Dresden, Germany

Abstract: With more and more genomes being sequenced, a lot of effort is devoted to their annotation with terms from controlled vocabularies such as the GeneOntology. Manual annotation based on relevant literature is tedious, but automation of this process is difficult. One particularly challenging problem is word sense disambiguation. Terms such as |development| can refer to developmental biology or to the more general sense. Here, we present two approaches to address this problem by using term co-occurrences and document clustering. To evaluate our method we defined a corpus of 331 documents on development and developmental biology. Term co-occurrence analysis achieves an F-measure of 77%. Additionally, applying document clustering improves precision to 82%. We applied the same approach to disambiguate |nucleus|, |transport|, and |spindle|, and we achieved consistent results. Thus, our method is a viable approach towards the automation of literature-based genome annotation.

Keywords: WSD; word sense disambiguation; biomedical ontologies; GeneOntology; text mining; genome annotation; Bayes; term co-occurrence; document clustering; GoPubMed; data mining; bioinformatics; genomes; development; developmental biology.

DOI: 10.1504/IJDMB.2008.020522

International Journal of Data Mining and Bioinformatics, 2008 Vol.2 No.3, pp.193 - 215

Published online: 29 Sep 2008 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article