Title: Combining sequence and itemset mining to discover named entities in biomedical texts: a new type of pattern

Authors: Marc Plantevit, Thierry Charnois, Jiri Klema, Christophe Rigotti, Bruno Cremilleux

Addresses: Universite de Caen Basse Normandie, CNRS, UMR6072, GREYC F-14032, France. ' Universite de Caen Basse Normandie, CNRS, UMR6072, GREYC F-14032, France. ' Faculty of Electrical Engineering, Czech Technical University, Technicka 2, Prague 6, 166 27, Czech Republic. ' Universite de Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205, F-69621, France. ' Universite de Caen Basse Normandie, CNRS, UMR6072, GREYC F-14032, France

Abstract: Biomedical named entity recognition (NER) is a challenging problem. In this paper, we show that mining techniques, such as sequential pattern mining and sequential rule mining, can be useful to tackle this problem but present some limitations. We demonstrate and analyse these limitations and introduce a new kind of pattern called LSR pattern that offers an excellent trade-off between the high precision of sequential rules and the high recall of sequential patterns. We formalise the LSR pattern mining problem first. Then we show how LSR patterns enable us to successfully tackle biomedical NER problems. We report experiments carried out on real datasets that underline the relevance of our proposition.

Keywords: LSR patterns; left-sequence-right patterns; sequential patterns; biomedical NER; named entity recognition; constraint-based pattern mining; biomedical texts; sequential rule mining; gene names; protein names; text mining; information extraction.

DOI: 10.1504/IJDMMM.2009.026073

International Journal of Data Mining, Modelling and Management, 2009 Vol.1 No.2, pp.119 - 148

Published online: 26 May 2009 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article