Int. J. of Data Mining, Modelling and Management   »   2009 Vol.1, No.2

 

 

Title: Combining sequence and itemset mining to discover named entities in biomedical texts: a new type of pattern

 

Author: Marc Plantevit, Thierry Charnois, Jiri Klema, Christophe Rigotti, Bruno Cremilleux

 

Addresses:
Universite de Caen Basse Normandie, CNRS, UMR6072, GREYC F-14032, France.
Universite de Caen Basse Normandie, CNRS, UMR6072, GREYC F-14032, France.
Faculty of Electrical Engineering, Czech Technical University, Technicka 2, Prague 6, 166 27, Czech Republic.
Universite de Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205, F-69621, France.
Universite de Caen Basse Normandie, CNRS, UMR6072, GREYC F-14032, France

 

Abstract: Biomedical named entity recognition (NER) is a challenging problem. In this paper, we show that mining techniques, such as sequential pattern mining and sequential rule mining, can be useful to tackle this problem but present some limitations. We demonstrate and analyse these limitations and introduce a new kind of pattern called LSR pattern that offers an excellent trade-off between the high precision of sequential rules and the high recall of sequential patterns. We formalise the LSR pattern mining problem first. Then we show how LSR patterns enable us to successfully tackle biomedical NER problems. We report experiments carried out on real datasets that underline the relevance of our proposition.

 

Keywords: LSR patterns; left-sequence-right patterns; sequential patterns; biomedical NER; named entity recognition; constraint-based pattern mining; biomedical texts; sequential rule mining; gene names; protein names; text mining; information extraction.

 

DOI: 10.1504/IJDMMM.2009.026073

 

Int. J. of Data Mining, Modelling and Management, 2009 Vol.1, No.2, pp.119 - 148

 

Available online: 26 May 2009

 

 

Editors Full Text AccessAccess for SubscribersPurchase this articleComment on this article