Title: Classification techniques with minimal labelling effort and application to medical reports

Authors: Fathi H. Saad, G. Duncan Bell, Beatriz De la Iglesia

Addresses: School of Computing Sciences, University of East Anglia, Norwich NR4 7TJ, UK. ' Endoscopy Unit, Norwich and Norfolk University Hospital, Colney Lane, Norwich NR4 7UY, UK. ' School of Computing Sciences, University of East Anglia, Norwich NR4 7TJ, UK

Abstract: There are a number of approaches to classify text documents. Here, we use Partially Supervised Classification (PSC) and argue that it is an effective and efficient approach for real-world problems. PSC uses a two-step strategy to cut down on the labelling effort. There are a number of methods that have been proposed for each step. An evaluation of various methods is conducted using real-world medical documents. The results show that using EM to build the classifier yields better results than SVM. We also experimentally show that careful selection of a subset of features to represent the documents can improve performance.

Keywords: document classification; positive-class based learning; partially supervised classification; labelled data; unlabelled data; medical text mining; feature reduction; data mining; bioinformatics; text documents; medical reports.

DOI: 10.1504/IJDMB.2008.022638

International Journal of Data Mining and Bioinformatics, 2008 Vol.2 No.3, pp.268 - 287

Published online: 22 Jan 2009 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article