Title: On the influence of training data quality on text document classification using machine learning methods

Authors: Jyri Saarikoski; Henry Joutsijoki; Kalervo Järvelin; Jorma Laurikkala; Martti Juhola

Addresses: School of Information Sciences, University of Tampere, Kanslerinrinne 1, FI-33014, Finland ' School of Information Sciences, University of Tampere, Kanslerinrinne 1, FI-33014, Finland ' School of Information Sciences, University of Tampere, Kanslerinrinne 1, FI-33014, Finland ' School of Information Sciences, University of Tampere, Kanslerinrinne 1, FI-33014, Finland ' School of Information Sciences, University of Tampere, Kanslerinrinne 1, FI-33014, Finland

Abstract: The main target of this paper was to study the influence of training data quality on the text document classification performance of machine learning methods. A graded relevance corpus of ten classes and 957 text documents was classified with Self-Organising Maps (SOMs), learning vector quantisation, k-nearest neighbours searching, naïve Bayes and support vector machines. The relevance level of a document (irrelevant, marginally, fairly or highly relevant) was used as a measure of the quality of the document as a training example, which is a new approach. The classifiers were evaluated with micro- and macro-averaged classification accuracies. The results suggest that training data of higher quality should be preferred, but even low-quality data can improve a classifier, if there is plenty of it. In addition, further means to facilitate classification by the SOMs were explored. The novel set of SOM approach performed clearly better than the original SOM and comparably against supervised classification methods.

Keywords: data mining; document collections; graded relevance; relevance assessment; training data quality; text classification; machine learning; SOM; self-organising maps; text documents; learning vector quantisation; LVQ; k-nearest neighbour; kNN; naive Bayes; SVM; support vector machines.

DOI: 10.1504/IJKEDM.2015.071284

International Journal of Knowledge Engineering and Data Mining, 2015 Vol.3 No.2, pp.143 - 169

Received: 03 Mar 2014
Accepted: 18 Nov 2014

Published online: 19 Aug 2015 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article