Title: Dimensionality reduction in text classification using scatter method

Authors: Jyri Saarikoski; Jorma Laurikkala; Kalervo Järvelin; Markku Siermala; Martti Juhola

Addresses: School of Information Sciences, 33014 University of Tampere, Finland ' School of Information Sciences, 33014 University of Tampere, Finland ' School of Information Sciences, 33014 University of Tampere, Finland ' School of Information Sciences, 33014 University of Tampere, Finland ' School of Information Sciences, 33014 University of Tampere, Finland

Abstract: Preprocessing of data is a vital part of any task involving machine learning. In the classification of text documents, the most important aspect of preprocessing is usually the dimensionality reduction of data vectors. This paper focuses on the use of a recent scatter method in the dimensionality reduction of text documents. The effectiveness of the method was tested with the classification of two datasets, the Reuters news collection and the Spanish CLEF 2003 news collection. The classification methods used were self-organising maps, Naïve Bayes method, k nearest neighbour searching and classification tree. For comparison, we also conducted the dimensionality reduction of the data with document frequency and mutual information approaches. The scatter method proved to be an effective dimensionality reduction method for text document data. The suggested approach outperformed the document frequency reduction and scored comparably against the mutual information method, except when only very small set of features was selected where mutual information was better, especially in the CLEF collection.

Keywords: text documents; dimensionality reduction; classification; mutual information; self-organising maps; SOMs; naïve Bayes; k nearest neighbour; kNN; classification tree; scatter method; machine learning.

DOI: 10.1504/IJDMMM.2014.059978

International Journal of Data Mining, Modelling and Management, 2014 Vol.6 No.1, pp.1 - 21

Available online: 23 Mar 2014 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article