Dimensionality reduction in text classification using scatter method Online publication date: Wed, 02-Jul-2014
by Jyri Saarikoski; Jorma Laurikkala; Kalervo Järvelin; Markku Siermala; Martti Juhola
International Journal of Data Mining, Modelling and Management (IJDMMM), Vol. 6, No. 1, 2014
Abstract: Preprocessing of data is a vital part of any task involving machine learning. In the classification of text documents, the most important aspect of preprocessing is usually the dimensionality reduction of data vectors. This paper focuses on the use of a recent scatter method in the dimensionality reduction of text documents. The effectiveness of the method was tested with the classification of two datasets, the Reuters news collection and the Spanish CLEF 2003 news collection. The classification methods used were self-organising maps, Naïve Bayes method, k nearest neighbour searching and classification tree. For comparison, we also conducted the dimensionality reduction of the data with document frequency and mutual information approaches. The scatter method proved to be an effective dimensionality reduction method for text document data. The suggested approach outperformed the document frequency reduction and scored comparably against the mutual information method, except when only very small set of features was selected where mutual information was better, especially in the CLEF collection.
Online publication date: Wed, 02-Jul-2014
If you are not a subscriber and you just want to read the full contents of this article, buy online access here.Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Data Mining, Modelling and Management (IJDMMM):
Login with your Inderscience username and password:
Want to subscribe?
A subscription gives you complete access to all articles in the current issue, as well as to all articles in the previous three years (where applicable). See our Orders page to subscribe.
If you still need assistance, please email firstname.lastname@example.org