You can view the full text of this article for using the link below.
Title: Learning-based text classifiers using the Mahalanobis distance for correlated datasets
Authors: Noopur Srivastava; Shrisha Rao
Schneider Electric India Pvt. Ltd., Beary's Global Research Triangle, Bangalore 560037, India
International Institute of Information Technology – Bangalore, Bangalore 560100, India
Abstract: We present a novel approach to text categorisation with the aid of the Mahalanobis distance measure for classification. For correlated datasets, classification using the Euclidean distance is not very accurate. The use of the Mahalanobis distance exploits the correlation in data for the purpose of classification. For achieving this on large datasets, an unsupervised dimensionality reduction technique, principal component analysis (PCA) is used prior to classification using the k-nearest neighbours (kNN) classifier. As kNN does not work well for high-dimensional data, and moreover computing correlations for huge and sparse data is inefficient, we use PCA to obtain a reduced dataset for the training phase. Experimental results show improvement in classification accuracy and a significant reduction in error percentage by using the proposed algorithm on huge datasets, in comparison with classifiers using the Euclidean distance.
Keywords: Mahalanobis distance; k-nearest neighbour; kNN; text classification; precision; recall; dimensionality reduction; principal component analysis; PCA; correlated datasets.
Int. J. of Big Data Intelligence, 2016 Vol.3, No.1, pp.18 - 27
Submission date: 30 Oct 2014
Date of acceptance: 26 May 2015
Available online: 29 Dec 2015