Int. J. of Big Data Intelligence   »   2016 Vol.3, No.1



You can view the full text of this article for Free access using the link below.



Title: Learning-based text classifiers using the Mahalanobis distance for correlated datasets


Authors: Noopur Srivastava; Shrisha Rao


Schneider Electric India Pvt. Ltd., Beary's Global Research Triangle, Bangalore 560037, India
International Institute of Information Technology – Bangalore, Bangalore 560100, India


Abstract: We present a novel approach to text categorisation with the aid of the Mahalanobis distance measure for classification. For correlated datasets, classification using the Euclidean distance is not very accurate. The use of the Mahalanobis distance exploits the correlation in data for the purpose of classification. For achieving this on large datasets, an unsupervised dimensionality reduction technique, principal component analysis (PCA) is used prior to classification using the k-nearest neighbours (kNN) classifier. As kNN does not work well for high-dimensional data, and moreover computing correlations for huge and sparse data is inefficient, we use PCA to obtain a reduced dataset for the training phase. Experimental results show improvement in classification accuracy and a significant reduction in error percentage by using the proposed algorithm on huge datasets, in comparison with classifiers using the Euclidean distance.


Keywords: Mahalanobis distance; k-nearest neighbour; kNN; text classification; precision; recall; dimensionality reduction; principal component analysis; PCA; correlated datasets.


DOI: 10.1504/IJBDI.2016.073901


Int. J. of Big Data Intelligence, 2016 Vol.3, No.1, pp.18 - 27


Submission date: 30 Oct 2014
Date of acceptance: 26 May 2015
Available online: 29 Dec 2015



Editors Full text accessFree access Free accessComment on this article