Title: Learning-based text classifiers using the Mahalanobis distance for correlated datasets

Authors: Noopur Srivastava; Shrisha Rao

Addresses: Schneider Electric India Pvt. Ltd., Beary's Global Research Triangle, Bangalore 560037, India ' International Institute of Information Technology – Bangalore, Bangalore 560100, India

Abstract: We present a novel approach to text categorisation with the aid of the Mahalanobis distance measure for classification. For correlated datasets, classification using the Euclidean distance is not very accurate. The use of the Mahalanobis distance exploits the correlation in data for the purpose of classification. For achieving this on large datasets, an unsupervised dimensionality reduction technique, principal component analysis (PCA) is used prior to classification using the k-nearest neighbours (kNN) classifier. As kNN does not work well for high-dimensional data, and moreover computing correlations for huge and sparse data is inefficient, we use PCA to obtain a reduced dataset for the training phase. Experimental results show improvement in classification accuracy and a significant reduction in error percentage by using the proposed algorithm on huge datasets, in comparison with classifiers using the Euclidean distance.

Keywords: Mahalanobis distance; k-nearest neighbour; kNN; text classification; precision; recall; dimensionality reduction; principal component analysis; PCA; correlated datasets.

DOI: 10.1504/IJBDI.2016.073901

International Journal of Big Data Intelligence, 2016 Vol.3 No.1, pp.18 - 27

Received: 31 Oct 2014
Accepted: 26 May 2015

Published online: 29 Dec 2015 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article