Int. J. of Big Data Intelligence   »   2016 Vol.3, No.1

 

 

You can view the full text of this article for Free access using the link below.

 

 

Title: Learning-based text classifiers using the Mahalanobis distance for correlated datasets

 

Authors: Noopur Srivastava; Shrisha Rao

 

Addresses:
Schneider Electric India Pvt. Ltd., Beary's Global Research Triangle, Bangalore 560037, India
International Institute of Information Technology – Bangalore, Bangalore 560100, India

 

Abstract: We present a novel approach to text categorisation with the aid of the Mahalanobis distance measure for classification. For correlated datasets, classification using the Euclidean distance is not very accurate. The use of the Mahalanobis distance exploits the correlation in data for the purpose of classification. For achieving this on large datasets, an unsupervised dimensionality reduction technique, principal component analysis (PCA) is used prior to classification using the k-nearest neighbours (kNN) classifier. As kNN does not work well for high-dimensional data, and moreover computing correlations for huge and sparse data is inefficient, we use PCA to obtain a reduced dataset for the training phase. Experimental results show improvement in classification accuracy and a significant reduction in error percentage by using the proposed algorithm on huge datasets, in comparison with classifiers using the Euclidean distance.

 

Keywords: Mahalanobis distance; k-nearest neighbour; kNN; text classification; precision; recall; dimensionality reduction; principal component analysis; PCA; correlated datasets.

 

DOI: 10.1504/IJBDI.2016.073901

 

Int. J. of Big Data Intelligence, 2016 Vol.3, No.1, pp.18 - 27

 

Submission date: 30 Oct 2014
Date of acceptance: 26 May 2015
Available online: 29 Dec 2015

 

 

Editors Full text accessFree access Free accessComment on this article