Authors: Noopur Srivastava; Shrisha Rao
Addresses: Schneider Electric India Pvt. Ltd., Beary's Global Research Triangle, Bangalore 560037, India ' International Institute of Information Technology – Bangalore, Bangalore 560100, India
Abstract: We present a novel approach to text categorisation with the aid of the Mahalanobis distance measure for classification. For correlated datasets, classification using the Euclidean distance is not very accurate. The use of the Mahalanobis distance exploits the correlation in data for the purpose of classification. For achieving this on large datasets, an unsupervised dimensionality reduction technique, principal component analysis (PCA) is used prior to classification using the k-nearest neighbours (kNN) classifier. As kNN does not work well for high-dimensional data, and moreover computing correlations for huge and sparse data is inefficient, we use PCA to obtain a reduced dataset for the training phase. Experimental results show improvement in classification accuracy and a significant reduction in error percentage by using the proposed algorithm on huge datasets, in comparison with classifiers using the Euclidean distance.
Keywords: Mahalanobis distance; k-nearest neighbour; kNN; text classification; precision; recall; dimensionality reduction; principal component analysis; PCA; correlated datasets.
International Journal of Big Data Intelligence, 2016 Vol.3 No.1, pp.18 - 27
Received: 31 Oct 2014
Accepted: 26 May 2015
Published online: 29 Dec 2015 *