Authors: Gillala Rekha; V. Krishna Reddy; Amit Kumar Tyagi
Addresses: Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Hyderabad, Telangana, India ' Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Guntur, India ' School of Computing Science and Engineering, Vellore Institute of Technology, Chennai Campus, Chennai, Tamil Nadu, India
Abstract: Imbalanced datasets typically make prediction accuracy difficult. Most of the real-world data are imbalanced in nature. The traditional classifiers assume a well-balanced class distribution for training data but in practical datasets show up an imbalance, thus obscure a classifier and degrade its capability to learn from such imbalanced datasets. Data pre-processing approaches address this concern by using either random undersampling or oversampling techniques. In this paper, we introduce Earth mover's distance (EMD), as a similarity measure, to find the samples similar in nature and eliminate them as redundant from the dataset. Earth mover's distance has received a lot of attention in wide areas such as computer vision, image retrieval, machine learning, etc. The Earth mover's distance-based undersampling approach provides a solution at the data level to eliminate the redundant instances in majority samples without any loss of valuable information. This method is implemented with five conventional classifiers and one ensemble technique respectively, like C4.5 decision tree (DT), k-nearest neighbour (k-NN), multilayer perceptron (MLP), support vector machine (SVM), naive Bayes (NB) and AdaBoost technique. The proposed method yields a superior performance on 21 datasets from Keel repository.
Keywords: class imbalance; classification; data pre-processing; sampling technique; Earth mover's distance; EMD.
International Journal of Intelligent Information and Database Systems, 2020 Vol.13 No.2/3/4, pp.376 - 392
Received: 25 Apr 2019
Accepted: 06 Jan 2020
Published online: 26 Aug 2020 *