Title: An Earth mover's distance-based undersampling approach for handling class-imbalanced data

Authors: Gillala Rekha; V. Krishna Reddy; Amit Kumar Tyagi

Addresses: Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Hyderabad, Telangana, India ' Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Guntur, India ' School of Computing Science and Engineering, Vellore Institute of Technology, Chennai Campus, Chennai, Tamil Nadu, India

Abstract: Imbalanced datasets typically make prediction accuracy difficult. Most of the real-world data are imbalanced in nature. The traditional classifiers assume a well-balanced class distribution for training data but in practical datasets show up an imbalance, thus obscure a classifier and degrade its capability to learn from such imbalanced datasets. Data pre-processing approaches address this concern by using either random undersampling or oversampling techniques. In this paper, we introduce Earth mover's distance (EMD), as a similarity measure, to find the samples similar in nature and eliminate them as redundant from the dataset. Earth mover's distance has received a lot of attention in wide areas such as computer vision, image retrieval, machine learning, etc. The Earth mover's distance-based undersampling approach provides a solution at the data level to eliminate the redundant instances in majority samples without any loss of valuable information. This method is implemented with five conventional classifiers and one ensemble technique respectively, like C4.5 decision tree (DT), k-nearest neighbour (k-NN), multilayer perceptron (MLP), support vector machine (SVM), naive Bayes (NB) and AdaBoost technique. The proposed method yields a superior performance on 21 datasets from Keel repository.

Keywords: class imbalance; classification; data pre-processing; sampling technique; Earth mover's distance; EMD.

DOI: 10.1504/IJIIDS.2020.109463

International Journal of Intelligent Information and Database Systems, 2020 Vol.13 No.2/3/4, pp.376 - 392

Received: 25 Apr 2019
Accepted: 06 Jan 2020

Published online: 09 Sep 2020 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article