Title: Synthetic sampling approach based on model-based clustering for imbalanced data

Authors: Shaukat Ali Shahee; Usha Ananthakumar

Addresses: Shailesh J. Mehta School of Management, Indian Institute of Technology Bombay, Mumbai, India ' Shailesh J. Mehta School of Management, Indian Institute of Technology Bombay, Mumbai, India

Abstract: A dataset exhibits class imbalance problem when one class has very few examples compared to the other class also referred to as between class imbalance. Apart from between-class imbalance, imbalance within classes where classes are composed of different number of sub-clusters with these sub-clusters containing different number of examples may also affect the performance of the classifier. In this paper, we propose a method that can handle both between-class and within-class imbalance simultaneously that also takes into consideration various data intrinsic characteristics. The proposed method uses model-based clustering with respect to classes to identify the sub-clusters present in the dataset and oversamples examples in each sub-cluster in such a manner that it eliminates between class and within class imbalance simultaneously. We validate our approach using neural network on ten publicly available datasets. The experimental results show the proposed method to be statistically significantly superior to other methods.

Keywords: classification; imbalanced dataset; oversampling; model-based clustering.

DOI: 10.1504/IJAISC.2018.097284

International Journal of Artificial Intelligence and Soft Computing, 2018 Vol.6 No.4, pp.348 - 364

Received: 24 Oct 2017
Accepted: 28 Oct 2018

Published online: 08 Jan 2019 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article