Title: Synthetic sampling approach based on model-based clustering for imbalanced data
Authors: Shaukat Ali Shahee; Usha Ananthakumar
Addresses: Shailesh J. Mehta School of Management, Indian Institute of Technology Bombay, Mumbai, India ' Shailesh J. Mehta School of Management, Indian Institute of Technology Bombay, Mumbai, India
Abstract: A dataset exhibits class imbalance problem when one class has very few examples compared to the other class also referred to as between class imbalance. Apart from between-class imbalance, imbalance within classes where classes are composed of different number of sub-clusters with these sub-clusters containing different number of examples may also affect the performance of the classifier. In this paper, we propose a method that can handle both between-class and within-class imbalance simultaneously that also takes into consideration various data intrinsic characteristics. The proposed method uses model-based clustering with respect to classes to identify the sub-clusters present in the dataset and oversamples examples in each sub-cluster in such a manner that it eliminates between class and within class imbalance simultaneously. We validate our approach using neural network on ten publicly available datasets. The experimental results show the proposed method to be statistically significantly superior to other methods.
Keywords: classification; imbalanced dataset; oversampling; model-based clustering.
DOI: 10.1504/IJAISC.2017.10018306
International Journal of Artificial Intelligence and Soft Computing, 2018 Vol.6 No.4, pp.348 - 364
Received: 24 Oct 2017
Accepted: 28 Oct 2018
Published online: 08 Jan 2019 *