Title: Borderline over-sampling for imbalanced data classification

Authors: Hien M. Nguyen, Eric W. Cooper, Katsuari Kamei

Addresses: Graduate School of Science and Engineering, Ritsumeikan University, 1-1-1 Noji Higashi, Kusatsu, Shiga 525-8577, Japan. ' College of Information Science and Engineering, Ritsumeikan University, 1-1-1 Noji Higashi, Kusatsu, Shiga 525-8577, Japan. ' College of Information Science and Engineering, Ritsumeikan University, 1-1-1 Noji Higashi, Kusatsu, Shiga 525-8577, Japan

Abstract: Traditional classification algorithms usually provide poor accuracy on the prediction of the minority class of imbalanced data sets. This paper proposes a new method for dealing with imbalanced data sets by over-sampling the borderline minority class instances. A Support Vector Machine (SVM) classifier is then trained to predict future instances. Compared with other over-sampling methods, the proposed method focuses only on the minority class instances residing along the decision boundary, due to the fact that this region is the most crucial for establishing the decision boundary. Furthermore, the artificial minority instances are generated in such a way that the regions of the minority class with fewer majority class instances would be expanded by extrapolation, otherwise the current boundary of the minority class would be consolidated by interpolation. Experimental results show that the proposed method achieves a better performance than other over-sampling methods.

Keywords: imbalanced data sets; over-sampling; support vector machines; SVM; data classification; borderline minority class.

DOI: 10.1504/IJKESDP.2011.039875

International Journal of Knowledge Engineering and Soft Data Paradigms, 2011 Vol.3 No.1, pp.4 - 21

Published online: 22 Apr 2011 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article