Title: A study of data pre-processing techniques for imbalanced biomedical data classification

Authors: Shigang Liu; Jun Zhang; Yang Xiang; Wanlei Zhou; Dongxi Xiang

Addresses: School of Software and Electrical Engineering, Swinburne University of Technology, Hawthorn, VIC 3122, Australia ' School of Software and Electrical Engineering, Swinburne University of Technology, Hawthorn, VIC 3122, Australia ' School of Software and Electrical Engineering, Swinburne University of Technology, Hawthorn, VIC 3122, Australia ' School of Information Technology, Deakin University, Burwood, VIC 3125, Australia ' Department of Genetics, Harvard Medical School, Boston, MA 02115, USA

Abstract: Biomedical data are widely accepted in developing prediction models for identifying a specific tumour, drug discovery and human cancers detection. However, previous studies usually focused on different classifiers, and overlook the class imbalance problem in real-world biomedical datasets. This paper mainly focuses on reviewing and evaluating some popular and recently developed resampling and feature selection (FS) methods for class imbalance learning with data distribution being considered. Experimental results show that: 1) resampling and FS techniques exhibit better performance using support vector machine (SVM) classifier; 2) techniques such as random undersampling and FS perform better than other data pre-processing methods with T location-scale distribution when using SVM and K-nearest neighbours (KNN) classifiers. Random oversampling outperforms other methods on negative binomial distribution using Random Forest with lower level of imbalance ratio; 3) FS outperforms other data pre-processing methods in most cases, thus, FS with SVM classifier is the best choice for imbalanced biomedical data learning.

Keywords: class-imbalance; data distribution; classification; biomedical data; resampling; feature selection.

DOI: 10.1504/IJBRA.2020.109103

International Journal of Bioinformatics Research and Applications, 2020 Vol.16 No.3, pp.290 - 318

Received: 13 Jun 2017
Accepted: 03 Feb 2018

Published online: 14 Aug 2020 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article