Article: An empirical examination of classification algorithms and resampling strategies for dealing with imbalanced datasets: a comparative analysis Journal: International Journal of Data Analysis Techniques and Strategies (IJDATS) 2025 Vol.17 No.3 pp.238 - 253 Abstract: Imbalanced datasets can lead to biased models and inaccurate predictions, thus making it a crucial issue to be addressed. This research comprehensively analyses issues, approaches and evaluation parameters to work with imbalanced dataset based machine learning models. Literature suggests that data imbalance handling methods are categorised into three broad categories namely pre-processing methods, cost-sensitive learning, and ensemble methods. Experiments are conducted to test popular classifiers in combination with three pre-processing methods namely clustered smote, random over sampling, and scaled values on seven standard imbalanced datasets. The results of study show that Random Forest classifier with Random Over Sampling pre-processing method, performed best for most of the datasets with precision values between 0.68 to 1, AUC values between 0.83-1, and prediction accuracy between 76.1-99.8%. This study highlights that the choice of the evaluation metric and the pre-processing method can have a significant impact on the performance of the classifier. Inderscience Publishers - linking academia, business and industry through research

Title: An empirical examination of classification algorithms and resampling strategies for dealing with imbalanced datasets: a comparative analysis

Authors: Himani S. Deshpande; Leena Ragha

Addresses: Department of Artificial Intelligence and Data Science, Thadomal Shahani Engineering College, Mumbai, 400050, Maharashtra, India ' Department of Computer Science and Engineering, BLDEA's V.P. Dr. P. G. Halakatti College of Engineering and Technology, Vijayapur, 586103, Karnataka, India

Abstract: Imbalanced datasets can lead to biased models and inaccurate predictions, thus making it a crucial issue to be addressed. This research comprehensively analyses issues, approaches and evaluation parameters to work with imbalanced dataset based machine learning models. Literature suggests that data imbalance handling methods are categorised into three broad categories namely pre-processing methods, cost-sensitive learning, and ensemble methods. Experiments are conducted to test popular classifiers in combination with three pre-processing methods namely clustered smote, random over sampling, and scaled values on seven standard imbalanced datasets. The results of study show that Random Forest classifier with Random Over Sampling pre-processing method, performed best for most of the datasets with precision values between 0.68 to 1, AUC values between 0.83-1, and prediction accuracy between 76.1-99.8%. This study highlights that the choice of the evaluation metric and the pre-processing method can have a significant impact on the performance of the classifier.

Keywords: imbalanced data; over sampling; undersampling; classification; cost sensitive; ensemble learning; feature weighing; instance weighing.

DOI: 10.1504/IJDATS.2025.148563

International Journal of Data Analysis Techniques and Strategies, 2025 Vol.17 No.3, pp.238 - 253

Received: 14 Jun 2023
Accepted: 05 Feb 2024
Published online: 12 Sep 2025 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: An empirical examination of classification algorithms and resampling strategies for dealing with imbalanced datasets: a comparative analysis

Keep up-to-date