Title: Evaluation of the importance of data pre-processing order when combining feature selection and data sampling

Authors: Ahmad Abu Shanab; Taghi M. Khoshgoftaar; Randall Wald; Jason Van Hulse

Addresses: Department of Computer & Electrical Engineering & Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA. ' Department of Computer & Electrical Engineering & Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA. ' Department of Computer & Electrical Engineering & Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA. ' Department of Computer & Electrical Engineering & Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA

Abstract: Two problems often encountered in machine learning are class imbalance and high dimensionality. In this paper we compare three different approaches for addressing both problems simultaneously, by applying both data sampling and feature selection. With the first two approaches, sampling is followed by feature selection. In the first approach, the features are selected based on the sampled data, and then the unsampled data is used with just the selected features. The second approach is similar, but the sampled data is used. Finally, with the third approach, feature selection is performed prior to sampling. To compare the approaches, we use seven datasets from different domains, employ nine feature rankers from three different families, apply three sampling techniques, and inject class noise to better simulate real-world datasets. The results show that the second and third approaches are both very good, with the third approach showing a slight (but not statistically significant) lead.

Keywords: feature selection; data sampling; data pre-processing order; machine learning; class imbalance; high dimensionality.

DOI: 10.1504/IJBIDM.2012.048730

International Journal of Business Intelligence and Data Mining, 2012 Vol.7 No.1/2, pp.116 - 134

Published online: 12 Nov 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article