Authors: Weizeng Ni; Nuo Xu; Honghao Dai; Samuel H. Huang
Addresses: Global Risk Analytics, Bank of America, NJ, 07310, USA ' Quantitative Methods, Collat School of Business, University of Alabama at Birmingham, AL, 35294, USA ' Risk Analytics, Axcess Financial Service, Cincinnati, OH, 45246, USA ' School of Dynamic Systems, University of Cincinnati, OH, 45221, USA
Abstract: Feature selection is an important and challenging step in learning from data with small sample size and high dimensionality. The widely-used approach wrapper potentially introduces feature selection bias due to data overfitting. More sophisticated approaches of external cross-validation and dual-loop cross-validation are proposed to reduce bias, but they tend to bring in excessive variability for data with small sample. This paper shows that a model independent approach, namely, minimum expected cost of misclassification (MECM), can reduce feature selection bias without cross-validation. An experiment on a synthetic dataset shows that 10-fold dual-loop cross-validation based wrapper has around 33% higher error rate than the noise-free error rate and fails to identify discriminative features consistently in all 10 folds. On the other hand, MECM can select more discriminative features and shows more robustness to different classification models. A real-word colon cancer dataset is further used to demonstrate the effectiveness of MECM.
Keywords: feature selection; overfitting; microarray data; model-independent; cross-validation; high dimensionality; small sample data; feature selection bias; model evaluation.
International Journal of Data Science, 2020 Vol.5 No.3, pp.229 - 246
Received: 30 Nov 2019
Accepted: 15 May 2020
Published online: 30 Jan 2021 *