Title: Perturbation and candidate analysis to combat overfitting of gene expression microarray data

Authors: Ravi Mathur; J. David Schaffer; Walker H. Land Jr.; John J. Heine; Jonathan M. Hernandez; Timothy Yeatman

Addresses: Department of Bioengineering, Binghamton University, Binghamton, NY 13902, USA. ' Department of Bioengineering, Binghamton University, Binghamton, NY 13902, USA. ' Department of Bioengineering, Binghamton University, Binghamton, NY 13902, USA. ' H. Lee Moffitt Cancer Center & Research Institute and University of South Florida, Tampa, FL 33620, USA. ' H. Lee Moffitt Cancer Center & Research Institute and University of South Florida, Tampa, FL 33620, USA. ' H. Lee Moffitt Cancer Center & Research Institute and University of South Florida, Tampa, FL 33620, USA

Abstract: Analysis of gene expression microarray datasets presents the high risk of over-fitting (spurious patterns) because of their feature-rich but case-poor nature. This paper describes our ongoing efforts to develop a method to combat over-fitting and determine the strongest signal in the dataset. A GA-SVM hybrid along with Gaussian noise (manual noise gain) is used to discover feature sets of minimal size that accurately classifies the cases under cross-validation. Initial results on a colorectal cancer dataset shows that the strongest signal (modest number of candidates) can be found by a binary search.

Keywords: Az value; colorectal cancer; cross-validation; DNA microarray; GAs; genetic algorithms; noise perturbation; overfitting; ROC curve; support vector machines; SVM; gene expression microarray data.

DOI: 10.1504/IJCBDD.2011.044443

International Journal of Computational Biology and Drug Design, 2011 Vol.4 No.4, pp.307 - 315

Published online: 24 Jan 2015 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article