Title: A novel random forests-based feature selection method for microarray expression data analysis

Authors: Dengju Yao; Jing Yang; Xiaojuan Zhan; Xiaorong Zhan; Zhiqiang Xie

Addresses: College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China; School of Software, Harbin University of Science and Technology, Harbin 150040, China ' College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China ' College of Computer Science and Technology, Heilongjiang Institute of Technology, Harbin 150050, China ' Department of Endocrinology, First Affiliated Hospital, Harbin Medical University, Harbin 150081, China ' School of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080, China

Abstract: High-dimensional data and a large number of redundancy features in bioinformatics research have created an urgent need for feature selection. In this paper, a novel random forests-based feature selection method is proposed that adopts the idea of stratifying feature space and combines generalised sequence backward searching and generalised sequence forward searching strategies. A random forest variable importance score is used to rank features, and different classifiers are used as a feature subset evaluating function. The proposed method is examined on five microarray expression datasets, including leukaemia, prostate, breast, nervous and DLBCL, and the average accuracies of the SVM classifier in these datasets are 100%, 95.24%, 85%, 91.67%, and 91.67%, respectively. The results show that the proposed method could not only improve the classification accuracy but also greatly reduce the computation time of the feature selection process.

Keywords: wrapper feature selection; filter feature selection; microarray expression data; data analysis; disease biomarkers; biomarker identification; feature ranking; random forests; support vector machines; SVM; bioinformatics; leukaemia dataset; prostate dataset; breast dataset; nervous dataset; DLBCL dataset; cancer classification.

DOI: 10.1504/IJDMB.2015.070852

International Journal of Data Mining and Bioinformatics, 2015 Vol.13 No.1, pp.84 - 101

Received: 10 Jan 2015
Accepted: 25 Jan 2015

Published online: 30 Jul 2015 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article