Title: Predicting tumour stages of lung cancer adenocarcinoma tumours from pooled microarray data using machine learning methods

Authors: Xin Li; Benjamin Scheich

Addresses: Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University, Washington DC, 20057, USA ' Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University, Association of Women's Health, Obstetric & Neonatal Nurses, Washington DC, 20036, USA

Abstract: This paper involved a novel method combination of predicting lung cancer adenocarcinoma stages using differential expression analysis for gene selection (linear modelling) and machine learning methods (support vector machines (SVMs) and random forest) on a pooled dataset from multiple publicly available microarray experiments. The raw data of 123 tumour microarray samples were initially preprocessed and analysed using robust multi-array average (RMA) and linear models for microarray data (LIMMA) to screen a list of significantly differential expressed genes, where two gene lists were identified according to different experimental settings. These two gene lists were then placed into the SVM model and random forest (RF) model for further investigation to build the prediction models. As result, both the SVM and RF models provided a lung cancer stage prediction model with the accuracy ranging from 67% to 71%.

Keywords: machine learning; gene expression; microarrays; lung cancer; tumour stages; adenocarcinoma tumours; gene selection; linear modelling; support vector machines; SVM; random forest; bioinformatics.

DOI: 10.1504/IJCBDD.2015.072109

International Journal of Computational Biology and Drug Design, 2015 Vol.8 No.3, pp.275 - 292

Received: 26 Aug 2014
Accepted: 10 Mar 2015

Published online: 30 Sep 2015 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article