Title: Comparative studies for developing protein based cancer prediction model to maximise the ROC-AUC with various variable selection methods

Authors: Yongkang Kim; Min-Seok Kwon; Yonghwan Choi; Sung Gon Yi; Junghyun Namkung; Sangjo Han; Wooil Kwon; Sun Whe Kim; Jin-Young Jang; Hyunsoo Kim; Youngsoo Kim; Seungyeoun Lee; Taesung Park

Addresses: Department of Statistics, Seoul National University, Seoul, South Korea ' Interdisciplinary program in Bioinformatics, Seoul National University, Seoul, South Korea ' IVD Business Unit SK Telecom, Seoul, South Korea ' IVD Business Unit SK Telecom, Seoul, South Korea ' IVD Business Unit SK Telecom, Seoul, South Korea ' IVD Business Unit SK Telecom, Seoul, South Korea ' Department of Surgery, Seoul National University Hospital, Seoul, South Korea ' Department of Surgery, Seoul National University Hospital, Seoul, South Korea ' Department of Surgery, Seoul National University Hospital, Seoul, South Korea ' Department of Biomedical Engineering, Seoul National University, Seoul, South Korea ' Department of Biomedical Engineering, Seoul National University, Seoul, South Korea ' Department of Mathematics and Statistics, Sejong University, Seoul, South Korea ' Department of Statistics, Seoul National University, Seoul, South Korea; Interdisciplinary program in Bioinformatics, Seoul National University, Seoul, South Korea

Abstract: The era of protein data analysis is coming with more accurate quantification experiments such as the multiple reaction monitoring (MRM). Protein is easier to obtain than the other genetic variants or gene expression data, which makes it more suitable for early diagnosis of cancer. Each patient has unique patterns of protein data, which makes it imperative for the researcher to select the effective markers to construct a consistent model to predict the patients. This research focuses on finding the most effective variable selection method to be applied in the early diagnosis of the pancreatic cancer. In the process, we compare classical selection methods (stepwise selection based on AIC, BIC), machine learning based selection method (support vector machine recursive feature selection; SVM-REF), and stepwise selection method using the area under the receiver operating characteristic curve (Step-AUC). Based on the simulation and real data analysis, we suggest a Step-AUC method to maximise the prediction performance of the early diagnosis by protein data.

Keywords: AIC; Akaike information criteria; BIC; Bayesian information criteria; SVM-REF; support vector machines; recursive feature selection; stepwise selection; step-AUC; MRM; multiple reaction monitoring; pancreatic cancer; protein data; cancer prediction; variable selection methods; bioinformatics; early diagnosis; machine learning; simulation.

DOI: 10.1504/IJDMB.2016.079803

International Journal of Data Mining and Bioinformatics, 2016 Vol.16 No.1, pp.64 - 76

Received: 17 May 2016
Accepted: 01 Jun 2016

Published online: 04 Oct 2016 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article