Int. J. of Data Mining and Bioinformatics   »   2016 Vol.16, No.1

 

 

Title: Cross-validation and cross-study validation of chronic lymphocytic leukaemia with exome sequences and machine learning

 

Authors: Abdulrhman Aljouie; Nihir Patel; Bharati Jadhav; Usman Roshan

 

Addresses:
Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, USA
Department of Genetics and Genomics Sciences, Icahn School of Medicine at Mount Sinai Hospital, Hess Center for Science and Medicine, New York City, NY 10029, USA
Department of Genetics and Genomics Sciences, Icahn School of Medicine at Mount Sinai Hospital, Hess Center for Science and Medicine, New York City, NY 10029, USA
Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, USA

 

Abstract: The era of genomics brings the potential of better DNA-based risk prediction and treatment. We explore this problem for chronic lymphocytic leukaemia that is one of the largest whole exome data set available from the NIH dbGaP database. We perform a standard next-generation sequence procedure to obtain Single-Nucleotide Polymorphism (SNP) variants and obtain a peak mean accuracy of 82% in our cross-validation study. We also cross-validate an Affymetrix 6.0 genome-wide association study of the same samples where we find a peak accuracy of 57%. We then perform a cross-study validation with exome samples from other studies in the NIH dbGaP database serving as the external data set. There we obtain an accuracy of 70% with top Pearson ranked SNPs obtained from the original exome data set. Our study shows that even with a small sample size we can obtain moderate to high accuracy with exome sequences, which is encouraging for future work.

 

Keywords: exome wide association studies; chronic lymphocytic leukaemia; machine learning; disease risk prediction; cross-validation; cross-study validation; exome sequences; bioinformatics; next-generation sequencing; single nucleotide polymorphisms; SNPs; SNP variants.

 

DOI: 10.1504/IJDMB.2016.10000562

 

Int. J. of Data Mining and Bioinformatics, 2016 Vol.16, No.1, pp.47 - 63

 

Available online: 04 Oct 2016

 

 

Editors Full Text AccessAccess for SubscribersPurchase this articleComment on this article