Title: Leveraging machine learning to advance genome-wide association studies

Authors: Gabrielle Dagasso; Yan Yan; Lipu Wang; Longhai Li; Randy Kutcher; Wentao Zhang; Lingling Jin

Addresses: Department of Mathematics and Statistics, Thompson Rivers University, Kamloops, British Columbia, Canada ' Department of Computing Science, Thompson Rivers University, Kamloops, British Columbia, Canada ' Department of Plant Sciences, University of Saskatchewan, Saskatoon, Saskatchewan, Canada ' Department of Mathematics and Statistics, University of Saskatchewan, Saskatoon, Saskatchewan, Canada ' Department of Plant Sciences, University of Saskatchewan, Saskatoon, Saskatchewan, Canada ' National Research Council of Canada, Saskatoon, Saskatchewan, Canada ' Department of Computer Science, University of Saskatchewan, Saskatoon, Saskatchewan, Canada

Abstract: Genome-Wide Association Studies (GWAS) has demonstrated its power in discovering genetic variations to particular traits related to agronomically important features in crops. The typical output of a GWAS program includes a series of Single Nucleotide Polymorphisms (SNPs) and their significance. Currently, there is no standard way to compare results across different programs or to select the most 'significant' results uniformly and consistently. To obtain a comprehensive and accurate set of SNPs associated with a trait of interest, we present a novel automated pipeline that leverages machine learning for GWAS discoveries. The pipeline first performs population structure analysis, then executes multiple GWAS software and combines their results into a single SNP set. After that, it selects SNPs from the set with high individual and/or joint effects with the Least Absolute Shrinkage and Selection Operator analysis. Finally, the predictivity of the model is assessed using cross-validation.

Keywords: genome-wide association studies; machine learning; population structure analysis; cross-validation; LASSO; fusarium head blight.

DOI: 10.1504/IJDMB.2021.116881

International Journal of Data Mining and Bioinformatics, 2021 Vol.25 No.1/2, pp.17 - 36

Received: 23 Mar 2021
Accepted: 05 Apr 2021

Published online: 05 Aug 2021 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article