International Journal of Data Mining and Bioinformatics (7 papers in press)
Analysis of COVID-19 genetic risk susceptibility using UK Biobank SNP genotype data
by Taewan Goo, Kyulhee Han, Catherine Apio, Taesung Park
Abstract: The coronavirus disease 2019 (COVID-19) has become a global pandemic. Here, we performed a study on host susceptibility to COVID-19 infection using COVID-19 test results and genomic data released by UK Biobank until early October of 2020. The data consisted of 27,713 samples including 2740 positive cases. We employed genome-wide association study, gene-level association and pathway analyses using common and rare variants. Among these analyses, only pathway analysis based on rare variants found seven significant pathways. Among them, the JAK-STAT pathway and glycolipid biosynthesis pathway have been reported to be associated with a viral infection, especially COVID-19 infection. Further, we found new pathways that were not previously reported, including pathways related to cellular signalling like NLR signalling pathway. Additional experiments and studies of these pathways may unveil the pathophysiological of COVID-19 and identify highly susceptible groups.
Keywords: COVID-19; GWAS; host genetics; infection susceptibility; pathway analysis.
Leveraging machine learning to advance genome-wide association studies
by Gabrielle Dagasso, Yan Yan, Lipu Wang, Longhai Li, Randy Kutcher, Wentao Zhang, Lingling Jin
Abstract: Genome-Wide Association Studies (GWAS) has demonstrated its power in discovering genetic variations to particular traits related to agronomically important features in crops. The typical output of a GWAS program includes a series of Single Nucleotide Polymorphisms (SNPs) and their significance. Currently, there is no standard way to compare results across different programs or to select the most 'significant' results uniformly and consistently. To obtain a comprehensive and accurate set of SNPs associated with a trait of interest, we present a novel automated pipeline that leverages machine learning for GWAS discoveries. The pipeline first performs population structure analysis, then executes multiple GWAS software and combines their results into a single SNP set. After that, it selects SNPs from the set with high individual and/or joint effects with the Least Absolute Shrinkage and Selection Operator analysis. Finally, the predictivity of the model is assessed using cross-validation.
Keywords: genome-wide association studies; machine learning; population structure analysis; cross-validation; LASSO; fusarium head blight.
Structural variation calling and genotyping by moment-based deep convolutional neural networks
by Timothy Becker, Dong-Guk Shin
Abstract: Structural Variation (SV) calling and genotyping remain an ongoing challenge using next generation sequencing technologies. The gold standard approach for genome consortia has been to utilise multiple SV calling algorithms and then merge the results based on SV type and coordinates and more recently to make use of multiple sequencing technologies for each sample cell line. This ensemble strategy provides more comprehensive SV calling but comes at the cost of high-compute run time. We make use of popular open-source machine learning libraries to formulate a new data representation suitable for mining whole genome sequences in a fraction of the ensemble time. We then compare the results to several well-established methods and ensembles. Our pure machine learning method demonstrates a new direction in technique, where feature selection and region filtering are no longer required to achieve desirable false positive rates.
Keywords: genomic variation; structural variation; data representation; moment-based tensors; machine learning; convolutional neural networks.
Detection of foetal single gene mutations using only maternal blood samples
by Junghyun Namkung
Abstract: Non-Invasive Prenatal Testings (NIPT) for chromosomal aneuploidy are widely applied but not yet for monogenic diseases. In this study, we have developed new analysis algorithms for detecting foetal single gene mutations that are linked to a mendelian disease by sequencing maternal blood samples. The proposed algorithm used two approaches to determine the foetal mutation status. If the mutation type is a duplication, we use the allele frequency of the heterozygous site, and if the mutation is a deletion, we use the ratio of the relative read depth of cell-free DNA to the parent genomic DNA. The algorithms were applied to real data consisting of four pairs of sequencing results generated using peripheral blood samples from two pregnant women. Both sample providers have their first child with Duchene Muscular Dystrophy (DMD) disease, a typical X-linked recessive disorder. Sequences were generated using massively parallel sequencing technologies with a targeted sequencing approach.
Keywords: non-invasive prenatal testing; NIPT; cffDNA; monogenic disease.
Application of SNPViz v2.0 using next-generation sequencing data sets in the discovery of potential causative mutations in candidate genes associated with phenotypes
by Shuai Zeng, Mária Škrabišová, Zhen Lyu, Yen On Chan, Nicholas Dietz, Kristin Bilyeu, Trupti Joshi
Abstract: Single Nucleotide Polymorphisms (SNPs) and insertions/deletions (Indels) are the most common biological markers widely spread across all genome chromosomes. Owing to the large amount of SNPs and Indels data that have become available during the last ten years, it is a challenge to intuitively integrate, compare, or visualise them and effectively perform analysis across multiple samples simultaneously. Genome-Wide Association Studies (GWAS) is an approach to find genetic variants associated with a trait, but it lacks an efficient way of investigating genomic variant functions. To tackle these issues, we developed SNPViz v2.0, a web-based tool designed to visualise large-scale haplotype blocks with detailed SNPs and Indels grouped by their chromosomal coordinates, along with their overlapping gene models, phenotype to genotype accuracies, Gene Ontology (GO), protein families (Pfam), and their functional effects. SNPViz v2.0 is available in both SoyKB and KBCommons. For soya bean only, the SNPViz v2.0 is available online at: http://soykb.org/SNPViz2/. For other plants such as Arabidopsis thaliana and Zea mays, SNPViz v2.0 in their respective knowledge bases is available online at: https://kbcommons.org.
Keywords: SNP; NGS; genotypes; phenotypes; visualisation.
Evaluation of statistical methods for the analysis of crossover designs with repeated measurements
by Md. Kamruzzaman, Yonggab Kim, Yeni Lim, Oran Kwon, Taesung Park
Abstract: The crossover design is a type of longitudinal study used in clinical trials to evaluate the effectiveness of new drugs and new treatments. In the crossover design, each subject is subsequently switched through all treatments after a washout period. Although the linear mixed-effects model is one of the commonly used methods for crossover designs, sometimes it suffers from convergence problems. In this study, we adopted generalised estimating equations for crossover design by shifting the position of the variables so that the independent variables of the linear mixed models are regarded as the response variables. The advantage of the generalised estimating equation model lies in its simple computation and is relatively easy to use. A simulation study showed that the power of generalised estimating equation models is comparable to or slightly better than that of linear mixed-effects model.
Keywords: correlated data; crossover design; mixed effects model; generalised estimating equation model; local odds ratio.
Dengue fever prediction modelling using data mining techniques
by Wipawan Buathong, Pita Jarupunphol
Abstract: This research experiments on several combinations of feature selection and classifier to obtain the most efficient classification model for predicting dengue fever. The features of relationship patterns for predicting dengue fever were investigated. In order to obtain the most effective classification model, several feature selection techniques were ranked and experimented with well-recognised classifiers. The measurement results of different models were illustrated and compared. The most efficient model is the neural network with three layers. Each layer contains 100 nodes with ReLu activation function. Five features were classified using information gain with 64.9% accuracy, 71.8% F-measure, 65.7% precision, and 79.0% recall. Other competitive machine learning models with slightly similar efficiency are: (1) the combined Naive Bayes and information gain; (2) the combined neural network and ReliefF; (3) the combined Naive Bayes and FCBF. SVM, on the other hand, is considered as the least efficient model when experimented with selected feature selection techniques.
Keywords: dengue fever; data mining; classification; feature selection; ranking.