International Journal of Data Mining and Bioinformatics (10 papers in press)
by Yan Yan, Connor Burbridge, Jinhong Shi, Juxin Liu, Anthony Kusalik
Abstract: Many software packages have been developed for genome-wide association studies (GWAS) based on various statistical models. One key factor influencing the statistical reliability of GWAS is the amount of input data used. In this paper we investigate how input data quantity influences output of four widely used GWAS programs, PLINK, TASSEL, GAPIT, and FaST-LMM, in the context of plant genomes and phenotypes. Both synthetic and real data are used. Evaluation is based on p- and q-values of output SNPs, and Kendall rank correlation between output SNP lists. Results show that for the same GWAS program, different Arabidopsis thaliana datasets demonstrate similar trends of rank correlation with varied input quantity, but differentiate on the numbers of SNPs passing a given p- or q-value threshold. We also show that variations in numbers of replicates influence the p-values of SNPs, but do not strongly affect the rank correlation.
Keywords: GWAS; Genome-Wide Association Study; Arabidopsis thaliana; plant phenomics; plant genomics; PLINK; TASSEL; GAPIT; FaST-LMM; statistical power; input data quantity; epistasis.
Fast practical on-line exact single and multiple pattern matching algorithms in highly similar sequences
by Nadia Ben Nsira, Thierry Lecroq, Elise Prieur-Gatson
Abstract: With the advent of high-throughput sequencing technologies there are
more and more genomic sequences of individuals of the same species available.
These sequences only differ by a very small amount of variations. There is thus a
strong need for efficient algorithms for performing fast pattern matching in such
specific sets of sequences. In this paper we propose efficient practical algorithms
that solve on-line exact pattern matching problem in a set of highly similar DNA
sequences. We first present a method for exact single pattern matching when k
variations are allowed in a window which size is equal to the pattern length. We
then propose an algorithm for exact multiple pattern matching when only one
variation is allowed in a window which size is equal to the length of the longest
pattern. Experimental results show that our algorithms, though not optimal in the
worst case, have good performances in practice.
Keywords: Computational biology; bioinformatics; algorithm design; pattern matching; string matching; DNA sequences; genomic sequences; Landau-Vishkin algorithm; Aho-Corasick algorithm; similar sequences.
Topological data analysis can extract subgroups with high incidence rates of Type 2 diabetes
by Hyung Sun Kim, Chahngwoo Yi, Yongkang Kim, Uhnmee Park, Woong Kook, Bermseok Oh, Hyuk Kim, Taesung Park
Abstract: Type 2 diabetes (T2D) is now a rapidly increasing, worldwide scourge, and the identification of genetic contributors is vital. However, current analyses of multiple, disease-contributing factors, and their combined interactions, remains quite difficult, using traditional approaches. Topological data analysis (TDA) shows what shape a data set can have, facilitating clustering analysis, by determining which components are close to each other. Thus, TDA can generate a network, using single-nucleotide polymorphism (SNP) data, revealing the genetic relatedness of specific individuals, and can derive multiple ordered subgroups, from one with a low patient concentration, to one with a high patient concentration. Since it is widely accepted that T2D pathogenesis is affected by multiple genetic factors, we performed TDA on T2D data from the Korea Association REsource (KARE) project, a population-based, genome-wide association study of the Korean adult population. Since KARE data contains follow-up information about the incidence of T2D, we compared the T2D status of each individual, at baseline, with that of 10 years later. For the TDA network-driven subgroups, ordered by prevalence, we compared the T2D incidence rate, after 10 years, for individuals initially without T2D. As a result, we found that the TDA network-driven, ordered subgroups had significantly increased incidence rates, linearly correlated with prevalence (p-value = 0.006914). Our results demonstrate the usefulness of TDA in both identifying genetic contributors (e.g., SNPs), and their interrelationships, in the pathology of complex diseases.
Keywords: type 2 diabetes; Korea Association REsource (KARE); single-nucleotide polymorphism; topological data analysis; network; subgroup analysis.
Exploratory analysis for detecting population structures by iterative pruning based on independent component analysis
by Mira Park, Eunbin Choi, Heonsu Lee, Yongkang Kim, Taesung Park
Abstract: One of the main issues in genome-wide association studiesrn(GWAS) is detecting population stratification. As populationrnstratification can produce spurious associations in GWAS, it isrnnecessary to find hidden structures and assign individuals tornsubpopulations in advance. We suggest an exploratory approach forrnpopulation structure analysis based on independent component analysisrnAuthorrn(ICA). ICA is unsupervised approach to identifying and separatingrnmixed sources from observed signals with little prior information. BothrnICA and principal component analysis (PCA) reduce data into smallerrnsets of components. However, unlike PCA, ICA can treat non-Gaussianrndata and use higher moments. To determine the population structure,rnwe first reduce the dimensionality of samples by projecting the data torna lower-dimensional subspace built by ICA. The samples are thenrnbisected using fuzzy clustering. Repeating this procedure until somernpredetermined stopping criterion, we can detect the population structurernand assign individuals to subpopulations. Information about the numberrnof optimal subpopulations can also be obtained. We consider negativernentropy as a measure of the importance of each component. To assessrnthe proposed method, we analyze simulated genotypic data withrndifferent degrees of structure. We compare the proposed method torncurrent PCA-based methods. Real data from the HapMap project arernalso analyzed for illustration.
Keywords: Independent component analysis; pruning; genome-widernassociation study; negative entropy; subpopulation.
Protein Fold Recognition Model based on Cubic Lattice
by Farzad Peyravi, Ali Mohammad Latif, Seyed Mohammad Moshtaghioun
Abstract: Proteins are essential for the biological processes in the human body. They can only perform their functions when they fold into their tertiary structure. We propose a novel fold recognition method for protein tertiary structure prediction based on a hidden Markov model and 3D coordinates of amino acid residues in 3D space. The method introduces states based on the basis vectors in Bravais cubic lattice to recognize the fold of proteins. The accuracy of proposed model is quite better in comparison with SAM, 3-HMM optimized and Markov chain in overall experiment.
Keywords: Protein structure prediction; Tertiary Structure; Fold recognition; Hidden Markov model; Bravais lattice; Cubic lattice.
Effective Induction of Gene Regulatory Networks Using a Novel Recommendation Method
by Makbule Gulcin Ozsoy, Faruk Polat, Reda Alhajj
Abstract: In this paper, we introduce a method based on recommendation systems to predict the structure of Gene Regulatory Networks (GRNs) making use of data from multiple sources. Our method is based on collaborative filtering approach enhanced with multiple criteria to predict the relationships of genes, i.e., which genes regulate others. We conduct experiments on two data sets to demonstrate the applicability and sustainability of our proposal. The first data set is composed of microarray data and transcription factor binding data, and it is evaluated by precision, recall and the F1-measure. The second data set is the Dream4 In Silico Network Challenge data set, and it is evaluated by the measures that are used during the challenge, namely the area under precision and recall curve (AUC-PR), the area under the receiver operating characteristic curve (AUC-ROC) and their averages. The experimental results show that applying algorithms from the recommendation systems domain on the problem of inference of GRN structures is effective. Also, we observed that combining information from multiple data sets gives better results.
Keywords: gene regulatory networks; recommendation systems; collaborative filtering; multiple data sources; Pareto dominance.
KeSACNN: a protein-protein interaction article classification approach based on deep neural network
by Ling Luo, Zhihao Yang, Lei Wang, Yin Zhang, Hongfei Lin, Jian Wang
Abstract: Automatic classification of protein-protein interaction (PPI) relevant articles from biomedical literature is a crucial step for biological database curation since it can help reduce the curation burden at the initial stage. However, most popular PPI article classification methods are based on traditional machine learning and their performances are heavily dependent on the feature engineering. Recent years, PPI article classification with neural networks has gained increasing attention, but domain knowledge has been rarely used in these methods. Aiming to exploit domain knowledge, we propose a domain Knowledge-enriched Self-Attention Convolutional Neural Network (KeSACNN) approach for PPI article classification. In this approach, two knowledge embeddings are proposed, and the novel convolution neural network architectures with self-attention mechanism are designed to leverage biomedical knowledge. The experimental results show that our method achieves the state-of-the-art performance on the BioCreative II and III corpora (82.92% and 67.93% in F-scores, respectively).
Keywords: PPI article classification; self-attention; convolutional neural network; domain knowledge.
Chemical-protein Interaction Extraction from Biomedical Literature: a Hierarchical Recurrent Convolutional Neural Network Method
by Cong Sun, Zhihao Yang, Lei Wang, Yin Zhang, Hongfei Lin, Jian Wang, Liang Yang, Kan Xu, Yijia Zhang
Abstract: Mining chemicalprotein interactions between chemicals and proteins plays vital roles in biomedical tasks, such as knowledge graph, pharmacology, and clinical research. Although chemicalprotein interactions can be manually curated from the biomedical literature, the process is difficult and time-consuming. Hence, it is of great value to automatically obtain the chemicalprotein interactions from biomedical literature. Recently, the most popular methods are based on the neural network to avoid complex manual processing. However, the performance is usually limited because of the lengthy and complicated sentences. To address this limitation, we propose a novel model, hierarchical recurrent convolutional neural network (HRCNN), to learn hidden semantic and syntactic features from sentence sub-sequences effectively. Our approach achieves an F-score of 65.56% on the CHEMPROT corpus and outperforms the state-of-the-art systems. The experimental results demonstrate that our approach can greatly alleviate the defect of existing methods due to the existence of long sentences.
Keywords: Chemical-protein interaction; Data mining; Relation extraction; Hierarchical neural network; Recurrent Convolutional Neural Network.
An Integrated Multivariate Group Sparse Approach to Identify Differentially Expressed Genes of Breast Cancer Data
by Namalee Napagoda
Abstract: Identifying differentially expressed genes play an important role in disease diagnosis and prognosis. In this study, we use Students t statistic for analyzing genes of publically available breast cancer data. Different t values in same gene from multiple data cannot be used for identifying cancer related genes separately. The presence of noise in gene expression data may affect the performance of the study. Therefore, we develop an Integrated Multivariate Group Sparse (IMGS) model based on the combined Students t statistic of the independent multiple data sets. Stability selection is used to identify the optimal values of tuning parameter in IMGS method. We illustrate the performance of Students t statistic, GeneMeta, metaMa and IMGS model on breast cancer genes with reference genes in GWAS. According to the results, the IMGS model is the more appropriate statistical approach than other three methods to identify the most significant genes of multiple gene expression data.
Keywords: Differentially expressed genes; GeneMeta; IMGS; metaMa; Stability selection; Student’s t statistic.
Mining Hub Genes from RNA-Seq Gene Expression Data using Biclustering Algorithm
by ANKUSH MAIND, Shital Raut
Abstract: Biclustering is a popularly used data mining technique for the analysis of gene expression data. Recently, multiple biclustering algorithms have been designed for finding co-expressed genes from the microarray gene expression data. Microarray data has some drawbacks. To overcome the drawbacks of microarray data, RNA-Seq technology was introduced. RNA-Seq technology is the advanced high throughput technique. In this paper, we have introduced a new approach for identifying hub genes from the RNA-Seq data using biclustering algorithm. For mining biclusters, efficient 'runibic' biclustering algorithm is used. The 'runibic' algorithm performs well on various issues such as overlapping, noise, stable output, accuracy, large-scale data, and biological significance. For each significant bicluster, we have constructed a gene co-expression network(GCN). Further, each constructed GCN used for identifying hub genes. The identified hub genes are specific to the subsets of experimental conditions. The extracted hub genes can be useful in the several clinical applications as prognostic or diagnostic markers of the diseases.
Keywords: biclustering; RNA-Seq data; data mining; bioinformatics; gene co-expression network; hub gene; biomarker.