International Journal of Data Mining and Bioinformatics (29 papers in press)
Molecular pathway identification using a new L1/2 solver and biological network-constrained model
by Hai-Hui Huang, Xiao-Ying Liu, Hui-Min Li, Yong Liang
Abstract: Molecular research is moving toward Big data epoch. There are various large-scale popular databases abstracted from different biological processes. Integrating such valuable information with the statistical model may shed light on how human cells work from a system-level perspective. In this article, we propose a novel penalized network-constrained regression model with a new L_(1/2 )solver for integrating gene regulatory networks into an analysis of gene expression data, where the network is graph Laplacian regularized. Extensive simulation studies showed that our proposed approach outperforms L_(1 )regularization and old L_(1/2 )regularization regarding prediction accuracy and predictive stability. We also apply our method to three kinds of cancer datasets. Particularly, our method achieves comparable or higher predictive accuracy than the old solver L_(1/2 )and L_(1 )regularization approaches, while fewer but informative genes and pathways are selected.
Keywords: big data; network analysis; variable selection; regularization; L1/2 penalty.
A combinational logic network based on binarized gene expression in gastric cancer
by Sungjin Park, Seungyoon Nam
Abstract: In general, gene circuit networks are employed for analyzing time-dependent gene expression datasets, known as time-series. However, in analyzing cancer genomics data acquired by the recent technology of next-generation sequencing datasets, which are measured once at a particular point in time (static), with enormous numbers of patients have accumulated. Here, we present a combinational logic network, with static gene expression datasets, to combine the structural compositions and all values of samples using Boolean Algebra expression, rather than using representative values. We then attempt to validate this approach by applying it to a real cancer patient dataset, demonstrating the feasibility of using combinational logic networks for graphically representing static gene expression datasets.
Keywords: Boolean logic model; network model; gastric cancer.
Pupylation Sites Prediction with Ensemble Classification Model
by Wenzheng Bao, Zhenhua Huang, Chang-An Yuan, De-Shuang Huang
Abstract: Post translational modification of protein is one of the most important biological processions in the field of proteomics and bioinformatics. Pupylation is a novel post translational modification which the small, intrinsi-cally disorder edprokaryotic ubiquitin-like protein is conjugated to lysine residues of potential segments. Both the experimental and computational prediction methods of such modified sites have proved to be a challenging issue. Computational methods mainly aimed at extracting effective features from the potential protein segments. In this paper, the statistical feature of adjacent amino acid residues has been proposed and the novel feature is combined appearance of adjacent amino acid and the BLOSUM62 matrix. The Neural Network and the Na
Keywords: lysine pupylation; neural network; naïve bayes; post-translational modification.
Discovery of deep order-preserving submatrix in DNA microarray data based on sequential pattern mining
by Zhiwen Liu, Yun Xue, Meihang Li, Bo Ma, Meizhen Zhang, Xin Chen, Xiaohui Hu
Abstract: In recent years, order-preserving submatrix (OPSM) model has been widely used in many fields, such as gene expression data analysis, recommender system and financial exploration. Since it focuses on the changes between the elements rather than the real value, it shows better robustness of data and statistical significance among results than other models do. However, OPSMs are NP-Hard problems, and many current methods of mining OPSMs are heuristic, which cannot mine all OPSMs as well as the deep OPSMs in the data matrix. Biologists are interested in deep OPSMs which are of biological significance, generally ignored by heuristic methods. In this article, an exact algorithm is proposed to find OPSMs by using frequent sequential pattern mining method. At first, we find out all common subsequences (ACS) between any two rows through dynamic programming, so that any kinds of mode will not be missed. Then, we store ACS into a suffix tree, a data structure by which the efficiency of algorithm will be improved. In this way, we can get all OPSMs that meet the threshold of row and column in this suffix tree, including those deep OPSMs. The real gene data and artificially synthesized data are employed to verify this algorithm, and the results indicate that its an efficient and meaningful method.
Keywords: OPSM; frequent sequential pattern; all common subsequences; dynamic programming.
A Markov blanket-based approach for finding high-dimensional genetic interactions associated with disease in family-based studies
by Hyo Jung Lee, Jae Won Lee, Hee Jeong Yoo, Seohoon Jin, Mira Park
Abstract: Detecting genetic interactions associated with complex disease is a major issue in genetic studies. Although a number of methods to detect gene-gene interactions for population-based genome-wide association studies (GWAS) have been developed, the statistical methods for family-based GWAS have been limited. In this study, we propose a new Bayesian approach called MB-TDT to find high order genetic interactions for pedigree data. The MB-TDT method combines the Markov blanket algorithm with classical transmission disequilibrium test (TDT) statistic. The incremental association Markov blanket (IAMB) algorithm was adopted for large-scale Markov blanket discovery. We evaluated the proposed method using both real and simulated datasets. In a simulation study, we compared the power of MB-TDT with conditional logistic regression, multifactor dimensionality reduction (MDR) and MDR-pedigree disequilibrium test (MDR-PDT). We demonstrated the superior power of MB-TDT in many cases. To demonstrate the approach, we analyzed the Korean autism disorder GWAS data. The MB-TDT method can identify a minimal set of causal SNPs associated with a specific disease, thus avoiding an exhaustive search.
Keywords: Genetic associations; Gene-gene interactions; Markov blanket; Pedigree data; Transmission disequilibrium test.
Validation of the merged co-variation signal in interacting protein pairs by mirror-dendrogram
by Xiaowei Song, Xingjian He, Yajun Wang, Yezhong Tang
Abstract: In the post-genomic era, in silico methods have proven increasingly useful for constructing interactomes, especially protein-protein interaction networks. Here we describe a structural co-variation based approach (i.e. mirror-dendrogram) for prediction of binary interacting proteins at a proteome-wide scale. The structural variation was measured in terms of physicochemical traits (i.e. Kyte-Doolittle hydrophobicity, molecular weight and molecular Van der Waals volume). We explored the performance of a series of mirror- algorithms (i.e. mirror-tree, tree of life-mirror-tree and mirror-dendrogram) in 1117 protein groups of 21 species of the Enterobacteriaceae family. Interestingly, sequence divergence degree of each protein group was found to have an important effect on the performance of these algorithms. The mirror-dendrogram is a robust way to validate the hypothesis that interacting protein pairs possess a mixed co-variation signal, which originates from background co-evolution and structural co-adaptation. We consider that mirror-dendrogram will promote the distinguishment of physically interacting proteins from functionally related ones by characterizing the merged co-variation signal.
Keywords: co-evolution; mirror-tree; mirror-dendrogram; physicochemical trait; protein-protein interaction; Enterobacteriaceae.
Transcriptomic and network analyses combine to identify genes that drive the red blood cell cycle of Plasmodium falciparum
by Xinran Yu, Hao Zhang, Timothy Lilburn, Hong Cai, Jianying Gu, Turgay Korkmaz, Yufeng Wang
Abstract: Despite coordinated attempts to control or eliminate it, malaria remains a widespread public health problem, with half the worlds population (3.2 billion people) at risk. While the annual death toll attributed to malaria has declined in recent years, the mortality is still very high. In 2015 the World Health Organisation estimates that between 236,000 and 635,000 people died, and the disease cost the continent of Africa, where 91% of cases occur, about USD 12 billion. The contribution of genomics to the defeat of malaria has been relatively small until recently. Although genomic data is available, much of it is difficult to interpret, as this parasite has no well-studied close relatives. This has led to a need for computationally-driven tools that will help us understand the dynamic cellular networks in the malaria parasite. This understanding, in turn, will help us identify new antimalarial targets in the parasite. Here, we coupled RNA-Seq analysis and network mining using a PageRank-based algorithm, and examined the temporal-specific expression of parasite genes during the 48-hour red blood cycle. We identified genes that appear to influence parasite development and red blood cell invasion. The just-in-time mechanism for gene expression may contribute to a dynamic and effective adaptive strategy of the malaria parasite.
Keywords: malaria; development cycle; RNA-Seq; PageRank; systems biology; Plasmodium falciparum.
A Novel Feature Selection Based on Apriori Property and Correlation Analysis for Protein Sequence Classification using MapReduce
by Bhavani R, Sudha Sadasivam G
Abstract: Feature selection is a crucial step in classification of protein sequences into existing superfamilies. Classifying protein sequences into different families based on their sequence patterns is helpful in predicting the structure and function of protein. This paper proposes a novel feature selection algorithm which first transforms the protein sequences into feature vectors and reduces the size of the feature vector based on the apriori property and correlation measure using MapReduce programming of Hadoop framework. Experimental results show that the proposed method of feature selection reduces the features by 81% to 84% and also improves accuracy by 5% to 6%.
Keywords: apriori property; sequence classification; correlation analysis; feature subset selection; MapReduce; bioinformatics.
Extracting Compact Representation of Knowledge from Gene Expression Data for Protein-protein Interaction
by Haohan Wang, Aman Gupta, Ming Xu
Abstract: DNA microarrays help measure the expression levels of thousands of genes concurrently. A major challenge is to extract biologically-relevant information and knowledge from massive amounts of microarray data. In this paper, we explore learning a compact representation of gene expression profiles by using a multi-task neural network model, so that further analyses can be carried out more efficiently on the data. The proposed network performs prediction tasks for Protein-Protein Interactions (PPIs) and Gene Ontology (GO), while simultaneously learning a high level representation of gene expression data. We argue that deep networks can extract more information from expression data as compared to standard statistical models. We tested the utility of our method by comparing its performance with famous feature extraction and dimensionality reduction methods on the task of PPI prediction, and found the results to be promising.
Keywords: Feature Extraction; Knowledge Representation; Deep Learning; Computational Biology.
Comparisons of linkage disequilibrium blocks of different populations for positive selection
by Sun-Ah Kim, Suh-Ryung Kim, Yun Joo Yoo
Abstract: Linkage disequilibrium structure (LD) is an important aspect of the study of population genetics and disease-gene association. Especially, analysing extended long haplotypes carrying a derived allele and examining LD block patterns can provide evidence for positive selection. We investigated the LD block structure of East Asian, European, and African populations for the previously reported sites of positive selection by comparing LD block construction results based on 1000 Genomes Project data. We confirmed that differences in LD block size in EDAR, LCT, PCDH15, and LARGE region among different populations is consistent with previous results regarding positive selection. We found that LD block comparisons can provide additional information for positive selection in regions of SLC30A19, PDE11A, and BCAS3 for East Asian and European populations based on the LD block patterns.
Keywords: positive selection; linkage disequilibrium; haplotype block; 1000 Genomes Project.
Dynamic Extended Tree conditioned LSTM-based Biomedical Event Extraction
by Lishuang Li, Jieqiong Zheng, Jia Wan
Abstract: Extracting knowledge from unstructured text has become essential to the text mining and knowledge discovery tasks in biomedical field. In this paper, we de-scribe a system to extract biomedical events among bio-tope and bacteria from biomedical literature. Since the deep learning methods can capture the hidden semantic in-formation by iteratively training the neural network, we propose a novel Long Short Term Memory (LSTM) Net-works framework DET-BLSTM for biomedical event ex-traction. In our framework, a dynamic extended tree is in-troduced as the input instead of the original sentences, which utilizes the syntactic information. Furthermore, the POS and distance embeddings are added to enrich input information. In final, considering that shallow machine learning methods can effectively take advantage of the domain expert experience, the predictions of SVM are used for post-processing. Our DET-BLSTM model with post-processing achieves 58.09% F-score in the test set, which is better than all official submissions to BioNLP-ST 2016 and 2.29% higher than the best system. In addition, the results of LSTM and LSTM variations indicate the stability of our model.
Keywords: LSTM; dynamic extended tree; biomedical event extraction; deep learning.
Rough-Fuzzy Segmentation of HEp-2 Cell Indirect Immunofluorescence Images
by Shaswati Roy, Pradipta Maji
Abstract: Human epithelial type-2 (HEp-2) cell is currently the most recommended substrate in indirect immunofluorescence (IIF) tests to diagnose various connective tissue disorders. The IIF test identifies the presence of antinuclear antibody (ANA) in patient serum. However, the proper detection ofrnHEp-2 cells from the IIF images is an important prerequisite for the recognitionrnof staining patterns of ANAs. The characteristics of HEp-2 cell images, duernto fluorescence intensity, make the segmentation process more challenging.rnRecently, rough-fuzzy clustering algorithms have been shown to providernsignificant results for image segmentation by handling different uncertaintiesrnpresent in the images. But, the neighborhood information is completely ignoredrnin these algorithms. However, the spatial information is useful when the imagernis distorted by different imaging artifacts. In this regard, the paper presentsrna segmentation algorithm by incorporating the neighborhood information intornrough-fuzzy clustering algorithm. In the current study, the class label of a pixel is influenced by its neighboring pixels, depending on their local spatial constraint and local gray level constraint. The performance of the proposed method is evaluated on several HEp-2 cell IIF images and compared with that of existing algorithms, both qualitatively and quantitatively.
Keywords: Image segmentation; HEp-2 cell images; rough-fuzzy clustering,rnspatial information.
Randomized Sequential and Parallel Algorithms for Efficient Quorum Planted Motif Search
by Peng Xiao, Soumitra Pal, Sanguthevar Rajasekaran
Abstract: Discovering patterns in biological sequences is very important to extract useful information from them. Motifs are crucial patterns that have numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similiarity between families of proteins, etc. Motif search is an important step in obtaining meaningful patterns from biological data. The general problem of motif search is intractable. There are many models of motif search proposed in the literature. Among these, the (l; d)-motif model, which is also known as Planted Motif Search (PMS), is widely studied. However, most of the exisiting algorithms are deterministic and the role of randomization in this area is still unexploited. This paper proposes an elegant as well as efficient randomized algorithm, named qPMS10, to solve PMS. The idea is based on random sampling.We also prove that if we choose the parameters carefully, then our result with be correct with a high probability. We utilize the most efficient PMS solver until now, named qPMS9, as a subroutine.We analyze the time complexity of both algorithms and provide a performance comparison of qPMS10 with qPMS9 on standard benchmark datasets. In addition, we offer a parallel implementation of qPMS10 and run tests using up to 4 processors. Both theoretical and empirical analyses demonstrate that our randomized algorithm outperforms the exsiting algorithms for solving PMS. We believe that our techniques can also be extended to other motif search models, such as Simple Motif Search (SMS) and Edit-distance based Motif Search (EMS).
Keywords: motif search; planted motif search; (l; d)-motif search; randomized algorithms; DNA and protein sequences; parallel algorithms for motif search.
Comparative studies on multivariate tests for joint-SNVs analysis and detection for bipolar disorder susceptibility genes
by Jin-Xiong Lv, Han-Chen Huang, Run-Sheng Chen, Lei Xu
Abstract: Instead of the single nucleotide variants (SNVs) analysis, many joint-SNVs analysis methods were proposed to tackle the missing heritability problem in the genome-wide association studies (GWASs). In this paper, we performed a comparative study involved in five typical methods for joint-SNVs analysis and a recently proposed method called Statistics-space Boundary-based test (S-space BBT). For a fair and comprehensive comparison, we conducted simulation experiments by considering dominant single variant, effect direction, minor allele frequency (MAF), odds ratio (OR) and the linkage disequilibrium (LD). The results indicated that the S-space BBT not only does not swamp the significant SNV but also maintains the stronger detection power under different configurations. As a result, we applied the S-space BBT to the dataset of bipolar disorder and obtained a list of biomarkers, besides, the literature researches were conducted to validate the reliability of the results.
Keywords: GWAS; sequence analysis; joint-SNVs analysis; odds ratio; dominant single variant; effect direction; minor allele frequency; the linkage disequilibrium ; S-space boundary-based test; bipolar disorder.
Comparisons of cancer classifiers based on RNA_seq and miRNA_seq
by Shinuk Kim, Hyowon Lee, Mark Kon
Abstract: Studies in computational cancer genomics have been faced with the challenge of increasing prediction accuracy of molecular datasets. Here we outline how a feature selection method combined with machine learning may help overcome this challenge for BRCA microRNA-Seq datasets, BRCA RNA-Seq and mRNA microarray datasets, and BLCA microRNA_seq and RNA_seq datasets. We used three different computational approaches; 1) support vector machine, 2) decision tree and 3) k nearest neighbors, and two different feature selection methods; 1) Fisher feature criterion and 2) infinite feature selection. Our computation approaches performed consistently better with RNA_seq datasets rather than with miRNA_seq or RNA_array datasets.
Keywords: feature selection; machine learning methods; classification methods; RNA_sequnce datasets; miRNA_sequence datasets; Breast invasive carcinoma; Bladder urothelial carcinoma.
Deep Fusion of Multi-channel Neurophysiological Signal for Emotion Recognition and Monitoring
by Xiang Li, Dawei Song, Peng Zhang, Yuexian Hou, Bin Hu
Abstract: How to fuse multi-channel neurophysiological signals for emotion recognition is emerging as a hot research topic in community of Computational Psychophysiology. Meanwhile, clinical psychiatrists are also in urgent need of computer-aided systems to automatically monitor a patients emotional fluctuations, which can benefit the diagnosis process. Nevertheless, prior related works mainly relies on feature engineering based approaches, which requires extracting various domain knowledge related features at a high time cost. Moreover, traditional fusion method can not fully utilize correlation information between different channels and frequency components. In this paper, we propose a preprocessing method that encapsulates the multi-channel neurophysiological signals into 3D frame cubes through wavelet and scalogram transform, which largely reduces the time cost in data preprocessing. We further design a hybrid deep learning model, in which the Convolutional Neural Network (CNN) is utilized for extracting and selecting task-related features, as well as mining inter-channel and inter-frequency correlation, besides, the Recurrent Neural Network (RNN) is concatenated for integrating contextual information from the frame cube sequence. Experiments are carried out in a trial-level emotion recognition task, on the DEAP benchmarking dataset. Experimental results demonstrate that our proposed framework outperforms the classical methods, with respect to the emotional dimensions of Valence and Arousal. In addition, the potential of our model in realtime monitoring, as well as critical channel and frequency determination,providesuswithanewperspectiveanddirectioninComputational Psychophysiology.
Keywords: affective computing; CNN; time series data analysis; EEG; emotion recognition; LSTM; multi-channel data fusion; multi-modal data fusion; physiological signal; RNN.
Dynamics in the neural network of an in vitro epilepsy model
by Bowen Liu, Junwei Mao, Yejun Shi, Qinchi Lu, Peiji Liang, Puming Zhang
Abstract: Epilepsy is growingly considered as a brain network disorder. In this study, epileptiform discharges induced by low-Mg2+ were recorded with a micro-electrode array. Dynamic effective network connectivity was constructed by calculating the time-variant partial directed coherence (tvPDC) of signals. We proposed a novel approach to track the state transitions of epileptic networks, and characterized the network topology by using graphical measures. We found that the network hub nodes coincided with the epileptogenic zone in previous electrophysiological findings. Two network states with distinct topologies were identified during the ictal-like discharges. The small-worldness significantly increased at the second state. Our results indicate the ability of tvPDC to capture the causal interaction between multi-channel signals important in indentifying the epileptogenetic zone. Moreover, the evolution of network states extends our knowledge of the network drivers for the initiation and maintenance of ical activity, and suggests the practical value of our network clustering approach.
Keywords: epilepsy; microelectrode array; dynamic network; graph theory; granger causality; tonic-clonic; network analysis; hippocampus; entorhinal cortex; small-worldness; low Mg2+.
Template Edge Similarity Graph Clustering for Mining Multiple Gene Expression Datasets
by Saeed Salem
Abstract: High Throughput technologies have enabled the acquisition of large
amounts of genomic data, including gene expression and RNA sequencing data for
multiple species under various biological and environmental conditions.Recently,
researchers have proposed methods for mining biological modules from gene
coexpression networks. Biological inference from a single expression dataset
suffers from spurious coexpression. Integrating multiple gene expression datasets
is a promising strategy to alleviate the challenges of protein functional annotation and biological module discovery based on single gene expression data.
We propose an integrative mining algorithm that constructs a template edge
similarity graph whose nodes are the coexpression edges and a weighted edge
connecting the two nodes corresponds to the structural similarity of the two edges across the coexpression graphs. Clustering the weighted edge similarity graph yields recurrent coexpression link clusters (modules). Experimental results on Human gene expression datasets show that the reported modules are functionally
homogeneous as evident by their enrichment with biological process GO terms.
Keywords: Coexpression networks; Edge-Edge Similarity; Biological Modules.
MiRFFS: a functional group-based feature selection method for the identification of microRNA biomarkers
by Yang Yang, Yiqun Xiao, Tianyu Cao, Wei Kong
Abstract: The identification of microRNA biomarkers has been a central task in disease diagnosis, prognosis assessment and drug design, due to the important roles that microRNAs play in the development of complex diseases. Using recent high-throughput experimental technologies, such as microRNA microarray and small-RNA sequencing, microRNA expression profiles have been largely studied, where differentially expressed microRNAs are potential biomarkers. Both statistical methods and machine learning approaches have been applied to the identification of biomarkers. Especially, feature selection and regularization techniques are efficient for filtering informative attributes from a high-dimensional space. In order to enhance their performance, the intrinsic data structure is usually exploited.
In this study, we focus on feature selection for microRNA expression data to identify potential biomarkers. Considering that microRNAs often work together to play their regulatory roles and form functional groups, we utilize the GO-based semantic similarity to infer miRNA functional groups, and propose a new feature selection method, called MiRFFS (MiRNA Functional group-based Feature Selection). We also incorporate the functional group information to the sparse group Lasso (SGL), and compare MiRFFS with SGL as well as the state-of-the-art feature selection methods. Experimental results on five miRNA microarray profiles of breast cancer show that MiRFFS can achieve a compact feature subset with substantial improvement on the accuracy compared with other feature selection and lasso methods.
Keywords: MicroRNA biomarker; Functional group; Feature selection.
BioNimbuZ: A Federated Cloud Platform for Bioinformatics Applications
by Michel Rosa, Breno Moura, Guilherme Vergara, Aletéia Araujo, Maristela Holanda, Maria Emilia Walter
Abstract: Challenges in bioinformatics include tools to treat large-scale processing, mainly due to the large volumes of data generated by high-throughput sequencing machines. However, many of these tools are not user friendly, and do not distribute their workloads properly. In federated cloud environments, even though services and resources are shared and available online, the processes of a workflow execution are almost entirely unautomated, and the majority of these processes do not efficiently balance their workloads. This paper presents the federated cloud platform called, BioNimbuZ a hybrid platform designed to execute bioinformatics applications easily, efficiently, and with good workload balance. Our tests were performed using a real bioinformatics workflow, with fragments generated by the Illumina sequencer, having achieved good performance in practice.
Keywords: BioNimbuZ; cloud computing; federated cloud computing; bioinformatics applications.
DASE2: Differential Alternative Splicing variants Estimation method without reference genome, and comparison with mapping strategy
by Kouki Yonezawa, Keisuke Nakata, Ryuhei Minei, Atsushi Ogura
Abstract: Alternative splicing is a mechanism to produce gene expression diversity under the constraint of a limited number of genes, causing spatiotemporal gene expression in tissues and developmental processes in most organisms. This mechanism is well studied in model organisms so far but not in non-model organisms because the current standard method requires genomic sequences as well as fully annotated information of exons and introns. However, it is essential to uncover the landscape of alternative splicing of organisms to understand its evolutionary impacts and roles. Therefore, we developed a method for condition-specific alternative splicing estimation without reference genome based on de novo transcriptome assembly. We also tested estimation results of DASE with genome mapping method to infer reliability of our method, and displayed that detection level of alternative splicing can be comparable with mapping strategy and useful for the screening of condition specific alternative splicing in non-model organisms. The software is deposited to https://github.com/koukiyonezawa/DASE.
Keywords: RNA-seq; isoforms; expression diversity.
Multivariate summary approach to omics data from crossover design with two repeated factors
by Sunghoon Choi, Soo-yeon Park, Hoejin Kim, Taesung Park, Oran Kwon
Abstract: A crossover design, with two repeated factors, is commonly used for analyzing tolerance tests, i.e., measurements of physiologic response, following ingestion of some exogenous substance. For data analysis using a crossover design, a standard approach is to use linear mixed effect models (LMMs), as these can adequately handle correlated measurements from the crossover design. Alternatively, univariate analyses, using single summary statistics, can be employed for assessments such as the difference of measurements between time points, incremental area under curve (iAUC), Cmax etc. However, the use of summary measures may result in the loss of information. In this study, instead of using one single summary measure, we propose using multiple summary measures simultaneously through LMMs by taking their correlation into account. We compare the performance of the proposed method with other existing methods through real data analysis and simulation studies. We show that our proposed method has equivalent power to that of standard LMM approach, while using a much fewer number of parameters.
Keywords: Linear mixed effect model; Crossover design; Repeated measurements.
A terpenoid metabolic network modeled as graph database
by Waldeyr Silva, Danilo Vilar, Daniel Souza, Maria Emília Walter, Marcelo Brígido, Maristela Holanda
Abstract: Terpenoids are involved in interactions such as signaling for communication intra/inter species, signaling molecules to attract pollinating insects, and defense against herbivores and microbes.rnDue to their chemical composition, many terpenoids possess vast pharmacological applicability in medicine and biotechnology, besides important roles in ecology, industry and commerce.rnMetabolic networks are composed of metabolic pathways, they allow to represent the metabolism of an organism. rnThe biosynthesis of terpenes has been widely studied over the years, and it is well known that they can be synthesized from two metabolic pathways: mevalonate pathway (MVA) and non-mevalonate pathway (MEP). rnOn the other hand, genome-scale reconstruction of metabolic networks faces many challenges, including organizational data storage and data modeling, tornproperly represent the complexity of systems biology. rnRecent NoSQL database paradigms have introduced new concepts of scalable storage and data queries. rnWith regard to biological data, the use of graph databases has grown because of its versatility. rnIn this paper, we propose 2Path, a graph database designed to represent terpenoid metabolic networks. rnIt is modeled in such a way so that it preserves important terpenoid biosynthesis characteristics.
Keywords: terpenoid; metabolic network; secondary metabolism; NoSQL; graph database;.
Nonlinear-RANSAC parameter optimization for dynamic molecular systems and signaling pathways
by Mingon Kang, Liping Tang, Jean Gao
Abstract: Vigorous mathematical modeling and accurate parameter estimation of the models are indispensable for building reliable models that represent dynamic characteristics of the biological systems. A challenging task in modeling complex biological systems is the accurate estimation of the large number of unknown parameters in the mathematical modeling. To tackle this problem, we develop a data-driven optimization method, nonlinear RANSAC, based on linear RANdom SAmple Consensus (a.k.a. RANSAC). Conventional RANSAC method is sound and simple, but it is oriented from linear system models. Our proposed nonlinear RANSAC extends its capability to nonlinear systems, while preserving the strengths of RANSAC. We applied nonlinear RANSAC to the dynamic molecular systems of phagocyte transmigration and signaling pathways. The parameters of mathematical equations for the phagocyte transmigration system were estimated by the proposed nonlinear RANSAC and compared the performance with ordinary least squares. Nonlinear RANSAC was also applied to signaling pathways, where mathematical equations are formulated using ordinary differential equations that represent molecular interactions between two biological components.
Keywords: Nonlinear RANSAC; parameter estimation; dynamic molecular systems; signaling pathway.
Identifying cis/trans-acting expression Quantitative Trait Loci (eQTL)
by Mingon Kang, Dongchul Kim, Chunyu Liu, Jean Gao
Abstract: Expression Quantitative Trait Loci (eQTL) studies have played an important role in discovering novel susceptibility genes and regulatory mechanisms of human diseases. High-throughput microarray technologies allow to measure thousands of gene expressions at the same time, and the advance enables one to capture the insight of the genetic architecture of gene expression. A number of multivariate methods have been proposed to identify loci associated to gene expression taking into account interactive effects and relationships between the units. However, the large data tend to increase false positives in the studies. We propose a Cis/Trans eQTL Association Mapping (CTAM) method to (1) take co-expressed genes without clustering or partitioning techniques, (2) build a mathematical model for cis- and trans-eQTL based on biological prior knowledge, and (3) identify significant disease-associated genes. The power to detect both joint effect and group effect of SNPs and gene expressions is assessed in the simulation studies. We also applied it to a study of psychiatric disorder diseases data. CTAM detects associations between cis/trans-acting eQTLs and genes.
Keywords: eQTL analysis; cis/trans-acting eQTL; multivariate.
Statistical Quality Control Analysis of High Dimensional Omics Data
by Yongkang Kim, Gyu-Tae Kim, Min-Seok Kwon, Taesung Park
Abstract: Quality control (QC) is a most important preprocessing procedure to remove unwanted variation in omics data, such as microarray, next generation sequencing, and mass spectrometry data. QC has become a standard procedure for identifying important biological signatures of interest. Although several QC analysis tools are now used widely, these usually require a subjective guideline to determine the quality of the omics data being assessed. Here, we propose a new simple QC plot for high dimensional omics data that can identify samples of poor quality in a more objective manner. The proposed QC plot can easily identify samples of poor quality by comparing the between/within group distances, between all possible pairs of samples. Through a permutation procedure, the distribution of these distances is derived, generating p-values for each sample. These p-values can then be used as a more objective criterion to determine the quality of the sample. To exemplify the utility of this approach, we applied the proposed QC plot to MicroArray Quality Control (MAQC), project 1 data.
Keywords: distance measure; quality control; microarray; omics data.
Integration of Multi-omics Data for Integrative Gene Regulatory Network Inference
by Neda Zarayeneh, Euiseong Ko, Jung Hun Oh, Sang Suh, Chunyu Liu, Jean Gao, Donghyun Kim, Mingon Kang
Abstract: Gene regulatory networks provide comprehensive insights and in-depth understanding of complex biological processes. The molecular interactions of gene regulatory networks are inferred from a single type of genomic data, e.g., gene expression data in most research. However, gene expression is a product of sequential interactions of multiple biological processes, such as DNA sequence variations, copy number variations, histone modifications, transcription factors, and DNA methylations. The recent rapid advances of high-throughput omics technologies enable one to measure multiple types of omics data, called 'multi-omics data', that represent the various biological processes. In this paper, we propose an Integrative Gene Regulatory Network inference method (iGRN) that incorporates multi-omics data and their interactions in gene regulatory networks. In addition to gene expressions, copy number variations and DNA methylations were considered for multi-omics data in this paper. The intensive experiments were carried out with simulation data, where iGRN's capability that infers the integrative gene regulatory network is assessed. Through the experiments, iGRN shows its better performance on model representation and interpretation than other integrative methods in gene regulatory network inference. iGRN was also applied to a human brain dataset of psychiatric disorders, and the biological network of psychiatric disorders was analyzed.
Keywords: gene regulatory network inference; multi-omics data; data integration.
Analysis of clustered RNA-seq Data
by Hyunjin Park, Seungyeoun Lee, Ye Jin Kim, Myung-Sook Choi, Taesung Park
Abstract: RNA sequencing (RNA-seq) technology has now become a powerful tool for measuring levels of transcripts. Through this high-throughput technology, we can investigate post-transcriptional modifications, non-coding RNAs, mutations, gene fusion, and changes in gene expression levels. Recently, many methods have been developed to find differentially expressed genes (DEGs) between treatment groups. Most of these methods assume that RNA-seq data is generated independently from the different subjects. Nowadays, clustered RNA-seq data are also commonly observed, such as paired RNA-seq data, from the same patient. Unfortunately, existing methods cannot adequately handle clustered RNA-seq data. In this paper, we propose a new testing method, based on the Generalized Estimating Equations (GEE) approach, which is widely used to analyze repeatedly measured data. Our GEE-based approach uses the correlations between RNA-seq data appropriately, which results in increased power in detecting DEGs. Through real data analysis and simulation studies, we compare the performance of the GEE method to those of other existing methods. Specifically, our GEE analysis was compared to various other methodologies, particularly with regard to sensitivity to detect DEGs and false discovery rates.
Keywords: RNA-seq; differentially expressed gene; DEG; simultaneously; multivariate; Generalized Estimating Equations; GEE; false discovery rate (FDR).
Evaluating the contributions of GO term properties to semantic similarity measurement
by Young-Rae Cho
Abstract: Ontologies are the frameworks to provide structured descriptions of components in a specific domain. Recent systematic approaches for semantic analysis and annotations in bio-ontology databases have advanced understanding of molecular functions in a genomic scale. Gene Ontology (GO) is one of the widely used ontology databases. Over the last decade, various semantic similarity measures have been proposed to quantify functional similarity between genes using GO and its annotation data. However, major challenges in the application of current GO data are the increasing complexity of ontology structures and the inconsistency of annotation data. In this study, we explore term properties in GO, such as term specificity and the term balancing effect, and evaluate the contributions of these properties to semantic similarity measurement. Our experiment is designed to predict positive protein-protein interactions (PPIs) by semantic similarities which are measured using the various term properties. The experimental results show that the accuracy of semantic similarity improved when the GO terms are weighted by term specificity. Among several term specificity measures that are commonly applied to semantic analysis, the information content using the ratio of annotating genes resulted in the highest accuracy. The experimental results also show that balancing terms with respect to their specificity is a significant factor in measuring semantic similarity between proteins.
Keywords: Gene Ontology; semantic similarity; annotations; term specificity; GO; PPI