International Journal of Data Mining and Bioinformatics (20 papers in press)
A Markov blanket-based approach for finding high-dimensional genetic interactions associated with disease in family-based studies
by Hyo Jung Lee, Jae Won Lee, Hee Jeong Yoo, Seohoon Jin, Mira Park
Abstract: Detecting genetic interactions associated with complex disease is a major issue in genetic studies. Although a number of methods to detect gene-gene interactions for population-based genome-wide association studies (GWAS) have been developed, the statistical methods for family-based GWAS have been limited. In this study, we propose a new Bayesian approach called MB-TDT to find high order genetic interactions for pedigree data. The MB-TDT method combines the Markov blanket algorithm with classical transmission disequilibrium test (TDT) statistic. The incremental association Markov blanket (IAMB) algorithm was adopted for large-scale Markov blanket discovery. We evaluated the proposed method using both real and simulated datasets. In a simulation study, we compared the power of MB-TDT with conditional logistic regression, multifactor dimensionality reduction (MDR) and MDR-pedigree disequilibrium test (MDR-PDT). We demonstrated the superior power of MB-TDT in many cases. To demonstrate the approach, we analyzed the Korean autism disorder GWAS data. The MB-TDT method can identify a minimal set of causal SNPs associated with a specific disease, thus avoiding an exhaustive search.
Keywords: Genetic associations; Gene-gene interactions; Markov blanket; Pedigree data; Transmission disequilibrium test.
Transcriptomic and network analyses combine to identify genes that drive the red blood cell cycle of Plasmodium falciparum
by Xinran Yu, Hao Zhang, Timothy Lilburn, Hong Cai, Jianying Gu, Turgay Korkmaz, Yufeng Wang
Abstract: Despite coordinated attempts to control or eliminate it, malaria remains a widespread public health problem, with half the worlds population (3.2 billion people) at risk. While the annual death toll attributed to malaria has declined in recent years, the mortality is still very high. In 2015 the World Health Organisation estimates that between 236,000 and 635,000 people died, and the disease cost the continent of Africa, where 91% of cases occur, about USD 12 billion. The contribution of genomics to the defeat of malaria has been relatively small until recently. Although genomic data is available, much of it is difficult to interpret, as this parasite has no well-studied close relatives. This has led to a need for computationally-driven tools that will help us understand the dynamic cellular networks in the malaria parasite. This understanding, in turn, will help us identify new antimalarial targets in the parasite. Here, we coupled RNA-Seq analysis and network mining using a PageRank-based algorithm, and examined the temporal-specific expression of parasite genes during the 48-hour red blood cycle. We identified genes that appear to influence parasite development and red blood cell invasion. The just-in-time mechanism for gene expression may contribute to a dynamic and effective adaptive strategy of the malaria parasite.
Keywords: malaria; development cycle; RNA-Seq; PageRank; systems biology; Plasmodium falciparum.
Multivariate summary approach to omics data from crossover design with two repeated factors
by Sunghoon Choi, Soo-yeon Park, Hoejin Kim, Taesung Park, Oran Kwon
Abstract: A crossover design, with two repeated factors, is commonly used for analyzing tolerance tests, i.e., measurements of physiologic response, following ingestion of some exogenous substance. For data analysis using a crossover design, a standard approach is to use linear mixed effect models (LMMs), as these can adequately handle correlated measurements from the crossover design. Alternatively, univariate analyses, using single summary statistics, can be employed for assessments such as the difference of measurements between time points, incremental area under curve (iAUC), Cmax etc. However, the use of summary measures may result in the loss of information. In this study, instead of using one single summary measure, we propose using multiple summary measures simultaneously through LMMs by taking their correlation into account. We compare the performance of the proposed method with other existing methods through real data analysis and simulation studies. We show that our proposed method has equivalent power to that of standard LMM approach, while using a much fewer number of parameters.
Keywords: Linear mixed effect model; Crossover design; Repeated measurements.
Identifying cis/trans-acting expression Quantitative Trait Loci (eQTL)
by Mingon Kang, Dongchul Kim, Chunyu Liu, Jean Gao
Abstract: Expression Quantitative Trait Loci (eQTL) studies have played an important role in discovering novel susceptibility genes and regulatory mechanisms of human diseases. High-throughput microarray technologies allow to measure thousands of gene expressions at the same time, and the advance enables one to capture the insight of the genetic architecture of gene expression. A number of multivariate methods have been proposed to identify loci associated to gene expression taking into account interactive effects and relationships between the units. However, the large data tend to increase false positives in the studies. We propose a Cis/Trans eQTL Association Mapping (CTAM) method to (1) take co-expressed genes without clustering or partitioning techniques, (2) build a mathematical model for cis- and trans-eQTL based on biological prior knowledge, and (3) identify significant disease-associated genes. The power to detect both joint effect and group effect of SNPs and gene expressions is assessed in the simulation studies. We also applied it to a study of psychiatric disorder diseases data. CTAM detects associations between cis/trans-acting eQTLs and genes.
Keywords: eQTL analysis; cis/trans-acting eQTL; multivariate.
Statistical Quality Control Analysis of High Dimensional Omics Data
by Yongkang Kim, Gyu-Tae Kim, Min-Seok Kwon, Taesung Park
Abstract: Quality control (QC) is a most important preprocessing procedure to remove unwanted variation in omics data, such as microarray, next generation sequencing, and mass spectrometry data. QC has become a standard procedure for identifying important biological signatures of interest. Although several QC analysis tools are now used widely, these usually require a subjective guideline to determine the quality of the omics data being assessed. Here, we propose a new simple QC plot for high dimensional omics data that can identify samples of poor quality in a more objective manner. The proposed QC plot can easily identify samples of poor quality by comparing the between/within group distances, between all possible pairs of samples. Through a permutation procedure, the distribution of these distances is derived, generating p-values for each sample. These p-values can then be used as a more objective criterion to determine the quality of the sample. To exemplify the utility of this approach, we applied the proposed QC plot to MicroArray Quality Control (MAQC), project 1 data.
Keywords: distance measure; quality control; microarray; omics data.
Integration of Multi-omics Data for Integrative Gene Regulatory Network Inference
by Neda Zarayeneh, Euiseong Ko, Jung Hun Oh, Sang Suh, Chunyu Liu, Jean Gao, Donghyun Kim, Mingon Kang
Abstract: Gene regulatory networks provide comprehensive insights and in-depth understanding of complex biological processes. The molecular interactions of gene regulatory networks are inferred from a single type of genomic data, e.g., gene expression data in most research. However, gene expression is a product of sequential interactions of multiple biological processes, such as DNA sequence variations, copy number variations, histone modifications, transcription factors, and DNA methylations. The recent rapid advances of high-throughput omics technologies enable one to measure multiple types of omics data, called 'multi-omics data', that represent the various biological processes. In this paper, we propose an Integrative Gene Regulatory Network inference method (iGRN) that incorporates multi-omics data and their interactions in gene regulatory networks. In addition to gene expressions, copy number variations and DNA methylations were considered for multi-omics data in this paper. The intensive experiments were carried out with simulation data, where iGRN's capability that infers the integrative gene regulatory network is assessed. Through the experiments, iGRN shows its better performance on model representation and interpretation than other integrative methods in gene regulatory network inference. iGRN was also applied to a human brain dataset of psychiatric disorders, and the biological network of psychiatric disorders was analyzed.
Keywords: gene regulatory network inference; multi-omics data; data integration.
Analysis of clustered RNA-seq Data
by Hyunjin Park, Seungyeoun Lee, Ye Jin Kim, Myung-Sook Choi, Taesung Park
Abstract: RNA sequencing (RNA-seq) technology has now become a powerful tool for measuring levels of transcripts. Through this high-throughput technology, we can investigate post-transcriptional modifications, non-coding RNAs, mutations, gene fusion, and changes in gene expression levels. Recently, many methods have been developed to find differentially expressed genes (DEGs) between treatment groups. Most of these methods assume that RNA-seq data is generated independently from the different subjects. Nowadays, clustered RNA-seq data are also commonly observed, such as paired RNA-seq data, from the same patient. Unfortunately, existing methods cannot adequately handle clustered RNA-seq data. In this paper, we propose a new testing method, based on the Generalized Estimating Equations (GEE) approach, which is widely used to analyze repeatedly measured data. Our GEE-based approach uses the correlations between RNA-seq data appropriately, which results in increased power in detecting DEGs. Through real data analysis and simulation studies, we compare the performance of the GEE method to those of other existing methods. Specifically, our GEE analysis was compared to various other methodologies, particularly with regard to sensitivity to detect DEGs and false discovery rates.
Keywords: RNA-seq; differentially expressed gene; DEG; simultaneously; multivariate; Generalized Estimating Equations; GEE; false discovery rate (FDR).
Evaluating the contributions of GO term properties to semantic similarity measurement
by Jan Sladek, Young-Rae Cho
Abstract: Ontologies are the frameworks to provide structured descriptions of components in a specific domain. Recent systematic approaches for semantic analysis and annotations in bio-ontology databases have advanced understanding of molecular functions in a genomic scale. Gene Ontology (GO) is one of the widely used ontology databases. Over the last decade, various semantic similarity measures have been proposed to quantify functional similarity between genes using GO and its annotation data. However, major challenges in the application of current GO data are the increasing complexity of ontology structures and the inconsistency of annotation data. In this study, we explore term properties in GO, such as term specificity and the term balancing effect, and evaluate the contributions of these properties to semantic similarity measurement. Our experiment is designed to predict positive protein-protein interactions (PPIs) by semantic similarities which are measured using the various term properties. The experimental results show that the accuracy of semantic similarity improved when the GO terms are weighted by term specificity. Among several term specificity measures that are commonly applied to semantic analysis, the information content using the ratio of annotating genes resulted in the highest accuracy. The experimental results also show that balancing terms with respect to their specificity is a significant factor in measuring semantic similarity between proteins.
Keywords: Gene Ontology; semantic similarity; annotations; term specificity; GO; PPI.
Formal Concept Analysis for Knowledge Discovery from Biological Data
by Khalid Raza
Abstract: Due to the rapid advancement in high-throughput technologies, such as microarrays and next generation sequencing, the volume of biological data is increasing exponentially. The current challenge in computational biology and bioinformatics research is how to analyze these huge raw biological datasets to extract meaningful biological knowledge. Formal concept analysis is a method based on lattice theory and widely used for data analysis, knowledge representation, knowledge discovery and knowledge management across several domains. This paper reviews the applications of formal concept analysis for knowledge discovery from biological data, including gene expression discretization, gene co-expression mining, gene expression clustering, finding genes in gene regulatory networks, enzyme/protein classifications, binding site classifications, and domain-domain interaction. It also presents a list of FCA-based software tools applied to the biological domain, and covers the challenges and future directions in this field.
Keywords: Formal Concept Analysis; Microarray Analysis; Gene Expression Mining; Concept Lattice; Biological Databases.
SSDPrimer: A SSD-based Primer Design Method for a Private Sequence DB
by Kang-Wook Chon, Sang-Hyun Hwang, Kyuhyeon An, Min-Soo Kim
Abstract: Many tools and websites have been proposed to help users to design primers for quantitative polymerase chain reaction (qPCR) experiments. Most of these tools and websites require the external step using BLAST-like tools to perform homology tests on off-target sequences for feasible and valid primers. Although MRPrimer tools have been proposed to design primers without the external step, they have drawbacks that they should be performed on a cluster of machine, or only allow users to design their primers on specific sequence DB. However, utilizing a cluster of machine is usually expensive, and designing primers on their own private sequence DB is indispensable in many cases. In order to solve the above problems, we propose a new primer design method called SSDPrimer that runs on a single machine, but achieves a comparable speed with the MRPrimer method by exploiting SSD storage. It allows users to design primers on private sequence DBs. In addition, SSDPrimer supports the web browser interface such that users can easily design high quality primers for target sequences by querying.
Keywords: qPCR; sequence DB; primer design; SSD; single machine.
Recommending Alternative Drugs by Using Generic Drug Names to Minimize Side Effects
by Sohee Hwang, Jungrim Kim, Jeongwoo Kim, Sanghyun Park
Abstract: Healthcare and the treatment of illnesses are one of the most fundamental aspects of modern human life, and drugs are the easiest approach to healthcare. For instance, drugs reduce pain, cure diseases, and maintain health. However, consuming drugs lead to diverse effects. We propose the use of generic medicine names, which are used to identify more affordable alternative drug formulations that contain similar chemical ingredients, and therefore, perform similar actions, to alleviate concerns of cost. Thereby, people would have the opportunity to select drugs that suit their preferences. It is important to note that while drugs with the same generic name serve similar purposes, they may also cause different side effects. Drugs affect the human body in numerous ways that are not all necessarily beneficial. Drug-induced side effects pose a major problem, and their negative effects often have serious consequences. This paper presents a strategy to address the issue of side effects by recommending alternative drugs that have the same therapeutic effect but with less detrimental effects. In the healthcare group, which is a platform of social networks, users share their comments and experiences with drugs, and numerical ratings of various drugs can subsequently be extracted. By integrating the generic names of drugs and data from social networks, more data can be obtained to arrive at meaningful conclusions. The process involves identifying a group of drugs with the same generic name and comparing user review-based ratings to determine the drug with fewer side effects. This paper proposes an new approach for analyzing drug-induced side effects, with collecting, processing, and using data from social networks.
Keywords: Data mining; Drug recommendation; adverse drug reaction; Social Data.
CoDeT: An Easy-to-Use Community Detection Tool
by Yifei Yue, Chaokun Wang, Xiang Ying, Jun Qian
Abstract: Network data plays an important role in biological research. For example, the interaction between proteins in living cells forms large complex networks. The corporation of cells in a living body also makes up networks. As an important method to dig out the topology information of networks, community detection algorithm has attracted a great interest of researchers during the past decade. Different methods have been developed. However, the diversity of the algorithms also makes users confused to choose a suitable one according to the specific application. Therefore, we present CoDeT, a system which integrates 11 state-of-the-art community detection algorithms and 12 recognized metrics. Besides, CoDeT can recommend the most suitable algorithm for users when they choose multiple algorithms for one data set. Experiment results show that the algorithms in our system are effective on bioinformatic networks composed of multiple communities. In addition, with our provided C++, Python and web service interfaces, users can choose the most convenient one to start.
Keywords: community detection; bioinformatic network analysis; graph algorithms; machine learning.
Gene Selection for Cancer Classification by Combining Minimum Redundancy Maximum Relevancy and Bat-inspired Algorithm
by Osama Alomari, Ahamad Khader, Mohammed Al-Betar, Laith Abualigah
Abstract: In this paper, the bat-inspired algorithm (BA) is tolerated to gene selection for cancer classification using microarray datasets. Microarray data consists of irrelevant, redundant, and noisy genes. Gene selection problem is tackled by determining the most informative genes taken from microarray data to accurately diagnose the cancer disease. Gene selection problem is widely solved by optimization algorithms. BA is a recent swarm-based algorithm, which imitates the echolocation system of bat individuals. It has been successfully applied to several optimization problems. Gene selection is tackled by combining two stages, namely, filter stage, which uses Minimum Redundancy Maximum Relevancy (MRMR) method; and wrapper stage, which uses BA and SVM. To test the accuracy performance of the proposed method, ten microarray datasets with different sizes were used. For comparative evaluation, the proposed method was compared with popular gene selection methods. The proposed method achieves comparable results of some datasets and produced new results for one dataset.
Keywords: Bat-inspired algorithm; Optimization; Gene Selection ; MRMR ; SVM ; Classification.
Development of a Simulation Result Management and Prediction System Using Machine Learning Techniques
by Ki Yong Lee, Young-Kyoon Suh, Kum Won Cho
Abstract: Computer simulations are widely used in various fields of science and engineering, including bioinformatics, computational biology, and fluid dynamics. Although the cost of executing simulations is rapidly increasing, the reuse of previously obtained simulation results to improve the execution of later requested simulations has not been well addressed. In this article, we propose a new simulation service system that utilizes previous simulation results actively to provide better services for later simulation requests. The proposed system automatically converts and stores completed simulation results in a database so that, if the same simulation is requested again, the corresponding result is returned immediately without re-execution. The system also can predict the results of a requested simulation via machine learning techniques. Thus, the system can avoid unnecessary computation, resulting in a reduced response time. In our experiments, the proposed system achieved very low error rates in prediction, ranging from 0.9% to 7.4%.
Keywords: Simulation Service System; Simulation Result Prediction; Machine Learning; Simulation Result Reuse.
Understanding protein-protein interaction networks from conserved patterns to conserved controllability
by Peng Gang Sun, Juan Chi
Abstract: Conserved patterns in protein-protein interaction (PPI) networks are of great importance for understanding the evolutions and functions in multiple species. By identifying important genes (driver nodes), network controllability-based on the minimum dominating set (MDS) provides a new way to study PPI networks, which motivates us to shift the focus from conserved patterns to conserved controllability. In this paper, we study the controllability by taking network structures into account, which emphasizes that a driver node can control a non-driver node if they belong to a specific structure, and further extend the controllability to multiple species for conserved controllability. We find that driver nodes are more likely to be tumor suppressor /drug targets and essential genes. The results over five species indicate that driver nodes across multiple species tend to be conserved, i.e., the homologous proteins of driver nodes in one species tend to be driver nodes in another species, and this tendency strengthens for the homologous protein pairs with stronger homologies. In addition, an interesting finding can be observed for conserved controllability, i.e., the five species can be classified into two groups, and within groups the conservation is stronger from low species to high species, which is just contrary to the species across the two groups.
Keywords: conserved controllability; protein-protein interaction network; multiple species.
Multi-Kernel LS-SVM Based Integration Bio-Clinical Data Analysis and Application to Ovarian Cancer
by Jaya Thomas, Lee Sael
Abstract: The medical research facilitates to acquire a diverse type of data from the same individual for a particular cancer. Recent studies show that utilizing such diverse data results in more accurate predictions. The major challenge faced is how to utilize such diverse data sets in an effective way.
In this paper, we introduce a multiple kernel based pipeline for integrative analysis of high-throughput molecular data (somatic mutation, copy number alteration, DNA methylation and mRNA) and clinical data. We apply the pipeline on Ovarian cancer data from TCGA. After multiple kernels have been generated from the weighted sum of individual kernels, it is used to stratify patients and predict clinical outcomes.
We examine the survival time, vital status, and neoplasm cancer status of each subtype to verify how well they cluster. We have also examined the power of molecular and clinical data in predicting dichotomized overall survival data and to classify the tumor grade for the cancer samples. It was observed that the integration of various data types yields higher log-rank statistics value. We were also able to predict clinical status with higher accuracy as compared to using individual data types.
Keywords: Integrative analysis; Least squares multiple kernel; Bio-clinical data; Ovarian cancer;.
Ensemble Classifier Design Selecting Important Genes based on Extracted Features
by Soumen Kumar Pati, Asit Kumar Das
Abstract: Ensemble classifier highly depends on the nature of the dataset and efficiency of the classifier degrades tremendously due to presence of irrelevant features. Because of the distinct characteristics inherent to specific cancer, selecting the most informative genes from high volume microarray dataset is challenging bioinformatics research topic. In the paper, the informative genes are selected based on some prominent features generated using statistical and probabilistic concepts. The selected genes are applied on genetic algorithm which intelligently selects an appropriate combination of classifiers where non-linear uniform cellular automata are employed to generate the initial population, multipoint-crossover and unique jumping gene mechanism for mutation to preserve the diversity in the population and a steady state fitness function is introduced for maximum accuracy with minimum classifiers where many classifiers of distinct characteristics are considered as base classifiers. Performance of the proposed method is compared with the state-of-art algorithms to demonstrate its effectiveness.
Keywords: ensemble classifier; informative gene; bioinformatics research; statistical concept; probabilistic concept; genetic algorithm; cellular automata; jumping gene mutation; microarray dataset. automata; jumping gene mutation; microarray dataset.
Pattern Recognition of Chemical Compounds using Multiple Dose-Response Curves
by Jiao Chen, Tianhong Pan, Shan Chen, Xiaobo Zou, Kaili Xu
Abstract: To determine distinct chemical properties characterized by Mechanism of Action (MoA), a pattern recognition algorithm using multiple dose-response curves is developed in this paper. By monitoring the dynamic time-dependent cellular response profiles (TCRPs) of living cells via Real Time Cellular Analyzer, changes in cell number caused by different MoAs are recorded as a time series. Based on the toxic-effect observed in TCRPs, a dose-response curve is established, which reflect the cytotoxicity of the tested chemicals. Features, which reflect the levels of cytotoxicity, are extracted from the dose-response curves. And the singular value decomposition (SVD) is taken to reduce the effect of collinearity in the extracted features. A k-means clustering method with deterministic initial centers is employed to classify the compressed features. As a result, the tested chemicals are classified into several groups. The proposed method enables relatively high throughput screening for chemical recognition at the cellular level.
Keywords: Mechanism of Action (MoA); Time-dependent cell response profile (TCRP); Toxic-effect; k-means cluster; Dose-response curve.
Iteration method for detecting disease genes in terms of the integration of the cellular compartment information with the protein-protein interaction data
by XiWei Tang, Wei Peng, Minzhu Xie
Abstract: Many computational approaches identify disease genes based on the protein-protein interaction (PPI) networks because of the principle 'Guilt-by-Associate'. However, the defects of the PPI data severely reduce the accuracy of the predicting methods. In the current study, a new framework called IMIDG is developed to identify causal genes for diseases. First, the reliability of the interactions among proteins is quantified by incorporating the subcellular localization information into the human PPI networks and the weighted networks are built. Based on the weighted PPI networks, an iteration function is performed to score and rank the disease candidate genes. The leave-one-out crossing validation (LOOCV) and literature study method are used to test IMIDG, DADA and ToppNet algorithms. The areas under curves show that IMIDG outperforms DADA and ToppNet methods in prioritizing disease candidate genes. Additionally,out of the 18 novel genes in the top 50 gene set, 5 genes are proved to be associated with colorectal cancer by the literatures, suggesting the remaining genes for further investigation.
Keywords: iteration method; subcellular localization; protein-protein interaction; disease gene.
Accurate Annotation of Metagenomic data without species-level references
by Haobin Yao, Tak-wah Lam, Hing-Fung Ting, Siu-Ming Yiu, Yadong Wang, Bo Liu
Abstract: In this paper, we propose a novel annotation tool, MetaAnnotator, to annotate metagenomic reads, which outperforms all existing tools significantly when only genus-level references exist in the database. From our experiments, MetaAnnotator can assign 87.5% reads correctly (67.5% reads are assigned to the exact genus) with only 8.5% reads wrongly assigned. The best existing tool (MetaCluster-TA) can only achieve 73.4% correct read assignment (with only 50.9% reads assigned to the exact genus and 22.6% reads wrongly assigned). The core concepts behind MetaAnnotator includes: (i) we only consider exact k-mers in coding regions of the references as they should be more significant and accurate; (ii) to assign reads to taxonomy nodes, we construct genome and taxonomy specific probabilistic models from the reference database; and (iii) using the BWT data structure to speed up the k-mer matching process.
Keywords: metagenomic data analysis; binning; accurate and fast annotation.