International Journal of Data Mining and Bioinformatics (17 papers in press)
Identifying cis/trans-acting expression Quantitative Trait Loci (eQTL)
by Mingon Kang, Dongchul Kim, Chunyu Liu, Jean Gao
Abstract: Expression Quantitative Trait Loci (eQTL) studies have played an important role in discovering novel susceptibility genes and regulatory mechanisms of human diseases. High-throughput microarray technologies allow to measure thousands of gene expressions at the same time, and the advance enables one to capture the insight of the genetic architecture of gene expression. A number of multivariate methods have been proposed to identify loci associated to gene expression taking into account interactive effects and relationships between the units. However, the large data tend to increase false positives in the studies. We propose a Cis/Trans eQTL Association Mapping (CTAM) method to (1) take co-expressed genes without clustering or partitioning techniques, (2) build a mathematical model for cis- and trans-eQTL based on biological prior knowledge, and (3) identify significant disease-associated genes. The power to detect both joint effect and group effect of SNPs and gene expressions is assessed in the simulation studies. We also applied it to a study of psychiatric disorder diseases data. CTAM detects associations between cis/trans-acting eQTLs and genes.
Keywords: eQTL analysis; cis/trans-acting eQTL; multivariate.
Analysis of clustered RNA-seq Data
by Hyunjin Park, Seungyeoun Lee, Ye Jin Kim, Myung-Sook Choi, Taesung Park
Abstract: RNA sequencing (RNA-seq) technology has now become a powerful tool for measuring levels of transcripts. Through this high-throughput technology, we can investigate post-transcriptional modifications, non-coding RNAs, mutations, gene fusion, and changes in gene expression levels. Recently, many methods have been developed to find differentially expressed genes (DEGs) between treatment groups. Most of these methods assume that RNA-seq data is generated independently from the different subjects. Nowadays, clustered RNA-seq data are also commonly observed, such as paired RNA-seq data, from the same patient. Unfortunately, existing methods cannot adequately handle clustered RNA-seq data. In this paper, we propose a new testing method, based on the Generalized Estimating Equations (GEE) approach, which is widely used to analyze repeatedly measured data. Our GEE-based approach uses the correlations between RNA-seq data appropriately, which results in increased power in detecting DEGs. Through real data analysis and simulation studies, we compare the performance of the GEE method to those of other existing methods. Specifically, our GEE analysis was compared to various other methodologies, particularly with regard to sensitivity to detect DEGs and false discovery rates.
Keywords: RNA-seq; differentially expressed gene; DEG; simultaneously; multivariate; Generalized Estimating Equations; GEE; false discovery rate (FDR).
CoDeT: An Easy-to-Use Community Detection Tool
by Yifei Yue, Chaokun Wang, Xiang Ying, Jun Qian
Abstract: Network data plays an important role in biological research. For example, the interaction between proteins in living cells forms large complex networks. The corporation of cells in a living body also makes up networks. As an important method to dig out the topology information of networks, community detection algorithm has attracted a great interest of researchers during the past decade. Different methods have been developed. However, the diversity of the algorithms also makes users confused to choose a suitable one according to the specific application. Therefore, we present CoDeT, a system which integrates 11 state-of-the-art community detection algorithms and 12 recognized metrics. Besides, CoDeT can recommend the most suitable algorithm for users when they choose multiple algorithms for one data set. Experiment results show that the algorithms in our system are effective on bioinformatic networks composed of multiple communities. In addition, with our provided C++, Python and web service interfaces, users can choose the most convenient one to start.
Keywords: community detection; bioinformatic network analysis; graph algorithms; machine learning.
Gene Selection for Cancer Classification by Combining Minimum Redundancy Maximum Relevancy and Bat-inspired Algorithm
by Osama Alomari, Ahamad Khader, Mohammed Al-Betar, Laith Abualigah
Abstract: In this paper, the bat-inspired algorithm (BA) is tolerated to gene selection for cancer classification using microarray datasets. Microarray data consists of irrelevant, redundant, and noisy genes. Gene selection problem is tackled by determining the most informative genes taken from microarray data to accurately diagnose the cancer disease. Gene selection problem is widely solved by optimization algorithms. BA is a recent swarm-based algorithm, which imitates the echolocation system of bat individuals. It has been successfully applied to several optimization problems. Gene selection is tackled by combining two stages, namely, filter stage, which uses Minimum Redundancy Maximum Relevancy (MRMR) method; and wrapper stage, which uses BA and SVM. To test the accuracy performance of the proposed method, ten microarray datasets with different sizes were used. For comparative evaluation, the proposed method was compared with popular gene selection methods. The proposed method achieves comparable results of some datasets and produced new results for one dataset.
Keywords: Bat-inspired algorithm; Optimization; Gene Selection ; MRMR ; SVM ; Classification.
Development of a Simulation Result Management and Prediction System Using Machine Learning Techniques
by Ki Yong Lee, Young-Kyoon Suh, Kum Won Cho
Abstract: Computer simulations are widely used in various fields of science and engineering, including bioinformatics, computational biology, and fluid dynamics. Although the cost of executing simulations is rapidly increasing, the reuse of previously obtained simulation results to improve the execution of later requested simulations has not been well addressed. In this article, we propose a new simulation service system that utilizes previous simulation results actively to provide better services for later simulation requests. The proposed system automatically converts and stores completed simulation results in a database so that, if the same simulation is requested again, the corresponding result is returned immediately without re-execution. The system also can predict the results of a requested simulation via machine learning techniques. Thus, the system can avoid unnecessary computation, resulting in a reduced response time. In our experiments, the proposed system achieved very low error rates in prediction, ranging from 0.9% to 7.4%.
Keywords: Simulation Service System; Simulation Result Prediction; Machine Learning; Simulation Result Reuse.
Understanding protein-protein interaction networks from conserved patterns to conserved controllability
by Peng Gang Sun, Juan Chi
Abstract: Conserved patterns in protein-protein interaction (PPI) networks are of great importance for understanding the evolutions and functions in multiple species. By identifying important genes (driver nodes), network controllability-based on the minimum dominating set (MDS) provides a new way to study PPI networks, which motivates us to shift the focus from conserved patterns to conserved controllability. In this paper, we study the controllability by taking network structures into account, which emphasizes that a driver node can control a non-driver node if they belong to a specific structure, and further extend the controllability to multiple species for conserved controllability. We find that driver nodes are more likely to be tumor suppressor /drug targets and essential genes. The results over five species indicate that driver nodes across multiple species tend to be conserved, i.e., the homologous proteins of driver nodes in one species tend to be driver nodes in another species, and this tendency strengthens for the homologous protein pairs with stronger homologies. In addition, an interesting finding can be observed for conserved controllability, i.e., the five species can be classified into two groups, and within groups the conservation is stronger from low species to high species, which is just contrary to the species across the two groups.
Keywords: conserved controllability; protein-protein interaction network; multiple species.
Multi-Kernel LS-SVM Based Integration Bio-Clinical Data Analysis and Application to Ovarian Cancer
by Jaya Thomas, Lee Sael
Abstract: The medical research facilitates to acquire a diverse type of data from the same individual for a particular cancer. Recent studies show that utilizing such diverse data results in more accurate predictions. The major challenge faced is how to utilize such diverse data sets in an effective way.
In this paper, we introduce a multiple kernel based pipeline for integrative analysis of high-throughput molecular data (somatic mutation, copy number alteration, DNA methylation and mRNA) and clinical data. We apply the pipeline on Ovarian cancer data from TCGA. After multiple kernels have been generated from the weighted sum of individual kernels, it is used to stratify patients and predict clinical outcomes.
We examine the survival time, vital status, and neoplasm cancer status of each subtype to verify how well they cluster. We have also examined the power of molecular and clinical data in predicting dichotomized overall survival data and to classify the tumor grade for the cancer samples. It was observed that the integration of various data types yields higher log-rank statistics value. We were also able to predict clinical status with higher accuracy as compared to using individual data types.
Keywords: Integrative analysis; Least squares multiple kernel; Bio-clinical data; Ovarian cancer;.
Ensemble Classifier Design Selecting Important Genes based on Extracted Features
by Soumen Kumar Pati, Asit Kumar Das
Abstract: Ensemble classifier highly depends on the nature of the dataset and efficiency of the classifier degrades tremendously due to presence of irrelevant features. Because of the distinct characteristics inherent to specific cancer, selecting the most informative genes from high volume microarray dataset is challenging bioinformatics research topic. In the paper, the informative genes are selected based on some prominent features generated using statistical and probabilistic concepts. The selected genes are applied on genetic algorithm which intelligently selects an appropriate combination of classifiers where non-linear uniform cellular automata are employed to generate the initial population, multipoint-crossover and unique jumping gene mechanism for mutation to preserve the diversity in the population and a steady state fitness function is introduced for maximum accuracy with minimum classifiers where many classifiers of distinct characteristics are considered as base classifiers. Performance of the proposed method is compared with the state-of-art algorithms to demonstrate its effectiveness.
Keywords: ensemble classifier; informative gene; bioinformatics research; statistical concept; probabilistic concept; genetic algorithm; cellular automata; jumping gene mutation; microarray dataset. automata; jumping gene mutation; microarray dataset.
Pattern Recognition of Chemical Compounds using Multiple Dose-Response Curves
by Jiao Chen, Tianhong Pan, Shan Chen, Xiaobo Zou, Kaili Xu
Abstract: To determine distinct chemical properties characterized by Mechanism of Action (MoA), a pattern recognition algorithm using multiple dose-response curves is developed in this paper. By monitoring the dynamic time-dependent cellular response profiles (TCRPs) of living cells via Real Time Cellular Analyzer, changes in cell number caused by different MoAs are recorded as a time series. Based on the toxic-effect observed in TCRPs, a dose-response curve is established, which reflect the cytotoxicity of the tested chemicals. Features, which reflect the levels of cytotoxicity, are extracted from the dose-response curves. And the singular value decomposition (SVD) is taken to reduce the effect of collinearity in the extracted features. A k-means clustering method with deterministic initial centers is employed to classify the compressed features. As a result, the tested chemicals are classified into several groups. The proposed method enables relatively high throughput screening for chemical recognition at the cellular level.
Keywords: Mechanism of Action (MoA); Time-dependent cell response profile (TCRP); Toxic-effect; k-means cluster; Dose-response curve.
Accurate Annotation of Metagenomic data without species-level references
by Haobin Yao, Tak-wah Lam, Hing-Fung Ting, Siu-Ming Yiu, Yadong Wang, Bo Liu
Abstract: In this paper, we propose a novel annotation tool, MetaAnnotator, to annotate metagenomic reads, which outperforms all existing tools significantly when only genus-level references exist in the database. From our experiments, MetaAnnotator can assign 87.5% reads correctly (67.5% reads are assigned to the exact genus) with only 8.5% reads wrongly assigned. The best existing tool (MetaCluster-TA) can only achieve 73.4% correct read assignment (with only 50.9% reads assigned to the exact genus and 22.6% reads wrongly assigned). The core concepts behind MetaAnnotator includes: (i) we only consider exact k-mers in coding regions of the references as they should be more significant and accurate; (ii) to assign reads to taxonomy nodes, we construct genome and taxonomy specific probabilistic models from the reference database; and (iii) using the BWT data structure to speed up the k-mer matching process.
Keywords: metagenomic data analysis; binning; accurate and fast annotation.
A Novel Low-rank Representation Method for Identifying Differentially Expressed Genes
by Xiu-Xiu Xu, Ying-Lian Gao, Jin-Xing Liu, Ya-Xuan Wang, Ling-Yun Dai, Xiang-Zhen Kong, Sha-Sha Yuan
Abstract: Low-rank representation (LRR) has attracted lots of attentions in recent years. However, LRR has a chief shortcoming, which uses the nuclear norm to approximate the non-convex rank function. This approximation minimizes all singular values, thus the nuclear norm may not approximate to the rank function well. In this paper, we propose a novel low-rank method that replaces the nuclear norm with the truncated nuclear norm to approximate the rank function. And it is applied to identifying differentially expressed genes. The truncated nuclear norm is defined as the sum of some smaller singular values which may be a better measure to approximate the rank function than the nuclear norm. In order to achieve the convergence of our method, the optimization problem of our method is solved by the augmented Lagrange multiplier method that has the property of convergence. The experimental results demonstrate that our method exceeds LLRR, TRPCA and RPCA methods.
Keywords: differentially expressed genes; truncated nuclear norm; low-rank; augmented Lagrange multiplier; TCGA datarn.
Medical Examination Data Prediction with Missing Information Imputation Based on Recurrent Neural Networks
by Han-Gyu Kim, Gil-Jin Jang, Ho-Jin Choi, Myungeun Lim, Jaehun Choi
Abstract: In this work, the recurrent neural networks (RNNs) for medical examination data prediction with missing information is proposed. Simple recurrent network (SRN), long short-term memory (LSTM) and gated recurrent unit (GRU) are selected among many variations of RNNs for the missing information imputation while they are also used to predict the future medical examination data. Besides, the missing information imputation based on bidirectional LSTM is also proposed to consider past information as well as the future information in the imputation process, while the traditional RNNs can only consider the past information during the imputation. We implemented medical examination results prediction experiment using the examination database of Koreans. The experimental results showed that the proposed RNNs worked better than the baseline linear regression method. Besides, the bidirectional LSTM performed best for missing information imputation.
Keywords: Medical Examination Data Prediction; Recurrent Neural Network; Long Short-Term Memory; Gated Recurrent Unit; Bidirectional LSTM.
A Markov blanket-based approach for finding high-dimensional genetic interactions associated with disease in family-based studies
by Hyo Jung Lee, Jae Won Lee, Hee Jeong Yoo, Seohoon Jin, Mira Park
Abstract: Detecting genetic interactions associated with complex disease is a major issue in genetic studies. Although a number of methods to detect gene-gene interactions for population-based Genome-Wide Association Studies (GWAS) have been developed, the statistical methods for family-based GWAS have been limited. In this study, we propose a new Bayesian approach called MB-TDT to find high-order genetic interactions for pedigree data. The MB-TDT method combines the Markov blanket algorithm with classical Transmission Disequilibrium Test (TDT) statistic. The Incremental Association Markov Blanket (IAMB) algorithm was adopted for large-scale Markov blanket discovery. We evaluated the proposed method using both real and simulated data sets. In a simulation study, we compared the power of MB-TDT with conditional logistic regression, Multifactor Dimensionality Reduction (MDR) and MDR-pedigree disequilibrium test (MDR-PDT). We demonstrated the superior power of MB-TDT in many cases. To demonstrate the approach, we analysed the Korean autism disorder GWAS data. The MB-TDT method can identify a minimal set of causal SNPs associated with a specific disease, thus avoiding an exhaustive search.
Keywords: genetic associations; gene-gene interactions; Markov blanket; pedigree data; transmission disequilibrium test.
Formal concept analysis for knowledge discovery from biological data
by Khalid Raza
Abstract: Owing to the rapid advancement in high-throughput technologies, such as microarrays and next generation sequencing, the volume of biological data is increasing exponentially. The current challenge in computational biology and bioinformatics research is how to analyse these huge raw biological datasets to extract meaningful biological knowledge. Formal concept analysis is a method based on lattice theory and widely used for data analysis, knowledge representation, knowledge discovery and knowledge management across several domains. This paper reviews the applications of formal concept analysis for knowledge discovery from biological data, including gene expression discretisation, gene co-expression mining, gene expression clustering, finding genes in gene regulatory networks, enzyme/protein classifications, binding site classifications, and domain-domain interaction. It also presents a list of FCA-based software tools applied to the biological domain, and covers the challenges and future directions in this field.
Keywords: formal concept analysis; microarray analysis; gene expression mining; concept lattice; biological databases.
Recommending alternative drugs by using generic drug names to minimise side effects
by Sohee Hwang, Jungrim Kim, Jeongwoo Kim, Sanghyun Park
Abstract: Healthcare and the treatment of illnesses are one of the most fundamental aspects of modern human life, and drugs are the easiest approach to healthcare. However, consuming drugs lead to diverse effects. We propose the use generic medicine names and it is important to note that while drugs with the same generic name serve similar purposes, they may also cause different side effects. This paper presents a strategy to address the issue of side effects by recommending alternative drugs that have the same therapeutic effect but with less detrimental effects. By integrating the generic names of drugs and data from social networks, more data can be obtained to arrive at meaningful conclusions. This paper proposes a new approach for analysing drug-induced side effects, with collecting, processing, and using data from social networks.
Keywords: data mining; drug recommendation; adverse drug reaction; social data; side effect; generic name; drug-induced; user comment; alleviated side effect; alternative drug.
Iteration method for detecting disease genes in terms of the integration of the cellular compartment information with the protein-protein interaction data
by Xiwei Tang, Wei Peng, Minzhu Xie
Abstract: Many computational approaches identify disease genes based on the protein-protein interaction (PPI) networks because of the principle 'Guilt-by-Associate'. However, the defects of the PPI data severely reduce the accuracy of the predicting methods. In the current study, a new framework called IMIDG is developed to identify causal genes for diseases. First, the reliability of the interactions among proteins is quantified by incorporating the subcellular localisation information into the human PPI networks and the weighted networks are built. Based on the weighted PPI networks, an iteration function is performed to score and rank the disease candidate genes. The leave-one-out crossing validation (LOOCV) and literature study method are used to test IMIDG, DADA and ToppNet algorithms. The areas under curves show that IMIDG outperforms DADA and ToppNet methods in prioritising disease candidate genes. Additionally, out of the 18 novel genes in the top 50 gene set, five genes are proved to be associated with colorectal cancer by the literatures, suggesting the remaining genes for further investigation.
Keywords: iteration method; subcellular localisation; protein-protein interaction; disease gene.
Search for regions with periodicity using the random position weight matrices in the C. elegans genome
by Eugene V. Korotkov, Maria A. Korotkova
Abstract: The present study developed a mathematical method for determining tandem repeats in a DNA sequence. A multiple alignment of periods was calculated by direct optimisation of the position-weight matrix (PWM) without using the pairwise alignments or searching for similarity between periods. A new mathematical algorithm for periodicity search was developed using the random PWMs. The developed algorithm was applied in analysing the DNA sequences of the C. elegans genome. A total of 25,360 regions were found to possess a periodicity with the length of 2 to 50 bases. On the average, a periodicity of ~4000 nucleotides was found to be associated with each region. A significant portion of the revealed regions possess periods consisting of 10 and 11 nucleotides, multiple of 10 nucleotides and periods in the vicinity of 35 nucleotides. Only ~30% of the periods found were discovered previously. This study discussed the origin of periodicity with insertions and deletions.
Keywords: period; sequence; random matrix; alignment; multiple alignment; tandem repeats; weight matrices; similarity; dynamic programming.