Forthcoming articles


International Journal of Data Mining and Bioinformatics


These articles have been peer-reviewed and accepted for publication in IJDMB, but are pending final changes, are not yet published and may not appear here in their final order of publication until they are assigned to issues. Therefore, the content conforms to our standards but the presentation (e.g. typesetting and proof-reading) is not necessarily up to the Inderscience standard. Additionally, titles, authors, abstracts and keywords may change before publication. Articles will not be published until the final proofs are validated by their authors.


Forthcoming articles must be purchased for the purposes of research, teaching and private study only. These articles can be cited using the expression "in press". For example: Smith, J. (in press). Article Title. Journal Title.


Articles marked with this shopping trolley icon are available for purchase - click on the icon to send an email request to purchase.


Articles marked with this Open Access icon are freely available and openly accessible to all without any restriction except the ones stated in their respective CC licenses.


Register for our alerting service, which notifies you by email when new issues of IJDMB are published online.


We also offer RSS feeds which provide timely updates of tables of contents, newly published articles and calls for papers.


International Journal of Data Mining and Bioinformatics (23 papers in press)


Regular Issues


  • Pupylation Sites Prediction with Ensemble Classification Model   Order a copy of this article
    by Wenzheng Bao, Zhenhua Huang, Chang-An Yuan, De-Shuang Huang 
    Abstract: Post translational modification of protein is one of the most important biological processions in the field of proteomics and bioinformatics. Pupylation is a novel post translational modification which the small, intrinsi-cally disorder edprokaryotic ubiquitin-like protein is conjugated to lysine residues of potential segments. Both the experimental and computational prediction methods of such modified sites have proved to be a challenging issue. Computational methods mainly aimed at extracting effective features from the potential protein segments. In this paper, the statistical feature of adjacent amino acid residues has been proposed and the novel feature is combined appearance of adjacent amino acid and the BLOSUM62 matrix. The Neural Network and the Na
    Keywords: lysine pupylation; neural network; naïve bayes; post-translational modification.

  • A Markov blanket-based approach for finding high-dimensional genetic interactions associated with disease in family-based studies   Order a copy of this article
    by Hyo Jung Lee, Jae Won Lee, Hee Jeong Yoo, Seohoon Jin, Mira Park 
    Abstract: Detecting genetic interactions associated with complex disease is a major issue in genetic studies. Although a number of methods to detect gene-gene interactions for population-based genome-wide association studies (GWAS) have been developed, the statistical methods for family-based GWAS have been limited. In this study, we propose a new Bayesian approach called MB-TDT to find high order genetic interactions for pedigree data. The MB-TDT method combines the Markov blanket algorithm with classical transmission disequilibrium test (TDT) statistic. The incremental association Markov blanket (IAMB) algorithm was adopted for large-scale Markov blanket discovery. We evaluated the proposed method using both real and simulated datasets. In a simulation study, we compared the power of MB-TDT with conditional logistic regression, multifactor dimensionality reduction (MDR) and MDR-pedigree disequilibrium test (MDR-PDT). We demonstrated the superior power of MB-TDT in many cases. To demonstrate the approach, we analyzed the Korean autism disorder GWAS data. The MB-TDT method can identify a minimal set of causal SNPs associated with a specific disease, thus avoiding an exhaustive search.
    Keywords: Genetic associations; Gene-gene interactions; Markov blanket; Pedigree data; Transmission disequilibrium test.

  • Transcriptomic and network analyses combine to identify genes that drive the red blood cell cycle of Plasmodium falciparum   Order a copy of this article
    by Xinran Yu, Hao Zhang, Timothy Lilburn, Hong Cai, Jianying Gu, Turgay Korkmaz, Yufeng Wang 
    Abstract: Despite coordinated attempts to control or eliminate it, malaria remains a widespread public health problem, with half the worlds population (3.2 billion people) at risk. While the annual death toll attributed to malaria has declined in recent years, the mortality is still very high. In 2015 the World Health Organisation estimates that between 236,000 and 635,000 people died, and the disease cost the continent of Africa, where 91% of cases occur, about USD 12 billion. The contribution of genomics to the defeat of malaria has been relatively small until recently. Although genomic data is available, much of it is difficult to interpret, as this parasite has no well-studied close relatives. This has led to a need for computationally-driven tools that will help us understand the dynamic cellular networks in the malaria parasite. This understanding, in turn, will help us identify new antimalarial targets in the parasite. Here, we coupled RNA-Seq analysis and network mining using a PageRank-based algorithm, and examined the temporal-specific expression of parasite genes during the 48-hour red blood cycle. We identified genes that appear to influence parasite development and red blood cell invasion. The just-in-time mechanism for gene expression may contribute to a dynamic and effective adaptive strategy of the malaria parasite.
    Keywords: malaria; development cycle; RNA-Seq; PageRank; systems biology; Plasmodium falciparum.

  • Randomized Sequential and Parallel Algorithms for Efficient Quorum Planted Motif Search   Order a copy of this article
    by Peng Xiao, Soumitra Pal, Sanguthevar Rajasekaran 
    Abstract: Discovering patterns in biological sequences is very important to extract useful information from them. Motifs are crucial patterns that have numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similiarity between families of proteins, etc. Motif search is an important step in obtaining meaningful patterns from biological data. The general problem of motif search is intractable. There are many models of motif search proposed in the literature. Among these, the (l; d)-motif model, which is also known as Planted Motif Search (PMS), is widely studied. However, most of the exisiting algorithms are deterministic and the role of randomization in this area is still unexploited. This paper proposes an elegant as well as efficient randomized algorithm, named qPMS10, to solve PMS. The idea is based on random sampling.We also prove that if we choose the parameters carefully, then our result with be correct with a high probability. We utilize the most efficient PMS solver until now, named qPMS9, as a subroutine.We analyze the time complexity of both algorithms and provide a performance comparison of qPMS10 with qPMS9 on standard benchmark datasets. In addition, we offer a parallel implementation of qPMS10 and run tests using up to 4 processors. Both theoretical and empirical analyses demonstrate that our randomized algorithm outperforms the exsiting algorithms for solving PMS. We believe that our techniques can also be extended to other motif search models, such as Simple Motif Search (SMS) and Edit-distance based Motif Search (EMS).
    Keywords: motif search; planted motif search; (l; d)-motif search; randomized algorithms; DNA and protein sequences; parallel algorithms for motif search.

  • Dynamics in the neural network of an in vitro epilepsy model   Order a copy of this article
    by Bowen Liu, Junwei Mao, Yejun Shi, Qinchi Lu, Peiji Liang, Puming Zhang 
    Abstract: Epilepsy is growingly considered as a brain network disorder. In this study, epileptiform discharges induced by low-Mg2+ were recorded with a micro-electrode array. Dynamic effective network connectivity was constructed by calculating the time-variant partial directed coherence (tvPDC) of signals. We proposed a novel approach to track the state transitions of epileptic networks, and characterized the network topology by using graphical measures. We found that the network hub nodes coincided with the epileptogenic zone in previous electrophysiological findings. Two network states with distinct topologies were identified during the ictal-like discharges. The small-worldness significantly increased at the second state. Our results indicate the ability of tvPDC to capture the causal interaction between multi-channel signals important in indentifying the epileptogenetic zone. Moreover, the evolution of network states extends our knowledge of the network drivers for the initiation and maintenance of ical activity, and suggests the practical value of our network clustering approach.
    Keywords: epilepsy; microelectrode array; dynamic network; graph theory; granger causality; tonic-clonic; network analysis; hippocampus; entorhinal cortex; small-worldness; low Mg2+.

  • BioNimbuZ: A Federated Cloud Platform for Bioinformatics Applications   Order a copy of this article
    by Michel Rosa, Breno Moura, Guilherme Vergara, Aletéia Araujo, Maristela Holanda, Maria Emilia Walter 
    Abstract: Challenges in bioinformatics include tools to treat large-scale processing, mainly due to the large volumes of data generated by high-throughput sequencing machines. However, many of these tools are not user friendly, and do not distribute their workloads properly. In federated cloud environments, even though services and resources are shared and available online, the processes of a workflow execution are almost entirely unautomated, and the majority of these processes do not efficiently balance their workloads. This paper presents the federated cloud platform called, BioNimbuZ a hybrid platform designed to execute bioinformatics applications easily, efficiently, and with good workload balance. Our tests were performed using a real bioinformatics workflow, with fragments generated by the Illumina sequencer, having achieved good performance in practice.
    Keywords: BioNimbuZ; cloud computing; federated cloud computing; bioinformatics applications.

  • Multivariate summary approach to omics data from crossover design with two repeated factors   Order a copy of this article
    by Sunghoon Choi, Soo-yeon Park, Hoejin Kim, Taesung Park, Oran Kwon 
    Abstract: A crossover design, with two repeated factors, is commonly used for analyzing tolerance tests, i.e., measurements of physiologic response, following ingestion of some exogenous substance. For data analysis using a crossover design, a standard approach is to use linear mixed effect models (LMMs), as these can adequately handle correlated measurements from the crossover design. Alternatively, univariate analyses, using single summary statistics, can be employed for assessments such as the difference of measurements between time points, incremental area under curve (iAUC), Cmax etc. However, the use of summary measures may result in the loss of information. In this study, instead of using one single summary measure, we propose using multiple summary measures simultaneously through LMMs by taking their correlation into account. We compare the performance of the proposed method with other existing methods through real data analysis and simulation studies. We show that our proposed method has equivalent power to that of standard LMM approach, while using a much fewer number of parameters.
    Keywords: Linear mixed effect model; Crossover design; Repeated measurements.

  • Nonlinear-RANSAC parameter optimization for dynamic molecular systems and signaling pathways   Order a copy of this article
    by Mingon Kang, Liping Tang, Jean Gao 
    Abstract: Vigorous mathematical modeling and accurate parameter estimation of the models are indispensable for building reliable models that represent dynamic characteristics of the biological systems. A challenging task in modeling complex biological systems is the accurate estimation of the large number of unknown parameters in the mathematical modeling. To tackle this problem, we develop a data-driven optimization method, nonlinear RANSAC, based on linear RANdom SAmple Consensus (a.k.a. RANSAC). Conventional RANSAC method is sound and simple, but it is oriented from linear system models. Our proposed nonlinear RANSAC extends its capability to nonlinear systems, while preserving the strengths of RANSAC. We applied nonlinear RANSAC to the dynamic molecular systems of phagocyte transmigration and signaling pathways. The parameters of mathematical equations for the phagocyte transmigration system were estimated by the proposed nonlinear RANSAC and compared the performance with ordinary least squares. Nonlinear RANSAC was also applied to signaling pathways, where mathematical equations are formulated using ordinary differential equations that represent molecular interactions between two biological components.
    Keywords: Nonlinear RANSAC; parameter estimation; dynamic molecular systems; signaling pathway.

  • Identifying cis/trans-acting expression Quantitative Trait Loci (eQTL)   Order a copy of this article
    by Mingon Kang, Dongchul Kim, Chunyu Liu, Jean Gao 
    Abstract: Expression Quantitative Trait Loci (eQTL) studies have played an important role in discovering novel susceptibility genes and regulatory mechanisms of human diseases. High-throughput microarray technologies allow to measure thousands of gene expressions at the same time, and the advance enables one to capture the insight of the genetic architecture of gene expression. A number of multivariate methods have been proposed to identify loci associated to gene expression taking into account interactive effects and relationships between the units. However, the large data tend to increase false positives in the studies. We propose a Cis/Trans eQTL Association Mapping (CTAM) method to (1) take co-expressed genes without clustering or partitioning techniques, (2) build a mathematical model for cis- and trans-eQTL based on biological prior knowledge, and (3) identify significant disease-associated genes. The power to detect both joint effect and group effect of SNPs and gene expressions is assessed in the simulation studies. We also applied it to a study of psychiatric disorder diseases data. CTAM detects associations between cis/trans-acting eQTLs and genes.
    Keywords: eQTL analysis; cis/trans-acting eQTL; multivariate.

  • Statistical Quality Control Analysis of High Dimensional Omics Data   Order a copy of this article
    by Yongkang Kim, Gyu-Tae Kim, Min-Seok Kwon, Taesung Park 
    Abstract: Quality control (QC) is a most important preprocessing procedure to remove unwanted variation in omics data, such as microarray, next generation sequencing, and mass spectrometry data. QC has become a standard procedure for identifying important biological signatures of interest. Although several QC analysis tools are now used widely, these usually require a subjective guideline to determine the quality of the omics data being assessed. Here, we propose a new simple QC plot for high dimensional omics data that can identify samples of poor quality in a more objective manner. The proposed QC plot can easily identify samples of poor quality by comparing the between/within group distances, between all possible pairs of samples. Through a permutation procedure, the distribution of these distances is derived, generating p-values for each sample. These p-values can then be used as a more objective criterion to determine the quality of the sample. To exemplify the utility of this approach, we applied the proposed QC plot to MicroArray Quality Control (MAQC), project 1 data.
    Keywords: distance measure; quality control; microarray; omics data.

  • Integration of Multi-omics Data for Integrative Gene Regulatory Network Inference   Order a copy of this article
    by Neda Zarayeneh, Euiseong Ko, Jung Hun Oh, Sang Suh, Chunyu Liu, Jean Gao, Donghyun Kim, Mingon Kang 
    Abstract: Gene regulatory networks provide comprehensive insights and in-depth understanding of complex biological processes. The molecular interactions of gene regulatory networks are inferred from a single type of genomic data, e.g., gene expression data in most research. However, gene expression is a product of sequential interactions of multiple biological processes, such as DNA sequence variations, copy number variations, histone modifications, transcription factors, and DNA methylations. The recent rapid advances of high-throughput omics technologies enable one to measure multiple types of omics data, called 'multi-omics data', that represent the various biological processes. In this paper, we propose an Integrative Gene Regulatory Network inference method (iGRN) that incorporates multi-omics data and their interactions in gene regulatory networks. In addition to gene expressions, copy number variations and DNA methylations were considered for multi-omics data in this paper. The intensive experiments were carried out with simulation data, where iGRN's capability that infers the integrative gene regulatory network is assessed. Through the experiments, iGRN shows its better performance on model representation and interpretation than other integrative methods in gene regulatory network inference. iGRN was also applied to a human brain dataset of psychiatric disorders, and the biological network of psychiatric disorders was analyzed.
    Keywords: gene regulatory network inference; multi-omics data; data integration.

  • Analysis of clustered RNA-seq Data   Order a copy of this article
    by Hyunjin Park, Seungyeoun Lee, Ye Jin Kim, Myung-Sook Choi, Taesung Park 
    Abstract: RNA sequencing (RNA-seq) technology has now become a powerful tool for measuring levels of transcripts. Through this high-throughput technology, we can investigate post-transcriptional modifications, non-coding RNAs, mutations, gene fusion, and changes in gene expression levels. Recently, many methods have been developed to find differentially expressed genes (DEGs) between treatment groups. Most of these methods assume that RNA-seq data is generated independently from the different subjects. Nowadays, clustered RNA-seq data are also commonly observed, such as paired RNA-seq data, from the same patient. Unfortunately, existing methods cannot adequately handle clustered RNA-seq data. In this paper, we propose a new testing method, based on the Generalized Estimating Equations (GEE) approach, which is widely used to analyze repeatedly measured data. Our GEE-based approach uses the correlations between RNA-seq data appropriately, which results in increased power in detecting DEGs. Through real data analysis and simulation studies, we compare the performance of the GEE method to those of other existing methods. Specifically, our GEE analysis was compared to various other methodologies, particularly with regard to sensitivity to detect DEGs and false discovery rates.
    Keywords: RNA-seq; differentially expressed gene; DEG; simultaneously; multivariate; Generalized Estimating Equations; GEE; false discovery rate (FDR).

  • Evaluating the contributions of GO term properties to semantic similarity measurement   Order a copy of this article
    by Jan Sladek, Young-Rae Cho 
    Abstract: Ontologies are the frameworks to provide structured descriptions of components in a specific domain. Recent systematic approaches for semantic analysis and annotations in bio-ontology databases have advanced understanding of molecular functions in a genomic scale. Gene Ontology (GO) is one of the widely used ontology databases. Over the last decade, various semantic similarity measures have been proposed to quantify functional similarity between genes using GO and its annotation data. However, major challenges in the application of current GO data are the increasing complexity of ontology structures and the inconsistency of annotation data. In this study, we explore term properties in GO, such as term specificity and the term balancing effect, and evaluate the contributions of these properties to semantic similarity measurement. Our experiment is designed to predict positive protein-protein interactions (PPIs) by semantic similarities which are measured using the various term properties. The experimental results show that the accuracy of semantic similarity improved when the GO terms are weighted by term specificity. Among several term specificity measures that are commonly applied to semantic analysis, the information content using the ratio of annotating genes resulted in the highest accuracy. The experimental results also show that balancing terms with respect to their specificity is a significant factor in measuring semantic similarity between proteins.
    Keywords: Gene Ontology; semantic similarity; annotations; term specificity; GO; PPI.

  • Formal Concept Analysis for Knowledge Discovery from Biological Data   Order a copy of this article
    by Khalid Raza 
    Abstract: Due to the rapid advancement in high-throughput technologies, such as microarrays and next generation sequencing, the volume of biological data is increasing exponentially. The current challenge in computational biology and bioinformatics research is how to analyze these huge raw biological datasets to extract meaningful biological knowledge. Formal concept analysis is a method based on lattice theory and widely used for data analysis, knowledge representation, knowledge discovery and knowledge management across several domains. This paper reviews the applications of formal concept analysis for knowledge discovery from biological data, including gene expression discretization, gene co-expression mining, gene expression clustering, finding genes in gene regulatory networks, enzyme/protein classifications, binding site classifications, and domain-domain interaction. It also presents a list of FCA-based software tools applied to the biological domain, and covers the challenges and future directions in this field.
    Keywords: Formal Concept Analysis; Microarray Analysis; Gene Expression Mining; Concept Lattice; Biological Databases.

  • SSDPrimer: A SSD-based Primer Design Method for a Private Sequence DB   Order a copy of this article
    by Kang-Wook Chon, Sang-Hyun Hwang, Kyuhyeon An, Min-Soo Kim 
    Abstract: Many tools and websites have been proposed to help users to design primers for quantitative polymerase chain reaction (qPCR) experiments. Most of these tools and websites require the external step using BLAST-like tools to perform homology tests on off-target sequences for feasible and valid primers. Although MRPrimer tools have been proposed to design primers without the external step, they have drawbacks that they should be performed on a cluster of machine, or only allow users to design their primers on specific sequence DB. However, utilizing a cluster of machine is usually expensive, and designing primers on their own private sequence DB is indispensable in many cases. In order to solve the above problems, we propose a new primer design method called SSDPrimer that runs on a single machine, but achieves a comparable speed with the MRPrimer method by exploiting SSD storage. It allows users to design primers on private sequence DBs. In addition, SSDPrimer supports the web browser interface such that users can easily design high quality primers for target sequences by querying.
    Keywords: qPCR; sequence DB; primer design; SSD; single machine.

  • Recommending Alternative Drugs by Using Generic Drug Names to Minimize Side Effects   Order a copy of this article
    by Sohee Hwang, Jungrim Kim, Jeongwoo Kim, Sanghyun Park 
    Abstract: Healthcare and the treatment of illnesses are one of the most fundamental aspects of modern human life, and drugs are the easiest approach to healthcare. For instance, drugs reduce pain, cure diseases, and maintain health. However, consuming drugs lead to diverse effects. We propose the use of generic medicine names, which are used to identify more affordable alternative drug formulations that contain similar chemical ingredients, and therefore, perform similar actions, to alleviate concerns of cost. Thereby, people would have the opportunity to select drugs that suit their preferences. It is important to note that while drugs with the same generic name serve similar purposes, they may also cause different side effects. Drugs affect the human body in numerous ways that are not all necessarily beneficial. Drug-induced side effects pose a major problem, and their negative effects often have serious consequences. This paper presents a strategy to address the issue of side effects by recommending alternative drugs that have the same therapeutic effect but with less detrimental effects. In the healthcare group, which is a platform of social networks, users share their comments and experiences with drugs, and numerical ratings of various drugs can subsequently be extracted. By integrating the generic names of drugs and data from social networks, more data can be obtained to arrive at meaningful conclusions. The process involves identifying a group of drugs with the same generic name and comparing user review-based ratings to determine the drug with fewer side effects. This paper proposes an new approach for analyzing drug-induced side effects, with collecting, processing, and using data from social networks.
    Keywords: Data mining; Drug recommendation; adverse drug reaction; Social Data.

  • CoDeT: An Easy-to-Use Community Detection Tool   Order a copy of this article
    by Yifei Yue, Chaokun Wang, Xiang Ying, Jun Qian 
    Abstract: Network data plays an important role in biological research. For example, the interaction between proteins in living cells forms large complex networks. The corporation of cells in a living body also makes up networks. As an important method to dig out the topology information of networks, community detection algorithm has attracted a great interest of researchers during the past decade. Different methods have been developed. However, the diversity of the algorithms also makes users confused to choose a suitable one according to the specific application. Therefore, we present CoDeT, a system which integrates 11 state-of-the-art community detection algorithms and 12 recognized metrics. Besides, CoDeT can recommend the most suitable algorithm for users when they choose multiple algorithms for one data set. Experiment results show that the algorithms in our system are effective on bioinformatic networks composed of multiple communities. In addition, with our provided C++, Python and web service interfaces, users can choose the most convenient one to start.
    Keywords: community detection; bioinformatic network analysis; graph algorithms; machine learning.

  • Gene Selection for Cancer Classification by Combining Minimum Redundancy Maximum Relevancy and Bat-inspired Algorithm   Order a copy of this article
    by Osama Alomari, Ahamad Khader, Mohammed Al-Betar, Laith Abualigah 
    Abstract: In this paper, the bat-inspired algorithm (BA) is tolerated to gene selection for cancer classification using microarray datasets. Microarray data consists of irrelevant, redundant, and noisy genes. Gene selection problem is tackled by determining the most informative genes taken from microarray data to accurately diagnose the cancer disease. Gene selection problem is widely solved by optimization algorithms. BA is a recent swarm-based algorithm, which imitates the echolocation system of bat individuals. It has been successfully applied to several optimization problems. Gene selection is tackled by combining two stages, namely, filter stage, which uses Minimum Redundancy Maximum Relevancy (MRMR) method; and wrapper stage, which uses BA and SVM. To test the accuracy performance of the proposed method, ten microarray datasets with different sizes were used. For comparative evaluation, the proposed method was compared with popular gene selection methods. The proposed method achieves comparable results of some datasets and produced new results for one dataset.
    Keywords: Bat-inspired algorithm; Optimization; Gene Selection ; MRMR ; SVM ; Classification.

  • Deep fusion of multi-channel neurophysiological signal for emotion recognition and monitoring   Order a copy of this article
    by Xiang Li, Dawei Song, Peng Zhang, Yuexian Hou, Bin Hu 
    Abstract: How to fuse multi-channel neurophysiological signals for emotion recognition is emerging as a hot research topic in community of Computational Psychophysiology. Nevertheless, prior feature engineering based approaches require extracting various domain knowledge related features at a high time cost. Moreover, traditional fusion method cannot fully utilise correlation information between different channels and frequency components. In this paper, we design a hybrid deep learning model, in which the 'Convolutional Neural Network (CNN)' is utilised for extracting task-related features, as well as mining inter-channel and inter-frequency correlation, besides, the 'Recurrent Neural Network (RNN)' is concatenated for integrating contextual information from the frame cube sequence. Experiments are carried out in a trial-level emotion recognition task, on the DEAP benchmarking dataset. Experimental results demonstrate that the proposed framework outperforms the classical methods, with regard to both of the emotional dimensions of Valence and Arousal.
    Keywords: affective computing; CNN; time series data analysis; EEG; emotion recognition; LSTM; multi-channel data fusion; multi-modal data fusion; physiological signal; RNN.
    DOI: 10.1504/IJDMB.2017.10007183
  • Template edge similarity graph clustering for mining multiple gene expression datasets   Order a copy of this article
    by Saeed Salem 
    Abstract: High throughput technologies have enabled the acquisition of large amounts of genomic data, including gene expression and RNA sequencing data for multiple species under various biological and environmental conditions. Recently, researchers have proposed methods for mining biological modules from gene co-expression networks. Biological inference from a single expression dataset suffers from spurious co-expression. Integrating multiple gene expression datasets is a promising strategy to alleviate the challenges of protein functional annotation and biological module discovery based on single gene expression data. We propose an integrative mining algorithm that constructs a template edge similarity graph whose nodes are the co-expression edges and a weighted edge connecting the two nodes corresponds to the structural similarity of the two edges across the co-expression graphs. Clustering the weighted edge similarity graph yields recurrent co-expression link clusters (modules). Experimental results on Human gene expression datasets show that the reported modules are functionally homogeneous as evident by their enrichment with biological process GO terms.
    Keywords: co-expression networks; edge-edge similarity; biological modules.
    DOI: 10.1504/IJDMB.2017.10007174
  • MiRFFS: a functional group-based feature selection method for the identification of microRNA biomarkers   Order a copy of this article
    by Yang Yang, Yiqun Xiao, Tianyu Cao, Wei Kong 
    Abstract: The identification of microRNA biomarkers has been a central task in disease diagnosis, prognosis assessment and drug design. Both statistical methods and machine learning approaches have been applied to the identification of biomarkers. Especially, feature selection and regularisation techniques are efficient for filtering informative attributes from a high-dimensional space. In order to enhance their performance, the intrinsic data structure is usually exploited. In this study, we utilise the GO-based semantic similarity to infer miRNA functional groups, and propose a new feature selection method, called MiRFFS (MiRNA Functional group-based Feature Selection). We also incorporate the functional group information to the sparse group Lasso (SGL), and compare MiRFFS with SGL as well as the state-of-the-art feature selection methods. Experimental results on five miRNA microarray profiles of breast cancer show that MiRFFS can achieve a compact feature subset with substantial improvement on the accuracy compared with other feature selection and lasso methods.
    Keywords: microRNA biomarker; functional group; feature selection; breast cancer.
    DOI: 10.1504/IJDMB.2017.10007184
  • DASE2: differential alternative splicing variants estimation method without reference genome, and comparison with mapping strategy   Order a copy of this article
    by Kouki Yonezawa, Keisuke Nakata, Ryuhei Minei, Atsushi Ogura 
    Abstract: Alternative splicing is a mechanism to produce gene expression diversity under the constraint of a limited number of genes, causing spatiotemporal gene expression in tissues and developmental processes in most organisms. This mechanism is well studied in model organisms so far but not in non-model organisms because the current standard method requires genomic sequences as well as fully annotated information of exons and introns. However, it is essential to uncover the landscape of alternative splicing of organisms to understand its evolutionary impacts and roles. Therefore, we developed a method for condition-specific alternative splicing estimation without reference genome based on de novo transcriptome assembly. We also tested estimation results of DASE with genome mapping method to infer reliability of our method, and displayed that detection level of alternative splicing can be comparable with mapping strategy and useful for the screening of condition specific alternative splicing in non-model organisms. The software is deposited to Github website.
    Keywords: RNA-seq; isoforms; expression diversity.
    DOI: 10.1504/IJDMB.2017.10007185
  • A terpenoid metabolic network modelled as graph database   Order a copy of this article
    by Waldeyr Mendes Cordeiro Da Silva, Danilo José Vilar, Daniel Da Silva Souza, Maria Emília Machado Telles Walter, Maristela Terto De Holanda, Marcelo De Macêdo Brígido 
    Abstract: Terpenoids are involved in interactions such as signalling for communication intra/inter species, signalling molecules to attract pollinating insects, and defence against herbivores and microbes. Owing to their chemical composition, many terpenoids possess vast pharmacological applicability in medicine and biotechnology, besides important roles in ecology, industry and commerce. Metabolic networks are composed of metabolic pathways, they allow us to represent the metabolism of an organism. The biosynthesis of terpenes has been widely studied over the years, and it is well known that they can be synthesised from two metabolic pathways: mevalonate pathway (MVA) and non-mevalonate pathway (MEP). On the other hand, genome-scale reconstruction of metabolic networks faces many challenges, including organisational data storage and data modelling, to properly represent the complexity of systems biology. Recent NoSQL database paradigms have introduced new concepts of scalable storage and data queries. With regard to biological data, the use of graph databases has grown because of its versatility. In this paper, we propose 2Path, a graph database designed to represent terpenoid metabolic networks. It is modelled in such a way so that it preserves important terpenoid biosynthesis characteristics.
    Keywords: terpenoid; metabolic network; secondary metabolism; NoSQL; graph database.
    DOI: 10.1504/IJDMB.2017.10007186