Forthcoming articles


International Journal of Data Mining and Bioinformatics


These articles have been peer-reviewed and accepted for publication in IJDMB, but are pending final changes, are not yet published and may not appear here in their final order of publication until they are assigned to issues. Therefore, the content conforms to our standards but the presentation (e.g. typesetting and proof-reading) is not necessarily up to the Inderscience standard. Additionally, titles, authors, abstracts and keywords may change before publication. Articles will not be published until the final proofs are validated by their authors.


Forthcoming articles must be purchased for the purposes of research, teaching and private study only. These articles can be cited using the expression "in press". For example: Smith, J. (in press). Article Title. Journal Title.


Articles marked with this shopping trolley icon are available for purchase - click on the icon to send an email request to purchase.


Articles marked with this Open Access icon are freely available and openly accessible to all without any restriction except the ones stated in their respective CC licenses.


Register for our alerting service, which notifies you by email when new issues of IJDMB are published online.


We also offer RSS feeds which provide timely updates of tables of contents, newly published articles and calls for papers.


International Journal of Data Mining and Bioinformatics (10 papers in press)


Regular Issues


  • Integration of k-means clustering algorithm with network analysis for drug-target interactions network prediction   Order a copy of this article
    by Reda Alhajj, Ala Qabaja, Sara Aghakhani 
    Abstract: Prediction of the interactions between drugs and target proteins is an important factor in silico drug discovery. The number of known interactions is very small in comparison to the potential number of interactions. In this paper, a new method is proposed which combines data from both chemical structures and genomic sequence data. This method uses both supervised and unsupervised learning, as well as network analysis techniques. The proposed approach integrates k-means clustering algorithm with Social Network Analysis (SNA) techniques for a novel prediction of drug-target interactions. Here, we demonstrate the performance of our approach in the prediction of drug-target interactions by using four classes of drug-target interaction networks in human; enzymes, ion channels, G protein-coupled receptors (GPCRs), and nuclear receptors. The AUC curve is used to evaluate the accuracy of the proposed approach using three classifiers; BayesNetwork, NaiveBayes and SVM. We could identify novel drug-protein interactions using the Bayes network classifier. The reported accuracy for enzymes, ion channels, GPCRs, and nuclear receptors are 98%, 85%, 98.6% and 99.2%.
    Keywords: k-means; clustering; network analysis; drug-protein interactions; network prediction; classification; support vector machine.

  • Neural Relevance Model Using Similarities with Elite Documents for Effective Clinical Decision Support   Order a copy of this article
    by Yanhua Ran, Ben He, Kai Hui, Jungang Xu, Le Sun 
    Abstract: Clinical Decision Support (CDS) is regarded as an information retrieval (IR) task, where medical records are used to retrieve full-text biomedical articles to satisfy the information needs from physicians, aiming at better medical solutions. Recent attempts have introduced the advances of deep learning by employing neural IR methods for CDS, where, however, only the document-query relationship is modelled, resulting in non-optimal results in that a medical record can barely reflect the information included in a relevant biomedical article which is usually much longer. Therefore, in addition to the document-query relationship, we propose a neural relevance model (DNRM) based on similarities to a set of elite documents, addressing the information mismatch by utilizing the content of relevant articles as a complete picture of the given medical record. Specifically, our DNRM model evaluates a document relative to a query and to several pseudo relevant documents for the query at the same time, capturing the interactions from both parts with a feed forward network. Experimental results on the standard Text REtrieval Conference (TREC) CDS track dataset confirm the superior performance of the proposed DNRM model.
    Keywords: Neural relevance matching; Clinical Decision Support; Information Retrieval.

  • Deletion Genotype Calling on the Basis of Sequence Visualization and Image Classification   Order a copy of this article
    by Jing Wang, Jingyang Gao, Cheng Ling 
    Abstract: Widely known genotype calling methods, such as CNVnator, Pindel, and LUMPY, are restricted in terms of detectable length ranges and sequence coverage. Focusing on deletions larger than 50 bp, we propose a new approach with two main steps: (1) visualizing images of deletions and (2) conducting deletion genotypes classification. Given the coordinates of candidates, this method first generates breakpoint images by fetching reads from BAM files. Convolutional neural networks then perform genotype recognition. We test our approach on both low and high coverage simulated noisy data and compare the results to those of CNVnator, Pindel, and LUMPY. The results indicate our approach surpasses other tools with higher accuracy, wider detectable deletion length range, and better performance on both low and high coverage data. To summarize, our approach not only provides an intuitive image view of deletion regions, but also achieves better results for genotype calling compared to existing tools.
    Keywords: deletion; genotype calling; convolutional neural network; visualization; image classification.

  • A Multiobjective Feature Selection and Classifier Ensemble Technique for Microarray Data Analysis   Order a copy of this article
    by Rasmita Dash, Bijan Bihari Misra 
    Abstract: Since last few years, microarray technology has got tremendous application in many biomedical researches. However, in order to efficiently recognize and apply this technology into the biomedical areas is still very difficult and expensive. There are many intelligent models has been developed with different biological interpretation. This work presents a multiobjective feature selection and classifier ensemble (MOFSCE) technique for microarray data analysis. This technique works in two phases. The first phase is a preprocessing step where bi-objective optimization technique is used to identify the significantly important genes in the non-dominated set through Pareto front. Here seven feature ranking approaches are used to develop twenty one bi-objective feature selection (BOFS) models. The quality of features selected is tested using support vector machine (SVM) classifier. The performance of BOFS model varies with different datasets. Therefore grading system is used to identify stable BOFS model. In the second phase a hybrid model is built up, which is an ensemble of five classifiers that receives selected features from the identified BOFS model. Output of the classifiers is presented to a harmony search based functional link artificial neural network (HSFLANN) for decision. Performance of MOFSCE is evaluated using seven publicly available microarray datasets. Results of MOFSCE are compared with a few other models and from statistical significance test it is found that MOFSCE is better model in comparison to others.
    Keywords: Feature Selection; Pareto Optimization; Ensemble approaches; Microarray Data Classification; Functional link artificial neural network ; Harmony search; Statistical test.

  • Link prediction potentials for biological networks   Order a copy of this article
    by Sadegh Sulaimany, Mohammad Khansari, Ali Masoudi-Nejad 
    Abstract: Improvement of biological networks reconstructed from high-throughput expression data is an important challenge in systems biology. Link prediction is a problem of interest in many application domains that can be used for this purpose. In this paper after a short review of several biological networks, we present the latest definition of the link prediction problem and review it from several viewpoints.rnWith a comprehensive search in the literature using PubMed, Science Direct and Google Scholar databases, and carefully reviewing the related papers having the link prediction plus at least one of the biological network terms in their title, abstract or keywords, we classify the results based on the graph type and major link prediction outlooks. Finally, we analyse the preformed researches to find new insights about potential uses in addition to understanding the current state, and propose several hints and directions for future works.rn
    Keywords: link prediction; biological networks; biological link prediction; biological link mining.

  • Ensemble Classification for Gene Expression Data based on Parallel Clustering   Order a copy of this article
    by Jun Meng, Dingling Jiang, Jing Zhang, Yushi Luan 
    Abstract: Analysis of large-scale gene expression data is a research hotspot in the field of bioinformatics, which can be used to study abnormal phenomenon in plant growth process. This paper proposes a biological knowledge integration method based on parallel clustering to select gene subsets effectively. Gene ontology is utilized to obtain the biological functional similarity, and combined with gene expression data. Parallelized affinity propagation algorithm is used to cluster data since it can not only obtain more biologically meaningful subsets, but also avoid the loss of some potential value in genes from simple gene primary selection. The algorithm is verified with four typical plant datasets and compared with other well-known integration methods. Experimental results on plant stress response datasets demonstrate that the proposed method can select genes with stronger classification ability.
    Keywords: ensemble classification; microarray data; MapReduce programming model; parallel information fusion.

  • Inversion Detection Using PacBio Long Reads   Order a copy of this article
    by Shenglong Zhu, Scott Emrich, Danny Chen 
    Abstract: Structural variations have received considerable attention in the past decade owing to their importance in disease etiology and ecological adaptation. Many prior efforts have exploited short paired-end reads to detect structural variations and, more recently, improved approaches have combined newer long reads with short ones to better predict variants. In this paper, we propose a new computational framework that uses only long reads to target a specific type of structural variations: large inversions. Our approach is complementary to state-of-the-art methods, but models identifying inversions as a Max-Cut problem. We show that this new approach is effective for predicting large inversions comparing to current structural variation detection tools. This new formulation also uncovers more complex structural variants that are not discovered by alternative frameworks. We conclude that our new approach is potentially powerful for detecting inversions in complex genomes. Our software is freely available at url{}.
    Keywords: inversion detection; PacBio long reads; structural variation; short paired-end reads; large inversion; breakpoint detection; max cut; complex inversion; complex genomes; InvDet; range minimum query; simple inversion; approximation algorithm; validated segment; mate pairs.

  • TISRover: ConvNets Learn Biologically Relevant Features for Effective Translation Initiation Site Prediction   Order a copy of this article
    by Jasper Zuallaert, Mijung Kim, Arne Soete, Yvan Saeys, Wesley De Neve 
    Abstract: Being a key component in gene regulation, translation initiation is a well-studied topic. However, recent findings have shown translation initiation to be more complex than initially thought, urging for more effective prediction methods. In this paper, we present TISRover, a multi-layered convolutional neural network architecture for translation initiation site prediction. We achieve stateof-the-art results, outperforming a previous deep learning approach by 4% to 23% in terms of auPRC, and other approaches by at least 68% in terms of error rate. Furthermore, we present a methodology to analyze the decision-making process of our network models, revealing various biologically relevant featuresfor translation initiation site prediction that are automatically learnt from scratch, without any prior knowledge. The most notable features found are the Kozak consensus sequence, the reading frame characteristics, the influence of stop and start codons in the sequence, and the presence of donor splice site patterns.
    Keywords: convolutional neural networks; deep learning; genomics; model interpretation; model visualization; translation initiation site prediction.

  • TPGraph: A Hospital Readmission Prediction Method Based on Temporal Phenotype Graphs   Order a copy of this article
    by Lizhen Cui, Xiangzhen Xu, Shijun Liu, Hui Li, Zhiqi Liu 
    Abstract: Accurate hospital readmission prediction in a vast amount of healthcare data is important to the reducing health care costs and improving treatment patterns. Due to the temporality and sequentiality of the medical records, we propose a method for predicting hospital readmission based on temporal phenotype graphs in this paper, namely the TPGraph. Firstly, we constructed a temporal graph for each patient based on their medical event sequence.Then, we developed an approach to identify the most significant frequent subgraphs as temporal phenotype graphs.After that, an improved greedy algorithm was designed to obtain the optimal expression coefficient of temporal phenotype graphs. Finally, the optimal expression coefficient as a feature, we use random forest algorithm to predict whether the patient will perform hospital readmission. Our experiments demonstrate the effectiveness of our proposed method, and show that our approach gain better predictive performance compared with the baselines.
    Keywords: Healthcare; Temporal Phenotype;Temporal Phenotype Graphs; Hospital Readmission Prediction; Frequent Subgraph Mining; Optimal Expression Coefficient;Temporal Graph; Medical Event Sequence; AGM; Coronary Heart Disease.

  • Protein Family Structure Signature for Multidomain Proteins   Order a copy of this article
    by Jun Tan, Donald Adjeroh 
    Abstract: The rapid increase in available protein structure datasets requires new techniques for fast, yet, effective analysis of protein 3D structures. In this work, we propose a structure-based signature for protein families, suitable for rapid analysis of multidomain domain protein structures. Our method is alignment-free, using protein strings as the basic representation. A key novelty is the two-stage approach, whereby an initial list of candidate protein superfamilies are rapidly identified using the protein family signature, and then information retrieval methods are applied only to the members of the candidate superfamilies. This approach is the key to both improved speed, and improved structure retrieval accuracy. Experimental results, including comparative results with state-of-the-art methods, demonstrate the performance of the proposed protein family signature on queries with multidomain protein structures.
    Keywords: protein structure; signature; retrieval; classification; alignment-free; structure analysis.