International Journal of Data Mining and Bioinformatics (11 papers in press)
Integration of k-means clustering algorithm with network analysis for drug-target interactions network prediction
by Reda Alhajj, Ala Qabaja, Reda Alhajj
Abstract: Prediction of the interactions between drugs and target proteins is an important factor in silico drug discovery. The number of known interactions is very small in comparison to the potential number of interactions. In this paper, a new method is proposed which combines data from both chemical structures and genomic sequence data. This method uses both supervised and unsupervised learning, as well as network analysis techniques.
The proposed approach integrates k-means clustering algorithm with Social Network Analysis (SNA) techniques for a novel prediction of drug-target interactions. Here, we demonstrate the performance of our approach in the prediction of drug-target interactions by using four classes of drug-target interaction networks in human; enzymes, ion channels, G protein-coupled receptors (GPCRs), and nuclear receptors. The AUC curve is used to evaluate the accuracy of the proposed approach using three classifiers; BayesNetwork, NaiveBayes and SVM. We could identify novel drug-protein interactions using the Bayes network classifier. The reported accuracy for enzymes, ion channels, GPCRs, and nuclear receptors are 98%, 85%, 98.6% and 99.2%.
Keywords: k-means; clustering; network analysis; drug-protein interactions; network prediction; classification; support vector machine.
Neural Relevance Model Using Similarities with Elite Documents for Effective Clinical Decision Support
by Yanhua Ran, Ben He, Kai Hui, Jungang Xu, Le Sun
Abstract: Clinical Decision Support (CDS) is regarded as an information retrieval (IR) task, where medical records are used to retrieve full-text biomedical articles to satisfy the information needs from physicians, aiming at better medical solutions. Recent attempts have introduced the advances of deep learning by employing neural IR methods for CDS, where, however, only the document-query relationship is modelled, resulting in non-optimal results in that a medical record can barely reflect the information included in a relevant biomedical article which is usually much longer. Therefore, in addition to the document-query relationship, we propose a neural relevance model (DNRM) based on similarities to a set of elite documents, addressing the information mismatch by utilizing the content of relevant articles as a complete picture of the given medical record. Specifically, our DNRM model evaluates a document relative to a query and to several pseudo relevant documents for the query at the same time, capturing the interactions from both parts with a feed forward network. Experimental results on the standard Text REtrieval Conference (TREC) CDS track dataset confirm the superior performance of the proposed DNRM model.
Keywords: Neural relevance matching; Clinical Decision Support; Information Retrieval.
Deletion Genotype Calling on the Basis of Sequence Visualization and Image Classification
by Jing Wang, Jingyang Gao, Cheng Ling
Abstract: Widely known genotype calling methods, such as CNVnator, Pindel, and LUMPY, are restricted in terms of detectable length ranges and sequence coverage. Focusing on deletions larger than 50 bp, we propose a new approach with two main steps: (1) visualizing images of deletions and (2) conducting deletion genotypes classification. Given the coordinates of candidates, this method first generates breakpoint images by fetching reads from BAM files. Convolutional neural networks then perform genotype recognition. We test our approach on both low and high coverage simulated noisy data and compare the results to those of CNVnator, Pindel, and LUMPY. The results indicate our approach surpasses other tools with higher accuracy, wider detectable deletion length range, and better performance on both low and high coverage data. To summarize, our approach not only provides an intuitive image view of deletion regions, but also achieves better results for genotype calling compared to existing tools.
Keywords: deletion; genotype calling; convolutional neural network; visualization; image classification.
A Multiobjective Feature Selection and Classifier Ensemble Technique for Microarray Data Analysis
by Rasmita Dash, Bijan Bihari Misra
Abstract: Since last few years, microarray technology has got tremendous application in many biomedical researches. However, in order to efficiently recognize and apply this technology into the biomedical areas is still very difficult and expensive. There are many intelligent models has been developed with different biological interpretation. This work presents a multiobjective feature selection and classifier ensemble (MOFSCE) technique for microarray data analysis. This technique works in two phases. The first phase is a preprocessing step where bi-objective optimization technique is used to identify the significantly important genes in the non-dominated set through Pareto front. Here seven feature ranking approaches are used to develop twenty one bi-objective feature selection (BOFS) models. The quality of features selected is tested using support vector machine (SVM) classifier. The performance of BOFS model varies with different datasets. Therefore grading system is used to identify stable BOFS model. In the second phase a hybrid model is built up, which is an ensemble of five classifiers that receives selected features from the identified BOFS model. Output of the classifiers is presented to a harmony search based functional link artificial neural network (HSFLANN) for decision. Performance of MOFSCE is evaluated using seven publicly available microarray datasets. Results of MOFSCE are compared with a few other models and from statistical significance test it is found that MOFSCE is better model in comparison to others.
Keywords: Feature Selection; Pareto Optimization; Ensemble approaches; Microarray Data Classification; Functional link artificial neural network ; Harmony search; Statistical test.
Link prediction potentials for biological networks
by Sadegh Sulaimany, Mohammad Khansari, Ali Masoudi-Nejad
Abstract: Improvement of biological networks reconstructed from high-throughput expression data is an important challenge in systems biology. Link prediction is a problem of interest in many application domains that can be used for this purpose. In this paper after a short review of several biological networks, we present the latest definition of the link prediction problem and review it from several viewpoints.rnWith a comprehensive search in the literature using PubMed, Science Direct and Google Scholar databases, and carefully reviewing the related papers having the link prediction plus at least one of the biological network terms in their title, abstract or keywords, we classify the results based on the graph type and major link prediction outlooks. Finally, we analyse the preformed researches to find new insights about potential uses in addition to understanding the current state, and propose several hints and directions for future works.rn
Keywords: link prediction; biological networks; biological link prediction; biological link mining.
Gene-gene interaction analysis for quantitative trait using cluster-based multifactor dimensionality reduction method
by Youjung Lee, Hyein Kim, Taesung Park, Mira Park
Abstract: With recent advances in high-throughput genotyping techniques, many genome-wide association studies have been conducted to understand the relationship between genes and complex diseases. Though single SNP analysis is common for many genetic studies, this approach has a limitation in explaining genetic changes in complex diseases. Most complex diseases cannot be explained by a single gene mutation, and lack of success in many genetic studies could be attributed to gene-gene interactions. Although various methods have been developed to identify gene-gene interactions for binary traits, few statistical methods are currently available for determining the genetic interactions associated with quantitative traits. To address this problem, we propose CL-MDR method. It is a modified version of multifactor dimensionality reduction for quantitative traits. The proposed method was examined by simulation studies, which showed that CL-MDR successfully identified interactions associated with quantitative traits. We have also applied our approach to a Korean GWAS data for illustration.
Keywords: clustering; genetic associations; gene-gene interactions; multifactor dimensionality reduction; quantitative trait.
A pipeline for identifying endogenous neuropeptides from spectral archives
by Mingze Bai, Mingmin He, Qifeng Sun, Huadong Liao, Kunxian Shu, Henning Hermjakob
Abstract: Shotgun proteomics experiments often provide a big amount of spectra data; however, a big part of them remain unidentified. Many unidentified spectra that are high probably from peptides could be revealed by data mining methods such as clustering. This idea motivates researchers to build 'spectral archives' to identify more peptides from the previously analysed resources. The objective is to build a general way to identify peptides for these high possibility spectra in spectral archives, to help biologists to get more output from the data. We here propose a novel generic pipeline for this approach, based on the PRIDE cluster resources, rather than building a complete archive from scratch. We applied our pipeline to test the identification of endogenous neuropeptides in rat. 33 high probability peptide-induced spectra have been exposed from rat's unidentified spectra in PRIDE cluster's archive.
Keywords: spectral library searching; spectral archives; endogenous neuropeptides; PRIDE cluster.
Determining sample size for cross-over designs with multiple groups
by Yongjun Jo, Hyojin Lee, Oran Kwon, Taesung Park
Abstract: In clinical research, determining sample size plays an important role. A cross-over design (CD) is widely used to compare multiple groups in order to verify the statistical significance of mean difference among multiple groups, because it has an advantage of removing any factors caused by subject variability. When multi-omics data such as metabolomics data is analysed, we often adopt CD to identify biomarkers that have group effects. While some methods exist for determining the sample size when comparing two groups, no available method allows comparison of more than two treatment groups. In this research, we propose a novel method for determining the sample size of CD with multiple treatment groups. We first propose a method for testing single biomarkers and then a method for a large number of biomarkers while controlling the false discovery rate or the family wise error rate.
Keywords: sample size calculation; cross-over designs; linear mixed model; FDR; Bonferroni correction.
Symbolic approach to reduced bio-basis
by Mohamed A. Mahfouz, Yasser El-Sonbaty, M.A. Ismail
Abstract: Reduced bio-basis is the minimal set of fixed-length sub-sequences of a biological sequence with maximum information. Sequence data are not numerical so centroid-based clustering algorithms are not directly applicable. The main contribution of this paper is to show how to apply centroid-based algorithms on biological sequences. The average similarity between a sub-sequence and other sub-sequences in a cluster is reduced to a similarity between the sub-sequence and an artificial centre formed in a similar way to the formation of the centre of symbolic objects. After applying the hard version of the proposed symbolic clustering algorithm, a possibilistic membership is computed for each sub-sequence that adds high outliers' rejection capability to the algorithm. Well-studied issues for the centroid-based approach such as parallelism or scalability can be applied to the proposed approach. Experimental results on several real datasets show that the proposed approach, in several respects, is superior to traditional methods.
Keywords: robustness; possibilistic clustering; symbolic clustering; relational clustering; bioinformatics; bio-basis; sequence clustering; amino acid mutation matrix.
A mutational co-occurrence network in gastric cancer based on an association index
by Sungjin Park, Seungyoon Nam
Abstract: Gastric cancer (GC) is one of the most lethal, as well as one of the heterogeneous, cancer types. Possible GC molecular mechanisms could be revealed by mutational co-occurrence analyses. Despite a known association between mutational co-occurrences and GC signalling contexts, no specific mechanisms have been identified. Here, known GC signalling contexts, including cancer hallmarks (DNA repair, WNT signalling, Notch signalling), were inspected in terms of mutational co-occurrences, and in particular, for a specific GC phenotype, microsatellite status (stable or low or high instability). By correlating mutational co-occurrences of gene pairs within cancer hallmarks, we constructed mutational co-occurring networks for each type of microsatellite status. As a result, we found that one status type, microsatellite-stable (MSS), associated with mutation of JAG1, likely co-occurs for genes belonging to the WNT and Notch signalling pathways. Our study may support the feasibility of a new therapeutic strategy of designing compounds that target Notch signalling, in MSS GC patients.
Keywords: gastric cancer; microsatellite stability status; network models; association index; Notch; Wnt; DNA repair.
Risk prediction of type 2 diabetes using common and rare variants
by Sunghwan Bae, Taesung Park
Abstract: The recent development of next generation sequencing technology has led to the identification of several disease-related genetic variants. In this study, we systematically compare the performance of prediction models using common and rare variants from the Whole Exome Sequencing data of the Type 2 Diabetes Genetic Exploration by Next generation sequencing in multi-ethnic samples. We evaluated several methods for predicting binary phenotypes such as Stepwise Logistic Regression, Penalised Regression and Support Vector Machine (SVM). We first constructed prediction models by combining variable selection and prediction methods for Type 2 Diabetes. We then calculated the Area Under the Curve (AUC) to compare the performance of the prediction models. The results indicate that the performance of the common and rare variants combination was better than either that of the common variants only or the rare variants only. Further, the AUC values of SVM were always larger than those of other prediction models.
Keywords: WES; whole exome sequencing; risk prediction model; T2D; type 2 diabetes; penalised regression methods; stepwise selection; SVM; support vector machine.