International Journal of Data Mining and Bioinformatics (11 papers in press)
MHCherryPan: a novel pan-specific model for binding affinity prediction of class I HLA-peptide
by Xuezhi Xie
Abstract: The human leukocyte antigen (HLA) system or complex plays an irreplaceable role in regulating the humans' immune system. Accurate prediction of peptide binding with HLA can efficiently promote to identify those neoantigens, which potentially make a great change in immune drug development. HLA is one of the most polymorphic genetic systems in humans, and thousands of HLA allelic versions exist. Due to the high polymorphism of HLA complex, it is still pretty difficult to accurately predict the binding affinity. In this paper, we proposed a novel algorithm which combined convolutional neural network and long short-term memory to solve this problem. Our model has been tested with the experimental benchmark from IEDB and shows the state-of-the-art performance compared with other currently popular algorithms.
Keywords: Bioinformatics; deep learning; health informatics; HLA; MHC;.
Ensemble of Deep Learning Models to Predict Platinum Resistance in High Grade Serous Ovarian Cancer
by Kyullhee Han*, Hyeonjung Ham*, SaeIck Kim, YoungSang Song, Taesung Park**, TaeJin Ahn**
Abstract: Deep learning is a machine learning approach that is emerged as a useful method in prediction based on provided data. Biological problems often include complicated interactions of genes and other metabolites. Deep learning has several benefits to predict those complicated problems and could make useful estimation. In clinical practice, the prediction of platinum resistance in ovarian cancer is an important problem for newly diagnosed patients, because it alters treatment options for patients and subsequently their quality of life. In this paper, Deep Neural Network (DNN) models are designed and evaluated with several feature selection approaches and network structures. Among the feature selection approaches, genes selected by group difference in gene expression showed the best performance with test accuracy 0.838 and AUC 0.889. Hybrid ensemble approaches that are designed to increase coverage of individual DNN models displayed better performance with test accuracy 0.90 and test AUC 0.869. An alternative hybrid ensemble model with removed partially sensitive samples displayed the performance with test accuracy 0.903 and test AUC 0.914 Based on these results, it is suggested that a hybrid ensemble approach could help prediction of platinum resistance in ovarian cancer and subsequent treatment practice in clinics.
Keywords: Ovarian cancer; Deep learning; Ensemble; Platinum resistance; Survival analysis.
Sensitivity-controlled Event Trigger Identification in Multi-Level Biomedical Context
by Chen Shen, Hongfei Lin, Zhengguang Li, Yonghe Chu, Zhihao Yang
Abstract: The identification of biomedical event triggers serves as an important step in biomedical event extraction. It is a domain-specific task restricted to limited annotated text and language representations in computational models. To achieve a model that can learn and leverage more semantic information, most conventional methods rely on machine learning models, which require a series of artificially designed features. Moreover, existing methods have been conducted on imbalanced datasets, but have not adjusted for this. Therefore, we propose a novel framework, CBSC, to address imbalanced quantities of training data across biomedical event categories. This frame-work integrates convolutional and recurrent neural networks for better language representation, and leverages sensitivity-controlled support vector ma-chine with an enhanced balanced loss function as the classifier of the network. The experiments conducted on the multi-level event extraction dataset show that our approach provides a more balanced solution between P/R and outperforms other state-of-the-art methods.
Keywords: Event trigger identification; biomedical event extraction; imbalanced classification; sensitivity-controlled support vector machine (SCSVM); neural networks.
An amino acid property based method for identifying solenoid proteins
by Senthilnathan Rajendran, Arunachalam Jothi
Abstract: Solenoid proteins are proteins that contain repeating structural units. They are associated with many important biological functions and also key factors for the onset of many human diseases like Huntington disease, mental retardation, inherited ataxias, etc. Detecting solenoid proteins from the sequence information alone is a challenging problem. Current methods for identifying solenoid proteins from sequence rely heavily on homology-based approaches. In this work, we have proposed an alternate method which uses just the amino acid composition and a set of biophysical descriptors to identify solenoid proteins. Four different machine learning approaches: Naive Bayes (NB), support vector machine (SVM), Bayesian Generalized Linear Models (BGLM) and Random forest (RF) method were used for classification. These four classification models were validated using the cross-validation technique. The Area Under the Curve (AUC) was found to be above 0.9 for all the models. The entire procedure was performed using the R programming language.
Keywords: Solenoid proteins; Repeats; Amino acid composition; Biophysical properties; PCA; AUC; SVM; Random forest; Naive Bayes; Machine learning.
Evaluation of cross-ontology association rules weighted by term specificity
by Young-Rae Cho
Abstract: The use of an ontology is currently a prevailing trend for management and anaysis of biological big data. The recently created bio-ontologies cover diverse domains in bioinformatics. Consequently, we have encountered strong demands on developing algorithms for handling complex ontology structures and accurate analysis of data in ontologies. One of the interesting topics is to apply association rule mining to ontological data analysis. We can discover the association rules of cross-ontology terms, which provide the clues for predicting functions a gene performs or phenotypes a gene determines. However, because association rule mining algorithms are biased towards the rules of more general terms, it has been a challenge to discover the rules between more specific terms in concept. We propose a pairwise cross-ontology weighted rule mining (WRM) approach which uses support and lift weighted by term specificity. This approach was tested using two different conceptual specificity metrics for each ontological term: edge-based and IC-based. For our experiments, biological process (BP) and molecular function (MF) sub-ontologies of Gene Ontology (GO), and phenotypic abnormality (PA) sub-ontology of Human Phenotype Ontology (HPO) were used. The experimental results show that IC-based WRM produced the rules of more specific terms in BP and PA than the general unweighted version of association rule mining (ARM). It indicates that our weighting strategy improves on discovering the conceptually specific cross-ontology rules between BP and PA. However, no methods could generate the rules of more specific MF terms than ARM. In comparison with previous methods, the proposed WRM and the level-wise search algorithm COLL had the best performance on discovering specific rules. The rules from BP to PA terms can be used for predicting the specific diseases caused by a gene anotated to BP terms.
Keywords: ontology; Gene Ontology; Human Phenotype Ontology; association rules; term specificity.
A Network Enhancement-based Method for Clustering of Single cell RNA-seq Data
by Xiaoshu Zhu, Lilu Guo Guo, Rongyuan Li, Yunpei Xu, Fang-xiang Wu, Xiaoqing Peng, Hong-Dong Li
Abstract: Single cell RNA sequencing (scRNA-seq) provides a more granular description of gene expression in a single cell. Many clustering methods for scRNA-seq data have been developed to understand cell development and cell differentiation. However, the high dimension and the strong noise make clustering scRNA-seq data challenging. To overcome this problem, we propose a method for clustering scRNA-seq data, called network enhancement-based similarity combined with Louvain (NES-Louvain). In NES-Louvain, the initial similarity matrix is denoised by using a network enhancement method. Then, a path-based similarity measurement is designed to introduce the nodes in high-order paths based on the assumption that including more relevant nodes would improve the similarity of node pairs. Finally, Louvain community detection method is improved to clustering single cells. The experimental results show that NES and NES-Louvain achieve better performance than other methods. Furthermore, NES-Louvain shows robust to perturbation.
Keywords: similarity measurement; single cell clustering; network enhancement; path-based similarity; Louvain community detection.
A systematic approach for pre-processing electronic health records for mining: case study of heart disease
by Leila Baradaran Sorkhabi, Farhad Soleimanian Gharehchopogh, Jafar Shahamfar
Abstract: Electronic Health Records (EHRs) form major part of Medical Big Data (MBD) and are enormous resources of knowledge. Mining EHRs can lead us to new generations of medicine (e.g. precision medicine). But actually it is not simply possible because EHRs are unsuitable for mining. Naturally any raw data is dirty but some special challenges make EHRs more susceptible to be dirty. To extract more precise and reliable knowledge we must pre-process EHRs. Performing appropriate pre-processing techniques which are based on specific properties of EHRs will provide high quality and more utilisable data. Here we introduce PEPMED, a systematic pre-processing approach that consists of three main stages. Each stage includes hybrid methods to deal with challenges of dirty data. Four well-known subgrouping methods were performed on both raw and pre-processed data to evaluate the approach. We used precision value and overall accuracy for measurements. Results show that PEPMED dramatically improved accuracy.
Keywords: EHRs; pre-processing; medical big data; data mining; precision medicine; heart disease; systematic; data quality; accuracy; data volume.
Deep learning-based classification and interpretation of gene expression data from cancer and normal tissues
by TaeJin Ahn, Taewan Goo, Chan-Hee Lee, SungMin Kim, Kyullhee Han, Sangick Park, Taesung Park
Abstract: Outstanding performance has been achieved in resolving recognition and classification problems with deep learning technology. As increasing amounts of gene expression data from cancer and normal samples become publicly available, deep learning may become an integral component of revealing specific patterns within massive data sets. Thus, we aimed to address the extent to which a deep learning can learn to recognise cancer. We integrated gene expression data from the Gene Expression Omnibus (GEO), The Cancer Gene Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET) and Genotype-Tissue Expression (GTEx) databases, including 13,406 cancer and 12,842 normal gene expression data from 24 different tissues. We first trained a Deep Neural Network (DNN) to identify cancer and normal samples using various gene selection strategies. Genes of high expression or large variance, therapeutic target genes from commercial cancer panels, and genes in NCI-curated cancer pathways. We also suggest a systematic analysis method to interpret trained deep neural networks. We applied the method to find genes that majorly contribute to classify cancer in an individual sample.
Keywords: cancer; deep learning; gene expression; oncogene addiction.
A multi-view classification and feature selection method via sparse low-rank regression analysis
by Yao Lu, Ying-Lian Gao, Pei-Yong Li, Jin-Xing Liu
Abstract: In recent years, multi-view classification and feature selection methods have received close attention in many fields. However, in many practical classification problems, the data in each view may contain a lot of noises. In addition, when data are of high dimensions and small sample attributes, it is difficult to remove redundant features in feature selection experiments. To deal with these problems well, the sparse multi-view low-rank regression method is proposed in this paper. The method based on sparse and low-rank theory introduces the penalty factors in the matrix transformation process to decompose the matrix into sparse and low-rank results. The model is constructed by imposing L2-norm and L2,1-norm constraints on the objective function. Experimental results on sequencing data show that the proposed method has superior performance over several state-of-the-art methods in multi-view classification and feature selection.
Keywords: classification; feature selection; L2,1-norm; low-rank regression; multi-view data; row-sparsity.
Drug target interaction prediction via multi-task co-attention
by Yuyou Weng, Xinyi Liu, Hui Li, Chen Lin, Yun Liang
Abstract: Drug-Target Interaction (DTI) prediction is a key step in drug discovery and drug repurposing. A variety of machine learning models are considered to be effective means of predicting DTI. Most current studies regard DTI prediction as a classification task (that is, negative or positive labels are applied to indicate the intensity of interaction) or regression tasks (numerical value is used to measure detailed DTI). In this article, we explore how to balance bias and variance through a multi-task learning framework. Because the classifier is more likely to produce higher bias, and the regression models are more prone to create a significant variance and overfit the training data. We propose a novel model, named Multi-DTI, that can predict the precise value and determine the correct labels of positive or negative interactions. Besides, these two tasks are performed with similar feature representations of CNN, which is adopted with a co-attention mechanism. Detailed experiments show that Multi-DTI is superior to state-of-the-art methods.
Keywords: multi-task learning; scientific data management; data integration; drug target interaction.
Foetal weight prediction based on improved PSO-GRNN model
by Fangxiong Chen, Guoheng Huang, Huishi Wu, Ke Hu, Weiwen Zhang, Lianglun Cheng
Abstract: Foetal weight prediction is important for foetal development and safety of pregnant women. However, foetal weight can only be roughly predicted using the ultrasound data set of pregnant women, and the prediction accuracy is still low. In this paper, we propose a prediction model, termed PSO-GRNN, which is based on Particle Swarm Optimisation algorithm and Generalised Regression Neural Network, in order to obtain the foetal weight using the physical examination data and ultrasonic data of pregnant women. The historical data of pregnant women's examination are pre-processed firstly, and a prediction model is established by GRNN and then the parameters of the prediction model are optimised to reduce human interference by using improved particle swarm optimisation algorithm. The experimental results show that on average compared with some state-of-the-art algorithms, the Mean Relative Error of the proposed method is 1.33% lower and the accuracy of foetal weight prediction is 4.15% higher respectively.
Keywords: foetal weight prediction; pregnant women prenatal; feature normalisation; particle swarm optimisation; GRNN; generalised regression neural network; regression model; deep learning; ultrasound.