International Journal of Data Mining and Bioinformatics (12 papers in press)
Systematic investigation of hyperparameters on performance of deep neural networks: application to ovarian cancer phenotypes
by Suhyun Hwangbo, Se Ik Kim, Untack Cho, Dae-Shik Suh, Yong-Sang Song, Taesung Park
Abstract: The application of deep neural networks (DNNs) to medicine has recently emerged as a major approach for prognosis. Many medical researchers have expected that the use of the DNN algorithms would provide higher prediction results for their analysis. However, while these applications are currently underway for medical imaging data, they are not yet optimized for clinicopathologic data, with two-dimensional input space. One such challenge is the difficulty of applying deep learning to optimize hyperparameters, i.e., learning DNN models for more accurate prediction results. In this study, we identified parameters having a greater impact on predictive power, by applying DNNs to clinicopathologic data. Specifically, we predicted therapeutic response to platinum-based chemotherapy, based on data from 710 epithelial ovarian cancer patients. Predictive performance was measured by area under the curve (AUC) after optimizing six hyperparameters, including the number of hidden layers, number of hidden units, optimization algorithm, weight initialization, activation function for hidden layers, and dropout rate. By identifying the significant main effects, and interaction effects, of these hyperparameters on clinical prediction, we successfully determined combinations of hyperparameters contributing to higher predictive power. These approaches have ramifications for assessing therapeutic response to numerous treatments for various pathologies.
Keywords: DNN; deep neural network; hyperparameter; predictive power.
Sparse Superlayered Neural Network-based Multi-Omics Cancer Subtype Classification
by Prasoon Joshi, Seokho Jeong, Taesung Park
Abstract: Recently, targeted treatment of different subtypes of cancer has become of interest, underlying an increased need for an accurate understanding of the molecular differentiation of pathological subtypes. To that end, we present a new deep neural network subtype classification model, Sparse CRoss-modal Superlayered Neural Network (SCR-SNN), focusing on integrating high-dimensional RNA sequencing data with DNA methylation data. Our model consists of the following steps: (1) biomarker filtration; (2) biomarker selection, using a cross-modal, superlayered neural network with an L1 penalty; (3) integration of selected biomarkers from gene expression and DNA methylation data; and (4) prediction model building. For comparison, machine-learning methods, such as principal component analysis, penalized logistic regression, and artificial neural networks, were used, alone and in combination. In these analyses, SCR-SNN was applied to gene expression and methylation data of lung adenocarcinoma and squamous cell lung carcinoma from The Cancer Genomic Atlas. The SCR-SNN model well classified these lung cancer subtypes, using only a small number of markers. A significant difference in epidermal development and cornification pathway activation levels, between the two lung cancer subtypes, was also found. This approach represents a promising methodology for disease and sub-disease categorization and diagnosis.
Keywords: classification; machine learning; multi-omics data; RNA-sequencing data.
Two-variate Phenotype-targeted Tests for Detecting Phenotypic Biomarkers in Cancers
by Jinxiong Lv, Shikui Tu, Lei Xu
Abstract: Detection of cancer-related phenotypic biomarkers is crucial for clinical research. Traditional pipeline consists of two stages, i.e., candidates are first selected to be significantly differentially expressed between tumor-adjacent and tumor conditions, and then later are filtered by phenotype-targeted tests (PT tests). Such two-phase process has low detection power. In this paper, two-variate PT test, which jointly considers tumor-adjacent data and tumor data, is adopted to strengthen the detection power. We conduct a systematic investigation on the three implementations of two-variate PT tests for detecting phenotypic biomarkers in three types of cancers, and provide a practical guideline for the usage of the two-variate PT tests. Experimental analysis indicates that the two-variate PT tests achieve stronger detection power than traditional methods. The tumor-adjacent data provides complementary information to the discriminant analysis, and Fisher discriminant analysis is able to best implement two-variate PT test for detecting phenotypic biomarkers in cancers.
Keywords: Two-variate Phenotype-Targeted Test; Phenotypic Biomarkers; Breast Cancer; Lung Cancer; Thyroid Cancer; Body Mass Index; Overall Survival Time; Pathologic Stage; Microarray Expression Data; RNA-seq Expression Data.
A Feature-Learning based method for the disease-gene prediction problem
by Lorenzo Madeddu, Giovanni Stilo, Paola Velardi
Abstract: We predict disease-genes relations on the human interactome network using a methodology that jointly learns functional and connectivity patterns surrounding proteins. rnContrary to other data structures, the Interactome is characterized by high incompleteness and absence of explicit negative knowledge, which makes predictive tasks particularly challenging. rnTo exploit at best latent information in the network, we propose an extended version of random walks, named Random Watcher-Walker ($RW^2$), which is shown to perform better than other state-of-the-art algorithms. We also show that the performance of $RW^2$ and other compared state-of-the-art algorithms is extremely sensitive to the interactome used, and to the adopted disease categorizations, since this influences the ability to capture regularities in presence of sparsity and incompleteness.
Keywords: network medicine; disease gene prediction; disease gene prioritization; node embedding; random walks; graph-based methods; biological networks; complex diseases.
Identification of protein hot regions by combing structure-based classification, energy-based clustering and sequence-based conservation in evolution
by Jing Hu, Haomin Gan, Nansheng Chen, Xiaolong Zhang
Abstract: Revealing the protein hot regions is the key point for understanding the protein-protein interaction, while due to the long period and labour-consuming of experimental methods, it is very helpful to use computational method to improve the efficiency to predict hot regions. In previous methods, some methods are based on a single side, such as structure, energy, and sequence, every side has its limitations. In this paper, we proposed a new method that combing structure-based classification, energy-based clustering and sequence-based conservation. This method makes full use of three sides of protein features and minimize the limitations of using one single side. Experimental results show that the proposed method increases the prediction accuracy of protein hot regions.
Keywords: hot region; protein structure; energy clustering; sequence conservation; protein-protein interaction.
A Systematic Approach for Pre-processing Electronic Health Records for mining: Case Study of Heart Disease
by Leila Baradaran Sorkhabi, Farhad Soleimanian Gharehchopogh, Jafar Shahamfar
Abstract: Electronic health records (EHRs) form major part of medical big data (MBD) and are enormous resources of knowledge. Mining EHRs can lead us to new generations of medicine (e.g. precision medicine). But actually it is not simply possible because EHRs are unsuitable for mining. Naturally any raw data is dirty but some special challenges make EHRs more susceptible to be dirty. To extract more precise and reliable knowledge we must pre-process EHRs. Performing appropriate pre-processing techniques which are based on specific properties of EHRs will provide high quality and more utilizable data. Here we introduce PEPMED, a systematic pre-processing approach that consists of three main stages. Each stage includes hybrid methods to deal with challenges of dirty data. Four well-known subgrouping methods were performed on both raw and pre-processed data to evaluate the approach. We used precision value and overall accuracy for measurements. Results show that PEPMED dramatically improved accuracy.
Keywords: EHRs; Pre-processing; Medical big data; Data mining; Precision medicine; Heart disease; Systematic; Data quality; Accuracy; Data volume.
Deep learning-based classification and interpretation of gene expression data from cancer and normal tissues
by TaeJin Ahn, Taewan Goo, Chan-hee Lee, SungMin Kim, Kyullhee Han, Sangick Park, Taesung Park
Abstract: Outstanding performance has been achieved in resolving recognition and classification problems using deep learning technology. As increasing amounts of gene expression data from cancer and normal samples are accumulated and available to researchers, deep learning is now beginning an integral component of revealing specific patterns within massive datasets. Thus, we examined the extent to which deep learning can learn to recognize cancer. We integrated gene expression data from the Gene Expression Omnibus, The Cancer Gene Atlas, Therapeutically Applicable Research To Generate Effective Treatments, and Genotype-Tissue Expression databases. The databases provide 13,406 cancer and 12,842 normal gene expression data from 24 different tissues. We first trained a deep neural network to identify cancer and normal samples using various gene selection strategies: genes showing high expression or large variance, drug target genes from commercial cancer panels, and genes in National Cancer Institute-curated cancer pathways. We also suggest a systematic analysis method for interpreting trained deep neural networks. We applied the method to identify genes that majorly contribute to classifying cancer in an individual sample.
Keywords: Cancer; deep learning; gene expression; oncogene addiction.
A Multi-view Classification and Feature Selection Method via Sparse Low-Rank Regression Analysis
by Yao Lu, Ying-Lian Gao, Jin-Xing Liu
Abstract: In recent years, multi-view classification and feature selection method has received close attention in many fields. However, in many practical classification problems, the data in each view may contain a lot of noise. In addition, when data are of high dimensional and small sample attributes, it is difficult to remove redundant features in feature selection experiments. To deal with these problems well, the sparse multi-view low-rank regression method is proposed in this paper. The method based on sparse and low rank theory is to make the matrix decomposition produce sparse and low rank results by adding the penalty factors in the matrix transformation process. The model is constructed by imposing L2-norm and L2,1-norm constraints on the objective function. Experimental results on sequencing data show that the proposed method has superior performance over several state-of-the-art methods in multi-view classification and feature selection.
Keywords: classification; feature selection; L2,1-norm; low-rank regression; multi-view data; row-sparsity.
Drug Target Interaction Prediction via Multi-task Co-attention
by Yuyou Weng, Xinyi Liu, Hui Li, Chen Lin, Yun Liang
Abstract: Drug-Target Interaction (DTI) prediction is a key step in drug discovery and drug repurposing. A variety of machine learning models are considered to be effective means of predicting DTI. Most current studies regard DTI prediction as a classification task (that is, negative or positive labels are applied to indicate the intensity of interaction) or regression tasks (numerical value is used to measure detailed DTI). In this article, we explore how to balance bias and variance through a multi-task learning framework. Because the classifier is more likely to produce higher bias, and the regression models are more prone to create a significant variance and overfit the training data. We propose a novel model, named Multi-DTI, that can predict the precise value and determine the correct labels of positive or negative interactions. Besides, these two tasks are performed with similar feature representations of CNN, which is adopted with a co-attention mechanism. Detailed experiments show that Multi-DTI is superior to state-of-the-art methods.
Keywords: Multi-task Learning; Scientific Data Management; Data Integration; Drug Target Interaction.
Fetal Weight Prediction Based on Improved PSO-GRNN Model
by Fangxiong Chen, Guoheng Huang, Huishi Wu, Ke Hu, Weiwen Zhang, Lianglun Cheng
Abstract: Fetal weight prediction is important for fetal development and safety of pregnant women. However, fetal weight can only be roughly predicted using the B ultrasound dataset of pregnant women, and the prediction accuracy is still very low. In this paper, an improved Particle Swarm Optimization algorithm is proposed to optimize Generalized Regression Neural Network (GRNN) prediction model of fetal weight based on the physical examination data and ultrasonic data of pregnant women. The historical data of pregnant women\'s examination were preprocessed firstly, and a prediction model is established by GRNN and then the parameters of the prediction model are optimized to reduce human interference by using improved particle swarm optimization algorithm. The experimental results showed that compared with some of the most advanced algorithms, the average relative error of the proposed method was 3.2% lower, and the accuracy of fetal weight prediction was 12% higher.
Keywords: fetal weight prediction; pregnant women prenatal; feature normalization; particle swarm optimization; generalized regression neural network; regression model; deep learning; B ultrasound.
MHCherryPan: a novel pan-specific model for binding affinity prediction of class I HLA-peptide
by Xuezhi Xie
Abstract: The human leukocyte antigen (HLA) system or complex plays an irreplaceable role in regulating the humans' immune system. Accurate prediction of peptide binding with HLA can efficiently promote to identify those neoantigens, which potentially make a great change in immune drug development. HLA is one of the most polymorphic genetic systems in humans, and thousands of HLA allelic versions exist. Due to the high polymorphism of HLA complex, it is still pretty difficult to accurately predict the binding affinity. In this paper, we proposed a novel algorithm which combined convolutional neural network and long short-term memory to solve this problem. Our model has been tested with the experimental benchmark from IEDB and shows the state-of-the-art performance compared with other currently popular algorithms.
Keywords: Bioinformatics; deep learning; health informatics; HLA; MHC;.
Ensemble of Deep Learning Models to Predict Platinum Resistance in High Grade Serous Ovarian Cancer
by Kyullhee Han*, Hyeonjung Ham*, SaeIck Kim, YoungSang Song, Taesung Park**, TaeJin Ahn**
Abstract: Deep learning is a machine learning approach that is emerged as a useful method in prediction based on provided data. Biological problems often include complicated interactions of genes and other metabolites. Deep learning has several benefits to predict those complicated problems and could make useful estimation. In clinical practice, the prediction of platinum resistance in ovarian cancer is an important problem for newly diagnosed patients, because it alters treatment options for patients and subsequently their quality of life. In this paper, Deep Neural Network (DNN) models are designed and evaluated with several feature selection approaches and network structures. Among the feature selection approaches, genes selected by group difference in gene expression showed the best performance with test accuracy 0.838 and AUC 0.889. Hybrid ensemble approaches that are designed to increase coverage of individual DNN models displayed better performance with test accuracy 0.90 and test AUC 0.869. An alternative hybrid ensemble model with removed partially sensitive samples displayed the performance with test accuracy 0.903 and test AUC 0.914 Based on these results, it is suggested that a hybrid ensemble approach could help prediction of platinum resistance in ovarian cancer and subsequent treatment practice in clinics.
Keywords: Ovarian cancer, Deep learning, Ensemble, Platinum resistance, Survival analysis