International Journal of Data Mining and Bioinformatics (15 papers in press)
An End-to-End Framework for Biomedical Event Trigger Identification with Hierarchical Attention and Adaptive Cost Learning
by Jinyong Zhang, Dandan Fang, Weizhong Zhao, Jincai Yang, Wen Zou, Xingpeng Jiang, Tingting He
Abstract: Recent years have witnessed the rapidly growing of the amount of textual data in biomedical community. How to extract effectively biomedical events in biomedical corpus remains a big challenge for the biomedical community. As a prerequisite step in biomedical event extraction, event trigger identification has attracted growing attention in biomedical research. Existing approaches to biomedical event trigger identification have two major drawbacks: 1) each sentence in a biomedical document is handled separately, which ignores the global context of documents; 2) they fails to treat the issue of imbalanced class which is induced by the sparseness of event triggers in biomedical documents. To improve the performance of biomedical event trigger identification, we propose a deep neural network based framework which addresses effectively the two mentioned challenges accordingly. Specifically, the syntactic dependency tree and hierarchical attention mechanism are utilized to model both local and global contexts, including the syntactic and semantic relationships both among words in the same sentence and among sentences in the same document. Moreover, we propose an adaptive cost learning method to address the class imbalance issue in biomedical event trigger identification. Extensive experiments are conducted on two real-world datasets, and the results demonstrate the effectiveness of the proposed framework.
Keywords: Biomedical event trigger identification; End-to-end model; Graph convolutional network; Syntactic dependency tree; Hierarchical attention mechanism; Adaptive cost learning.
Identifying Catheter-Related Events Through Sentence Classification
by Thomas Brox Røst, Christine Raaen Tvedt, Haldor Husby, Ingrid Andås Berg, Øystein Nytrø
Abstract: Infections caused by central venous catheter (CVC) use is a serious and under-reported problem in healthcare. The CVC is almost ubiquitous in critical care because it enables fast circulatory monitoring and central administration of medication and nutrition. However, the CVC exposes the patient to a risk of blood-stream infections (BSI). Explicit documentation of normal CVC usage and exposure is sparse and indirect in the health record. For a clinician, CVC presence is simple to infer from record statements about procedures, plans and results related to CVC. In order to capture evidence about CVC-related risk of infections and complications, it is important to develop computerized tools that can estimate individual patient days of CVC exposure retrospectively for large cohorts of patients. Towards that objective, we have developed methods for learning classifiers for statements about CVC-related events occurring in the textual health record. This includes developing and testing an annotation ontology of events and indicators, annotation guidelines, a gold standard of annotated clinical records selected from a corpus of complete health records for more 800 episodes of care and collecting alternate health register evidence for validation purposes. This paper describes the available data and gold standard, feature selection approaches and our experiments with different classification algorithms. We find that even with limited data it is possible to build reasonably accurate sentence classifiers for the most important events. We also find that making use of document meta information helps improve classification quality by providing additional context to a sentence. Finally, we outline some strategies on using our results for future analysis and reasoning about CVC usage intervals and CVC exposure over individual patient trajectories.
Keywords: medical informatics; health informatics; machine learning; cvc; central venous catheter.
Multi-task transfer learning for biomedical machine reading comprehension
by Wenyang Guo, Yongping Du, Yiliang Zhao, Keyan Ren
Abstract: Biomedical machine reading comprehension aims to extract the answer to the given question from complex biomedical passages, which requires the machine to have the ability to process strong comprehension on natural language. Recent progress has made on this task, but still severely restricted by the insufficient training data due to the domain-specific nature. To solve this problem, we propose a hierarchical question-aware context learning model trained by the multi-task transfer learning algorithm, which can capture the interaction between the question and the passage layer by layer, with multi-level embeddings to strengthen the ability of the language representation. The multi-task transfer learning algorithm leverages the advantages of different machine reading comprehension tasks to improve model generalization and robustness, pre-training on multiple large-scale open-domain datasets and fine-tuning on the target-domain training set. Moreover, data augmentation is also adopted to create new training samples with various expressions. The public biomedical dataset collected from PubMed provided by BioASQ is used to evaluate the model performance. The results show that our method is superior to the best recent solution and achieves a new state of the art.
Keywords: biomedical machine reading comprehension; multi-task learning; transfer learning; attention; data augmentation.
Efficient methods for hierarchical multi-omic feature extraction and visualization
by Timothy Becker, Dong-Guk Shin
Abstract: A single DNA alignment file can be resource intensive to visualize at arbitrary scale given current visualization systems. We address this limitation by integrating a parallel out-of-core feature extraction algorithm with a disk based hierarchical data store that is several orders of magnitude faster for visualization tasks. To demonstrate the utility of our approach, we designed a high-performance web application that serves translated data to an interactive client. We incorporate novel visualization of these data features, while allowing user-specified resolution and response. Unlike per-read techniques which can run out of memory when displaying large scale genomic variations, our data structure returns a controllable representation of that region, making the technique ideally suited for visualization of multiple large data sets. We describe our open-source feature extraction framework and web-based visualization while comparing the performance to current systems.
Keywords: feature extraction; sequence alignment visualization.
A functional network construction method to interpret the pathological process of colorectal cancer
by Bolin Chen, Manting Yang, Li Gao, Tao Jiang, Xuequn Shang
Abstract: The prognosis of cancer stage I, II, III and IV patients remains a challenge due to the limited understanding about their pathogenic mechanisms and cancerous processes. A prevalent kind of method for studying cancer is to analyze the significant dysfunctions in terms of the differentially expressed genes. However, most studies ignore the fact that DEGs detected from cancer patients tend to be highly heterogeneous with each other, and significant genetic markers of cancer will change along with its pathogenic stages, which can easily mislead the enriched dysfunctions toward to those less relevant functions. Hence, in this study, we propose a new method to generate the functional network with a clear transferring route from initial to later stages according to the change of pathogenic stages of colorectal cancer (CRC). Functional interaction networks and functional evolution network clearly illustrate the functional evolution processes underlying the pathological stages of CRC. Results interpret that the proposed network construction method has a powerful capacity in detecting the most relevant cellular functions compared to existing methods, which could be employed to explore the evolution processes of cancers mechanisms and may provide a new target for therapeutic intervention.
Keywords: Colorectal cancer; Enrichment analysis; Functional interaction network; Functional evolution network.
GANCDA: A novel method for predicting circRNA-disease associations based on deep generative adversarial network
by Xin Yan, Lei Wang, Zhu-Hong You, Li-Ping Li, Kai Zheng
Abstract: Circular RNA (circRNA) is a single stranded closed non-coding RNA without 3' and 5' polyadenylated tails, which plays a key regulatory role in life activities. Recognizing the association between circRNA and disease is of great significance for the study of disease mechanism. However, traditional experimental methods for identifying the association between circRNA and disease are usually extremely blind and time-consuming. Therefore, the method based on intelligent computing is needed to effectively predict the potential circRNA-disease association and narrow the identification range for biological experiments. In this paper, we propose a model GANCDA based on multi-source similar information and deep Generative Adversarial Network (GAN) to predict disease associated circRNA. Firstly, GANCDA fuses the multi-source information of disease Gaussian interaction profile kernel similarity, circRNA Gaussian interaction profile kernel similarity and disease semantic similarity, and then uses GAN to effectively extract the essential features of the fusion descriptor in an adversarial learning manner, and finally sends them to the Logistic Model Tree (LMT) classifier for prediction. The 5-fold cross-validation of GANCDA on the circR2Disease dataset achieved 90.6% AUC, 89.2% accuracy and 89.4% precision. In comparison with other feature extraction models and the state-of-the-art SVM classifier model, GANCDA showed strong competitiveness. Moreover, GANCDA prediction results are also supported by biological experiments. Among the top 20 circRNAs with the highest scores in gastric cancer, colorectal cancer and breast cancer, 16, 15 and 17 of them have been confirmed by relevant literatures and databases, respectively. These excellent results show that GANCDA can accurately predict the potential circRNA-disease association and can be used as an effective assistant tool for biological experiments.
Keywords: Circular RNA; Diseases; CircRNA-disease association; Generative adversarial network; Logistic model tree.
Two stage clustering analysis to detect pattern change of biomarker expression between experimental conditions
by Iksoo Huh, Sunghoon Choi, Youjin Kim, Soo-Yeon Park, Oran Kwon, Taesung Park
Abstract: In a crossover design, individuals usually undergo all experimental conditions, and the measurements of biomarkers are repeatedly observed at serial time points for each experimental condition. To analyze time-dependent changing patterns of biomarkers, clustering algorithms are commonly used across time points to group together subjects having similar changing patterns. Among the clustering methods, hierarchical- and K-means clustering have been popularly used. However, since they are originally unsupervised approaches, they do not identify different changing patterns between experimental conditions. We propose a new two-stage clustering method focusing on changing patterns. The first stage is to eliminate non-informative biomarkers using Euclidean distances, and the second stage is to allocate the remaining biomarkers to predefined patterns using a correlation-based distance. We demonstrate the advantages of our proposed method by simulation and real data analysis.
Keywords: Two Stage; Pattern Clustering; Biomarker expression; Intervention Study; Cross-over design.
Predicting protein functions by using non-negative matrix factorization with multi-networks co-regularization
by Wei Peng, Jielin Du, Lun Li, Wei Dai, Wei Lan
Abstract: In this work, we proposed a novel non-Negative Matrix Factorization-based method, namely PONMF-S to learn Protein and GO features from different biological networks for protein function prediction. PONMF-S decomposes known GO-protein association matrix into a GO matrix and a protein matrix under the co-regularization of protein-protein interaction (PPI) network and GO similarity network. Additionally, we also extend PONMF-S to other versions by considering the function influence of proteins neighbors and GO terms neighbors. We apply our methods and two state-of-the-art methods (UBiRW and NMFGO) to predict functions for proteins of S. cerevisiae and H. sapiens. The prediction results show that PONMF-S outperforms the other two existing methods when randomly removing a part of known function information. When predicting functions for the proteins that have not any known ahead functional information, PONMF-S improves the prediction performance of NMFGO significantly and is comparable with UBiRW.
Keywords: protein function prediction; regularized nonnegative matrix factorization; protein-protein interaction network; GO functional similarity.
A novel protein complex identifying method based on key protein(PCIM)
by Junmin Zhao, Jingpu Zhang, Yuanyuan Ma, Bin Yang
Abstract: This paper was inspired by the internal nucleate-appendage structure of protein complexes and the close relationships between key proteins and protein complexes. Based on these new finding, we design a novel protein complex identifying method from the angle of point (PCIM). The proposed algorithm include three steps: First, selecting the protein with high degree and high connection strength as seed. Then, producing the preliminary nucleus according to connection strength based on the seed protein. Finally, the protein complexes are generated by identifying accessorial proteins through second-order connection strength for each nucleus. We used several common methods to evaluate our algorithm, it is superior to the others existing algorithms, it can identify protein more effectively. We also did a case analysis, a new member protein YGR225W is mined for complex-60 the anaphase promoted complex, which confirmed the correctness of our prediction and also further proved that our algorithm was effective.
Keywords: protein complex; key protein; protein complex mining.
Graph embedding and ensemble learning for predicting gene-disease associations
by Haorui Wang, Xiaochan Wang, Zhouxin Yu, Wen Zhang
Abstract: The discovery of gene-disease associations is important for preventing, diagnosing and treating diseases. The effective integration of diverse data is critical for developing high-accuracy prediction models. In this paper, we propose two heterogeneous network-based methods that enhance gene-disease association prediction by using graph embedding and ensemble learning, abbreviated as HNEEM and HNEEM-PLUS. We integrate gene-disease associations, gene-chemical associations, gene-gene associations and disease-chemical associations to construct a heterogeneous network, in which the nodes represent different entities and the edges represent associations between entities. We adopt six graph embedding methods respectively to learn the representative vectors of genes and diseases from the network, and build individual prediction models by each graph embedding representation and random forest. Then we use individual models as base predictors and combine them to construct the ensemble model HNEEM by average scoring. To increase the diversity of base predictors, we further introduce the multilayer perceptron as an additional classifier and generate more base predictors, and thus we propose an extended method named HNEEM-PLUS. Through the experiments on different datasets, we demonstrate that the graph embedding method produces satisfying results in the gene-disease association prediction, and integrating different graph embedding methods can produce better performances. In computational experiments, HNEEM produces better results compared to the state-of-the-art gene-disease perdition methods, and HNEEM-PLUS produces better results than HNEEM. In conclusion, HNEEM and HNEEM-PLUS are effective tools for predicting gene-disease associations.
Keywords: gene-disease association; heterogeneous network; graph embedding.
Systematic investigation of hyperparameters on performance of deep neural networks: application to ovarian cancer phenotypes
by Suhyun Hwangbo, Se Ik Kim, Untack Cho, Dae-Shik Suh, Yong-Sang Song, Taesung Park
Abstract: The application of deep neural networks (DNNs) to medicine has recently emerged as a major approach for prognosis. Many medical researchers have expected that the use of the DNN algorithms would provide higher prediction results for their analysis. However, while these applications are currently underway for medical imaging data, they are not yet optimized for clinicopathologic data, with two-dimensional input space. One such challenge is the difficulty of applying deep learning to optimize hyperparameters, i.e., learning DNN models for more accurate prediction results. In this study, we identified parameters having a greater impact on predictive power, by applying DNNs to clinicopathologic data. Specifically, we predicted therapeutic response to platinum-based chemotherapy, based on data from 710 epithelial ovarian cancer patients. Predictive performance was measured by area under the curve (AUC) after optimizing six hyperparameters, including the number of hidden layers, number of hidden units, optimization algorithm, weight initialization, activation function for hidden layers, and dropout rate. By identifying the significant main effects, and interaction effects, of these hyperparameters on clinical prediction, we successfully determined combinations of hyperparameters contributing to higher predictive power. These approaches have ramifications for assessing therapeutic response to numerous treatments for various pathologies.
Keywords: DNN; deep neural network; hyperparameter; predictive power.
Sparse Superlayered Neural Network-based Multi-Omics Cancer Subtype Classification
by Prasoon Joshi, Seokho Jeong, Taesung Park
Abstract: Recently, targeted treatment of different subtypes of cancer has become of interest, underlying an increased need for an accurate understanding of the molecular differentiation of pathological subtypes. To that end, we present a new deep neural network subtype classification model, Sparse CRoss-modal Superlayered Neural Network (SCR-SNN), focusing on integrating high-dimensional RNA sequencing data with DNA methylation data. Our model consists of the following steps: (1) biomarker filtration; (2) biomarker selection, using a cross-modal, superlayered neural network with an L1 penalty; (3) integration of selected biomarkers from gene expression and DNA methylation data; and (4) prediction model building. For comparison, machine-learning methods, such as principal component analysis, penalized logistic regression, and artificial neural networks, were used, alone and in combination. In these analyses, SCR-SNN was applied to gene expression and methylation data of lung adenocarcinoma and squamous cell lung carcinoma from The Cancer Genomic Atlas. The SCR-SNN model well classified these lung cancer subtypes, using only a small number of markers. A significant difference in epidermal development and cornification pathway activation levels, between the two lung cancer subtypes, was also found. This approach represents a promising methodology for disease and sub-disease categorization and diagnosis.
Keywords: classification; machine learning; multi-omics data; RNA-sequencing data.
Two-variate Phenotype-targeted Tests for Detecting Phenotypic Biomarkers in Cancers
by Jinxiong Lv, Shikui Tu, Lei Xu
Abstract: Detection of cancer-related phenotypic biomarkers is crucial for clinical research. Traditional pipeline consists of two stages, i.e., candidates are first selected to be significantly differentially expressed between tumor-adjacent and tumor conditions, and then later are filtered by phenotype-targeted tests (PT tests). Such two-phase process has low detection power. In this paper, two-variate PT test, which jointly considers tumor-adjacent data and tumor data, is adopted to strengthen the detection power. We conduct a systematic investigation on the three implementations of two-variate PT tests for detecting phenotypic biomarkers in three types of cancers, and provide a practical guideline for the usage of the two-variate PT tests. Experimental analysis indicates that the two-variate PT tests achieve stronger detection power than traditional methods. The tumor-adjacent data provides complementary information to the discriminant analysis, and Fisher discriminant analysis is able to best implement two-variate PT test for detecting phenotypic biomarkers in cancers.
Keywords: Two-variate Phenotype-Targeted Test; Phenotypic Biomarkers; Breast Cancer; Lung Cancer; Thyroid Cancer; Body Mass Index; Overall Survival Time; Pathologic Stage; Microarray Expression Data; RNA-seq Expression Data.
A Feature-Learning based method for the disease-gene prediction problem
by Lorenzo Madeddu, Giovanni Stilo, Paola Velardi
Abstract: We predict disease-genes relations on the human interactome network using a methodology that jointly learns functional and connectivity patterns surrounding proteins. rnContrary to other data structures, the Interactome is characterized by high incompleteness and absence of explicit negative knowledge, which makes predictive tasks particularly challenging. rnTo exploit at best latent information in the network, we propose an extended version of random walks, named Random Watcher-Walker ($RW^2$), which is shown to perform better than other state-of-the-art algorithms. We also show that the performance of $RW^2$ and other compared state-of-the-art algorithms is extremely sensitive to the interactome used, and to the adopted disease categorizations, since this influences the ability to capture regularities in presence of sparsity and incompleteness.
Keywords: network medicine; disease gene prediction; disease gene prioritization; node embedding; random walks; graph-based methods; biological networks; complex diseases.
Identification of protein hot regions by combing structure-based classification, energy-based clustering and sequence-based conservation in evolution
by Jing Hu, Haomin Gan, Nansheng Chen, Xiaolong Zhang
Abstract: Revealing the protein hot regions is the key point for understanding the protein-protein interaction, while due to the long period and labour-consuming of experimental methods, it is very helpful to use computational method to improve the efficiency to predict hot regions. In previous methods, some methods are based on a single side, such as structure, energy, and sequence, every side has its limitations. In this paper, we proposed a new method that combing structure-based classification, energy-based clustering and sequence-based conservation. This method makes full use of three sides of protein features and minimize the limitations of using one single side. Experimental results show that the proposed method increases the prediction accuracy of protein hot regions.
Keywords: hot region; protein structure; energy clustering; sequence conservation; protein-protein interaction.