International Journal of Data Mining and Bioinformatics (11 papers in press)
MU Thyroid Nodule Electronic Database (MU-TNED), a multidisciplinary informatics approach to long-term thyroid nodule and thyroid cancer follow-up
by Terri Benskin, Iris Zachary, Magda Esebua, Uzma Khan
Abstract: Thyroid nodules are common findings and thyroid cancer is projected to be one of the leading causes of cancer in women. The EHR includes the necessary data needed to connect clinical research with patient outcomes. The objective for this project was to develop and validate a usable informatics tool for clinicians and researchers to record, analyse, and be able to manipulate the clinical and research data to benefit all collaborators. The tool was specifically designed to enable follow-up in a longitudinal manner to support multiple aspects of research. The informatics tool MU-TNED was designed with a multidisciplinary team including the departments of pathology and anatomical sciences, endocrinology and health informatics to be able to transfer identified and validated clinical information directly from the EHR into a research database based on clinicians and research needs.
Keywords: clinical informatics; electronic data capture; electronic health record; health informatics; thyroid nodules; thyroid cancer.
Deep Learning Approaches in Electron Microscopy Imaging for Mitochondria Segmentation
by Ismail Oztel, Gozde Yolcu, İlker Ersoy, Tommi White, Filiz Bunyak
Abstract: Deep neural networks provide outstanding classification and detection accuracy in biomedical imaging applications. We present a study for mitochondria segmentation in electron microscopy (EM) images. Mitochondria play a significant role in cell cycle by generating the needed energy, and show quantifiable morphological differences with diseases such as cancer, metabolic disorders, and neurodegeneration. EM imaging allows researchers to observe the morphological changes in cells as part of disease process at a high resolution. Manual segmentation of mitochondria in large sequences of EM images is time consuming and prone to subjective delineation. Thus, manual segmentation may not provide the high accuracy needed for accurate quantification of morphological changes. We show that a convolutional neural network provides accurate mitochondria segmentation in CA1 hippocampus area of brain that is imaged by a focused ion beam scanning electron microscope (FIBSEM). We compare our results with other studies which are studied on the same data set and other deep neural network approaches and provide quantitative comparison.
Keywords: Deep Learning; Convolutional Neural Networks; Image Segmentation; Electron Microscopy; Mitochondria.
Saccharomyces Cerevisiae Genotype Phenotype Mapping Through Leptokurtic PLS Loading Weights.
by Gulshan Sharif, Tahir Mehmood
Abstract: For mapping yeast genotype phenotype mapping, where small set of influential genes are supposed to explain the variation in phenotypes, Partial Least Squares (PLS) had been used for influential variables i.e. genes selection. Modeling the PLS loading weights, which is an essential indicator for variable selection in PLS, through probability distribution has shown success in variable selection. We have revisited the yeast genotype phenotype mapping, where PLS loading weights appeared to be leptokurtic. Hence modeling the PLS loading weights with leptokurtic i.e. Laplace distributions can improves the yeast mapping. We have introduced the Laplace-PLS where leptokurtic PLS loading weights are modeled for influential gene selection. The comparison of genotype phenotype mapping through Laplace-PLS is made with PLS, Soft-threshold PLS(Soft-PLS), uninformative variable elimination in PLS (UVE-PLS) and distribution based truncation in PLS (Trunc-PLS) . Monte-Carlo simulation have been used for parameter estimation and performance assessment. The PLS methods are evaluated through the predicted root means square error (RMSE), number of influential genes and selectivity index. Results indicates the Laplace-PLS results in least RMSE with smaller number of influential genes and with higher consistency level. Genotype phenotype mapping is explained through the background information like existence of premature stop codons, copy number variations, frame shift mutations, etc.
Keywords: Partial least squares; variable selection; genomics; genotype phenotype mapping; leptokurtic distribution.
A framework for identifying functional modules in dynamic networks
by XiWei Tang, Xueyong Li, Sai Hu, Bihai Zhao
Abstract: Detecting functional modules in Protein-Protein Interaction (PPI) networks is essential to understand gene function, biological pathways and cellular organization. Majority of methods predict functional modules via the static PPI networks. However, cellular systems are highly dynamic and regulated by the biological networks. Considering the dynamic inherent within these networks, we build the time course PPI networks in terms of the gene expression profiles. And then a novel framework for identifying functional modules with core-attachment structure has been proposed in accordance with the dynamic PPI networks. Our algorithm generates the cores by mining co-expression neighborhood graphs with an aggregation degree over a threshold and expands them to form functional modules. The method is compared with other competing algorithms based on two different yeast PPI networks. The results show that the proposed framework outperforms state-of-the-art methods.
Keywords: functional module; dynamic network; core-attachment; protein-protein interaction.
Modified Evolutionary Model with Insertion and Deletion (Indel) for Phylogenetic Tree Construction
by Asim Mahadani, Goutam Sanyal, Pradosh Mahadani, Partha Bhattacharjee
Abstract: Estimation of biological relationship among various species is usually measured on some evolutionary models. Majority of the existing evolutionary models ignore insertion and deletion (Indel) event which reduces the computational difficulties but at the same time it causes loss of valuable phylogenetic information. In this study, we developed a modified Jukes-Cantor (J-C) Model to include the insertion and deletion (Indel) information in phylogenetic technique. Utility of modified Jukes-Cantor (M J-C) Model was tested in existing phylogenetic methods (UPGMA and NJ). 24 Indel rich chloroplast DNA sequences (trnH-psbA) of different Citrus species are used as a dataset in phylogenetic tree construction. In comparison to J-C model, the proposed modified Jukes-Cantor (J-C) model shows significant improvement in branch support (bootstrap) value in phylogenetic trees, considering Indel information. This new modified J-C model may accelerate the phylogenetic study on Indel rich DNA Sequences. The Indels are important for Phylogenetic tree reconstruction, specially when low level of divergence is found in the studied taxa.
Keywords: Indel; Gap; J-C Model; Evolutionary Model;Genetic distance; Phylogenetic;.
A performance evaluation of NoSQL databases to manage proteomics data
by Chaimaa Messaoudi, Rachida Fissoune, Hassan Badir
Abstract: NoSQL databases have recently been introduced as alternatives to
traditional relational database management systems because of their capabilities
in terms of storing data, query retrieval and introducing various data models such
as graph-oriented database. Biological datasets can be modeled using various
models, for example, graphs (protein-protein interaction) or documents (protein
sequence information). Applications that involve these two data models can be
combined into a single unique architecture either using the polyglot persistence
approach or using a multi-model approach. The relative advantage of these
approaches when applied to biomedical datasets, such as proteomics, is evaluated
in this study. This paper evaluates the performance of a polyglot persistence
approach versus a multi-model data store. The polyglot persistence approach
combines a graph-oriented database (Neo4j) and a document-oriented database
(MongoDB); and the multi-model system is OrientDB. The comparisons are
made following these aspects: importation, single operations (INSERT,UPDATE,
DELETE, READ) and query performance. Five proteomics datasets were used in
this study: Cancer, Chromatin, Parkinson, Alzheimer and Diabetes, downloaded
from Intact and Uniprot databases. OrientDB demonstrates a potential to manage
large proteomics dataset for query retrieval and graph importation. However,
when updating records, OrientDB was found to be slow. There is no single store
that performs better in all cases. The depth levels of graph traversal of queries,
the number of fields in document-oriented database and the size of the graph
influence the performance of the NoSQL databases.
Keywords: Proteomics; MongoDB; Multi-model; Neo4j; OrientDB; Polyglot Persistence.
Protein family structure signature for multidomain proteins
by Jun Tan, Donald Adjeroh
Abstract: The rapid increase in available protein structure datasets requires new techniques for fast, yet, effective analysis of protein 3D structures. In this work, we propose a structure-based signature for protein families, suitable for rapid analysis of multidomain protein structures. Our method is alignment-free, using protein strings as the basic representation. A key novelty is the two-stage approach, whereby an initial list of candidate protein superfamilies are rapidly identified using the protein family signature, and then information retrieval methods are applied only to the members of the candidate superfamilies. This approach is the key to both improved speed, and improved structure retrieval accuracy. Experimental results, including comparative results with state-of-the-art methods, demonstrate the performance of the proposed protein family signature on queries with multidomain protein structures.
Keywords: protein structure; protein structure signature; retrieval; classification; alignment-free; structure analysis.
CPredictor 4.0: effectively detecting protein complexes in weighted dynamic PPI networks
by Yunjia Shi, Heng Yao, Jihong Guan, Shuigeng Zhou
Abstract: The identification of protein complexes is significant to understand the mechanisms of cellular processes. Up to present, many methods on protein complex detection have been developed in static PPI networks. However, static PPI networks cannot accurately describe the behaviours of proteins in the different stages of life cycle of a cell. In this paper, we combine different data sets including gene expression data, GO terms and high-throughput PPI data to reconstruct weighted dynamic PPI networks, on which a new method called CPredictor4.0 are proposed. Specifically, we first calculate protein active probability and protein functional similarity to construct weighted dynamic PPI networks, then define a high-order topological overlap measure of similarity to extract protein complexes based on the core-attachment model. In our experiments, four PPI datasets are used to detect protein complexes. Experimental results indicate that CPredictor4.0 is superior to the existing methods in overall.
Keywords: protein to protein interactions; protein complexes; protein active probability; functional similarity.
A comparative review of recent bioinformatics tools for inferring gene regulatory networks using time-series expression data
by Kevin Byron, Jason T.L. Wang
Abstract: The Gene Regulatory Network (GRN) inference problem in computational biology is challenging. Many algorithmic and statistical approaches have been developed to computationally reverse engineer biological systems. However, there are no known bioinformatics tools capable of performing perfect GRN inference. Here, we review and compare seven recent bioinformatics tools for inferring GRNs from time-series gene expression data. Standard performance metrics for these seven tools based on both simulated and experimental data sets are generally low, suggesting that further efforts are needed to develop more reliable network inference tools.
Keywords: DREAM; dialogue for reverse engineering assessments and methods ESCAPE; embryonic stem cell atlas from pluripotency evidence GRN; gene regulatory network; reverse engineering; time-series.
A Bayesian approach for genotyping single nucleotide polymorphisms (SNPs)
by Ali Sheikhi, David Ramsey
Abstract: Genotyping is the process of determining the genetic make-up (genotype) of an individual by examining the individual's DNA sequence. It reveals the alleles an individual has inherited from their parents. Single Nucleotide Polymorphism (SNP) genotyping first involves identifying SNPs that are a common source of genetic variation. Genotyping of SNPs has become extremely important to researchers working to understand and treat disease. Most SNPs are biallelic, thus the genotype is defined by the three possible combinations of the two alleles. In this study, we present a Bayesian approach for genotyping the detected SNPs. We also present the results of genotyping the real genome sequence data.
Keywords: Bayesian; genotype; SNP; single nucleotide polymorphism.
DiffGRN: differential gene regulatory network analysis
by Youngsoon Kim, Jie Hao, Yadu Gautam, Tesfaye B. Mersha, Mingon Kang
Abstract: Identification of differential gene regulators with significant changes under disparate conditions is essential to understand complex biological mechanism in a disease. Differential Network Analysis (DiNA) examines different biological processes based on gene regulatory networks that represent regulatory interactions between genes with a graph model. While most studies in DiNA have considered correlation-based inference to construct gene regulatory networks from gene expression data due to its intuitive representation and simple implementation, the approach lacks in the representation of causal effects and multivariate effects between genes. In this paper, we propose an approach named Differential Gene Regulatory Network (DiffGRN) that infers differential gene regulation between two groups. We infer gene regulatory networks of two groups using Random LASSO, and then we identify differential gene regulations by the proposed significance test. The advantages of DiffGRN are to capture multivariate effects of genes that regulate a gene simultaneously, to identify causality of gene regulations, and to discover differential gene regulators between regression-based gene regulatory networks. We assessed DiffGRN by simulation experiments and showed its outstanding performance than the current state-of-the-art correlation-based method, DINGO. DiffGRN is applied to gene expression data in asthma. The DiNA with asthma data showed a number of gene regulations, such as ADAM12 and RELB, reported in biological literature.
Keywords: DiNA; differential network analysis; gene regulatory network.