International Journal of Data Mining and Bioinformatics (11 papers in press)
Predicting Microbial Interactions from Time Series Data with Network Information
by Yan Wang, Mingzhi Mao, Xingpeng Jiang, Fang Li, Wenping Deng, Shaowu Shen
Abstract: The evolution of biotechnological knowledge has made it possible to
extract large amounts of data, e.g. high-throughput metagenomics or 16S-rRNA
sequencing data, from patients suffering from all kinds of diseases. The nature
of these data poses new challenges to study microbial interactions, which play
important roles on the structure and function of complex microbial communities.
Co-occurrence patterns of microbial species among multiple samples are often
utilized to infer interactions. Vector autoregressive (VAR) model was proved
to be an efficient approach to infer dynamic interactions in biological systems.
However, the microbial data are high-dimensional, which means that the number
of covariates is much large than the number of observations-the number of
covariates easily reaches a few thousands, whereas the number of observations
is much lower. Reducing the dimension of data or selecting suitable covariates
became a critical component VAR modeling. Many network-regularized VAR
with different penalties or regularization are used to predict the temporally
microbial interacting patterns among microbial species. In this paper, we
develop a graph-regularized vector autoregressive model incorporating network
information to infer causal relationships among microbial entities. This methods
not only consider the signs of the network connections among any two covariates,
but also construct network weighted matrix by microbial topology information.
The coordinate descent algorithm for estimating model parameters improves the
accuracy of prediction. The experimental results on a time series data set of human
gut microbiomes indicate that the proposed approach has better performance than
other VAR-based models with penalty functions.
Keywords: Microbiome; Microbial interactions; Vector autoregression model; Laplacian regularization; Coordinate descent; Penalty function.
Fuzzy-Soft-Fuzzy Set Model For Mining Amino Acid Associations In Peptide Sequences Of Mycobacterium Tuberculosis Complex (MTBC).
by Amita Jain, Kamal Raj Pardasani
Abstract: The massive volumes of molecular data are available in online databases. This data provides new opportunities and challenges for developing models for analysis to extract useful information and knowledge. The different types of inherent uncertainties present in this molecular data poses major challenge in the analysis of this data. These uncertainties arise due to intentional or unintentional ignorance of relationships among the various fields and parameters of the data. The existing algorithms for association rule mining are not completely capable in dealing with the inherent uncertainties present in molecular data. In the present paper a fuzzy soft fuzzy approach is proposed for mining amino acid associations in molecular sequences of MTBC. The data consisting of peptide sequences of mycobacterium tuberculosis complex is transformed into fuzzy transactional dataset using the fuzzy set to incorporate the degree of relationships among amino acids present in the peptide sequences. The soft set is employed to incorporate degree of relationships of parameter with the peptide sequences and transform fuzzy transactions into soft fuzzy transactions. The fuzzy set is again employed to incorporate the relationships of length of peptide sequences with the parameter length range of the sequences. Thus the soft fuzzy transaction is transformed into fuzzy soft fuzzy transactions. The association rule mining is performed to generate fuzzy soft fuzzy amino acid association patterns in the peptide sequences of MTBC. The results generated from the proposed approach are compared with the existing approaches for association rule mining on the same dataset. It is observed that the proposed approach prunes the spurious patterns and recovers the missing patterns obtained by the existing approaches. Thus it is concluded that the proposed approach addresses the issues of uncertainties by incorporating the various hidden relationships in data which have been ignored by the existing approaches.
Keywords: Data mining; Association rule; support; confidence; fuzzy set; soft set etc.
Iterative segmented least square method for functional microRNA-mRNA module discovery in breast cancer
by Sungmin Rhee, Sangsoo Lim, Sun Kim
Abstract: MicroRNAs (miRNAs) have significant biological roles at the\r\nmolecular level by regulating genes post-transcriptionally. To understand the\r\nfunctional effects of miRNAs in different biological contexts, it is essential\r\nto elucidate miRNA-mRNA regulatory modules (MRMs). The computational\r\ncomplexity for inferencing MRMs is very high due to the many-to-many\r\nrelationships between miRNAs and mRNAs and inferencing MRMs is still a\r\nchallenging unresolved problem. In this paper, we propose a novel iterative\r\nsegmented least square method for functional MRM discovery. Our method\r\noperates in two steps: 1) grouping and ordering the miRNAs and mRNAs\r\nto build per-sample matrices representing miRNA-mRNA regulations, and 2)\r\ndetermining maximum sized modules from structured miRNA-mRNA matrices.\r\nIn experiments with human breast cancer data sets from TCGA, we show that our\r\nmethod outperforms existing methods in terms of both GO similarity and cluster\r\nevaluation. In addition, we show that modules determined by our method can be\r\nused for breast cancer survival prediction and subtype classification.
Keywords: microRNA; Regulatory network inference; Optimization; Dynamic programming.
Artificial Neural Network Classification of Microarray Data Using New Hybrid Gene Selection Method
by Rabia Aziz, C.K. Verma, Manoj Jha, Namita Srivastava
Abstract: This paper proposed a new combination of feature selection/extraction approach for Artificial neural networks (ANN) classification of high dimensional microarray data, which uses an independent component analysis (ICA) as an extraction technique and artificial bee colony (ABC) as an optimization technique. The study evaluates the performance of the proposed ICA+ABC algorithm by conducting extensive experiments on five binary and one multiclass gene expression microarray dataset and compared the proposed algorithm with ICA and ABC. The proposed method shows superior performance as it achieves the highest classification accuracy along with the lowest average number of selected genes. Furthermore, the present work compares the proposed ICA+ABC algorithm with popular filter techniques and with other similar bio-inspired algorithms with ICA. The experimental results show that the proposed algorithm gives more accurate classification rate for ANN classifier. Therefore, ICA+ABC are a promising approach for solving gene selection and cancer classification problems using microarray data.
Keywords: DNA microarrays; Artificial bee colony (ABC); Independent component analysis (ICA); Artificial neural networks (ANN); Cancer Classification.
Implementing Computational Biology Pipelines using VisFlow
by Xin Mou, Hasan Jamil, Robert Rinker
Abstract: Data integration continues to baffle researchers even though substantialrnprogress has been made. Although the emergence of technologies such as XML,rnweb services, semantic web and cloud computing have helped, a system in whichrnbiologists are comfortable articulating new applications and developing themrnwithout technical assistance from a computing expert is yet to be realized. Therndistance between a friendly graphical interface that does little, and a traditional"rnsystem though clunky yet powerful, is deemed too great more often than not.rnThe question that remains unanswered is, if a user can state her query involvingrna set of complex, heterogeneous and distributed life sciences resources in anrneasy to use language and execute it without further help from a computerrnsavvy programmer. In this paper, we present a declarative meta-language, calledrnVisFlow, for requirement specification, and a translator for mapping requirementsrninto executable queries in a variant of SQL augmented with integration artifacts.
Keywords: Graphical user interface languages; Query languages for nonrelationalrnengines; Workflow and data management systems.
A Novel Point Density Based Validity Index for Clustering Gene Expression Datasets
by M. Arif Wani, Romana Riyaz
Abstract: Elucidating the patterns hidden in gene expression data offers an opportunity for identifying co-expressed genes and biologically relevant grouping of genes. However, the large number of genes and the complexity of biological networks greatly increase the challenges of comprehending and interpreting the microarray data. A first step toward addressing this challenge is the use of clustering techniques. Validation of results obtained from a clustering algorithm is an important part of the clustering process. In this paper, we propose a new cluster validity index (ARPoints index) for the purpose of cluster validation. A new approach to determine the compactness measure and distinctness measure of clusters is presented. We revisit commonly known indices and conduct a thorough comparison of these indices with the proposed index and provide a summary of performance evaluation of different indices. Experimental results show that the proposed index performs better than the commonly known cluster validity indices.
Keywords: Clustering; Cluster validation; Compactness Measure of Clusters; Distinctness Measure of Clusters; Clustering Gene Data; Gene Expression Analysis.
Prediction of DNA-binding Residues from sequence information Using Convolutional Neural Network
by Jiyun Zhou, Qin Lu, Ruifeng Xu, Lin Gui, Hongpeng Wang
Abstract: Most DNA-binding residue prediction methods overlooked the motif features which are important for the recognition between protein and DNA. In order to efficiently use the motif features for prediction, we first propose to use Convolutional Neural Network (CNN) in deep learning to extract discriminant motif features. We then propose a neural network classifier, referred to as CNNsite, by combining the extracted motif features, sequence features and evolutionary features. The evaluation on PDNA-62, PDNA-224 and TR-265 shows that motif features perform better than sequence features and evolutionary features. The evaluation on PDNA-62, PDNA-224 and an independent dataset shows that CNNsite also outperforms the previous methods. We also show that many motif features composed by the residues which play important roles in DNA-protein interactions have large discriminant powers. It indicates that CNNsite has very good ability to extract important motif features for DNA-binding residue prediction.
Keywords: DNA; Protein; Interaction; Residue; CNN; motif; sequence; PSSM; evolutionary; binding; neural network.
An Aggregation Method for Sparse Logistic Regression
by Zhe Liu
Abstract: L_1 regularized logistic regression has now become a workhorse of data mining and bioinformatics: it is widely used for many classification problems, particularly ones with many features. However, L_1 regularization typically selects too many features and that so-called false positives are unavoidable. In this paper, we demonstrate and analyze an aggregation method for sparse logistic regression in high dimensions. This approach linearly combines the estimators from a suitable set of logistic models with different underlying sparsity patterns and can balance the predictive ability and model interpretability. Numerical performance of our proposed aggregation method is then investigated using simulation studies. We also analyze a published genome-wide case-control dataset to further evaluate the usefulness of the aggregation method in multilocus association mapping.
Keywords: logistic regression; aggregation; sparse model; sample-splitting; Markov chain Monte Carlo method; genome-wide association study.
Concod: An effective Integration Framework of Consensus-based Calling Deletions from Next-generation Sequencing Data
by Lei Cai, Chong Chu, Xiaodong Zhang, Yufeng Wu, Jingyang Gao
Abstract: Detection of structural variations such as deletion with short sequence reads from next-generation sequencing is a significant but challenging problem in the field of genome analysis. This paper proposes a conceptual framework to improve the effects of calling deletions. Although the genetic sequencing tools are massively produced for the moment, not a single method clearly outperforms all other methods. At present, a widely used way of deletion detection is merging, which combined all the features to achieve more accurate deletion calling. However, most existing methods using the combining approach are heuristic and the called deletions by these tools still contain many wrongly called deletions. In this paper, we introduce Concod, an effective integration framework using machine learning to detect deletions. First, Concod collects the candidate deletions from multiple existing deletion detection tools. Then, based on the multiple detection theories, the features of candidates are extracted from sequence. Last, according to these features, a machine learning model is trained to distinguish the true and false candidates. We test our framework on different coverage of real data and make a comparison with other existing tools, including Pindel, SVseq2, BreakDancer and DELLY. Results show that Concod improves both precision and sensitivity of deletion detection significantly.
Keywords: structural variations; deletion detection; machine learning; feature extraction; next-generation sequencing.
A novel method to measure the semantic similarity of HPO terms
by Jiajie Peng, Hansheng Xue, Yukai Shao, Xuequn Shang, Yadong Wang, Jin Chen
Abstract: It is critical yet remains to be challenging to make precise disease diagnosis from complex clinical features and highly heterogeneous genetic background. Recently, phenotype similarity has been effectively applied to model patient phenotype data. However, the existing measurements are revised based on the Gene Ontology-based term similarity models, which are not optimized for human phenotype ontologies. We propose a new similarity measure called $PhenoSim$. Our model includes a noise reduction component to model the noisy patient phenotype data, and a path-constrained Information Content-based method for phenotype semantics similarity measurement. Evaluation tests compared $PhenoSim$ with four existing approaches. It showed that $PhenoSim$ could effectively improve the performance of HPO-based phenotype similarity measurement, thus increasing the accuracy of phenotype-based causative gene prediction and disease prediction.
Keywords: Human Phenotpe Ontology; Semantic Similarity; Phenotype Similarity; Noise Reduction; Causative Gene Prediction; Disease Prediction.
Molecular pathway identification using a new L1/2 solver and biological network-constrained model
by Hai-Hui Huang, Yong Liang, Xiao-Ying Liu, Hui-Min Li
Abstract: Molecular research is moving toward Big data epoch. There are various large-scale popular databases abstracted from different biological processes. Integrating such valuable information with the statistical model may shed light on how human cells work from a system-level perspective. In this article, we propose a novel penalized network-constrained regression model with a new L_(1/2 )solver for integrating gene regulatory networks into an analysis of gene expression data, where the network is graph Laplacian regularized. Extensive simulation studies showed that our proposed approach outperforms L_(1 )regularization and old L_(1/2 )regularization regarding prediction accuracy and predictive stability. We also apply our method to three kinds of cancer datasets. Particularly, our method achieves comparable or higher predictive accuracy than the old solver L_(1/2 )and L_(1 )regularization approaches, while fewer but informative genes and pathways are selected.
Keywords: big data; network analysis; variable selection; regularization; L1/2 penalty.