| Forthcoming Papers > International Journal of Data Mining and Bioinformatics (IJDMB) Journal Homepage This page lists papers submitted for IJDMB via the web that have been reviewed and accepted but not yet published. Please note that titles, authors, abstracts and keywords may change upon publication. Our TOC e-mail alerting service will notify you immediately when new issues of IJDMB are published on-line. Click here to register for our TOC E-Mail Alerting. We also offer the convenience of RSS feeds which provide a means to view new content timely posted to your web site or desktop. Click here to start to use our free RSS news feeds. | International Journal of Data Mining and Bioinformatics (24 papers in press)
- Significance Analysis and Improved Discovery of Disease Specific Differentially Co-expressed Gene Sets in Microarray Data
by Haixia Li, R. Krishna Murthy Karuturi Abstract: Disease specific deregulated pathways of genes may be identified by differential co-expression analysis of gene expression as opposed to differential expression that signifies change of the activity levels of the pathways. Kostka and Spang proposed statistic (KS-statistic) and an algorithm (KS algorithm) to elicit differentially co-expressed gene sets by minimizing KS-statistic. We show that statistical distributions of KS-statistic under null-hypothesis in un-normalized and unit-variance normalized data settings are central and doubly non-central F-distributions respectively. The null-distributions facilitate the estimation of the statistical significance or the p-value of differential co-expression of a gene set. We propose two alternative but equivalent statistics whose null distributions are easier to evaluate compared to that of doubly non-central F-distribution. We propose to objectively set the search parameters for the algorithm via maximizing the statistical significance of the resultant gene-set. In addition, we propose FNs-KS algorithm for improved discovery of differentially co-expressed gene sets; we show that FNs-KS substantially outperforms KS on both simulated and real data. Keywords: Gene expression; Microarray analysis; Differential co-expression; Friendly Neighbors algorithm; Statistical significance; Disease-specific deregulated pathways - Identifying the Overlapping Complexes in Protein Interaction Networks
by Min Li, Jianxin Wang, Jianer Chen, Zhao Cai, Gang Chen Abstract: Identification of protein complexes in large interaction networks is crucial to understand principles of cellular organization and predict protein functions. In this paper, we present a new algorithm of identifying protein complexes based on maximal clique extension (IPC-MCE). The maximal clique is considered as the core of the protein complex. Proteins in a complex are classed into core vertices and peripheral vertices. The relation between the core vertices and peripheral vertices is measured by the interaction probability. We apply IPC-MCE algorithm to the protein interaction network of Sacchromyces cerevisiae collected from MIPS. Many well known protein complexes are detected. Our algorithm has a high recall of more than 80% and has a precision of more than 20%. A harmonic mean of precision and recall (f-measure) is more than 34%. By identifying the overlapping protein complexes, we can assign a main function to the unknown proteins or assign a new function category to the known proteins. Keywords: Protein interaction network; protein complexes; graph; dense subgraph; maximal clique - Evaluation of BIC and cross validation for model selection on sequence segmentations
by Niina Haiminen, Heikki Mannila Abstract: Segmentation is a general data mining technique for summarizing and analyzing sequential data. For example, in bioinformatics segmentations can be used to study large-scale genomic structures such as isochores. One of the key questions in segmentation is choosing the number of segments to use. We present extensive experimental studies on standard model selection techniques Bayesian Information Criterion (BIC) and cross validation (CV). The results show that these methods often find the correct number of piecewise constant segments on generated real-valued, binary, and categorical sequences. Also segments having the same means but different variances can be identified. Furthermore, we demonstrate the effect of linear trends and outliers on the results; both phenomena are frequent in real data. The results indicate that BIC is fairly sensitive to outliers, and that CV in general is more robust. Intuitive segmentation results are given for real DNA sequences with respect to changes in their codon, G+C, and bigram frequencies, as well as copy-number variation from CGH data. Keywords: segmentation; model selection; cross validation; BIC; sequence; binary; multinomial; genome; likelihood - A Weighted Local Least Squares Imputation Method for Missing Value Estimation in Microarray Gene Expression Data
by Wai-Ki Ching, Limin Li, Nam-Kiu Tsing, Ching-Wan Tai, Tuen-Wai Ng, Alice S Wong, Kwai-Wa Cheng Abstract: Many clustering techniques and classification methods for
analyzing microarray data require a complete dataset. However,
very often gene expression datasets contain missing values due to
various reasons. Therefore the treatment of missing values is an
important step in the preprocessing of the data. A number of
methods such as the Row Average (RA) method,
K-nearest neighbors Imputation (KNNimpute) algorithm
and Singular Values Decomposition Imputation (SVDimpute)
algorithm have been proposed to estimate the missing
values. Recently, Kim et al. proposed a Local Least Squares
Imputation (LLSI) method. The contribution of this paper is
twofold. We first propose to use vector angle as a measurement for
the similarity between genes. Numerical experiments indicate that
vector angle is more effective. We then propose a Weighted Local
Least Square Imputation (WLLSI) method for missing values
estimation. WLLSI allows training on the weighting parameter and
can take advantage of the LLSI method and the RA method. Numerical
results on both synthetic data and real microarray data indicate
that WLLSI method is more robust. The imputation methods are then
applied to a breast cancer dataset and interesting results are
obtained. Keywords: gene expression data analysis; missing value imputation;
vector angle - LIBGS: A MATLAB Softeware Package For Gene Selection
by Yi Zhang, Dingding Wang, Tao Li Abstract: Many gene selection algorithms have been applied in gene expression
data analysis successfully. Currently, there are several existing
gene selection software packages, such as
rankgene1, and mRMR. Due to different developing environments of these toolkits, it is difficult
to compare different algorithms using them. We have developed a
software package named LIBGS, which can effectively evaluate
gene function in discriminating biological samples of different
types. This package includes: 1) Seven new gene selection algorithms
implemented using MATLAB; 2) A MATLAB interface for Rankgene1.1
which includes another eight selection measures; 3) A MATLAB
interface for two well-known classification tools (e.g., LIBSVM and
WEKA); 4) Programs for converting data formats; 5) A collection of
six popular expression data sets. In LIBGS, we provide consistent
input and output data formats for different gene selection
algorithms, which make it more flexible to perform data analysis and
algorithm comparison. All of the features make LIBGS a useful tool
in gene expression analysis and feature selection.
Keywords: LIBGS; feature selection algorthms, MATLAB - Protein Structural Classification Using Orthogonal Transformation and Class-Association Rules
by Sumeet Dua, Praveen Kidambi Abstract: Protein structure classification and comparison is a central area in the field of bioinformatics. Rapid increase in the size of protein databases has prompted the development of fast, automated methods to classify unknown protein structures. Protein structure databases commonly suffer from the ‘curse of dimensionality,’ necessitating the development of the dimensionality reduction of protein structural information prior to its classification. Moreover, the design and development of efficient manual or semi-automated classification techniques has not kept pace with the growth in such databases. In this paper, we propose a novel automated computational framework for three-dimensional (3D) structure-based classification of proteins using orthogonal transformation of the geometric shape descriptors derived from protein structures, employing an association rule-based supervised clustering approach. We demonstrate our results on two different datasets. The proposed novel computational framework demonstrates the applicability of association rule discovery-based classification of structural descriptors for protein fold classification with improved sensitivity. Keywords: Proteins; protein classification; association rules; dimensionality reduction; dihedral angles; rule classification; laplace accuracy; geometric descriptors; cosine transformation - Bi-k-Bi Clustering: Mining Large Scale Gene Expression Data Using Two-Level Biclustering
by Levent Carkacioglu, Rengul Cetin Atalay, Ozlen Konu, Volkan Atalay, Tolga Can Abstract: Due to the increase in gene expression data sets in recent years, application of data mining techniques on these data became a matter of interest. Various methods have been proposed for mining gene expression profiles. However, most of these methods target single gene expression data sets and cannot handle all the available gene expression data in public databases in reasonable amount of time and space. In this paper, we propose a novel framework, bi-k-bi clustering, for finding association rules of gene pairs that can easily operate on large scale and multiple heterogeneous data sets. Our motivation in this study is to help biologists discover significant gene pair relations among large scale gene expression data sets. We applied the proposed bi-k-bi clustering framework on all available NCBI GEO Homo sapiens data sets and five selected groups of data sets independently. Our results show consistency and relatedness with the available literature and also provides novel associations. Keywords: biclustering; association pattern discovery; spearman rank correlation; gene expression analysis - Prediction of protein-protein interactions from primary sequences
by Qiwen Dong Abstract: Proteome-wide prediction of protein–protein interactions is a difficult and important problem in biology. In this study, an efficient method is presented to predict protein-protein interactions with sequence composition information. Four kinds of basic building blocks of protein sequences are investigated, including N-grams, patterns, motifs and binary profiles. The protein sequences are mapped into high-dimensional vectors by using the occurrence frequencies of each kind of building blocks. The resulting vectors are then taken as input to support vector machine to predict protein-protein interactions. Experiments are conducted over the “small-scale” subset of the DIP database. The experimental results show that there is minor difference in prediction performance among the four kinds of different building blocks. The method based on combination of all building blocks outperforms any of the
building blocks, and gets a sensitivity of 72.73%, a specificity of 77.33% and an overall accuracy of 75.03%. The corresponding ROC score is 0.8233 and ROC50 score is 0.7003. We also demonstrate that the use of latent semantic analysis, which is an
efficient feature extraction technique from natural language processing, can efficiently remove noise and improve the prediction efficiency without significantly degrading the performance. The results obtained here are helpful for the prediction of protein-protein interactions by using only sequence information. Keywords: protein-protein interaction, basic building block, latent semantic analysis - Hierarchical Classification of G-Protein-Coupled-Receptors with Data-Driven Selection of Attributes and Classifiers
by Andrew Secker, Matthew Davies, Alex Freitas, Edward Clark, Jon Timmis, Darren Flower Abstract: We address the prediction of protein function using information derived from a protein’s primary sequence, a challenging and important problem in bioinformatics. We consider here the functional classification of the G-protein-coupled receptors (GPCRs) into a hierarchy of families, sub-families and sub-sub-families. We recast this task as a hierarchical classification problem and tackle it from a data mining perspective using a novel top-down hierarchical classification system where, for each classifier node in the class hierarchy, the set of predictor attributes to be used in that node and the classifier to be applied to the selected attributes are chosen in a data-driven manner. We report computational results for a challenging GPCR dataset having three hierarchical class levels, and 5, 38 and 87 classes in the first, second and third levels. Compared with a previous hierarchical classification system selecting classifiers only, our new system significantly reduces processing time without significantly sacrificing predictive accuracy. Keywords: hierarchical classification, supervised learning, attribute selection, feature selection, classifier selection, protein function prediction, G-protein coupled receptor (GPCR) - Understandable Learning Machine System Design for Transmembrane or Embedded Membrane Segments Prediction
by Hae-Jin Hu, Robert Harrison, Phang C. Tai, Yi Pan Abstract: Learning machine based approaches such as Neural Network or SVM have shown decent performance on transmembrane (TM) segment prediction. However, they were not able to explain a decision making process in a biologically understandable way since they were black-box models. To overcome this limitation, we modified an existing association rule–based classifier CPAR to handle the sequential patterns for prediction. This modified classifier PCPAR was improved further by combining with SVM in parallel or sequential. The experimental results indicate that this hybrid scheme offers biologically meaningful rules on TM/EM segment prediction while maintaining the performance almost as well as the SVM method. The evaluation of the sturdiness and the receiver operating characteristic (ROC) curve analysis proved that this new scheme is robust and competent with SVM on TM/EM segment prediction. The prediction server is available at http://bmcc2.cs.gsu.edu/~haeh2/. Keywords: association rule based classifier, support vector machine, transmembrane segment, embedded membrane segment - A Hybrid Clustering Algorithm for Identifying Modules in Protein-Protein Interaction Networks
by Liang Yu, Lin Gao Abstract: Identifying functional modules in protein-protein interaction (PPI) network is important to the understanding of the organization and the interaction of the cellular processes. As PPI networks are becoming larger and more complete, effective computation methods will be required. In this paper, we present a novel algorithm Combining Molecular Complex Detection (MCODE) with Girvan and Newman (GN) to identify biological complexes in protein interaction networks. Our algorithm can accurately discover denser modules in large-scale protein interaction networks. We applied our clustering algorithm to S.cerevisiae PPI networks and obtained high matching rate between the predicted modules and the known protein complexes in Munich Information Center for Protein Sequences (MIPS). The simulation results show that our clustering algorithm provides an effective, reliable, and scalable method of identifying the protein modules in PPI networks. Keywords: protein-protein interaction (PPI) networks; graph clustering; functional modules; protein complexes - Integrating Flexibility and Interactivity in Bioinformatics Visual Programming Tools with Focus+Context Algorithm
by Xiajiong Shen, Jun Gu Abstract: An improved bioinformatics visual programming prototype system VBP aims to visualize highly complicated bioinformatical data is described. In this paper we describe the integration of Focus+Context algorithm and bio-visual programming to show the dynamically adjusting of Focusing on details without loosing the context simultaneously. Because of this flexible and interactive architecture, VBP makes an ideal bio-visual programming tool for future bioinformatics or systems biology research. Keywords: bioinformatics; information visualization; bio-visual programming; Focus+Context views - An Improved Position Weight Matrix Method based on an Entropy Measure for the Recognition of Prokaryotic Promoters
by Qinqin Wu, Jiajun Wang, Hong Yan Abstract: In this paper, an improved position weight matrix (PWM) method is proposed based on an entropy measure for the recognition of prokaryotic promoters. In this method, the conservative sites of the prokaryotic promoters are extracted according to an entropy measure, and then two improved position weight matrices are constructed based on the training set. By using the values of the matrix elements in the specific columns corresponding to the extracted conservative sites, the test sequences are scored and subsequently classified. Experiment results on several datasets show that the proposed algorithm outperforms the existing ones. Keywords: prokaryotic promoter recognition; conservative sites; information entropy; position weight matrix (PWM) - Microarray Data Classification by Multi-Information Based Gene Scoring Integrated with Gene Ontology
by Vincent S. Tseng, Hsieh-Hui Yu Abstract: In recent years, the advent of microarray technologies allows biologists to measure the expression profiles of thousands of genes simultaneously. Meanwhile, data mining techniques have been proposed to analyze gene expression data gracefully. Selecting informative genes is one of the most important issues for deciphering biological information hidden in such data. However, due to the characteristics of microarray data with small samples and large number of genes, general feature selection methods that are not biologically relevant become questionable. In this paper, we propose a novel classification method for microarray data by integrating multi-information based gene scoring method with biological information. The gene ontology (GO) database which describes the roles of genes and gene products in organisms is adopted as the biological knowledge base in the study. The main advantage of our method is that it provides biologists deeper insights in the relations between genes and gene function categories for target classification. Through experimental evaluation, our proposed method is shown to deliver good accuracy in classification. Keywords: microarray data classification; gene expression analysis; gene scoring; gene ontology; gene expression analysis, etc - Using Hybrid Hierarchical K-means Clustering Algorithm for Protein Sequence Motif Super-Rule-Tree (SRT) Structure Construction
by Bernard Chen, Jieyue He, Stephen Pellicer, Yi Pan Abstract: In order to analysis the biologically significant regions, protein sequence motifs information is significantly relied on. These conserved regions have the potential to determine the role of the proteins. Many algorithms or techniques to discover motifs require a predefined fixed window size in advance. Due to the fixed size, these approaches often deliver a number of similar motifs simply shifted by some bases or including mismatches. To confront the mismatched motifs problem, we use the super-rule concept to construct a Super-Rule-Tree (SRT) by a modified HHK clustering algorithm which requires no parameter setup to identify the similarities and dissimilarities between the motifs. By analyzing the motifs results generated by our approach, they are not only significant in sequence area but secondary structure similarity. We believe new proposed HHK clustering algorithm and SRT can play an important role in similar researches which requires predefined fixed window size. Keywords: Super-Rule-Tree (SRT), Hybrid Hierarchical K-means clustering algorithm, protein sequence motif. - Screening SNPs residing in the MicroRNA-Binding Sites of Hepatocellular Carcinoma Related Genes
by Jun Ding, Yuzhen Gao, Yan He, Yifeng Zhou, Moli Huang, Haiyan Liu Abstract: MicroRNAs are noncoding small RNAs regulating gene expressions by targeting mRNAs at the 3 terminal untranslated regions, leading to mRNA cleavage or translational repression in most occasions. Single nucleotide polymorphisms located at miRNA-binding sites (miRNA-binding SNPs) are likely to affect the expression of the miRNA targets and may contribute to the susceptibility of humans to common diseases. Here we selected 289 candidate hepatocellular carcinoma (HCC) related genes according to the existing literature and database. We identified putative miRNA binding sites of 52 genes by specialized algorithms (PicTar, miRBase, miRanda, TargetScan), then we screened SNPs in the 3’-UTRs of 50 genes. Using BLAST program, we identified 5 genes that had SNPs in the regions of miRNA-binding sites, one of which is confirmed in another published study. The SNPs we identified could affect the binding and regulatory activities of the miRNAs. Therefore, we propose these SNPs for further investigations in case-control association studies. Keywords: hepatocellular carcinoma; Single nucleotide polymorphisms (SNPs); miRNA - The paper is from MLBB 2008 - Prediction of inter-residue contact clusters from hydrophobic cores
by Peng Chen Abstract: A contact map is a key factor representing a specific protein structure. As previous works have reported, contact maps play an important role in the folding and stability of proteins and even a corrupted contact map can be used to reconstruct its corresponding protein structure. Thus we can predict the structure of a protein partially through the contact map prediction. To simplify the protein contact map prediction, we predict the inter-residue contact clusters centered at the groups of their surrounding inter-residue contacts instead. In this paper, we adopt a SVM based approach to predict the inter-residue contact cluster centers. The input information of the SVM predictor includes sequence profile, evolutionary rate, and predicted secondary structure. The SVM predictor is based on hydrophobic cores that may be considered as locations of the inter-residue contact clusters. As a result, about 35% clustering centers of inter-residue contacts can be predicted accurately. Keywords: Support Vector Machine, Contact Cluster, Hydrophobic Core - Medical Informatics: Transition from Data Acquisition to Data Analysis By Means of Bioinformatics’ Tools and Resources
by Mahmood A. Mahdavi Abstract: medical informatics has shifted its focus from acquisition and storage of healthcare data by integrating computational, informational, cognitive, and organizational sciences to semantic analysis of the data for problem solving and clinical decision making. In this transition, bioinformatics’ tools and resources are the most appropriate means to improve the analysis, as major biological databases are now containing clinical data alongside genomics, proteomics and other biological data. This article briefly reviews bioinformatics tools and resources and then discusses their applications in analysing clinical data for diagnostics. Keywords: clinical diagnostics; decision making; microarray; database; homology; bioinformatics - Accuracy of protein hydropathy predictions
by Satu Jääskeläinen, Pentti Riikonen, Tapio Salakoski, Mauno Vihinen Abstract: Hydropathy, the tendency of proteins and amino acids to like or dislike water interaction, is a dominant force in protein folding. Several different hydropathy scales are available for amino acids to do sequence-based predictions. Hydropathy predictions are widely used, without knowing about the accuracy and reliability of the obtained results. We investigated the prediction accuracy of 56 hydropathy scales by correlating predicted values with the accessible surface area in known three dimensional structures of proteins. Results for different amino acids vary greatly within each scale, but are more consistent between the scales. We also investigated prediction accuracies of amino acids separately in secondary structural elements and in protein fold families. One of the most common applications of hydropathy scales is to predict antigenic regions. Some epitopes are located among the most exposed regions. Despite very low overall correlation, hydropathy predictions can still be used in certain applications where the shape of the plot is important instead of the prediction values. Keywords: Hydropathy predictions, hydropathy scales, epitopes - WF-MSB: A Weighted Fuzzy-based Biclustering Method for Gene Expression Data
by Lien-Chin Chen, Philip S. Yu, Vincent S. Tseng Abstract: Biclustering is an important analysis method on gene expression data for finding a subset of genes sharing compatible expression patterns. Although some biclustering algorithms have been proposed, few provided a query-driven approach for biologists to search the biclusters which contain a certain gene of interest. In this paper, we proposed a generalized fuzzy-based approach, namely Weighted Fuzzy-based Maximum Similarity Biclustering (WF-MSB), for extracting a query-driven bicluster based on the user-defined reference gene. A fuzzy-based similarity measurement and condition weighting approach are used to extract significant biclusters in expression levels. Both of the most similar bicluster and the most dissimilar bicluster to the reference gene are discovered by WF-MSB. The proposed WF-MSB method was evaluated in comparison with MSBE on a real yeast microarray data and synthetic datasets. The experimental results show that WF-MSB can effectively find the biclusters with significant GO-based functional meanings. Keywords: Biclustering; gene expression; data mining; fuzzy set; gene similarity measure - An algorithm for network motif discovery in biological networks
by Guimin Qin, Lin Gao Abstract: Network motif discovery is a key problem in analysis of biological networks. In this paper, we present an efficient algorithm for detecting consensus motifs. Firstly, we extend subgraph searching algorithm ESU(Enumerate Subgraphs) to efficiently search non-treelike subgraphs of which the probability of occurrence in random networks is small. Then we classify isomorphic subgraphs into different groups. Finally, we use a hierarchical clustering method to cluster subgraphs, and derive consensus motifs from the clusters. Our algorithm is applied to the PPI (protein-protein interaction) networks and the transcriptional regulatory networks of E. coli and S. cerevisiae. The experiment results show that the algorithm can efficiently discover motifs which are consistent with current biology knowledge. And it can also detect several consensus motifs with a given size, which may help biologists go further into cellular process. Keywords: network motifs; biological networks; PPI networks; transcriptional regulatory networks - Protein Interaction Detection in Sentences via Gaussian Processes: A preliminary evaluation
by Tamara Polajnar, Simon Rogers, Mark Girolami Abstract: Classification methods are vital for efficient access of knowledge hidden in biomedical publications. Support vector machines (SVMs) are modern non-parametric deterministic classifiers that produce state of the art performances in text mining, and across other disciplines, while reducing the need for feature engineering. In this paper we offer a much needed evaluation of the Gaussian Process (GP) classifier, as a non-parametric probabilistic analogue to SVMs, which has been rarely applied to text classification. To this end, we provide an extensive experimental comparison of the performance and properties of these competing classifiers on the challenging problem of protein interaction detection in biomedical publications. Our results show that GPs can match the performance of SVMs without the need for costly margin parameter tuning, whilst offering the advantage of an extendable probabilistic framework for text classification. Keywords: text mining; Gaussian process; support vector machine; protein interaction; sentence classification - SVM-RFE based feature selection for tandem mass spectrum quality assessment
by Jiarui Ding, Jinhong Shi, Fang-Xiang Wu Abstract: In literature, hundreds of features have been proposed to
assess the quality of tandem mass spectra. However, many of these
features are nearly irrelevant to describe the quality of a spectrum and
the inclusion of these features can degenerate the spectrum quality as-
sessment performance. We propose to use a two-stage recursive feature
elimination based on support vector machine (SVM-RFE) method to
select the highly relevant features from the collection of features in liter-
ature. To verify the relevance of selected features, classi¯ers are trained
from di®erent sets of selected features and their performances are an-
alyzed. The results demonstrate that the sets with a small number
of features (such as 13 or 15 features) outperforms the full set of fea-
tures, which indicates that these features together can better describe
the quality of tandem mass spectra and hence improve the performance
of tandem mass spectral quality assessmen Keywords: feature selection; SVM-RFE; tandem mass spectra; quality assessment; proteomics - Detecting Microarray Data Supported MicroRNA-mRNA Interactions
by Hui Liu Abstract: MicroRNAs (miRNAs) have been recently emerged as a novel class of endogenous post-transcriptional regulators in a variety of animal and plant species. The regulatory mechanism of miRNAs is known as translational repression or post-transcriptional degradation induced by partially or fully binding to the 3’-UTR of their target mRNAs. Experimental analysis and sequence-based computational approaches have revealed a large number of miRNA genes and their targets. However, our knowledge of the biogenesis, functions, and regulatory mechanisms of these miRNAs is still extremely limited. Identifying bona fide miRNA-mRNA interactions is an important but challenging task for our insight into the regulatory mechanism of miRNAs. In this paper, we propose a new method to detect miRNA-mRNA interactions by exploiting both genome sequence and microarray data. We employ a bipartite graph to model the relationships between miRNAs and their targets and develop a variant of affinity propagation algorithm to reveal the interactions supported by microarray data. We first carry out experiments to analyze the impact of the model parameters and tune them to optimal values, then apply our model to refine the sequence-based predictions and compare our method to state-of-the-art methods. Our extensive experiments show that our method performs effectively in screening the miRNA-mRNA interactions predicted by sequence-based approaches to reduce the number of candidate miRNA targets using microarray data. Keywords: MicroRNA; Target gene; Bipartite graph; Microarray
|
|