International Journal of Data Mining and Bioinformatics (4 papers in press)
Accurate Annotation of Metagenomic data without species-level references
by Haobin Yao, Tak-wah Lam, Hing-Fung Ting, Siu-Ming Yiu, Yadong Wang, Bo Liu
Abstract: In this paper, we propose a novel annotation tool, MetaAnnotator, to annotate metagenomic reads, which outperforms all existing tools significantly when only genus-level references exist in the database. From our experiments, MetaAnnotator can assign 87.5% reads correctly (67.5% reads are assigned to the exact genus) with only 8.5% reads wrongly assigned. The best existing tool (MetaCluster-TA) can only achieve 73.4% correct read assignment (with only 50.9% reads assigned to the exact genus and 22.6% reads wrongly assigned). The core concepts behind MetaAnnotator includes: (i) we only consider exact k-mers in coding regions of the references as they should be more significant and accurate; (ii) to assign reads to taxonomy nodes, we construct genome and taxonomy specific probabilistic models from the reference database; and (iii) using the BWT data structure to speed up the k-mer matching process.
Keywords: metagenomic data analysis; binning; accurate and fast annotation.
A Novel Low-rank Representation Method for Identifying Differentially Expressed Genes
by Xiu-Xiu Xu, Ying-Lian Gao, Jin-Xing Liu, Ya-Xuan Wang, Ling-Yun Dai, Xiang-Zhen Kong, Sha-Sha Yuan
Abstract: Low-rank representation (LRR) has attracted lots of attentions in recent years. However, LRR has a chief shortcoming, which uses the nuclear norm to approximate the non-convex rank function. This approximation minimizes all singular values, thus the nuclear norm may not approximate to the rank function well. In this paper, we propose a novel low-rank method that replaces the nuclear norm with the truncated nuclear norm to approximate the rank function. And it is applied to identifying differentially expressed genes. The truncated nuclear norm is defined as the sum of some smaller singular values which may be a better measure to approximate the rank function than the nuclear norm. In order to achieve the convergence of our method, the optimization problem of our method is solved by the augmented Lagrange multiplier method that has the property of convergence. The experimental results demonstrate that our method exceeds LLRR, TRPCA and RPCA methods.
Keywords: differentially expressed genes; truncated nuclear norm; low-rank; augmented Lagrange multiplier; TCGA datarn.
Medical Examination Data Prediction with Missing Information Imputation Based on Recurrent Neural Networks
by Han-Gyu Kim, Gil-Jin Jang, Ho-Jin Choi, Myungeun Lim, Jaehun Choi
Abstract: In this work, the recurrent neural networks (RNNs) for medical examination data prediction with missing information is proposed. Simple recurrent network (SRN), long short-term memory (LSTM) and gated recurrent unit (GRU) are selected among many variations of RNNs for the missing information imputation while they are also used to predict the future medical examination data. Besides, the missing information imputation based on bidirectional LSTM is also proposed to consider past information as well as the future information in the imputation process, while the traditional RNNs can only consider the past information during the imputation. We implemented medical examination results prediction experiment using the examination database of Koreans. The experimental results showed that the proposed RNNs worked better than the baseline linear regression method. Besides, the bidirectional LSTM performed best for missing information imputation.
Keywords: Medical Examination Data Prediction; Recurrent Neural Network; Long Short-Term Memory; Gated Recurrent Unit; Bidirectional LSTM.
A hybrid-ensemble based framework for microarray data Gene selection
by Amirreza Rouhi, Hossein Nezamabadi-pour
Abstract: With the advent and propagation of high-dimensional microarray data, the process of gene selection has now become far more difficult and time-consuming, and classic feature selection methods are quickly becoming obsolete. Dealing with high-dimensional biomedical data is associated with problems such as the curse of dimensionality and increased presence of redundant and irrelevant genes, which all lead to significant rise in classification error. This paper provides a framework for combined use of ensemble and hybrid methods for gene selection in high-dimensional data with the aim of increasing classification accuracy and reducing dimensionality. The proposed method is benchmarked using several microarray datasets. The comparison results with those of latest ensemble feature selection methods confirm the good performance of the proposed approach.
Keywords: Gene selection; Feature selection; Microarray data; Hybrid methods; Metaheuristic; Ensemble methods.