Title: Accurate annotation of metagenomic data without species-level references

Authors: Haobin Yao; T.W. Lam; H.F. Ting; S.M. Yiu; Yadong Wang; Bo Liu

Addresses: Department of Computer Science, The University of Hong Kong, Hong Kong, China ' Department of Computer Science, The University of Hong Kong, Hong Kong, China ' Department of Computer Science, The University of Hong Kong, Hong Kong, China ' Department of Computer Science, The University of Hong Kong, Hong Kong, China ' School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China ' School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China

Abstract: Taxonomic annotation is a critical first step for analysis of metagenomic data. Despite a lot of tools being developed, the accuracy is still not satisfactory, in particular, when a close species-level reference does not exist in the database. In this paper, we propose a novel annotation tool, MetaAnnotator, to annotate metagenomic reads, which outperforms all existing tools significantly when only genus-level references exist in the database. From our experiments, MetaAnnotator can assign 87.5% reads correctly (67.5% reads are assigned to the exact genus) with only 8.5% reads wrongly assigned. The best existing tool (MetaCluster-TA) can only achieve 73.4% correct read assignment (with only 50.9% reads assigned to the exact genus and 22.6% reads wrongly assigned). The speed of MetaAnnotator is also the second faster (1 hour for 20 million reads). The core concepts behind MetaAnnotator includes: (i) we only consider exact k-mers in coding regions of the references as they should be more significant and accurate; (ii) to assign reads to taxonomy nodes, we construct genome and taxonomy specific probabilistic models from the reference database; and (iii) using the BWT data structure to speed up the k-mer matching process.

Keywords: metagenomic data analysis; binning; accurate and fast annotation.

DOI: 10.1504/IJDMB.2017.091354

International Journal of Data Mining and Bioinformatics, 2017 Vol.19 No.4, pp.283 - 297

Received: 11 May 2017
Accepted: 20 May 2017

Published online: 27 Apr 2018 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article