Title: Statistical methods for metagenomics data analysis

Authors: Chanyoung Lee; Seungyeoun Lee; Taesung Park

Addresses: Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea ' Department of Mathematics and Statistics, Sejong University, Seoul, South Korea ' Department of Statistics, Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea

Abstract: With the advent of next-generation sequencing (NGS) technology, sequencing of microbes now allows association analyses between genomic features and the environment. Several statistical methods have been proposed for analysing metagenome data. In this study, we proposed a novel method, Centred log ratio-transformed, Permutation-based Logistic regression (CPL), based on a logistic regression model that uses centred log-ratio transformation and permutation. Using CPL, we systematically compare the performances of various statistical methods for their ability to find differentially abundant features (DAFs). We first assessed the type I error rate of each method and compared power of each method over different levels of sparsity. Furthermore, we applied the various methods to real data of colorectal cancer (CRC), and compared the list of obtained taxonomic markers to the results of a previous CRC study. As a result, we recommend using CPL, metagenomeSeq and/or ANCOM, because they preserved type I error well, with comparable power.

Keywords: DAFs; differentially abundant features; metagenome; microbiome; association test; statistical methods; 16S rRNA; OTU; operational taxonomic unit; taxa; logistic regression.

DOI: 10.1504/IJDMB.2017.091366

International Journal of Data Mining and Bioinformatics, 2017 Vol.19 No.4, pp.366 - 385

Received: 13 Feb 2018
Accepted: 19 Feb 2018

Published online: 27 Apr 2018 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article