Title: Correcting imbalanced reads coverage in bacterial transcriptome sequencing with extreme deep coverage

Authors: Xinjun Zhang; Dharanesh Gangaiah; Robert S. Munson Jr.; Stanley M. Spinola; Yunlong Liu

Addresses: Center for Computational Biology and Bioinformatics, School of Informatics and Computing, Indiana University Bloomington, Bloomington, IN 47408, USA ' Department of Microbiology and Immunology, Indiana University School of Medicine, Indianapolis, IN 46202, USA ' The Center for Microbial Pathogenesis in the Research Institute at Nationwide Children’s Hospital, Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH 43210, USA ' Department of Microbiology and Immunology, Department of Medicine, Department of Pathology and Laboratory Medicine, Center for Immunobiology, Indiana University School of Medicine, Indianapolis, IN 46202, USA ' Center for Computational Biology and Bioinformatics, Center for Medical Genomics, Departments of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA

Abstract: High throughput bacterial RNA-Seq experiments can generate extremely high and imbalanced sequencing coverage. Over- or under-estimation of gene expression levels will hinder accurate gene differential expression analysis. Here we evaluated strategies to identify expression differences of genes with high coverage in bacterial transcriptome data using either raw sequence reads or unique reads with duplicate fragments removed. In addition, we proposed a generalised linear model (GLM) based approach to identify imbalance in read coverage based on sequence compositions. Our results show that analysis using raw reads identifies more differentially expressed genes with more accurate fold change than using unique reads. We also demonstrate the presence of sequence composition related biases that are independent of gene expression levels and experimental conditions. Finally, genes that still show strong coverage imbalance after correction were tagged using statistical approach.

Keywords: bacterial transcriptome sequencing; RNA-Seq; gene differential expression; coverage imbalance; tri-nucleotides; GLM; generalised linear modelling; computational biology; RNA sequences; gene expression levels.

DOI: 10.1504/IJCBDD.2014.061646

International Journal of Computational Biology and Drug Design, 2014 Vol.7 No.2/3, pp.195 - 213

Published online: 27 May 2014 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article