Article Abstract

Title: Mining large-scale repetitive sequences in a MapReduce setting
  Author: Hongfei Cao, Michael Phinney, Devin Petersohn, Benjamin Merideth, Chi-Ren Shyu   Email author(s)
  Address: Department of Computer Science, University of Missouri, Columbia, MO 65211, USA ' Department of Computer Science, University of Missouri, Columbia, MO 65211, USA ' Department of Computer Science, University of Missouri, Columbia, MO 65211, USA ' Informatics Institute, University of Missouri, Columbia, MO 65211, USA ' Department of Computer Science, University of Missouri, Columbia, MO 65211, USA; Informatics Institute, University of Missouri, Columbia, MO 65211, USA
  Journal: International Journal of Data Mining and Bioinformatics 2016 - Vol. 14, No.3  pp. 210 - 228
  Abstract: Recent research suggests DNA repeats play critical roles in cellular regulatory functions and disease development. The challenge associated with identifying repeats across a collection of genomes arises from the amount of data stored within DNA, and intermediate data generated by alignment- and hash-based approaches are substantial. We present a MapReduce-based method for repeat identification and propose efficient storage and search techniques. Our approach distributes the computation and storage across a cluster of commodity computers, lending a cost-effective, flexible, robust, and scalable solution to the challenge of identifying various types of repetitive sequences across a collection of genomes. In this study, we benchmark our method using a collection of six genomes, totalling approximately 14.2 billion base pairs. We demonstrate a tenfold speedup over previous state-of-the-art approaches and linear scalability. In addition, we conduct a deeper scalability analysis by processing a collection of 39 genomes, approximately 104 billion base pairs.
  Keywords: repetitive sequences; big data; sequence analysis; genomic sequence analysis; DNA repeats; Hadoop; MapReduce; UCEs; ultra conserved elements; cluster computing; genomes; data mining; cellular regulatory functions; disease development; repeat identification; bioinformatics.
  DOI: 10.1504/IJDMB.2016.074873
  Access for editors and complimentary subscribers       Access for Subscribers     Purchase this article        We welcome your comments about this article Comment on the article