Article: Mining large-scale repetitive sequences in a MapReduce setting Journal: International Journal of Data Mining and Bioinformatics (IJDMB) 2016 Vol.14 No.3 pp.210 - 228 Abstract: Recent research suggests DNA repeats play critical roles in cellular regulatory functions and disease development. The challenge associated with identifying repeats across a collection of genomes arises from the amount of data stored within DNA, and intermediate data generated by alignment- and hash-based approaches are substantial. We present a MapReduce-based method for repeat identification and propose efficient storage and search techniques. Our approach distributes the computation and storage across a cluster of commodity computers, lending a cost-effective, flexible, robust, and scalable solution to the challenge of identifying various types of repetitive sequences across a collection of genomes. In this study, we benchmark our method using a collection of six genomes, totalling approximately 14.2 billion base pairs. We demonstrate a tenfold speedup over previous state-of-the-art approaches and linear scalability. In addition, we conduct a deeper scalability analysis by processing a collection of 39 genomes, approximately 104 billion base pairs. Inderscience Publishers - linking academia, business and industry through research

Title: Mining large-scale repetitive sequences in a MapReduce setting

Authors: Hongfei Cao; Michael Phinney; Devin Petersohn; Benjamin Merideth; Chi-Ren Shyu

Addresses: Department of Computer Science, University of Missouri, Columbia, MO 65211, USA ' Department of Computer Science, University of Missouri, Columbia, MO 65211, USA ' Department of Computer Science, University of Missouri, Columbia, MO 65211, USA ' Informatics Institute, University of Missouri, Columbia, MO 65211, USA ' Department of Computer Science, University of Missouri, Columbia, MO 65211, USA; Informatics Institute, University of Missouri, Columbia, MO 65211, USA

Abstract: Recent research suggests DNA repeats play critical roles in cellular regulatory functions and disease development. The challenge associated with identifying repeats across a collection of genomes arises from the amount of data stored within DNA, and intermediate data generated by alignment- and hash-based approaches are substantial. We present a MapReduce-based method for repeat identification and propose efficient storage and search techniques. Our approach distributes the computation and storage across a cluster of commodity computers, lending a cost-effective, flexible, robust, and scalable solution to the challenge of identifying various types of repetitive sequences across a collection of genomes. In this study, we benchmark our method using a collection of six genomes, totalling approximately 14.2 billion base pairs. We demonstrate a tenfold speedup over previous state-of-the-art approaches and linear scalability. In addition, we conduct a deeper scalability analysis by processing a collection of 39 genomes, approximately 104 billion base pairs.

Keywords: repetitive sequences; big data; sequence analysis; genomic sequence analysis; DNA repeats; Hadoop; MapReduce; UCEs; ultra conserved elements; cluster computing; genomes; data mining; cellular regulatory functions; disease development; repeat identification; bioinformatics.

DOI: 10.1504/IJDMB.2016.074873

International Journal of Data Mining and Bioinformatics, 2016 Vol.14 No.3, pp.210 - 228

Received: 11 May 2015
Accepted: 11 May 2015
Published online: 22 Feb 2016 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Mining large-scale repetitive sequences in a MapReduce setting

Keep up-to-date