Article: Applying Agrep to r-NSA to solve multiple sequences approximate matching Journal: International Journal of Data Mining and Bioinformatics (IJDMB) 2014 Vol.9 No.4 pp.358 - 385 Abstract: This paper addresses the approximate matching problem in a database consisting of multiple DNA sequences, where the proposed approach applies Agrep to a new truncated suffix array, r-NSA. The construction time of the structure is linear to the database size, and the computations of indexing a substring in the structure are constant. The number of characters processed in applying Agrep is analysed theoretically, and the theoretical upper-bound can approximate closely the empirical number of characters, which is obtained through enumerating the characters in the actual structure built. Experiments are carried out using (synthetic) random DNA sequences, as well as (real) genome sequences including Hepatitis-B Virus and X-chromosome. Experimental results show that, compared to the straight-forward approach that applies Agrep to multiple sequences individually, the proposed approach solves the matching problem in much shorter time. The speed-up of our approach depends on the sequence patterns, and for highly similar homologous genome sequences, which are the common cases in real-life genomes, it can be up to several orders of magnitude. Inderscience Publishers - linking academia, business and industry through research

Title: Applying Agrep to r-NSA to solve multiple sequences approximate matching

Authors: Bing Ni; Man-Hon Wong; Chi-Fai David Lam; Kwong-Sak Leung

Addresses: Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Shatin, Hong Kong ' CSE Department, CUHK, Shatin, Hong Kong ' CSE Department, CUHK, Shatin, Hong Kong ' CSE Department, CUHK, Shatin, Hong Kong

Abstract: This paper addresses the approximate matching problem in a database consisting of multiple DNA sequences, where the proposed approach applies Agrep to a new truncated suffix array, r-NSA. The construction time of the structure is linear to the database size, and the computations of indexing a substring in the structure are constant. The number of characters processed in applying Agrep is analysed theoretically, and the theoretical upper-bound can approximate closely the empirical number of characters, which is obtained through enumerating the characters in the actual structure built. Experiments are carried out using (synthetic) random DNA sequences, as well as (real) genome sequences including Hepatitis-B Virus and X-chromosome. Experimental results show that, compared to the straight-forward approach that applies Agrep to multiple sequences individually, the proposed approach solves the matching problem in much shorter time. The speed-up of our approach depends on the sequence patterns, and for highly similar homologous genome sequences, which are the common cases in real-life genomes, it can be up to several orders of magnitude.

Keywords: numerical suffix array; truncated suffix array; Agrep; multiple sequences; approximate matching; DNA sequences; bioinformatics; Hepatitis-B virus; X-chromosome; genome sequences.

DOI: 10.1504/IJDMB.2014.062145

International Journal of Data Mining and Bioinformatics, 2014 Vol.9 No.4, pp.358 - 385

Received: 22 Jul 2010
Accepted: 10 Feb 2011
Published online: 21 Oct 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Applying Agrep to r-NSA to solve multiple sequences approximate matching

Keep up-to-date