Article: Querying highly similar sequences Journal: International Journal of Computational Biology and Drug Design (IJCBDD) 2013 Vol.6 No.1/2 pp.119 - 130 Abstract: In this paper, we present a solution to the extreme similarity sequencing problem. The extreme similarity sequencing problem consists of finding occurrences of a pattern p in a set S0, S1, …, Sk, of sequences of equal length, where Si, for all 1≤i≤k, differs from S0 by a constant number of errors - around 10 in practice. We present an asymptotically fast O(n + occ logocc) time algorithm, as well as a practical O(nk/w) time algorithm for solving this problem, where n is the length of a sequence, occ is the number of candidate occurrences reported by our technique, w is the size of the machine word, and the total number of errors is bounded by k - the number of sequences. Inderscience Publishers - linking academia, business and industry through research

Title: Querying highly similar sequences

Authors: Carl Barton; Mathieu Giraud; Costas S. Iliopoulos; Thierry Lecroq; Laurent Mouchard; Solon P. Pissis

Addresses: Department of Informatics, King's College London, London WC2R 2NS, UK ' INRIA Lille Nord-Europe, Le Chesnay Cedex, France ' Department of Informatics, King's College London, London WC2R 2NS, UK; Curtin Business School, Curtin University, Perth, Australia ' Faculty of Science, University of Rouen, LITIS - EA4108, France ' Department of Informatics, King's College London, London WC2R 2NS, UK; System and Information Processing, University of Rouen, LITIS - EA4108, France ' Florida Museum of Natural History, University of Florida, Gainesville, FL 32611-2710, USA; Heidelberg Institute for Theoretical Studies, Heidelberg, Germany

Abstract: In this paper, we present a solution to the extreme similarity sequencing problem. The extreme similarity sequencing problem consists of finding occurrences of a pattern p in a set S₀, S₁, …, S_k, of sequences of equal length, where S_i, for all 1≤i≤k, differs from S₀ by a constant number of errors - around 10 in practice. We present an asymptotically fast O(n + occ logocc) time algorithm, as well as a practical O(nk/w) time algorithm for solving this problem, where n is the length of a sequence, occ is the number of candidate occurrences reported by our technique, w is the size of the machine word, and the total number of errors is bounded by k - the number of sequences.

Keywords: DNA sequencing; highly similar sequences; similarity searching; querying DNA sequences; NGS; next-generation sequencing.

DOI: 10.1504/IJCBDD.2013.052206

International Journal of Computational Biology and Drug Design, 2013 Vol.6 No.1/2, pp.119 - 130

Published online: 18 Sep 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Querying highly similar sequences

Keep up-to-date