Authors: Carl Barton; Mathieu Giraud; Costas S. Iliopoulos; Thierry Lecroq; Laurent Mouchard; Solon P. Pissis
Addresses: Department of Informatics, King's College London, London WC2R 2NS, UK ' INRIA Lille Nord-Europe, Le Chesnay Cedex, France ' Department of Informatics, King's College London, London WC2R 2NS, UK; Curtin Business School, Curtin University, Perth, Australia ' Faculty of Science, University of Rouen, LITIS - EA4108, France ' Department of Informatics, King's College London, London WC2R 2NS, UK; System and Information Processing, University of Rouen, LITIS - EA4108, France ' Florida Museum of Natural History, University of Florida, Gainesville, FL 32611-2710, USA; Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
Abstract: In this paper, we present a solution to the extreme similarity sequencing problem. The extreme similarity sequencing problem consists of finding occurrences of a pattern p in a set S0, S1, , Sk, of sequences of equal length, where Si, for all 1≤i≤k, differs from S0 by a constant number of errors - around 10 in practice. We present an asymptotically fast O(n + occ logocc) time algorithm, as well as a practical O(nk/w) time algorithm for solving this problem, where n is the length of a sequence, occ is the number of candidate occurrences reported by our technique, w is the size of the machine word, and the total number of errors is bounded by k - the number of sequences.
Keywords: DNA sequencing; highly similar sequences; similarity searching; querying DNA sequences; NGS; next-generation sequencing.
International Journal of Computational Biology and Drug Design, 2013 Vol.6 No.1/2, pp.119 - 130
Published online: 20 Feb 2013 *Full-text access for editors Access for subscribers Purchase this article Comment on this article