Title: Generalised Sequence Signatures through symbolic clustering

Authors: Dietmar H. Dorr, Anne M. Denton

Addresses: Research and Development, Thomson Reuters, St. Paul, MN 55123, USA. ' Department of Computer Science, North Dakota State University, Fargo, ND 58102, USA

Abstract: Traditionally sequence motifs and domains are defined such that insertions, deletions and mismatched regions are small compared with matched regions. We introduce an algorithm for the identification of Generalised Sequence Signatures (GSS) that can be composed of windows distributed throughout the sequence. Our approach is based on clustering analysis of recurring subsequences of a predefined length, to which we refer as symbols. Sequences are grouped so as to maximise the number of shared symbols among them. We show that the utilisation of GSS for deriving sequence annotations yields higher confidence values than the usage of other signature recognition approaches.

Keywords: sequence motifs; sequence domains; sequence annotations; sequence signatures; symbolic clustering; bioinformatics; data mining; signature recognition.

DOI: 10.1504/IJDMB.2010.037546

International Journal of Data Mining and Bioinformatics, 2010 Vol.4 No.6, pp.656 - 674

Published online: 16 Dec 2010 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article