Title: Clustering sequences by overlap

Authors: Dietmar H. Dorr, Anne M. Denton

Addresses: Department of Computer Science, North Dakota State University, Fargo, ND, 58105, USA. ' Department of Computer Science, North Dakota State University, Fargo, ND, 58105, USA

Abstract: A clustering algorithm is introduced that combines the strengths of clustering and motif finding techniques. Clusters are identified based on unambiguously defined sequence sections as in motif finding algorithms. The definition of similarity within clusters allows transitive matches and, thereby, enables the discovery of remote homologies that cannot be found through motif-finding algorithms. Directed Acyclic Graph (DAG) structures are constructed that link short clusters to the longer ones. We compare the clustering results to the corresponding domains in the InterPro database. A second comparison shows that annotations based on our domains are inherently more consistent than those based on InterPro domains.

Keywords: sequence clustering; motif finding; annotation; bioinformatics; DAG; directed acyclic graph; InterPro domains; similarity; transitive matches; remote homologies.

DOI: 10.1504/IJDMB.2009.026701

International Journal of Data Mining and Bioinformatics, 2009 Vol.3 No.3, pp.260 - 279

Received: 22 Jun 2007
Accepted: 18 Jan 2008

Published online: 23 Jun 2009 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article