Title: BAG: a graph theoretic sequence clustering algorithm

Authors: Sun Kim, Jason Lee

Addresses: School of Informatics, Center for Genomics and Bioinformatics, Indiana University, Bloomington, IN 47408, USA. ' School of Informatics, Center for Genomics and Bioinformatics, Indiana University, Bloomington, IN 47408, USA

Abstract: In this paper, we first discuss issues in clustering biological sequences with graph properties, which inspired the design of our sequence clustering algorithm BAG. BAG recursively utilises several graph properties: biconnectedness, articulation points, pquasi-completeness, and domain knowledge specific to biological sequence clustering. To reduce the fragmentation issue, we have developed a new metric called cluster utility to guide cluster splitting. Clusters are then merged back with less stringent constraints. Experiments with the entire COG database and other sequence databases show that BAG can cluster a large number of sequences accurately while keeping the number of fragmented clusters significantly low.

Keywords: problem solving; control methods; search; complexity measures; performance measures; graph tree search strategies; bioinformatics; graph theory; sequence clustering; biological sequences; cluster splitting.

DOI: 10.1504/IJDMB.2006.010855

International Journal of Data Mining and Bioinformatics, 2006 Vol.1 No.2, pp.178 - 200

Available online: 07 Sep 2006 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article