Title: Parallel algorithms for clustering biological graphs on distributed and shared memory architectures

Authors: Inna Rytsareva; Timothy Chapman; Ananth Kalyanaraman

Addresses: School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99164, USA ' DigiPen Institute of Technology, Redmond, WA 98052, USA ' School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99164, USA

Abstract: Graph algorithms on parallel architectures present an interesting case study for irregular applications. In this paper, we address one such irregular application - one of clustering real-world graphs constructed out of biological data using parallel computers. We present the design and evaluation of two different parallel implementations of a serial graph clustering heuristic called the Shingling heuristic, which was developed by Gibson et al. In the OpenMP shared memory implementation pClust-sm, we were able to improve both the asymptotic runtime and memory complexities of the serial implementation, and drastically reduce the time to solution from the order of several days to a few minutes on larger inputs (∼100 M edges). With the Hadoop MapReduce implementation pClust-mr, we were able to demonstrate linear scaling up to 64 cores on modest sized inputs (∼11 M edges) and enhance the problem size reach by about two orders of magnitude relative to a serial implementation.

Keywords: graph clustering; shared memory architectures; OpenMP algorithm; MapReduce algorithm; hash tables; union-find data structure; Shingling heuristic; protein family identification; protein domain families; parallel computing; biological graphs; distributed architectures.

DOI: 10.1504/IJHPCN.2014.062724

International Journal of High Performance Computing and Networking, 2014 Vol.7 No.4, pp.241 - 257

Available online: 11 Jun 2014 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article