Int. J. of High Performance Computing and Networking   »   2007 Vol.5, No.1/2

 

 

Title: Error recovery mechanism for grid-based workflow within SLA context

 

Author: Dang Minh Quan

 

Address: Paderborn Center for Parallel Computing (PC2), University of Paderborn, Fuerstenalle 11, Paderborn, 33102, Germany

 

Abstract: Service Level Agreements (SLAs) serve as a foundation for a reliable and predictable job execution at remote grid sites. In this paper, we describe an error recovery mechanism for workflow within the SLA context, coping with catastrophic failure when one or several High Performance Computing Centers (HPCCs) are detached from the grid system. We propose an algorithm to detect all affected sub-jobs when the error happens and an algorithm to remap those sub-jobs to the remaining healthy HPCCs with makespan optimise. The experiment result shows that our mechanism discovers a higher quality solution in a shorter time period than other existing methods.

 

Keywords: grid computing; service level agreements; SLA; error recovery; workflow; mapping; high performance computing.

 

DOI: 10.1504/IJHPCN.2007.015769

 

Int. J. of High Performance Computing and Networking, 2007 Vol.5, No.1/2, pp.110 - 121

 

Available online: 14 Nov 2007

 

 

Editors Full text accessAccess for SubscribersPurchase this articleComment on this article