Title: A recovery mechanism for errors caused by a late subjob in a system handling SLA-based Grid workflows

Authors: Dang Minh Quan, Jorn Altmann

Addresses: School of Information Technology, International University in Germany, Germany. ' School of Information Technology, International University in Germany, Germany; Technology Management, Economics and Policy Program (TEMEP), School of Engineering, Seoul National University, South Korea

Abstract: Supporting SLAs (Service Level Agreements) for Grid-based workflows requires providing mechanisms for handling errors (i.e., the failures of subjobs). In the context of this paper, we propose an error recovery mechanism which can handle one failed subjob of a workflow. The error recovery mechanism has a maximum of three phases, depending on the impact of the error. In each phase, we use a dedicated algorithm to remap the subjobs of the workflow to the resources. The main contributions of the paper are the error recovery mechanism for SLA-based workflows and the mapping algorithm G-map, which is used in the first phase of the recovery mechanism. The G-map remaps the groups of subjobs, which are directly affected by an error. The efficiency of the proposed algorithm is validated through simulation results.

Keywords: grid computing; service level agreements; SLA; grid-based workflows; error recovery; late subjobs; simulation.

DOI: 10.1504/IJWGS.2008.018490

International Journal of Web and Grid Services, 2008 Vol.4 No.1, pp.35 - 62

Published online: 25 May 2008 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article