Title: Reliable job execution with process failure recovery in computational grid

Authors: P. Latchoumy; P. Sheik Abdul Khader

Addresses: Department of Information Technology, B.S. Abdur Rahman University, Chennai, India ' Department of Computer Applications, B.S. Abdur Rahman University, Chennai, India

Abstract: Grid computing provides a virtual framework that integrates heterogeneous resources and services distributed across multiple controlled domains. Dynamic grid environment makes the grid infrastructure unreliable, resulting in failure of executing jobs. The great challenge here is to provide the reliable job execution in the presence of resource failure. In this paper, we present a new model called reliable job execution with process failure recovery in computational grid. In this model, the reliability of sites is monitored and the historical data is used for predicting resource failures taken into account when dispatched user jobs to resources and it recovers the failed job after the process failure has occurred. If the process failure occurs due to CPU overloads or memory thrashing, the backup process starts to run from the recently saved checkpoint. This system also considers the availability of checkpoints by storing checkpoints in multiple backup checkpoint servers. The experimental results demonstrate that our proposed strategy provides the guaranteed service to the grid user within the specified deadline.

Keywords: computational grid; overall resource value; ORV; reliability; service level agreements; SLAs; reliable job execution; checkpoint replication services; process failure recovery; process recovery daemon; PRD; resource utilisation; grid computing; reliability.

DOI: 10.1504/IJICT.2015.072042

International Journal of Information and Communication Technology, 2015 Vol.7 No.6, pp.607 - 631

Received: 18 Sep 2013
Accepted: 06 Jan 2014

Published online: 30 Sep 2015 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article