Title: An algorithm for online distributed fault-tolerant job scheduling in grid computing

Authors: Jun Zeng

Addresses: College of Big Data and Intelligent Engineering, Yangtze Normal University, Fu Ling of Chong Qing, 408100, China

Abstract: In order to solve the problem of various faults in grid computing environment, this paper raises an online distributed fault-tolerant job scheduling algorithm. The algorithm is consisted of two main algorithm modules, which was job schedule algorithm module, and replica management and placement algorithm module, respectively. The former is based on the idea of job replica, which each replica is independently and scheduled at different sites. Those unused resources are used to run the job replica so that at least one of replicas can be successfully completed. The latter makes each remote separate resource manager (SRM) to run a job replica to send jobs at each monitoring interval, which the status of the replica can be told to the original SRM (PSRM). PSRM periodically checks the application status table and queries all remote SRMs to obtain the status of the computing machine and network, and monitors all the running job replicas in the site, so as to achieve the fault tolerance function. The experimental results show that the online distributed fault-tolerant job scheduling algorithm can achieve better job average response time under various failure rates when compared with other grid fault-tolerant scheduling algorithms and non-fault-tolerant scheduling algorithms.

Keywords: grid computing; online distribution model; fault-tolerant; job schedule; fault model.

DOI: 10.1504/IJWGS.2021.118411

International Journal of Web and Grid Services, 2021 Vol.17 No.4, pp.389 - 407

Received: 19 Dec 2019
Accepted: 22 Apr 2020

Published online: 25 Oct 2021 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article