Title: Self-healing in autonomic distributed systems based on delayed communication-induced checkpointing

Authors: Alberto Calixto Simón; Saul E. Pomares Hernandez; Jose Roberto Perez Cruz; Riadh Ben Halima; Hatem Hadj Kacem

Addresses: Universidad del Papaloapan, UNPA, Av, Ferrocarril s/n, 68400, Loma Bonita, Oax., México ' Department of Computer Science, Instituto Nacional de Astrofísica Óptica y Electrónica, INAOE, Luis Enrique Erro No. 1, 72840, Tonantzintla, Puebla, México ' Department of Computer Science, Instituto Nacional de Astrofísica Óptica y Electrónica, INAOE, Luis Enrique Erro No. 1, 72840, Tonantzintla, Puebla, México ' ReDCAD Lab, University of Sfax, ENIS, B.P. W1173, 3038, Sfax, Tunisia; CNRS, LAAS, Univ de Toulouse, 7 avenue du colonel Roche, F-31400 Toulouse, France ' University of Sfax, FSEG Sfax, Route de l'Aéroport, B.P. 1088, 3018 Sfax, Tunisia

Abstract: An autonomic distributed system is composed of geographically distributed autonomic components. One open challenge in autonomic computing is the efficient monitoring at runtime oriented towards the collection of information, from which the system itself will detect, diagnose, and repair problems that result from failures in software and/or hardware components. For this purpose, communication-induced checkpointing (CIC) can be a useful tool. CIC aims to form global consistent snapshots from which the system can recover. To achieve this, CIC solutions monitor exchanged information among the processes to identify dangerous checkpointing patterns. When a dangerous pattern is identified, it is broken by locally triggering a forced checkpoint. Nevertheless, not all triggered forced checkpoints are necessary. In this paper, we present a delayed CIC approach that reduces forced checkpoints by using triggering rules called safe checkpoint conditions. Finally, we present simulation results that show that our proposal is more efficient than other current solutions.

Keywords: distributed systems; communication-induced checkpointing; delayed CIC; autonomic computing; self-healing; checkpointing patterns; simulation.

DOI: 10.1504/IJAACS.2016.079621

International Journal of Autonomous and Adaptive Communications Systems, 2016 Vol.9 No.3/4, pp.183 - 200

Published online: 06 Oct 2016 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article