Title: OS-level hang detection in complex software systems

 

Author: Antonio Bovenzi, Marcello Cinque, Domenico Cotroneo, Roberto Natella, Gabriella Carrozza

 

Addresses:
Dipartimento di Informatica e Sistemistica, Universita degli Studi di Napoli Federico II, Via Claudio 21, 80125, Naples, Italy.
Dipartimento di Informatica e Sistemistica, Universita degli Studi di Napoli Federico II, Via Claudio 21, 80125, Naples, Italy.
Dipartimento di Informatica e Sistemistica, Universita degli Studi di Napoli Federico II, Via Claudio 21, 80125, Naples, Italy.
Dipartimento di Informatica e Sistemistica, Universita degli Studi di Napoli Federico II, Via Claudio 21, 80125, Naples, Italy.
SESM SCARL, Via Circumvallazione Esterna di Napoli, 80014, Giugliano in Campania, Naples, Italy

 

Journal: Int. J. of Critical Computer-Based Systems, 2011 Vol.2, No.3/4, pp.352 - 377

 

Abstract: Many critical services are nowadays provided by large and complex software systems. However, the increasing complexity introduces several sources of non-determinism, which may lead to hang failures: the system appears to be running, but part of its services is perceived as unresponsive. Online monitoring is the only way to detect and to promptly react to such failures. However, when dealing with off-the-shelf-based systems, online detection can be tricky since instrumentation and log data collection may not be feasible in practice. In this paper, a detection framework to cope with software hangs is proposed. The framework enables the non-intrusive monitoring of complex systems, based on multiple sources of data gathered at the operating system (OS) level. Collected data are then combined to reveal hang failures. The framework is evaluated through a fault injection campaign on two complex systems from the air traffic management (ATM) domain. Results show that the combination of several monitors at the OS level is effective to detect hang failures in terms of coverage and false positives and with a negligible impact on performance.

 

Keywords: failure detection; hang failures; online monitoring; critical software systems; operating systems; complex systems; software hangs; fault injection; air traffic management; air traffic control.

 

DOI: http://dx.doi.org/10.1504/IJCCBS.2011.042333

 

Available online 04 Sep 2011

 

 

Editors Full Text AccessAccess for SubscribersPurchase this articleComment on this article