Title: Secure live migration of parallel applications using container-based virtual machines

Authors: Thomas J. Hacker; Fabian Romero; Jeremiah J. Nielsen

Addresses: Computer and Information Technology, Purdue University, West Lafayette, IN 47907, USA. ' IBM, Chicago, IL 60626, USA. ' Computer and Information Technology, Purdue University, West Lafayette, IN 47907, USA

Abstract: A parallel application will terminate when a computational node fails. As the number of components in supercomputers increase and applications scale to use these systems, the mean time to failure decreases. Traditional fault tolerance approaches, such as checkpointing, are failing to scale. An alternative approach we explore in this paper is the use of VM-based live migration to move a process from a failing node to a healthy one to reduce the fault rate experienced by an application. We investigate the use of a virtualisation environment based on OpenVZ to perform live migrations of virtual machines on which multi-processor parallel applications are running. We explore the correctness, performance, security, and reliability of this approach along with the overhead of using OS-level virtualised systems for fault recovery. Our results confirm that it is possible to efficiently migrate virtual containers without affecting the correctness or completion of parallel applications.

Keywords: reliability; fault tolerance; network architecture; network design; high performance computing; network security; parallel computing; live migration; node failure; virtual machines; virtual containers; virtualisation.

DOI: 10.1504/IJSSC.2012.045562

International Journal of Space-Based and Situated Computing, 2012 Vol.2 No.1, pp.45 - 57

Received: 10 Jun 2011
Accepted: 21 Nov 2011

Published online: 20 Sep 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article