Title: Master node fault tolerance in distributed big data processing clusters

Authors: Ivan Gankevich; Yuri Tipikin; Vladimir Korkhov; Vladimir Gaiduchok; Alexander Degtyarev; A. Bogdanov

Addresses: Dept. of Computer Modelling and Multiprocessor Systems, Saint Petersburg State University, Universitetskaia emb. 7-9, 199034 Saint Petersburg, Russia ' Dept. of Computer Modelling and Multiprocessor Systems, Saint Petersburg State University, Universitetskaia emb. 7-9, 199034 Saint Petersburg, Russia ' Dept. of Computer Modelling and Multiprocessor Systems, Saint Petersburg State University, Universitetskaia emb. 7-9, 199034 Saint Petersburg, Russia ' Dept. of Computer Modelling and Multiprocessor Systems, Saint Petersburg State University, Universitetskaia emb. 7-9, 199034 Saint Petersburg, Russia ' Dept. of Computer Modelling and Multiprocessor Systems, Saint Petersburg State University, Universitetskaia emb. 7-9, 199034 Saint Petersburg, Russia ' Dept. of Computer Modelling and Multiprocessor Systems, Saint Petersburg State University, Universitetskaia emb. 7-9, 199034 Saint Petersburg, Russia

Abstract: Distributed computing clusters are often built with commodity hardware which leads to periodic failures of processing nodes due to relatively low reliability of such hardware. While worker node fault-tolerance is straightforward, fault tolerance of master node poses a bigger challenge. In this paper master node failure handling is based on the concept of master and worker roles that can be dynamically re-assigned to cluster nodes along with maintaining a backup of the master node state on one of worker nodes. In such case no special component is needed to monitor the health of the cluster while master node failures can be resolved except for the cases of simultaneous failure of master and backup. We present experimental evaluation of the technique implementation, show benchmarks demonstrating that a failure of a master does not affect running job, and a failure of backup results in re-computation of only the last job step.

Keywords: parallel computing; big data processing; distributed computing; backup node; state transfer; delegation; cluster computing; fault-tolerance; high-availability; hierarchy.

DOI: 10.1504/IJBIDM.2019.101264

International Journal of Business Intelligence and Data Mining, 2019 Vol.15 No.2, pp.158 - 172

Received: 03 Feb 2017
Accepted: 06 May 2017

Published online: 30 Jul 2019 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article