Title: Scalable data management for map-reduce-based data-intensive applications: a view for cloud and hybrid infrastructures

Authors: Gabriel Antoniu; Alexandru Costan; Julien Bigot; Frédéric Desprez; Gilles Fedak; Sylvain Gault; Christian Pérez; Anthony Simonet; Bing Tang; Christophe Blanchet; Raphael Terreux; Luc Bougé; François Briant; Franck Cappello; Kate Keahey; Bogdan Nicolae; Frédéric Suter

Addresses: INRIA Rennes – Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes Cedex, France ' INRIA Rennes – Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes Cedex, France ' LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France ' LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France ' LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France ' LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France ' LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France ' LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France ' LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France ' CNRS/Université Lyon 1, Institut de Biologie et Chimie des Protéines, 7 Passage du Vercors, 69 367 Lyon cedex 07, France ' CNRS/Université Lyon 1, Institut de Biologie et Chimie des Protéines, 7 Passage du Vercors, 69 367 Lyon cedex 07, France ' ENS Cachan – Antenne de Bretagne, Campus de Ker Lann, Avenue Robert Schuman, 35170 Bruz, France ' IBM Products and Solutions Support Center, 1 Rue de la Vieille Poste, 34006 Montpellier, France ' Joint INRIA-UIUC Laboratory for Petascale Computing, National Center for Supercomputing Applications, 9 West Clark Street, Urbana, IL 61801, USA ' Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA ' IBM Research, Server 3, Damastown Industrial Park, Mulhuddart, Dublin 15, Ireland ' CNRS, CC IN2P3, Domaine Scientifique de La Doua, 43 bd du 11 Novembre 1918, 69622 Villeurbanne Cedex, France

Abstract: As map-reduce emerges as a leading programming paradigm for data-intensive computing, today's frameworks which support it still have substantial shortcomings that limit its potential scalability. In this paper, we discuss several directions where there is room for such progress: they concern storage efficiency under massive data access concurrency, scheduling, volatility and fault-tolerance. We place our discussion in the perspective of the current evolution towards an increasing integration of large-scale distributed platforms (clouds, cloud federations, enterprise desktop grids, etc.). We propose an approach which aims to overcome the current limitations of existing map-reduce frameworks, in order to achieve scalable, concurrency-optimised, fault-tolerant map-reduce data processing on hybrid infrastructures. This approach will be evaluated with real-life bio-informatics applications on existing Nimbus-powered cloud testbeds interconnected with desktop grids.

Keywords: map-reduce; cloud computing; desktop grids; hybrid infrastructures; bioinformatics; task scheduling; fault tolerance; scalable data management; cloud infrastructures; data-intensive computing; storage efficiency; massive data; access concurrency; volatility.

DOI: 10.1504/IJCC.2013.055265

International Journal of Cloud Computing, 2013 Vol.2 No.2/3, pp.150 - 170

Published online: 28 Feb 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article