Title: Scalable data management for map-reduce-based data-intensive applications: a view for cloud and hybrid infrastructures

 

Author: Gabriel Antoniu; Alexandru Costan; Julien Bigot; Frédéric Desprez; Gilles Fedak; Sylvain Gault; Christian Pérez; Anthony Simonet; Bing Tang; Christophe Blanchet; Raphael Terreux; Luc Bougé; François Briant; Franck Cappello; Kate Keahey; Bogdan Nicolae; Frédéric Suter

 

Addresses:
INRIA Rennes – Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes Cedex, France
INRIA Rennes – Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes Cedex, France
LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France
LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France
LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France
LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France
LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France
LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France
LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France
CNRS/Université Lyon 1, Institut de Biologie et Chimie des Protéines, 7 Passage du Vercors, 69 367 Lyon cedex 07, France
CNRS/Université Lyon 1, Institut de Biologie et Chimie des Protéines, 7 Passage du Vercors, 69 367 Lyon cedex 07, France
ENS Cachan – Antenne de Bretagne, Campus de Ker Lann, Avenue Robert Schuman, 35170 Bruz, France
IBM Products and Solutions Support Center, 1 Rue de la Vieille Poste, 34006 Montpellier, France
Joint INRIA-UIUC Laboratory for Petascale Computing, National Center for Supercomputing Applications, 9 West Clark Street, Urbana, IL 61801, USA
Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA
IBM Research, Server 3, Damastown Industrial Park, Mulhuddart, Dublin 15, Ireland
CNRS, CC IN2P3, Domaine Scientifique de La Doua, 43 bd du 11 Novembre 1918, 69622 Villeurbanne Cedex, France

 

Journal: Int. J. of Cloud Computing, 2013 Vol.2, No.2/3, pp.150 - 170

 

Abstract: As map-reduce emerges as a leading programming paradigm for data-intensive computing, today's frameworks which support it still have substantial shortcomings that limit its potential scalability. In this paper, we discuss several directions where there is room for such progress: they concern storage efficiency under massive data access concurrency, scheduling, volatility and fault-tolerance. We place our discussion in the perspective of the current evolution towards an increasing integration of large-scale distributed platforms (clouds, cloud federations, enterprise desktop grids, etc.). We propose an approach which aims to overcome the current limitations of existing map-reduce frameworks, in order to achieve scalable, concurrency-optimised, fault-tolerant map-reduce data processing on hybrid infrastructures. This approach will be evaluated with real-life bio-informatics applications on existing Nimbus-powered cloud testbeds interconnected with desktop grids.

 

Keywords: map-reduce; cloud computing; desktop grids; hybrid infrastructures; bioinformatics; task scheduling; fault tolerance; scalable data management; cloud infrastructures; data-intensive computing; storage efficiency; massive data; access concurrency; volatility.

 

DOI: 10.1504/IJCC.2013.055265

10.1504/13.55265

 

 

Editors Full Text AccessAccess for SubscribersPurchase this articleComment on this article