Title: Scalable data management for map-reduce-based data-intensive applications: a view for cloud and hybrid infrastructures

 

Author: Gabriel Antoniu; Alexandru Costan; Julien Bigot; Frédéric Desprez; Gilles Fedak; Sylvain Gault; Christian Pérez; Anthony Simonet; Bing Tang; Christophe Blanchet; Raphael Terreux; Luc Bougé; François Briant; Franck Cappello; Kate Keahey; Bogdan Nicolae; Frédéric Suter

 

Address: INRIA Rennes – Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes Cedex, France ' INRIA Rennes – Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes Cedex, France ' LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France ' LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France ' LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France ' LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France ' LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France ' LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France ' LIP/ENS Lyon, 46 allée d'Italie, 69364 Lyon cedex 7, France ' CNRS/Université Lyon 1, Institut de Biologie et Chimie des Protéines, 7 Passage du Vercors, 69 367 Lyon cedex 07, France ' CNRS/Université Lyon 1, Institut de Biologie et Chimie des Protéines, 7 Passage du Vercors, 69 367 Lyon cedex 07, France ' ENS Cachan – Antenne de Bretagne, Campus de Ker Lann, Avenue Robert Schuman, 35170 Bruz, France ' IBM Products and Solutions Support Center, 1 Rue de la Vieille Poste, 34006 Montpellier, France ' Joint INRIA-UIUC Laboratory for Petascale Computing, National Center for Supercomputing Applications, 9 West Clark Street, Urbana, IL 61801, USA ' Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA ' IBM Research, Server 3, Damastown Industrial Park, Mulhuddart, Dublin 15, Ireland ' CNRS, CC IN2P3, Domaine Scientifique de La Doua, 43 bd du 11 Novembre 1918, 69622 Villeurbanne Cedex, France

 

Journal: Int. J. of Cloud Computing, 2013 Vol.2, No.2/3, pp.150 - 170

 

Abstract: As map-reduce emerges as a leading programming paradigm for data-intensive computing, today's frameworks which support it still have substantial shortcomings that limit its potential scalability. In this paper, we discuss several directions where there is room for such progress: they concern storage efficiency under massive data access concurrency, scheduling, volatility and fault-tolerance. We place our discussion in the perspective of the current evolution towards an increasing integration of large-scale distributed platforms (clouds, cloud federations, enterprise desktop grids, etc.). We propose an approach which aims to overcome the current limitations of existing map-reduce frameworks, in order to achieve scalable, concurrency-optimised, fault-tolerant map-reduce data processing on hybrid infrastructures. This approach will be evaluated with real-life bio-informatics applications on existing Nimbus-powered cloud testbeds interconnected with desktop grids.

 

Keywords: map-reduce; cloud computing; desktop grids; hybrid infrastructures; bioinformatics; task scheduling; fault tolerance; scalable data management; cloud infrastructures; data-intensive computing; storage efficiency; massive data; access concurrency; volatility.

 

DOI: 10.1504/IJCC.2013.055265

10.1504/13.55265

 

 

Purchase this articleComment on this article