Article: Improving Map-Reduce for GPUs with cache Journal: International Journal of High Performance Systems Architecture (IJHPSA) 2015 Vol.5 No.3 pp.166 - 177 Abstract: Applications need specific or custom optimisations to completely exploit the compute capabilities of the underlying hardware. This is often a very tedious task for the programmer. Moreover, many of these applications do not scale well with data size. The Map-Reduce (MR) framework provides a high level of abstraction to map these applications onto the distributed/parallel architectures, but with a large performance penalty. We analyse a state-of-the-art MR framework to assess its performance penalty. The primary objective of this work is to reduce the performance gap between MR and native compute unified device architecture (CUDA) implementation of the applications (onlyCUDA). This work reports deployment of three applications on graphics processor units (GPUs) using MR framework. We study the performance of these applications on modern GPUs with different cache configurations. The results show that the performance of the applications with MR framework does not decline much if the reconfigurable cache of modern GPUs is utilised properly. We show penalty reduction of 5×, 6.45× and 15.87× for SmithWaterman (SW) algorithm, N-body (NB) simulation, and Blowfish (BF) algorithm, respectively. Inderscience Publishers - linking academia, business and industry through research

Title: Improving Map-Reduce for GPUs with cache

Authors: Arun Kumar Parakh; M. Balakrishnan; Kolin Paul

Addresses: Department of Computer Science and Engineering, Indian Institute of Technology Delhi, New Delhi, India ' Department of Computer Science and Engineering, Indian Institute of Technology Delhi, New Delhi, India ' Department of Computer Science and Engineering, Indian Institute of Technology Delhi, New Delhi, India

Abstract: Applications need specific or custom optimisations to completely exploit the compute capabilities of the underlying hardware. This is often a very tedious task for the programmer. Moreover, many of these applications do not scale well with data size. The Map-Reduce (MR) framework provides a high level of abstraction to map these applications onto the distributed/parallel architectures, but with a large performance penalty. We analyse a state-of-the-art MR framework to assess its performance penalty. The primary objective of this work is to reduce the performance gap between MR and native compute unified device architecture (CUDA) implementation of the applications (onlyCUDA). This work reports deployment of three applications on graphics processor units (GPUs) using MR framework. We study the performance of these applications on modern GPUs with different cache configurations. The results show that the performance of the applications with MR framework does not decline much if the reconfigurable cache of modern GPUs is utilised properly. We show penalty reduction of 5×, 6.45× and 15.87× for SmithWaterman (SW) algorithm, N-body (NB) simulation, and Blowfish (BF) algorithm, respectively.

Keywords: Map-Reduce; onlyCUDA; CUDA; penalty; performance; SmithWaterman; N-Body; Blowfish; graphics processor units; GPUs; CPU; cache; speed; simulation.

DOI: 10.1504/IJHPSA.2015.070392

International Journal of High Performance Systems Architecture, 2015 Vol.5 No.3, pp.166 - 177

Received: 09 Feb 2015
Accepted: 10 Feb 2015
Published online: 04 Jul 2015 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Improving Map-Reduce for GPUs with cache

Keep up-to-date