Title: Performance evaluation of OpenMP's target construct on GPUs - exploring compiler optimisations

Authors: Akihiro Hayashi; Jun Shirako; Etorre Tiotto; Robert Ho; Vivek Sarkar

Addresses: Department of Computer Science, Rice University, Houston, TX, USA ' Department of Computer Science, Rice University, Houston, TX, USA ' IBM Canada Laboratory, 8200 Warden Ave, Markham, ON L6G 1C7, Canada ' IBM Canada Laboratory, 8200 Warden Ave, Markham, ON L6G 1C7, Canada ' Department of Computer Science, Rice University, Houston, TX, USA

Abstract: OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP's high-level parallel abstractions with accelerator programming. This extension allows programmers to write GPU programs in standard C/C++ or Fortran languages, without exposing too many details of GPU architectures. However, such high-level programming models generally impose additional program optimisations on compilers and runtime systems. Otherwise, OpenMP programs could be slower than fully hand-tuned and even naive implementations with low-level programming models like CUDA. To study potential performance improvements by compiling and optimising high-level programs for GPU execution, in this paper, we: 1) evaluate a set of OpenMP benchmarks on two NVIDIA Tesla GPUs (K80 and P100); 2) conduct a comparable performance analysis among hand-written CUDA and automatically-generated GPU programs by the IBM XL and clang/LLVM compilers.

Keywords: GPUs; OpenMP; CUDA; LLVM; XL compiler; NVPTX; NVVM; Kepler; Pascal; performance evaluation; compilers; OpenMP's target constructs.

DOI: 10.1504/IJHPCN.2019.097051

International Journal of High Performance Computing and Networking, 2019 Vol.13 No.1, pp.54 - 69

Received: 13 Feb 2017
Accepted: 23 Apr 2017

Published online: 17 Dec 2018 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article