Title: Performance evaluation of OpenMP's target construct on GPUs - exploring compiler optimisations
Authors: Akihiro Hayashi; Jun Shirako; Etorre Tiotto; Robert Ho; Vivek Sarkar
Addresses: Department of Computer Science, Rice University, Houston, TX, USA ' Department of Computer Science, Rice University, Houston, TX, USA ' IBM Canada Laboratory, 8200 Warden Ave, Markham, ON L6G 1C7, Canada ' IBM Canada Laboratory, 8200 Warden Ave, Markham, ON L6G 1C7, Canada ' Department of Computer Science, Rice University, Houston, TX, USA
Abstract: OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP's high-level parallel abstractions with accelerator programming. This extension allows programmers to write GPU programs in standard C/C++ or Fortran languages, without exposing too many details of GPU architectures. However, such high-level programming models generally impose additional program optimisations on compilers and runtime systems. Otherwise, OpenMP programs could be slower than fully hand-tuned and even naive implementations with low-level programming models like CUDA. To study potential performance improvements by compiling and optimising high-level programs for GPU execution, in this paper, we: 1) evaluate a set of OpenMP benchmarks on two NVIDIA Tesla GPUs (K80 and P100); 2) conduct a comparable performance analysis among hand-written CUDA and automatically-generated GPU programs by the IBM XL and clang/LLVM compilers.
Keywords: GPUs; OpenMP; CUDA; LLVM; XL compiler; NVPTX; NVVM; Kepler; Pascal; performance evaluation; compilers; OpenMP's target constructs.
International Journal of High Performance Computing and Networking, 2019 Vol.13 No.1, pp.54 - 69
Available online: 11 Dec 2018 *Full-text access for editors Access for subscribers Purchase this article Comment on this article