Authors: Ahmad Lashgar; Amirali Baniasadi
Addresses: ECE Department, University of Victoria, Engineering Lab Wing Building, Room A220, Victoria, BC, Canada ' ECE Department, University of Victoria, Engineering Office Wing Building, Room 323, Victoria, BC, Canada
Abstract: OpenACC's programming model presents a simple interface to programmers, offering a trade-off between performance and development effort. OpenACC relies on compiler technologies to generate efficient code and optimise the performance. The cache directive is among the challenges to implement directives. The cache directive allows the programmer to utilise the accelerator's hardware- or software-managed caches by passing hints to the compiler. In this paper, we investigate the implementation aspect of cache directive under NVIDIA-like GPUs and propose optimisations for the CUDA backend. We use CUDA's shared memory as the software-managed cache space. We first show that a straightforward implementation can be very inefficient, and undesirably downgrade performance. We investigate the differences between this implementation and hand-written CUDA alternatives and introduce the following optimisations to bridge the performance gap between the two: 1) improving occupancy by sharing the cache among several parallel threads; 2) optimising cache fetch and write routines via parallelisation and minimising control flow. Investigating three test cases, we show that the best cache directive implementation can perform very close to hand-written CUDA equivalent and improve performance up to 2.4× (compared to the baseline OpenACC.)
Keywords: OpenACC; cache memory; CUDA; software-managed cache; performance; GPGPUs.
International Journal of High Performance Computing and Networking, 2019 Vol.13 No.1, pp.35 - 53
Received: 13 Feb 2017
Accepted: 23 Apr 2017
Published online: 11 Dec 2018 *