Article: Array streaming for array programming Journal: International Journal of Computational Science and Engineering (IJCSE) 2018 Vol.17 No.3 pp.263 - 282 Abstract: A barrier to efficient array programming, for example in Python/NumPy, is that algorithms written as pure array operations completely without loops, while most efficient on small input, can lead to explosions in memory use. The present paper presents a solution to this problem using <i>array streaming</i>, implemented in the automatic parallelisation high-performance framework <i>Bohrium</i>. This makes it possible to use array programming in Python/NumPy code directly, even when the apparent memory requirement exceeds the machine capacity, since the automatic streaming eliminates the temporary memory overhead by performing calculations in per-thread registers. Using Bohrium, we automatically fuse, stream, JIT-compile, and execute NumPy array operations on GPGPUs without modification to the user programs. We present performance evaluations of three benchmarks, all of which show dramatic reductions in memory use from streaming, yielding corresponding improvements in speed and utilisation of GPGPU-cores. The fusion step is implemented using the theoretical framework presented in Kristensen et al. (2016), using a streaming-maximising cost function. The streaming-enabled Bohrium effortlessly runs programs on input sizes several orders of magnitude beyond sizes that crash on pure NumPy due to exhausting system memory. Inderscience Publishers - linking academia, business and industry through research

Title: Array streaming for array programming

Authors: Mads R.B. Kristensen; James E. Avery

Addresses: University of Copenhagen, Niels Bohr Institute, Denmark ' University of Copenhagen, Niels Bohr Institute, Denmark

Abstract: A barrier to efficient array programming, for example in Python/NumPy, is that algorithms written as pure array operations completely without loops, while most efficient on small input, can lead to explosions in memory use. The present paper presents a solution to this problem using array streaming, implemented in the automatic parallelisation high-performance framework Bohrium. This makes it possible to use array programming in Python/NumPy code directly, even when the apparent memory requirement exceeds the machine capacity, since the automatic streaming eliminates the temporary memory overhead by performing calculations in per-thread registers. Using Bohrium, we automatically fuse, stream, JIT-compile, and execute NumPy array operations on GPGPUs without modification to the user programs. We present performance evaluations of three benchmarks, all of which show dramatic reductions in memory use from streaming, yielding corresponding improvements in speed and utilisation of GPGPU-cores. The fusion step is implemented using the theoretical framework presented in Kristensen et al. (2016), using a streaming-maximising cost function. The streaming-enabled Bohrium effortlessly runs programs on input sizes several orders of magnitude beyond sizes that crash on pure NumPy due to exhausting system memory.

Keywords: JIT-compilation; high-productivity; Python; OpenCL; OpenMP; Bohrium; Python; Numpy; Numba; Cython; GP-GPU.

DOI: 10.1504/IJCSE.2018.095848

International Journal of Computational Science and Engineering, 2018 Vol.17 No.3, pp.263 - 282

Received: 18 Jan 2017
Accepted: 13 Apr 2017
Published online: 25 Oct 2018 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Array streaming for array programming

Keep up-to-date