Article: Optimising MPI tree-based communication for NUMA architectures Journal: International Journal of Autonomous and Adaptive Communications Systems (IJAACS) 2015 Vol.8 No.4 pp.407 - 423 Abstract: Today's computer clusters are often composed of many multi-core processors that are networked together. With this architecture communication between cores on different nodes is often on a magnitude slower than those between cores on the same node. Cores on the same processor communicate faster than cores on different processors on the same node. Most MPI implementations assume a homogeneous network. In this paper, we treat a multi-core node as a heterogeneous unit and optimise MPI scatter/gather communications by scheduling using topology information. We demonstrate that a previous heuristics for heterogeneous clusters do improve the performance, but might not produce optimal results on multi-core node for communications. Our solution modifies the fastest edge first heuristic by accounting for how many messages can be sent in parallel without impeding the bandwidth. We are able to achieve 20% to 30% performance gains over the MPI scatter/gather implementation on homogeneous, multi-core nodes. Inderscience Publishers - linking academia, business and industry through research

Title: Optimising MPI tree-based communication for NUMA architectures

Authors: Christer Karlsson; Zizhong Chen

Addresses: South Dakota School of Mines and Technology, Rapid City, SD 57701, USA ' University of California, Riverside, CA 92521, USA

Abstract: Today's computer clusters are often composed of many multi-core processors that are networked together. With this architecture communication between cores on different nodes is often on a magnitude slower than those between cores on the same node. Cores on the same processor communicate faster than cores on different processors on the same node. Most MPI implementations assume a homogeneous network. In this paper, we treat a multi-core node as a heterogeneous unit and optimise MPI scatter/gather communications by scheduling using topology information. We demonstrate that a previous heuristics for heterogeneous clusters do improve the performance, but might not produce optimal results on multi-core node for communications. Our solution modifies the fastest edge first heuristic by accounting for how many messages can be sent in parallel without impeding the bandwidth. We are able to achieve 20% to 30% performance gains over the MPI scatter/gather implementation on homogeneous, multi-core nodes.

Keywords: message passing interface; MPI; multi-core processors; computer clusters; binomial tree; fastest edges first; channel aware ordering; CAO; optimisation; NUMA architectures; scheduling; topology; non-uniform memory access; multi-core nodes.

DOI: 10.1504/IJAACS.2015.073190

International Journal of Autonomous and Adaptive Communications Systems, 2015 Vol.8 No.4, pp.407 - 423

Received: 23 Sep 2013
Accepted: 11 Oct 2013
Published online: 27 Nov 2015 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Optimising MPI tree-based communication for NUMA architectures

Keep up-to-date