Title: High-performance and low-power VLIW cores for numerical computations

Authors: Miquel Pericas, Eduard Ayguade, Javier Zalamea, Josep Llosa, Mateo Valero

Addresses: Departament d'Arquitectura de Computadors, Universitat Politecnica de Catalunya, Jordi Girona, 1–3. Modul D6 Compus Nord, 08034 Barcelona, Spain. ' Departament d'Arquitectura de Computadors, Universitat Politecnica de Catalunya, Jordi Girona, 1–3. Modul D6 Compus Nord, 08034 Barcelona, Spain. ' Departament d'Arquitectura de Computadors, Universitat Politecnica de Catalunya, Jordi Girona, 1–3. Modul D6 Compus Nord, 08034 Barcelona, Spain. ' Departament d'Arquitectura de Computadors, Universitat Politecnica de Catalunya, Jordi Girona, 1–3. Modul D6 Compus Nord, 08034 Barcelona, Spain. ' Departament d'Arquitectura de Computadors, Universitat Politecnica de Catalunya, Jordi Girona, 1–3. Modul D6 Compus Nord, 08034 Barcelona, Spain

Abstract: Issue logic is among the worst scaling structures in a modern microprocessor. Increasing the issue width increments the processor area in an exponential way. Bigger processors will have inherently larger wire delays. In this scenario, technology scaling will yield smaller performance improvements as the wire delays do not decrease. Instead, they start to dominate the clock cycle. In order to offer higher performance the wire problem needs to be tackled. This paper discusses two methods which attempt to move the wire problem out of the critical path. The first method is the clustering technique, which directly approaches the wire problem by combining several smaller execution cores in the processor backend to perform the computations. Each core has a smaller issue width and a much smaller area. The second technique we study is the widening technique. This technique consists in reducing the issue width of the processor, but giving the instructions SIMD capabilities. The parallelism here is small (normally two to four) and does not resemble multimedia or vector extensions. Wide processors use wide functional units that compute the same operation on multiple words. The rationale behind this idea is that by reducing the issue width (but not the computational bandwidth), we are also reducing the issue logic circuitry and the complexity of structures such as the register file and the cache memory. When compared with a centralised core with 128 registers, 8 FPUs and 4 memory ports, our approach, using an equivalent amount of hardware units, is able to achieve speedups up to 1.7.

Keywords: ILP; VLIW cores; clustering; FPU widening; floating point units; modulo scheduling; energy-delay; numerical computations; high performance computing; issue logic; issue width reduction.

DOI: 10.1504/IJHPCN.2004.008346

International Journal of High Performance Computing and Networking, 2004 Vol.1 No.4, pp.171 - 179

Published online: 07 Dec 2005 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article