Article: A latency-conscious SMT branch prediction architecture Journal: International Journal of High Performance Computing and Networking (IJHPCN) 2004 Vol.2 No.1 pp.11 - 21 Abstract: Executing multiple threads has proved to be an effective solution to partially hide latencies that appear in a processor. When a thread is stalled because of a long-latency operation is being processed, such as a memory access or a floating-point calculation, the processor can switch to another context so that another thread can take advantage of the idle resources. However, fetch stall conditions caused by a branch predictor delay are not hidden by current simultaneous multithreading (SMT) fetch designs, causing a performance drop due to the absence of instructions to execute. In this paper, we propose several solutions to reduce the effect of branch predictor delay in the performance of SMT processors. Firstly, we analyse the impact of varying the number of access ports. Secondly, we describe a decoupled implementation of an SMT fetch unit that helps to tolerate the predictor delay. Finally, we present an interthread pipelined branch predictor, based on creating a pipeline of interleaved predictions from different threads. Our results show that, combining all the proposed techniques, the performance obtained is similar to that obtained using an ideal, 1-cycle access branch predictor. Inderscience Publishers - linking academia, business and industry through research

Title: A latency-conscious SMT branch prediction architecture

Authors: Ayose Falcon, Oliverio J. Santana, Alex Ramirez, Mateo Valero

Addresses: Computer Architecture Department, UPC, Jordi Girona 1-3, Modulo D6, 08034 Barcelona, Spain. ' Computer Architecture Department, UPC, Jordi Girona 1-3, Modulo D6, 08034 Barcelona, Spain. ' Computer Architecture Department, UPC, Jordi Girona 1-3, Modulo D6, 08034 Barcelona, Spain. ' Computer Architecture Department, UPC, Jordi Girona 1-3, Modulo D6, 08034 Barcelona, Spain

Abstract: Executing multiple threads has proved to be an effective solution to partially hide latencies that appear in a processor. When a thread is stalled because of a long-latency operation is being processed, such as a memory access or a floating-point calculation, the processor can switch to another context so that another thread can take advantage of the idle resources. However, fetch stall conditions caused by a branch predictor delay are not hidden by current simultaneous multithreading (SMT) fetch designs, causing a performance drop due to the absence of instructions to execute. In this paper, we propose several solutions to reduce the effect of branch predictor delay in the performance of SMT processors. Firstly, we analyse the impact of varying the number of access ports. Secondly, we describe a decoupled implementation of an SMT fetch unit that helps to tolerate the predictor delay. Finally, we present an interthread pipelined branch predictor, based on creating a pipeline of interleaved predictions from different threads. Our results show that, combining all the proposed techniques, the performance obtained is similar to that obtained using an ideal, 1-cycle access branch predictor.

Keywords: SMT processors; simultaneous multithreading; fetch engine; branch prediction delay; decoupled predictor; predictor pipelining; memory latencies; multiple threads; interthread pipelined branch predictor; high performance computing.

DOI: 10.1504/IJHPCN.2004.009264

International Journal of High Performance Computing and Networking, 2004 Vol.2 No.1, pp.11 - 21

Published online: 14 Mar 2006 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: A latency-conscious SMT branch prediction architecture

Keep up-to-date