Title: An efficient fault tolerant mechanism to deal with permanent and transient failures in a network on chip

Authors: Muhammad Ali, Michael Welzl, Sven Hessler, Sybille Hellebrand

Addresses: Institute of Computer Science, University of Innsbruck, Austria. ' Institute of Computer Science, University of Innsbruck, Austria. ' Institute of Computer Science, University of Innsbruck, Austria. ' Department of Electrical Engineering, University of Paderborn, Germany

Abstract: Recent advances in the silicon technology is enabling the VLSI chips to accommodate billions of transistors; leading toward incorporating hundreds of heterogeneous components on a single chip. However, it has been observed that the scalability of chips is posing grave problems for the current interconnect architecture which is unable to cope with the growing number of components on a chip. To remedy the inefficiency of buses, researchers have explored the area of computer networks besides exploring parallel computing to come up with viable solutions for billion transistor chips. The outcome is a novel and scalable communication paradigm for future System on Chips (SoCs) called as Network on Chips (NoC). However, as the chip scales, the probability of both permanent and temporary faults is also increasing, making Fault Tolerance (FT) a key concern in scaling chips. Alpha particle emissions, Gaussian noise on channels are some of the reasons which introduce transient faults in the data. Besides that, due to electromigration of conductors, corrosion or aging factors, on-chip modules or links may suffer permanent damage. This paper proposes a comprehensive solution to deal with both permanent and transient errors affecting the VLSI chips. On the one hand we present an efficient packet retransmission mechanism to deal with packet corruption or loss due to transient faults. On the other hand, we propose a deterministic routing mechanism which routes packets on alternate paths when a communication link or a router suffers permanent failure.

Keywords: network-on-chip; NoC; fault tolerance; self-healing; dynamic routing; reliable packet delivery; VLSI; packet retransmission.

DOI: 10.1504/IJHPSA.2007.015397

International Journal of High Performance Systems Architecture, 2007 Vol.1 No.2, pp.113 - 123

Published online: 14 Oct 2007 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article