Abstract

Scaling CMOS process technology continues to enable increased levels of system integration, leading to on-chip communication demands beyond what traditional digital signaling techniques can efficiently provide with sufficient reliability. In this paper we survey the state of the art of on-chip interconnect techniques for improving performance, energy, and reliability and provide a review of interconnect reliability considerations. Finally, we provide a case study to evaluate the efficiency of error correcting codes on a state-of-the-art energy-efficient low-swing interconnect.

1. Introduction

Energy-efficient on-chip communication has increasingly become a primary concern in large IC designs such as systems-on-chip (SoC) and chip-multiprocessors. While CMOS technology scaling has provided improvements in computational performance, enabled by increases in transistor density and speed, the energy and latency required to move data between distant on-chip locations in these scaling technology nodes are increasingly becoming design bottlenecks [1โ€“3]. As such, techniques for providing energy-efficient on-chip communication have been explored with increasing interest in the last decade.

As on-chip systems continue to grow in size and complexity, increased device counts, process variation and reduced signal margins mandate that interconnect reliability considerations take on new priorities in circuits that were traditionally considered robust. The result is inefficient overmargining of signal levels and timing, reduced yields and higher susceptibility to both static and transient errors being introduced to the system [4]. As such, both static manufacturing flaws and single-event-induced errors can unacceptably reduce the mean time to failure (MTTF) of traditionally robust circuits [5]. Taking steps to ensure the robustness and reliability of energy-efficient interconnect circuits should therefore be a design priority in order for energy-efficient interconnect circuits to be considered a viable design choice.

The remainder of this paper is presented as follows. Section 2 surveys recent on-chip interconnect techniques. Section 3 reviews causes of reduced interconnect reliability. Section 4 reviews recent work on error coding implementations for on-chip interconnect and presents a case study of the area, delay, and energy overhead of applying various error coding schemes to a low-swing energy-efficient on-chip interconnect structure operating at low-supply voltages. Finally, Section 5 concludes.

2. On-Chip Interconnect Circuits

On-chip interconnect is a general term used to refer to a wide variety of circuit implementations designed to transport information across a chip. Energy-efficiency, performance, and implementation overhead have typically been the primary design goals, with techniques to achieve the desired tradeoffs between these often competing goals being introduced in the form of custom circuits, synthesized encoding schemes, and architectural decisions. While fully digital implementations have typically been able to assume bit-error rates (BERs) better than 10โˆ’15 over the operating range of voltages and frequencies based on the margining provided by digital EDA tools, that assumption does not necessarily hold true in custom low-swing interconnect implementations or even potentially in modern deep sub-micron (DSM) circuits in general without inefficiencies introduced via overmargining. In this section we survey recent interconnect implementations with varying design goals and provide a summary of interconnect circuit implementations in Table 1.

2.1. Conventional Logic-Based Interconnects

Due to the increased complexity and overhead of implementing custom low-swing interconnect transceiver circuits rather than synthesizing standard cell-based โ€œtransceivers,โ€ previous works have targeted improving the energy efficiency, performance and/or reliability of conventional standard-cell-based interconnects rather than introducing custom circuits. To this end, [6] proposes a dynamic voltage and frequency scaling (DVFS) approach to network-on-chip (NoC) interconnect circuits that uses recent network traffic history to estimate the current throughput demands of the network and adjusts the interconnect voltage and frequency accordingly to just meet that requirement. This technique is however limited to only reducing the signaling voltage to the ๐‘‰min of the digital logic surrounding the interconnect wires.

Reference [7, 8] propose charge recycling schemes for energy reduction on full-swing interconnect circuits, the former also applies an information theory-based approach to studying the fundamental minimum energy required to pass information via on-chip interconnect when using encoding schemes. Further, [7] also introduces an encoding scheme to improve performance by limiting transitions that cause crosstalk induced link delays. In this same realm of work, [9โ€“11] introduce encoding schemes to reduce link energy consumption by reducing line transitions.

Reference [12] proposes a pulse-width modulation scheme in which three bits are encoded into the clock-relative pulse delay and pulse width in order to reduce power by limiting the switching activity on the bus. However, while this approach can be mostly implemented in the digital domain, it requires custom delay and link arbiter cells, operates at low speeds, and also requires wires longer than approximately 9โ€‰mm before breaking even with the transceiver overheads.

2.2. Low-Swing Interconnect Circuits

Recent custom low-swing on-chip interconnect circuits have been shown to provide high-performance, energy-efficient alternatives to conventional full-swing digital signaling. However, improvements in signaling energy and performance typically come at the expense of overheads in the form of increased design complexity, larger circuit, and wiring footprints or decreased noise margins. Previously demonstrated links therefore typically target global interconnect lengths up to 10โ€‰mm where these overheads can be more easily absorbed in exchange for large energy savings achieved by replacing the repeated full-swing digital signal drivers.

Reference [13] evaluates a wide variety of high-to-low and low-to-high swing interconnect driver and receiver structures for use in interconnect signaling and compares their performance on the basis of SNR, receiver offset and sensitivity, crosstalk susceptibility, power supply rejection, and energy-delay product.

Reference More recent signaling techniques for long wires focus on increasing the useful bandwidth of the wire, which acts as a low pass RC filter. Capacitively driven links, proposed in [14] use a feed-forward AC-coupling capacitor in order to increase the effective signaling bandwidth of the channel while reducing the supply voltage. Reference [15โ€“18] extend this approach by applying feed-forward equalization (FFE) and decision feedback equalization (DFE) techniques, used traditionally in high-speed serial transceivers for off-chip links, to improve performance and reliability for long, on-chip interconnect wires. These proposed methods however result in either floating AC-coupled wires, which are are more susceptible to interference than actively driven nodes, or require static bias currents which burn power even when no signaling is taking place.

While cross-chip global communication is the more obvious use of custom low-swing interconnect circuits, applications such as 2D mesh NoCs and SoCs with many discrete heterogeneous components often require many shorter wires of just 1-2โ€‰mm. In these cases, the wire capacitance still dominates the energy required for communication, especially in a multihop environment where a 10โ€‰mm distance is traversed in smaller steps. However the wire bandwidth, which limits bit-rate on 10โ€‰mm wires, has a significantly reduced effect on the achievable bit-rate between adjacent cores, which is a prime motivator for the NoCs paradigm [19]. To this end, a body of works exist proposing approaches to efficient communication over the relatively short interconnects that are expected in these systems.

Reference [20] extends previous work on capacitively driven interconnect transceivers by removing excessive equalization circuits that are not necessary on shorter wire lengths and optimizing transceiver area for use in a parallel bus configuration rather than a serial link. Reference [21, 22] provide examples of low-swing dual-supply differential wire drivers that achieve excellent energy efficiency on interconnects from 1โ€‰mm to 5โ€‰mm at the expense of very modest increases in wiring area. Reference [21] further extends this interconnect technique by also integrating compact low-swing interconnect transceivers into the crossbar switch of NoC routers, an NoC component whose wire dominated loading has been observed to consume as much as 30% of NoC communication energy.

Reference [23] explores the limits of low-swing interconnect energy efficiency by proposing offset correction and DFE techniques for traditional sense amplifiers that are scalable to subthreshold voltage supplies and significantly improve receiver offset without increasing energy. Further, [23] proposes a capacitive charge sharing transmitter circuit which allows adjustable low-swing signaling at levels that can be matched to a receiverโ€™s measured input offset, reducing the need for overmargining signal swing to account for device offset.

2.3. Low-Latency Serial Links

On-chip serial links exploit the transmission line properties of on-chip wires in order to take advantage of near speed-of-light communication. Reference [24, 25] demonstrate point-to-point connections with the potential to provide data rates upwards of 40โ€‰Gbps with latencies of approximately 10โ€‰ps/mm. However, the large transceiver areas and wide multilayer wire paths these approaches use to achieve the required inductance and low resistance properties results in worse bit-rate per wire area figures of merit than simpler RC-mode signaling. However, at the architectural level a small number of global, low-latency communication links, such as that proposed in [26] for network flow control or multicast signaling as proposed in [27] could be used to provide efficient automatic repeat request (ARQ) signaling in error detect and recover schemes, such as described using conventional signals in [5] and are deserving of further exploration.

2.4. Other Interconnect Techniques

Less conventional on-chip interconnect techniques include RF-band signaling [28] which shows the potential to provide extremely high bandwidth density, but require large areas and complex transceiver designs. Optical communication channels, a long discussed communication medium for silicon devices, are proposed in [29] for hybrid on- and off-chip signaling in order to bridge the inter- and intrachip communication systems of many-core processors and external memories. Using wave-division multiplexing, these techniques show potential for extremely high bandwidth densities, but require additional off-chip optical systems as well on-chip circuit area, limiting their potential as an on-chip interconnect to only a small number of nodes separated by large distances.

3. Sources of Error

Here we consider two situations in which errors can be introduced to interconnect circuits and discuss potential sources which can cause these errors to occur.

The first, is by a flipping of the stored value bit in the receiver output latch. Under nominal operating and design conditions this stored value is quite stable, however in the presence of large process variation, the noise margin of the storage element can be degraded. This leaves the storage cell more susceptible to large external interference sources, such as a radiation impact event or coupling from a strongly driven and poorly routed clock signal, which may overcome the noise margin of the SR latch.

The second error condition occurs when the low-swing signal is sampled incorrectly. Interference and noise sources can induce transient voltages to the interconnect wire or internal receiver nodes during a sampling event, or, static defects in the receiver circuit can cause an incorrect evaluation during sampling, resulting in an erroneous value being stored at the receiver output latch.

Process variation may also cause static logical errors such as stuck-at faults, however a well-developed field already exists to provide digital error coverage and yield analysis using advanced tool suites which can be applied to interconnect circuits, and as such this fault type will not be addressed in the scope of this paper.

3.1. Crosstalk

Crosstalk from neighboring signal wires can induce errors in the form of glitches, signal speedup or slowdown as described in [30]. Wire twisting [31] is shown to mitigate the effect of neighbor crosstalk on differential pairs, by forcing coupling to occur evenly on each of the differential wires. Further, encoding can be applied to reduce the occurrence of aggressive neighbor signals by disallowing events that are likely to cause error inducing crosstalk events. While this incurs an additional circuit overhead, [7] shows that the use of encoding schemes which reduce the switching activity of the wires can both improve performance and reduce energy.

Low-swing differential interconnects are resilient to common-mode crosstalk from nearby full-swing signals, however asymmetric crosstalk remains a concern. Adding ground shielding around signal wires provides excellent protection, however results in increased signal wire capacitance and requires significant additional wiring area. A more judicious approach to wire shielding is presented in [21] which seeks to add wire shielding between signal pairs, reducing only asymmetric coupling, but still allowing common mode interference onto the wires, resulting in less added capacitance to the signaling wires from the shielding.

3.2. Intersymbol Interference

Inter-symbol interference (ISI) is caused when residual energy from previous (pre-cursor ISI) or next (post-cursor ISI) bits is present on the wire at the time when the current value is sampled by the receiver. The deterministic nature of ISI allows a variety of techniques to be used to suppress its effect. In addition to FFE, DFE and ISI reduction encoding, which were introduced previously, [32] uses line multiplexing to temporally space the use of wires and thus reduce the impact of ISI on their signals, while [33] proposes using error-rate controlled DVFS in order to increase the delay and thus adaptively reduce the effect of ISI, when errors are detected.

When sense-amplifier circuits are used as receivers in low-swing interconnect circuits, hysteresis in the amplifier structure can be modeled to have the same effect as ISI and can thus be considered as a residual voltage signal at the inputs or as a transient receiver offset. To this end, [23] proposes an energy-efficient, tunable hysteretic DFE circuit embedded in the receiver to take advantage of this property by either correcting for existing hysteresis or tuning the hysteresis value to compensate for ISI. This is performed by asymmetrically controlling the tail-nodes of a pseudodifferential sense-amplifier, which significantly reduces the additional capacitive loading of internal receiver nodes resulting from other embedded DFE approaches.

3.3. Radiation-Induced Soft Errors

Single-event, radiation-induced errors can occur when alpha particles or cosmic rays pass through a p-n-junction and induce a current. While radiation impact events occur in normal operating environments, they are traditionally of concern primarily to space and military applications. Devices designed for these applications are required to operate reliably in environments where higher energy particles are more common (i.e., space) and therefore more likely to change a stored value when an impact occurs. Scaling device sizes and capacitances however mean that even lower energy particle impact events, which, are more common in normal operating conditions, are more likely to cause soft errors. This is typically of most concern to memory cells, which due to their small capacitance, take relatively little energy to change value. For this reason, applications that demand high levels of reliability often implement radiation hard by design (RHBD) layout techniques, such as using edgeless transistors or increasing SRAM cell capacitance. Error codes, such as single-error correcting, double-error detecting (SECED) are also commonly used in memory systems to detect, correct and even periodically scrub-stored data to remove errors [34, 35].

Interconnect signal wires are at low risk of radiation-induced errors because currents resulting from radiation impact events are only induced in active regions and thus will only impart charge to the signal wires via an impact at the transmitter, while the large RC of the wire limits itโ€™s effect at the receiver. However, when receiver designs target minimal area such as [21, 23] the receiver storage element nodes may be susceptible, particularly during the evaluation clock edge. To evaluate the susceptibility of a structure to radiation particle impact, a current pulse can be applied as described in [36] or alternatively, a design layout can be evaluated for robustness using EDA tools as described in [37].

3.4. Process Variation

Process variation is used here as a general term that includes random dopant fluctuation (RDF), line edge roughness (LER), and other device parameter variations that can typically be modeled as a change in the width, length or effective threshold voltage of a device. Errors introduced by process variation appear in the form of differences in delay for driven signals and input offset and delay for receiver circuits. However, because these errors are generally static in nature they can potentially be mitigated by embedded measurement and calibration circuits [23], localized adaptive body biasing and voltage scaling [38], or device and block sparing [39].

3.5. Clocking

Clock skew and jitter occur as a result of the other error sources described in this section, however they have a direct effect on the introduction of errors because low-swing interconnect systems are dependent on an accurate clock signal to sample arriving data correctly. Sufficient timing margins are employed to ensure that clock skew remains within design specifications, while judicious power and clock network designs and routing, as well as local decoupling near clock buffering, help to reduce jitter. For large chip designs, however, margining results in unacceptable performance degradation when it must account for clock skew and jitter across long on-chip distances, temperature, and process gradients. For this reason, large systems often employ more complex clock designs, such as globally asynchronous, locally synchronous (GALS) clocking schemes, as used in [40, 41] to limit the size of the clock domain across which jitter and skew must be tightly controlled. These may be used with wave-pipelining [20] or forwarded clock interconnect schemes [21] to resynchronize data at its destination.

3.6. Supply Noise

The majority of supply noise in digital systems originates from the simultaneous switching of devices at each clock cycle which results in resistive voltage drop and inductive voltage changes across packaging and power distribution wires. While this source of noise can in some cases be considered deterministic interference and be compensated for, the scale and complexity of modern digital ICs as well as implementation overhead limits the feasibility of applying active correction. Instead, sufficient power supply routing and parallel bond pads are used to limit the resistance and inductance such that the voltage drop is maintained within a set design limit. In general for interconnect circuits, when handled in a conventional design flow, supply noise is of concern primarily because of its contribution to clock jitter. However, low-swing signaling implementations can be more susceptible to supply noise or may not be characterized as robustly as traditional digital cells and may still introduce errors. For this reason, differential low-swing signaling, which has superior power supply rejection due to its differential nature, is often preferred, as in [14โ€“17, 20โ€“23], while single-ended techniques are more susceptible [13].

3.7. Temperature

Temperature gradients across a chip are caused by differences in switching activity between chip regions. While on-chip temperature gradients change extremely slowly relative to the period of the clock, the temperature of a typical commercial processor can change dramatically over a period of seconds or minutes. Because temperature has a direct effect on many of the characteristics of transistor operation, we identify three temperature-related concerns to interconnect circuits.

First, delay characteristics of the transceiver circuits, including the clock distribution network, will change over time, introducing slowly time-varying skew. Designs can be made to have sufficient margins such that these delay changes will not typically introduce errors. Additionally, forwarded clocking, wave pipelining, and mesosynchronous clocking schemes helps to limit the effect that spatial clock skew distributions caused by temperature gradients has on system operation. Second, in low-swing interconnect circuits, sense-amplifier offset characteristics may change over time, demanding that either additional signal margins be implemented or periodic recalibration procedures be performed. Finally, thermal noise levels in sense amplifier receivers, analyzed extensively for DRAM systems, for example in [42], will change over time, requiring that transceivers must again either be designed with sufficient signal margins or adaptively adjust signal levels in order to achieve a target BER.

4. Case Study

While avoiding errors by providing sufficient signal levels and timing margins is the preferred methodology in todayโ€™s design flows and EDA tools, recent works show the potential inefficiency of such overmargining, especially in the presence of extreme variation distributions in the device, signal and timing parameters that must be considered in subthreshold operation [43]. While ensuring reliable operation of energy-efficient communication channels is an obvious prerequisite to their adoption, an alternative to extreme overmargining in these high-variability operating conditions is the ability to detect and react to the presence of errors. Once an error is identified it can be handled in a variety of manners, dependent on the particular system and application requirements and constraints. For instance, the information can be corrected at the receiver using redundant information encoded in forward error correction (FEC), a request can be made to resend corrupted information (automatic repeat request, ARQ) or the data can simply be thrown away. Further, if the error source can be identified, steps can be taken to mitigate further errors by implementing adaptive techniques that adjust system parameters at runtime.

In this section, we study the overhead of implementing error detection and correction schemes at the link level on a dual-supply, low-swing interconnect circuit presented in [23] which is optimized for area- and energy-efficient differential operation across supply voltages of 0.5โ€‰V to 1.0โ€‰V. Analysis of architectural network and application level implications of such error correction mechanisms, such as increased buffering overhead to handle an ARQ event, where it has not already been performed, is left for future work.

4.1. Background and Related Work

Recent works have recognized that the reliability of on-chip interconnect circuits is quickly becoming a limiting design factor, with [5] pointing out that an interconnect BER of 10โˆ’12 on a 16-node, 200โ€‰MHz NoC corresponds to a mean time to failure of just 26 minutes. This fact has motivated the exploration of techniques to improve system reliability absent the traditional means of applying additional timing margins. This includes many of the error mitigation techniques reviewed in the previous section. Reference [5, 44, 45] take a different approach, applying error-correcting and -detecting codes to traditionally highly reliable full-swing interconnect circuits, thereby reducing the requirement to and impact of overdesigning voltage and timing margins on system performance and energy. To this effect, [44] explores the energy overhead of implementing a variety of error-coding schemes on a 32-bit AMBA bus, and extends this evaluation to scaling bus voltages in [45]. Reference [5] examines the problem from the perspective of a network architect, evaluating the network overhead implications of implementing FEC and ARQ in a NoC environment. Here we extend these works by presenting a case study evaluating the overheads of applying similar error coding schemes to a custom, differential, low-swing interconnect circuit.

4.2. Error Code Implementations

Five well-known error-coding schemes were implemented and simulated in HSPICE using 65โ€‰nm standard cells and optimized for energy efficiency in order to characterize the area, delay and energy overhead of each scheme for encoding and decoding four information bits. These include single-error detecting (SED), single-error correcting (SEC), double-error detecting (DED), four-error detecting cyclic redundancy check (CRC-5), single-error correcting, and double-error detecting (SECDED). Coding schemes and characteristics are listed in Table 2 and are shown in conventional (๐‘›,๐‘˜) format where ๐‘˜ represents the number of information bits and ๐‘› represents the total number of bits, including parity.

Analysis of each methodโ€™s error detection efficacy and effect on BER is described in [44, 45], respectively, therefore only the implementation overheads with respect to interconnect circuit structure are addressed here in order to present the reader with easy to use metrics by which they may evaluate each code as a viable design choice.

In order to evaluate the efficiency of a particular error-coding scheme we first observe the following equations modeling the energy on the interconnect. (1) describes the energy per bit transmitted on the dual-supply interconnect for a selected supply voltage wherein ๐ถwire is the wire capacitance, ๐‘‰sw1 is the interconnect signal swing required to achieve an acceptable BER without applying error coding, and ๐ธtxrx is the combined transmitter and receiver energy per bit sent, which is essentially constant across signal swing values for a given supply voltage. (2) describes the energy per bit of original information transmitted when error coding is applied and includes two additional terms: ๐ธecc, representing the energy to encode a data word as reported in Table 2, and an overhead factor; ๐‘˜, to account for the additional transceiver and wire energy required to send additional parity bit(s): ๐ธ๐‘โˆ’noecc=12๐ถwire๐‘‰2sw1+๐ธtxrx,๐ธ(1)๐‘โˆ’ecc๎‚€1=๐‘˜2๐ถwire๐‘‰2sw2+๐ธtxrx๎‚+๐ธecc.(2)

Next, we consider under what conditions the energy efficiency of the interconnect implementation with error coding outperforms the original interconnect that does not implement error coding and solve to find the error-coded voltage swing that results in less energy than its original counterpart, while achieving the same or better BER:๐ธ๐‘โˆ’ecc<๐ธ๐‘โˆ’noecc,๐‘‰(3)2sw2<1๐‘˜๎‚ต๐‘‰2sw1๐ธ+2โˆ—txrx(1โˆ’๐‘˜)โˆ’๐ธecc๐ถ๎‚ถ.(4)

Using the result of (4) we can plot the positive, real values of ฮ”๐‘‰=(๐‘‰sw1โˆ’๐‘‰sw2) versus ๐‘‰sw1 for each error coding implementation in order to illustrate the minimum required voltage swing reduction at which the error code provides an energy efficient BER reduction technique for the interconnect circuit, shown for a supply voltage of 500โ€‰mV and 1โ€‰mm differential wire pair in Figure 1, while Figure 2 provides a frame of reference for comparing the relative overhead to link length.

From Figure 1 we can see that because the dual-supply interconnect already provides an acceptable BER with 100โ€‰mV signal swing, none of the codes considered would provide an energy benefit on the 1โ€‰mm wire, even if the signaling swing were reduced to 0โ€‰mV.

5. Conclusion

Continued CMOS process scaling and system integration continues to increase the on-chip communication demands beyond what conventional digital signaling can efficiently provide. To this end, a survey of interconnect implementation techniques was presented comparing both encoding schemes and low-swing implementations for improving energy efficiency, performance, or reliability.

Reliability in these interconnect systems is a prerequisite for their adoption and is of particular concern when operating at reduced voltage swings, near performance limits, or in the presence of large device variations. We reviewed potential sources of error introduction to interconnect systems in order to provide a better understanding of how these sources may be suppressed or otherwise mitigated in a manner most appropriate and efficient to the source and application.

Finally, the energy efficiency of the studied error codes were found to be insufficient to warrant their application to a particular dual-supply low-swing interconnect circuit. However, we suggest that while error-mitigation techniques appear to offer better energy efficiency than error detection and correction schemes, the latter may reduce the requirements on voltage and timing margining, resulting in higher performing systems that could achieve the same or improved reliability. Additionally, characteristics of the chosen process technology play a significant role in determining whether an error-coding scheme is warranted, suggesting that error coding for on-chip interconnects may become more viable with continued scaling.

While previous analysis has been performed with respect to transient error introduction to interconnect circuits, link transceivers are assumed to be logically functional, which may not be an accurate assumption in the absence of extreme signal and timing margining in subthreshold designs or multibillion transistor processors. Therefore, this analysis should be extended to also consider hardware redundancy (device sparing) by including the probability of device failure (i.e., a nonfunctional transceiver line or unacceptably high receiver offset).

Further, while we show that the energy overhead of error coding is generally too inefficient for use with the dual-supply transmitter analyzed under normal operating conditions, further work should be performed to explore the potential of these error coding schemes as a feedback mechanism for implementing BER-adaptive energy and performance control or for identifying nonfunctional circuit blocks which may be replaced with redundant hardware. This detect and react paradigm will likely continue to increase in appeal as CMOS scaling continues to present new challenges from increased communication demands, device density and variation, reduced signal margins, and a nonscaling (or even shrinking) power envelope.