Table of Contents Author Guidelines Submit a Manuscript
The Scientific World Journal
Volume 2014, Article ID 286084, 12 pages
http://dx.doi.org/10.1155/2014/286084
Research Article

Characterizing the Effects of Intermittent Faults on a Processor for Dependability Enhancement Strategy

1Multicore Research Institute, High Performance CPU Center, Tsinghua University, Building F.I.T, Beijing 100084, China
2School of Computer Science, Harbin Institute of Technology, No. 155 Fanrong Street, Nangang District, Harbin 150001, China
3School of Computer and Communication Engineering, University of Science and Technology Beijing, No. 30 Xueyuan Road, Haidian District, Beijing 100083, China

Received 31 August 2013; Accepted 17 March 2014; Published 28 April 2014

Academic Editors: J. Shu and F. Yu

Copyright © 2014 Chao(Saul) Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

As semiconductor technology scales into the nanometer regime, intermittent faults have become an increasing threat. This paper focuses on the effects of intermittent faults on NET versus REG on one hand and the implications for dependability strategy on the other. First, the vulnerability characteristics of representative units in OpenSPARC T2 are revealed, and in particular, the highly sensitive modules are identified. Second, an arch-level dependability enhancement strategy is proposed, showing that events such as core/strand running status and core-memory interface events can be candidates of detectable symptoms. A simple watchdog can be deployed to detect application running status (IEXE event). Then SDC (silent data corruption) rate is evaluated demonstrating its potential. Third and last, the effects of traditional protection schemes in the target CMT to intermittent faults are quantitatively studied on behalf of the contribution of each trap type, demonstrating the necessity of taking this factor into account for the strategy.

1. Introduction

Semiconductor technology scaling into the nanometer regime has impelled a resurgence of interest in intermittent faults. The driving forces include shrinking geometries, smaller interconnect dimensions, lower power voltages, and decreased noise margins, all of which have a negative impact on the dependability of circuits under transient, permanent, and, in particular, intermittent faults [1, 2]. In addition, it is forecast that multicore is more vulnerable to intermittent faults in future technology [3].

Unlike transient faults, intermittent faults occur in bursts. Also, in contrast to permanent faults, they arise only in particular situations and do not persist. The following characteristics distinguish intermittent from transient and permanent faults.(i)Burst. Intermittent faults occur in bursts whose duration can vary across a wide range of timescales from orders of cycles to even milliseconds or more.(ii)Nonrepeatability. Intermittent faults (e.g., caused by defects) are expected to arise under particular situations (e.g., elevated temperature, voltage droops, etc.).(iii)Fixed Location. Once activated, intermittent faults repeatedly occur at the same location or from the same module of a processor. Consequently, replacement of the offending component eliminates intermittent faults, which is in contrast to transient faults which cannot be fixed by repair [4].

Above all, intermittent faults are expected to become more frequent in the nanometer regime and have become an increasing threat to multicore.

The above distinguishing features and the complicated source of failures (SOFs) of intermittent faults leave many uncertainties to be exploited. To the best of our knowledge, we are the first to adopt a SPARC T2 chip multithreading (CMT) processor as a case study to characterize the fault effects. Thus, a dependability enhancement strategy is proposed. This paper focuses on fault effects on NET versus REG on one hand and the implications for dependability enhancement strategy on the other. Major contributions are as follows.

First, a detailed evaluation of the vulnerability characteristics is made using sensitivity metrics. The target CMT is exercised with two workloads, featuring memory intensive and CPU-intensive, respectively. A similar trend in the effect of intermittent faults is revealed, and, in particular, the common highly sensitive modules are identified. This corroborates that the susceptible characteristics do not vary with workloads in terms of sensitivity metric [5].

Second, through a thorough breakdown of the outcome categories, a novel light-weight arch-level dependability enhancement strategy is proposed, showing that core/strand running status and core-memory interface events can be candidates of the detectable symptoms across all the modules under investigation (DeadLock and Invalid Packet in this paper). Application running status (incomplete execution, IEXE event) can be covered by a simple watchdog to further refine the proposed light-weight arch-level strategy and the silent data corruption (SDC) is estimated demonstrating its potential.

Third, to the best of our knowledge we are the first to make a quantitative study of the effect of traditional protection schemes in the target CMT in terms of the contribution of each trap type, showing the necessity of taking this factor into account for the strategy [6].

In Section 2, we describe experimental methodology. Section 3 makes a thorough investigation of the vulnerability characteristics by sensitivity metrics. Then, Section 4 prospects an arch-level dependability enhancement strategy against intermittent faults and the SDC is evaluated demonstrating its potential. Section 5 discusses the protection effect of traditional schemes in the target CMT, including ECC and parity. Related work is described in Section 6, and Section 7 provides a conclusion.

2. Experimental Methodology

2.1. Target System

The target system is a CMT version of the UltraSPARC processor. Representative units are selected as device under test (DUT), including (1) address generation unit (AGEN) in instruction fetch unit (IFU), (2) pick unit (PKU), (3) decoder, (4) arithmetic logic unit (ALU), and (5) integer register file (IRF) [7]. Every unit is composed of several modules and the detailed information of each module is listed in Table 1.

tab1
Table 1: Representative units and corresponding modules under investigation in the target CMT.

The target CMT is exercised with two validation test programs from the OpenSPARC T2 package as described in Table 2 [8]. LDST_ATOMIC.S is memory intensive, while IFU_BASIC_EX_RAW.S is CPU-intensive (abbreviated as LDST and EXU). The CMT is in one core one thread (1c1t) configuration, as the multicore configuration is left for future work.

tab2
Table 2: Test benches description.
2.2. Fault Injection Framework

A fault injection framework, namely, Verilog PLI based fault injector (VPFIT), was designed based on Synopsis VCS to facilitate this work. The overall architecture of VPFIT is depicted in Figure 1, including fault injector, trace generator, and statistics. A series of programming language interface (PLI) tasks, such as Inject_TransFault, Inject_PermFault, and Inject_IntermFault, besides some attendant PLI tasks, including Test_ExecTime, were deliberately designed.

286084.fig.001
Figure 1: PFIT (Verilog PLI based fault injector) framework.

The key features of the VPFIT include (1) automation of injections into the Verilog description of the target CMT, (2) support different fault types (e.g., transient, intermittent, and permanent faults), (3) a variety of fault models (e.g., pulse, stuck-at, open, indeterminism, bridge, and delay in NET versus bit-flip and stuck-at in REG), (4) different fault parameters (e.g., , , and for intermittent faults), (5) automation of trace generation and data collection, and (6) a variety of back-end scripts for analysis and statistics (e.g., classification into outcome categories, computation of sensitivity, and trap statistics).

As the purpose of this work is to characterize the susceptibility indices to intermittent faults for the dependability enhancement strategy at an early design stage, a Verilog description of the target CMT which is independent of implementation and process technology is adopted. The swat-sim like hierarchical simulation is left for future work [7].

2.3. Fault Injection Method

On behalf of the fixed location characteristics, intermittent faults are injected into each module of a unit (altogether thirteen modules in this work). To characterize the effects of intermittent faults, transient faults and permanent faults are injected correspondingly as well as a reference index.

For each trial (fifty fault injections), transient faults are first injected to generate a random template of the fault sites. Then, intermittent faults (and permanent faults) are injected according to the specific configuration for the trial. Fault site includes the following information: module ID, object type (NET or REG), object ID, faulty bit, and the fault injection instant () which is randomly chosen from the total execution cycles of the golden trace. For each fault injection instance, only one fault is injected and the workload runs to completion.

According to the object type of the fault site, for example, NET or REG, different fault models are adopted correspondingly for transient faults, intermittent faults, and permanent faults, as listed in Table 3.

tab3
Table 3: Fault injection method.

For transient faults, a pulse of a duration randomly generated from the [0.01T–0.1T] interval is applied to NET, while the bit-flip fault model is applied to REG.

For permanent faults, a fault model randomly chosen from the stuck@0/1, indeterminism, and open is applied to NET without drawback until the end of the simulation run, while the stuck@0/1 is applied to REG.

For intermittent faults, a fault model randomly chosen from the pulse, indeterminism, and open is applied to NET, while the bit-flip model is applied to REG.

The fault parameters and are defined according to the uniform distribution function at the ranges [0.01–0.1], [0.1–1], and [1–10], respectively [9].

Here, designates the clock cycle of the target CMT (1 ns in this work) and the smallest simulated time granularity is 1 ps. The parameter is specified as two, four, and eight [9].

The combination of fault site, fault model, and fault parameters (e.g., , , and for intermittent faults) constitute a configuration.

For each module under a specific configuration, fifty injection instances constitute a trial and seven trials constitute a champion. After a fault injection champion, the back-end statistics are collected. Overall a total of 81,900 simulation runs are performed (350 injections * 13 modules * 3 * 3 * 2 workloads) for intermittent faults and 18,200 runs for permanent and transient faults.

3. Sensitivity and Vulnerability Characteristics

Sensitivity is defined as the percentage of faults in an object (NET or REG) pertaining to a given unit or module that results in processor architectural state mismatch [5, 10]. In this section, the target CMT is exercised with two workloads featuring memory intensive and CPU-intensive, respectively, and comprehensive fault injections are conducted to make a thorough investigation of the vulnerability characteristics by using sensitivity metrics.

3.1. Sensitivity at Unit Level

Table 4 provides (1) the sensitivity of NET and REG per unit, (2) the sensitivity of transient, permanent, and in particular, intermittent faults for each configuration (the combination of and , where is equal to two, four, and eight, and ranges from the intervals of [0.01, 0.1], [0.1, 1], and [1, 10], resp.).

tab4
Table 4: Sensitivity per unit (%).

Taking the EXU workload, for example, analysis of the data leads to several conclusions.

First, there is clear evidence that for transient faults the sensitivity of NET (on average 1.1% with a random fault duration ranging from 0.01 to 0.1 under pulse fault model) is not negligible, even though this figure is five times smaller than the sensitivity of REG (5.5% on average). To the contrary, for permanent faults the sensitivity of NET is 1.59 times higher than that of REG (25.8% versus 16.2%) except for IRF unit.

Second, the different configurations of fault parameters and ([0.01, 0.1], [0.1, 1], and [1, 10] and two, four, and eight) simulate the exacerbation of wear-out process. On the whole, the sensitivity for a specific configuration of increases with respect to the , while for a specific sensitivity increases with . Note that discrepancies exist under some configurations. In-depth analysis reveals that the randomly generated fault models between corresponding fault types (transient, permanent, and intermittent faults) become the leading factor, and the difference between randomly selected fault sites between trials becomes another factor.

Third, for units responsible for control, the sensitivity of NET grows more sharply than that of REG as and increase, indicating the need for a protection scheme to be employed. To further identify the sphere of protection, an in-depth analysis at a finer granularity—module level—was performed.

3.2. Sensitivity Breakup per Unit

A detailed breakdown of the sensitivity per module for EXU and LDST workload is described, respectively, in Tables 5 and 6.

tab5
Table 5: Sensitivity breakup per unit for EXU workload ((a) for NET and (b) for REG).
tab6
Table 6: Sensitivity breakup per unit for LDST workload ((a) for NET and (b) for REG).

Note, there are three REG objects in PKU_PCK module, but its sensitivity is zero. In-depth analysis reveals that two objects are concerned with scan chain which is disabled, and the other is intrinsic to logical masking of single bit fault. The collected data lead to several important conclusions providing valuable susceptibility indices for the dependability enhancement strategy.

First, the impact of fault parameters to sensitivity at the module granularity follows a similar trend as described in the previous section, that the sensitivity for a specific increases with respect to , while that for a specific increases with .

Second, although the two workloads have different features, a similar trend of the impact of intermittent faults is revealed, and, in particular, the common highly sensitive modules are identified. This corroborates that susceptible characteristics do not vary with workloads, and thus sensitivity can provide valuable information for dependability enhancement strategy [5].

Sensitivity metrics under two workloads reveal that the following modules become the vulnerable bottlenecks for intermittent faults, as listed in Table 7.

tab7
Table 7: Module level vulnerable bottlenecks.

Pick unit(PKU) is a representative unit in the target CMT which is highly sensitive to intermittent faults, wherein the module of PKU_PKD in charge of error detection and checking and PKU_SWL implementing the state machine becomes the bottlenecks. Taking the EXU workload, for example, the sensitivity of PKU_PKD is 24.9% for NET in B2_0.1–1 configuration versus 12% for PKU_SWL for REG in B2_0.01–0.1 configuration.

In target CMT, IRF is well protected from transient faults by ECC module. Whereas, data show that both NET and REG in EXU_IRF module are highly sensitive to intermittent faults with in [1, 10] configuration. In addition, the REG in EXU_RML, a module in charge of register management, is highly sensitive to transient faults, indicating a scheme to protect it from not only intermittent faults but also transient faults. Moreover, the NET of the following modules is highly sensitive: IFU_AGC in AGEN and DEC_DEL in decoder. The REG of the EXU_EDP in ALU is vulnerable as well.

Thirdly, above all the collected data press for a protection scheme which can not only cover all of the highly sensitive modules across a variety of units, including PKU, AGEN, decoder, IRF, and ALU, but is also general enough to protect both the NET and REG object types from transient and intermittent faults. When taking into account design and verification complexity, previous approaches which either target a specific unit or aim at some particular parts of the processor are no longer viable [5, 8, 1114].

Hence, a more general and light-weight method at arch-level, which is not only across different fault types (transient, permanent, and intermittent faults) but also independent of various modules (as listed in Table 7), is a better choice.

4. Dependability Enhancement Strategy

4.1. Outcome Categories

The fault injection outcome categories are outlined as follows: dead lock (DLock), invalid packet request (IPacket), short execution (Short), incomplete execution (IEXE), bad trap (BadTrap), and latent (Latent). The detailed description of each category is listed in Table 8.

tab8
Table 8: Outcome categories description.

Figure 2 depicts fault propagation from the fault site through processor architectural state to the application. Through latency analysis, two groups are differentiated: microarchitectural group and propagated group. Analysis shows that some categories, including DLock, IPacket, and Short, falls into both groups, denoted as and in Table 8. Note that all the results presented here assume that the probability of the occurrence of intermittent faults for each module is equal.

286084.fig.002
Figure 2: Outcome categories: microarchitectural level groups versus propagated groups.
4.2. Dependability Enhancement Strategy

Experimental results of the u-architectural and propagated groups, respectively, for NET and REG under LDST workload are listed in Table 9. The result of the EXU workload is similar which, due to space constraints, is omitted.

tab9
Table 9: Detectable symptoms.

In-depth analysis shows that the following categories lead to SDC events: DLock, IPacket, IEXE, Short, and BadTrap, as depicted in “u-SDC/p-SDC events” column.

An alarming statistic is observed for the u-architectural group in which the outcome of NET primarily falls into IPacket, while that of REG mainly falls into DLock event.

Covering as many SDC events as possible is of utmost importance for the dependability enhancement strategy. In-depth analysis reveals that the DLock and IPacket are detectable symptoms. Data in the “detectable symptoms” column show that for NET the two events contribute to the majority of the u-architectural group with a percentage of about 79.0, 71.8, 53.1, 8.9, and 25.4 out of 79.6, 72.4, 67.1, 9.1, and 26.7, respectively, for PKU, AGEN, decoder, ALU, and IRF (80.8, 83.3, 57.6, 25.5, and 29.4 out of 81.0, 83.8, 58.1, 27.4, and 34.6 for EXU workload). This implicates a light-weight protection scheme to contain these two kinds of events as detectable symptoms.

For the propagated group, analysis shows that a simple watchdog can be deployed to cover the IEXE event. Thus, the proposed arch-level dependability enhancement strategy can be further improved to contain not only core/strand status and crossbar event but also application running status (DLock and IPacket versus IEXE) as detectable symptoms.

After detailed fault injections, the SDC rate is listed in the SDC′ and SDC columns in Table 10 after incorporating u-architectural and application level symptoms, respectively. Data demonstrate that by incorporating the u-arch level detectable symptoms (DLock and IPacket) the SDC rate reduces from 6.3% to 0.7% for NET versus 1.3% to 0.2% for REG for LDST workload. By incorporating another application level symptom, namely, IEXE, further SDC decrease is acquired, demonstrating the efficacy of the proposed arch-level dependability enhancement strategy against intermittent faults.

tab10
Table 10: SDC ((a) for NET and (b) for REG).

All in all, the above analysis provides a valuable use for reference that the following events, core or strand running status and core-memory interface or crossbar event (DLock and IPacket in this paper), can be alternatives of arch-level symptoms of hardware faults across a variety of modules for the units under test. Application running status (IEXE) can be considered as another symptom to refine strategy.

5. Effects of Traditional Protection Schemes to Intermittent Faults

Experimental results demonstrate that 6.5% of traps has triggered out of the manifested symptoms.

In terms of the dependability enhancement strategy, it is impossible to overlook the capability of traditional protection schemes to intermittent faults. In this section, a quantitative study is made on the effect of traditional protection schemes to intermittent faults, demonstrating the necessity of taking into account this factor for the dependability enhancement strategy [6].

In the target CMT, sequential logic is usually protected by traditional schemes such as ECC or parity; besides they are typically concerned with some attendant trap(s) to facilitate protection. The detailed breakdown of traps to each outcome category (Latent, Incomplete EXEcution, Bad Trap, InvalidPacketRequest, and DeadLock) is depicted in Table 11, showing that majority of traps (99.5%) originate from the propagated group owing to fault propagation.

tab11
Table 11: Trap distributions.
5.1. Protection Effects Quantitative Study

Table 12 describes the effect of traditional protection scheme to intermittent faults for NET and REG, respectively: (1) the priority metric is normalized as the number of occurrence of traps per six champions (2100 injections) expressing a relative weight, (2) the fault coverage and recovery rate for each module, and (3) the contribution of various trap types per module by descending priority. The result of LDST is similar with that of EXU except for some remarkable load/store characteristics, which, due to space constraints, is omitted.

tab12
Table 12: Protect effects breakup per trap type.

As expected, the parity and ECC is more effective for REG than for NET with an overall priority of 268.9 versus 97.3, and the average fault coverage and recovery rate for REG versus NET (13.9% and 99.2% versus 3.7% and 93.5%) are higher.

For REG, the protection capability for decoder, IRF, and AGEN is expressed by a relative priority of 110.0, 92.4, and 66.5, respectively. However, there is no protective effect for PKU and ALU with the priority of zero.

For net, the priority of 58.6, 16.4, 10.3, 9.0, and 3.1 demonstrates the protection capability for the units PKU, AGEN, decoder, ALU, and IRF, respectively. Of all the modules, PKU_PKD is protected best with a relative priority of 47.3.

The average fault coverage for NET is only 3.7%. However, once an intermittent fault is covered, the traditional scheme is effective with a recovery rate of nearly 100% except ALU and IRF. For ALU, the fault coverage is only 2.3% with the recovery rate of about 22.2%, while IRF is 0.6% versus 66.7%, respectively, indicating the need to protect the logic in ALU and IRF from intermittent faults.

The contribution of different trap types per module is quantitatively described by a relative priority. Data show that, for NET, of all the trap types 0 × 10 takes the majority contribution of about 88% (86.4/97.3). On the contrary, for REG, trap types 0 × 10, 0 × 29, 0 × 0a, 0 × 20, and 0 × 11 together contribute nearly 83% (223.9/268.9). This indicates that 0 × 10 fatal trap is of utmost importance to protect both the NET and REG, while other trap types, such as 0 × 29, 0 × 0a, and 0 × 20, are vital to protect REG from intermittent faults.

5.2. Discussions

The above analysis leads to several prospects for the intermittent faults dependability enhancement strategy.

First, for the traditional protection scheme the coverage rate of 3.7% versus 13.9% on average for NET and REG reinforces the advocate of an enhancement strategy to be deployed to counter intermittent faults. The recovery rate of 93.5% versus 99.2% for NET and REG attests the protection effect of traditional scheme to intermittent faults, demonstrating the necessity of taking into account this factor for dependability enhancement.

Second, in-depth analysis shows that a simple watchdog can be deployed to cover the IEXE event. Thus, the arch-level strategy proposed can be further improved to contain not only core/strand status and crossbar event, but also application running status (DLock and IPacket versus IEXE in this paper) as detectable symptoms. Preliminary estimation shows that on average 0.1% of SDC decrease is acquired for NET across all the units, including AGEN, PKU, decoder, ALU, and IRF under LDST test bench.

Third and last, we are convinced that the trap would be a promising symptom for fault diagnosis or fault prediction, providing valuable information for architects to further refine the dependability strategy, which is the focus of our future work.

6. Related Work

Comprehensive fault injections have been conducted to characterize the effects of transient faults on processors. As semiconductor technology scales into the nanometer regime, a resurgence of interest in intermittent faults has come forth in recent years.

Generally, intermittent faults are assumed to be the prelude of permanent faults. In contrast to transient faults due to single-event upset (SEU), intermittent faults are related to irreversible physical defects in the circuit. These defects can be produced either in the design/manufacturing process or during the normal operation. In the case of normal operation produced defects, a series of wear-out mechanisms can occur in long term perspective, initially revealing as intermittent faults until finally developing into a permanent fault [2]. The SOFs (Source Of Failures) of intermittent faults can be categorized as follows.

Design or manufacturing defects constitute one of the most important SOFs. Residues, process variations, or infant mortality provoked by manufacturing processes, together with design defects, aggravate the situation.

Aging or in-progress wear out becomes another SOF. Complex wear-out mechanisms, such as time dependent dielectric breakdown (TDDB), negative bias temperature instability (NBTI), electromigration (EM), stress migration (SM), and thermal cycling (TC) in packages, are expected to become more frequent in the nanometer regime. Devices typically do not fail suddenly but display intermittent behavior for a period of time beforehand and finally evolve to permanent faults.

Environmental triggers are the inducements for intermittent faults. Continuous shrinking of device feature size due to device scaling leads to increasing susceptibility to various inducements, such as PVT variation, increased cross-talk, and environmental interferences, and so forth.

Above all, the intermittent faults are expected to be an austere challenge of VLSI circuits in the nanometer regime, especially for multi-core in future technologies [1523].

Accordingly, the computer community commenced to explore the impact of intermittent faults [24, 25]. Rashid et al. made a preliminary study of intermittent faults propagation in application, furthered by Wei et al. [26, 27]. Gracia evaluated the effects of intermittent faults on an embedded system [6, 28]. In contrast to previous work targeting an embedded system or a microcontroller, the UltraSPARC CMT processor is used as a case study in this paper to characterize intermittent faults.

Pan et al. proposed intermittent faults vulnerable factor (IVF), a metric similar to AVF, to estimate the susceptibility of typical sequential units in a processor to intermittent faults [29]. Kim and Somani advocated the sensitivity metric at RTL or lower levels [5]. Saggese et al. made a thorough study of the susceptibility of a superscalar processor to transient faults with the sensitivity metric [10]. Instead of a superscalar, sensitivity metric is adopted to characterize intermittent faults for a CMT; then a protection strategy is proposed in this paper. Experimental results of this paper corroborate Kim’s analytic findings that the susceptible characteristics do not vary with workloads on behalf of the sensitivity metrics [5].

Data in this work demonstrate that previous protection schemes targeting a specific unit or some particular parts of a processor are no longer viable [1114]. Accordingly, an arch-level dependability enhancement strategy, which is not only independent of fault types (intermittent, transient, and permanent faults) but is also applicable across various sensitive modules, is put forward and its potential is evaluated.

7. Conclusions

To the best of our knowledge, we are the first to use SPARC T2 processor as a case study to characterize the effects of intermittent faults at register transfer level (RTL) and a dependability enhancement strategy is proposed.

First, sensitivity evaluation demonstrates that susceptible characteristics do not vary with workloads, and the similar trend of the effect of intermittent faults is revealed and the common sensitive modules are identified.

Second, a quantitative study of traditional protection scheme to intermittent faults is made on behalf of the contribution of each trap type, reinforcing the advocate of an enhancement strategy to be deployed to counter intermittent faults while demonstrating the necessity of taking into account this factor for dependability strategy.

Third, a thorough breakdown of outcome categories provides a valuable use for reference that the following events, core, or strand status and core-memory interface events (DLock and IPacket in this paper) can be candidates of arch-level symptoms, whilst workload status (IEXE) can be application level symptom to refine the strategy. Data demonstrate that by incorporating arch-level symptoms (DLock and IPacket) the SDC reduces from 6.3% to 0.7% for NET versus 1.3% to 0.2% for REG. With the additional application level symptom (IEXE), further SDC decrease is acquired demonstrating the efficacy of the proposed dependability enhancement strategy for intermittent faults. Thus a general strategy can outline that core/strand running status and crossbar events can be candidates of arch-level symptoms, and workload status can be used as application symptoms to refine the strategy.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant no. 61373025 and no. 90818016, National High Technology Research and Development Program of China (no. 2012AA010905), Beijing Natural Science Foundation (4142034), China Scholarship Council Foundation, and Beijing Higher Education Young Elite Teacher Project (YETP0380). The authors would like to express their great appreciation to Craig Miller for his valuable advice on the writing of this paper.

References

  1. “Process integration, devices and structures,” The International Technology Roadmap for Semiconductors, p. 8, Update, 2006.
  2. P. Gil, J. Arlat, H. Madeira et al., “Fault Representativeness,” Deliverable ETIE2. DBench European Project IST-2000-25425.
  3. P. M. Wells, K. Chakraborty, and G. S. Sohi, “Mixed-mode multicore reliability,” in Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14), pp. 169–180, March 2009. View at Publisher · View at Google Scholar · View at Scopus
  4. C. Constantinescu, “Impact of intermittent faults on nanocomputing devices,” in Proceedings of the Workshop on Dependable and Secure Nanocomputing (WDSN '07), Edinburgh, UK, 2007.
  5. S. Kim and A. K. Somani, “Soft error sensitivity characterization for microprocessor dependability enhancement strategy,” in Proceedings of the International Conference on Dependable Systems and Networks (DNS '02), pp. 416–425, June 2002. View at Scopus
  6. J. Gracia-Moran, D. Gil-Tomas, L. J. Saiz-Adalid, J. C. Baraza, and P. J. Gil-Vicente, “Experimental validation of a fault tolerant microcomputer system against intermittent faults,” in Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN '10), pp. 413–418, July 2010. View at Publisher · View at Google Scholar · View at Scopus
  7. M.-L. Li, P. Ramachandran, U. R. Karpuzcu, S. K. S. Hari, and S. V. Adve, “Accurate microarchitecture-level fault modeling for studying hardware faults,” in Proceedings of the IEEE International Conference on Mechatronics and Automation (ICMA '08), pp. 105–116, August 2008. View at Publisher · View at Google Scholar · View at Scopus
  8. J. C. Smolens, Fingerprinting: hash-based error detection in microprocessors [CMU doctor thesis], 2008.
  9. D. Gil-Tomás, J. Gracia-Morán, J.-C. Baraza-Calvo et al., “Analyzing the impact of intermittent faults on microprocessors applying fault injection,” IEEE Design and Test of Computers, vol. 29, no. 6, pp. 66–673, 2013. View at Google Scholar
  10. G. P. Saggese, A. Vetteth, Z. Kalbarczyk, and R. Iyer, “Microprocessor sensitivity to failures: control vs. execution and combinational vs. sequential logic,” in Proceedings of the International Conference on Dependable Systems and Networks, pp. 760–769, July 2005. View at Publisher · View at Google Scholar · View at Scopus
  11. J. Carretero, X. Vera, P. Chaparro, and J. Abella, “On-line failure detection in memory order buffers,” in Proceedings of the International Test Conference (ITC '08), October 2008. View at Publisher · View at Google Scholar · View at Scopus
  12. X. Vera, J. Abella, J. Carretero, P. Chaparro, and A. González, “Online error detection and correction of erratic bits in register files,” in Proceedings of the 15th IEEE International On-Line Testing Symposium (IOLTS '09), pp. 81–86, June 2009. View at Publisher · View at Google Scholar · View at Scopus
  13. J. Abella, X. Vera, O. Unsal, O. Ergin, and A. González, “Fuse: a technique to anticipate failures due to degradation in ALUs,” in Proceedings of the 13th IEEE International On-Line Testing Symposium (IOLTS '07), pp. 15–22, July 2007. View at Publisher · View at Google Scholar · View at Scopus
  14. J. Abella, P. Chaparro, X. Vera, J. Carretero, and A. González, “On-line failure detection and confinement in caches,” in Proceedings of the 14th IEEE International On-Line Testing Symposium (IOLTS '08), pp. 3–9, July 2008. View at Publisher · View at Google Scholar · View at Scopus
  15. C. Constantinescu, “Trends and challenges in VLSI circuit reliability,” IEEE Micro, vol. 23, no. 4, pp. 14–19, 2003. View at Publisher · View at Google Scholar · View at Scopus
  16. J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, “The case for lifetime reliability-aware microprocessors,” in Proceedings of the 31st Annual International Symposium on Computer Architecture, pp. 276–287, June 2004. View at Scopus
  17. C. M. Tan and A. Roy, “Electromigration in ULSI interconnects,” Materials Science and Engineering R: Reports, vol. 58, no. 1-2, pp. 1–75, 2007. View at Publisher · View at Google Scholar · View at Scopus
  18. J. Abella, X. Vera, O. S. Unsal, O. Ergin, A. Gonza'lez, and J. W. Tschanz, “Refueling: preventing wire degradation due to electromigration,” IEEE Computer Society, vol. 28, no. 6, pp. 37–46, 2008. View at Google Scholar
  19. R. Degraeve, B. Kaczer, and G. Groeseneken, “Reliability: a possible showstopper for oxide thickness scaling?” Semiconductor Science and Technology, vol. 15, no. 5, pp. 436–444, 2000. View at Publisher · View at Google Scholar · View at Scopus
  20. X. Li, J. Qin, and J. B. Bernstein, “Compact modeling of MOSFET wearout mechanisms for circuit-reliability simulation,” IEEE Transactions on Device and Materials Reliability, vol. 8, no. 1, pp. 98–121, 2008. View at Publisher · View at Google Scholar · View at Scopus
  21. V. Huard, M. Denais, and C. Parthasarathy, “NBTI degradation: from physical mechanisms to modelling,” Microelectronics Reliability, vol. 46, no. 1, pp. 1–23, 2006. View at Publisher · View at Google Scholar · View at Scopus
  22. Failure Mechanisms and Models for Semiconductor Devices, JEDEC Publication JEP122-A, 2002.
  23. A. DeHon, H. M. Quinn, and N. P. Carter, “Vision for cross-layer optimization to address the dual challenges of energy and reliability,” in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE '10), pp. 1017–1022, March 2010. View at Scopus
  24. C. Constantinescu, “Intermittent faults and effects on reliability of integrated circuits,” in Proceedings of the 54th Annual Reliability and Maintainability Symposium (RAMS '08), January 2008. View at Publisher · View at Google Scholar · View at Scopus
  25. D. Gil, L. J. Saiz, J. Gracia, J. C. Baraza, and P. J. Gil, “Injecting intermittent faults for the dependability validation of commercial microcontrollers,” in Proceedings of the IEEE International High Level Design Validation and Test Workshop (HLDVT '08), pp. 177–184, November 2008. View at Publisher · View at Google Scholar · View at Scopus
  26. L. Rashid, K. Pattabiraman, and S. Gopalakrishnan, “Towards understanding the effects of intermittent hardware faults on programs,” in Proceedings of the International Conference on Dependable Systems and Networks Workshops (DSN-W '10), pp. 101–106, July 2010. View at Publisher · View at Google Scholar · View at Scopus
  27. J. Wei, L. Rashid, K. Pattabiraman, and S. Gopalakrishnan, “Comparing the effects of intermittent and transient hardware faults on programs,” in Proceedings of the IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W '11), pp. 53–58, June 2011. View at Publisher · View at Google Scholar · View at Scopus
  28. J. Gracia, L. J. Saiz, J. C. Baraza, D. Gil, and P. J. Gil, “Analysis of the influence of intermittent faults in a microcontroller,” in Proceedings of the IEEE Workshop on Design and Diagnostics of Electronic Circuits and Systems (DDECS '08), pp. 80–85, April 2008. View at Publisher · View at Google Scholar · View at Scopus
  29. S. Pan, Y. Hu, and X. Li, “IVF: characterizing the vulnerability of microprocessor structures to intermittent faults,” in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE '10), pp. 238–243, March 2010. View at Scopus