Abstract

Hardware redundancy at different levels of design is a common fault mitigation technique, which is well known for its efficiency to the detriment of area overhead. In order to reduce this drawback, several fault-tolerant techniques have been proposed in literature to find a good trade-off. In this paper, critical constituent gates in math circuits are detected and graded based on the impact of an error in the output of a circuit. These critical gates should be hardened first under the area constraint of design criteria. Indeed, output bits considered crucial to a system receive higher priorities to be protected, reducing the occurrence of critical errors. The 74283 fast adder is used as an example to illustrate the feasibility and efficiency of the proposed approach.

1. Introduction

With the technology scaling, electronic circuits are becoming more and more prone to faults and defects. Reliability analysis of logic circuits is emerging as an important parameter in deep submicron electronic technologies [1, 2]. It is especially critical for systems designed to be applied in space, avionics, and biomedical applications. In order to design reliable nanoelectronic devices, different fault-tolerant strategies have been extensively researched over the past years [3, 4].

Modular redundancy is a representative method which can provide reliability enhancement to the detriment of area overhead. Motivated by the need of economical fault-tolerant designs, researchers have been committed to searching for better trade-offs between reliability and overhead [5]. A hybrid redundancy method is proposed in [6], which combines information and hardware redundancy to achieve better fault tolerance. Sensitive transistors are protected in [7] based on duplicating and sizing a subset of transistors necessary for soft error tolerance in combinational circuits. In [8], Ruano et al. presented a method to automatically apply Triple Modular Redundancy (TMR) on digital circuits. The idea is to meet the reliability constraint while reducing the area overhead of typical TMR implementation.

Although all the aforementioned works provide reductions in the area overhead when compared to classical hardware redundancy systems, they do not take account of the usage profile of the results. In fact, a designer may use this additional information to make better decisions about which are the critical blocks of a circuit and then assign the desired priorities to protect them.

This work first proposes a different approach to identify critical logic blocks in math circuits. It relies on the fact of many digital systems and applications to tolerate some loss of quality or optimality in the primary outputs. In most cases, the trade-off in area is also associated with improvement of performance like faster operations, less power consumption, and so forth. The main idea is that different errors may have different consequences for different digital applications. For instance, in a binary output word, errors located in the most significant bits tend to be more critical than errors located in the least significant bits.

This paper is organized as follows. Section 2 introduces the practical reliability concept and explains the advantages of using such metric for reliability analysis. In Section 3, a fast adder circuit 74283 is applied as a case study to illustrate and validate the proposed method. As an example, the estimate of peak signal-to-noise ratio (PSNR) with different fault-prone critical gates in image processing is considered, together with both analysis and comparison of results. Finally, Section 4 outlines some conclusions and suggestions for future works.

2. Reliability Evaluation

2.1. Nominal Reliability

Let be a vector of bits representing the output of a circuit. The reliability of a circuit is usually defined as the probability that it produces correct outputs, that is, the probability that all are correct (s) and (s). Given that the output bits are independent, this value, also known as nominal reliability [2], is conventionally expressed as in (1), where stands for the reliability of :

Let us now suppose that the circuit’s output “” is coded by the use of a binary scheme, where and stand for the most significant bit (MSB) and the least significant bit (LSB), respectively. Actually, MSB is the bit position in a binary number having the greatest numerical value. Therefore, error(s) occurring in MSB(s) will result in more remarkable disparities than in any other bit. By contrast, errors in LSB(s) may even be masked by the target application.

In spite of that, nominal reliability assigns equal reliability costs to the bits of “” as shown in (1). In fact, two different architectures for a logic function may have the same reliability value and one may still have a higher probability to provide more acceptable results than the other does. For instance, let us suppose that a designer obtains three different architectures for a 4-bit adder in which the output is coded using a binary scheme. Besides, he has to select one among them based on the reliability of the output. The reliabilities for the output bits of such architectures are presented in Table 1.

Analyzing the nominal reliability values for the obtained architectures, Architecture 1 and Architecture 2 are selected as the best solutions. Indeed, no distinction can be made between these two architectures regarding the nominal reliability value. However, as the output of this circuit is coded using a binary scheme, the first architecture would provide better results (smaller disparities) than the second one. Ideally, a more desirable analysis should take account of the amount of information that each bit of an output carries (or its importance) in order to assign progressively great costs to them. To tackle this problem, a new metric to analyze the reliability of a circuit with a multiple-bit output is presented in Section 2.2.

2.2. Practical Reliability

Practical reliability is a metric that can assess the importance of each output bit when analyzing the reliability of a circuit. It can be evaluated as shown in (2). The weight factor allows a designer to adjust the importance of a specific output bit to the output of the circuit. Notice that if , for all , the practical reliability expression (2) becomes the nominal reliability expression (1). In this work, a standard binary representation is considered so that is calculated as shown in (3). Note that (2) can also be related to the probability that an error will cause a significant disparity on the output of a circuit (a critical error).

Although the proposed metric does not evaluate the true reliability of a circuit, this value takes account of both the reliability and the importance of an output bit for the target application. This is of great value for practical applications. For instance, let us analyze the architectures shown in Table 1. It can be noted that the practical reliability values are different from the values obtained with nominal reliability. Actually, even the order of the best architectures changes with the proposed metric. Architecture 2, which was previously deemed the best architecture together with Architecture 1, now is viewed as the worst choice due to the low reliability value of its MSB. In fact, practical reliability punishes architectures which present low reliability in critical bits, thus providing a designer with a more realistic result based on the target application.

3. Selectively Hardening Critical Gates

We know that critical gates should be hardened first in order to increase hardware usage efficiency and, at the same time, to minimize area overhead. The main idea here is to grade gates in math circuit to be protected based on critical factors. In this work, a critical factor explores not only the probability that an error will be introduced by a gate but also how critical this error will be in the target application as shown in Section 3.1.

3.1. Identifying Critical Gates

In order to explain and validate the proposed method, the 4-bit fast adder 74283 is employed (see Figure 1). The first module () produces the generate, propagate, and XOR functions. The second module () is the carry-lookahead (CLA) realization for the carry function. Finally, the 8-bit XOR gate produces the sum function.

The fast adder 74283 has 9 inputs and 5 outputs and is composed of 36 logic gates and 4 buffers. All 40 blocks (gates and buffers) are considered as fault-prone. Further, it is supposed that these blocks () are independent and labeled as shown in Figure 2.

The procedure of detecting which are the critical gates of this circuit takes two steps: first, a fault emulation platform, named FIFA [9], is used to inject faults due to Single Event Upsets (SEUs); next, critical gates are detected by analysis of errors that appear in the output vector.

The FIFA platform can generate one fault configuration per clock cycle. Further, it can inject a large number of simultaneous faults into the circuit [9]. However, in this work, it considers only the occurrence of single faults so that the platform injects just one fault each time. If the occurrence of multiple simultaneous faults is likely, the platform can be configured to deal with that.

Finally, the results, which are produced by the original and the faulty circuits, are compared bit by bit. If these results are different, it is concluded that the effects of the injected fault have been propagated to the output bits. Otherwise, it is concluded that the fault has been masked.

The fault injection emulation is performed to detect the critical factors. The idea is to inject a single fault in a gate and analyze the output for all the possible input vectors. Then, for each output bit , the number of errors related to a single fault in is evaluated (see Table 2). The columns correspond to weighted versions of . In our case study, as a standard binary representation is considered, is obtained as shown in (4). Note that there are possible input logic values for each faulty gate. All the simulation results are shown in Table 2.

The critical gates are detected according to the results presented in Table 2. The more critical the gates are, the higher priorities they receive to be protected (in this case using TMR). Configuration of TMR based on this principle is more efficient in practical applications as shown in Section 3.2.

In fact, critical factors are assigned to the gates according to the number of weighted errors in Table 2. If the numbers of weighted errors are equal, gates that are closer to the primary outputs receive higher priorities. If the numbers of weighted errors and the distance to the primary outputs are both identical, gates presenting more reconvergent fan-outs are considered more critical. Gates whose three parameters are equal receive the same critical factor. Note that the rightmost column in Table 2 gives the critical factor for a gate . The higher the factor number is, the more critical the gate will be. In this work, critical factors are assigned as integers .

3.2. Reliability Analysis and Comparison

Subsequent to obtaining the critical gates, the reliability of the redundant adder circuit is evaluated by using the SPR tool [1]. Further, the signal reliability of a given signal is considered as the probability that this signal carries a correct value. In fact, to assume that a binary signal can carry incorrect information is equivalent to assuming that it can take four different values: correct zero (), correct one (), incorrect zero (), and incorrect one ().

The probabilities for occurrence of each one of these four values are represented as probability matrices shown as follows:

The signal reliability of , denoted as , comes directly from (6), where stands for the probability function.

The SPR technique generates a matrix representing the output signal of a logical block, which explores the following information: the probability matrices representing the input signals for a given logical block, the logical function of such a block, and the probability that this block will not fail. In order to understand this procedure, let us consider a digital block performing a logical function on a signal to produce a signal (see Figure 3). Now, assume that the probability that this operator will fail is represented by , and represents the probability that it will not fail. Then, the reliability of can be obtained by the following equation:

As can be seen in (7), when the input signal is reliable, that is, , the reliability of the output signal is given by , which stands for the probability of success of the logical block itself. This implies that, for fault-free inputs, the reliability of the output signal is given by the inherent reliability of the block that produces this signal.

Let us now consider hardware redundancy as the chosen redundancy technique to protect a logic block. Suppose that the area overhead constraint allows a designer to protect up to 5 gates. According to the critical factors presented in Table 2, gates , , , , and are selected by the proposed method as the five candidates to be protected. The method presented in [8], under the same area overhead constraint, applies redundancy in gates , , , , and . As the occurrence of single errors is assumed, the protected blocks are considered reliable; that is, .

The reliability of the output bits for the original circuit and for the redundant configurations can be obtained by the SPR technique. Table 3 shows the reliability results for the respective configurations considering for the gates not protected. It can be noted that both the nominal reliability and the practical reliability values are available. It is considered that the output of the 74283 adder comprises a 5-bit binary word so that the practical reliability can be evaluated from (2) and (3).

Analyzing the results presented in Table 3, it shows the effectiveness of the proposed approach. As commented above, the main idea is to take account of the impact of an error to the output of a circuit in order to prioritize the reliability enhancement of the most important bits for the application. Indeed, the proposed hardening method shows a notable increase in the reliability of the most significant bits of the circuit (see Table 3). For instance, the reliabilities of and (LSBs) do not present any increase compared to the original circuit. Besides, the reliability of (MSB) presents the highest improvement as expected, once it is considered the most critical bit for this application.

Furthermore, it can be noted that, under the same area overhead, the nominal reliability increases by almost the same amount with both methods (see Figure 4). In fact, nominal reliability assigns equal reliability costs to the output bits of the 74283. This means that the output bits are considered as having the same importance to the system, so that the nominal reliability value does not distinguish in which bit the reliability was actually increased. In spite of that, practical reliability results can handle this problem and can indeed provide a sharper distinction between these two hardened architectures as shown in Figure 4.

4. Conclusion

In this paper, we presented a method to selectively apply hardening method to arithmetic circuits. Critical constituent gates are detected by taking account of not only the probability of error occurrence but also the impact of such error to the system. Indeed, bits considered critical to the target application receive higher priorities to be protected when the proposed method is employed.

Simulation results show the effectiveness of the proposed approach. This indicates that such critical gates should be hardened with priorities in order to increase hardware usage efficiency and to minimize area overhead simultaneously. The results could also be combined to approximate computing algorithms dedicated to fault-tolerant design [10]. Future works include approximating logic design based on gate grading results.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was partially supported by National Natural Science Foundation of China (Grant no. 61401205).