Abstract

Fault tolerance is of great importance for big data systems. Although several software-based application-level techniques exist for fault security in big data systems, there is a potential research space at the hardware level. Big data needs to be processed inexpensively and efficiently, for which traditional hardware architectures are, although adequate, not optimum for this purpose. In this paper, we propose a hardware-level fault tolerance scheme for big data and cloud computing that can be used with the existing software-level fault tolerance for improving the overall performance of the systems. The proposed scheme uses the concurrent error detection (CED) method to detect hardware-level faults, with the help of Scalable Error Detecting Codes (SEDC) and its checker. SEDC is an all unidirectional error detection (AUED) technique capable of detecting multiple unidirectional errors. The SEDC scheme exploits data segmentation and parallel encoding features for assigning code words. Consequently, the SEDC scheme can be scaled to any binary data length “n” with constant latency and less complexity, compared to other AUED schemes, hence making it a perfect candidate for use in big data processing hardware. We also present a novel area, delay, and power efficient, scalable fault secure checker design based on SEDC. In order to show the effectiveness of our scheme, we compared the cost of hardware-based fault tolerance with an existing software-based fault tolerance technique used in HDFS and compared the performance of the proposed checker in terms of area, speed, and power dissipation with the famous Berger code and m-out-of-2m code checkers. The experimental results show that the proposed SEDC-based hardware-level fault tolerance scheme significantly reduces the average cost associated with software-based fault tolerance in a big data application, and the proposed fault secure checker outperforms the state-of-the-art checkers in terms of area, delay, and power dissipation.

1. Introduction

Big data is promising for business applications and is rapidly increasing as an important segment of the IT industry. Big data has also opened doors of significant interest in various fields, including remote healthcare, telebanking, social networking services (SNS), and satellite imaging [1]. Failures in many of these systems may represent significant economic or market share loss and negatively affect an organization’s reputation [2]. Hence, it is always intended that whenever a fault occurs, the damage done should be within an acceptable threshold rather than beginning the whole task from scratch, due to which fault tolerance becomes an integral part in cloud computing and big data [3]. Fault tolerance prevents a computer or network device from failing in the event of an unexpected error [2]. A recent study [4] showed that the cost of fault tolerance in cloud applications with high probability of failure and network latency is around 5% for the range of application sizes, hence providing improved performance at a lower cost.

The fault tolerance schemes in popular big data frameworks like Hadoop and MongoDB are composed of some sort of data replication or redundancy [5, 6]. MongoDB replicates its primary data in secondary devices. In a faulty event, the data is recalled from the secondary or the secondary temporarily acts as a primary. Fault tolerance in Hadoop relies on multiple copies of data stored on different data nodes. Although replication schemes allow complete data recovery, they consume a lot of memory and communication resources. Hence, in recent years, many researchers have proposed fault tolerance algorithms for improved data recovery, effective fault detection, and reduced latency in big data and cloud computing [2, 510]. All of which detect fault at the software (SW) level. Even though faults propagated due to transient errors in hardware are also detected by these schemes, and software-based techniques are more flexible, the amount of data required to process to detect a fault costs a lot more than hardware- (HW-) based fault tolerance schemes. A recent study [11] investigated the cause of data corruption in a Hadoop Distributed File System (HDFS) and found that when processing uploaded files, HW errors such as disk failure and bit-flips in processor and memory generate exceptions that are difficult to handle properly. Liu et al. [7] implemented some level of HW-based fault tolerance by modelling CPU temperature to anticipate a deteriorating physical machine. Liu et al. [7] proposed the CPU temperature monitoring as an essential step for preventing machine failure due to overheating as well as for improving the data center’s energy efficiency.

Parker [12] discussed how in many cases the faults are a direct consequence of tightly integrating digital and physical components into a single unit at a sensor or field node. In fact, many modern systems rely so heavily on digital technology that the reliability of the system cannot be decomposed and partitioned into physical and SW components due to interactions between them. There is a cost associated with the storage, transmission, and analysis of these higher-dimensional data. Furthermore, many of the SW-based approaches are simulation intensive, which may lead to broad implementation challenges. To overcome some of these challenges, he suggested that onboard, embedded processing will be a practical requirement.

Transient errors in HW, if propagated, may cause chain reaction of errors at the SW layer, causing potential failure at the node/server level. Detection at the HW level requires less computation time (as low as single clock cycle) as compared with detection at the SW level (several machine cycles), while a simple recovery mechanism called recomputation at the HW level can save a lot of data swapping and signaling at the SW level. As discussed in [13], big data has created opportunities for semiconductor companies to develop more sophisticated systems to cover the challenges faced in big data and cloud computing, and a trend towards integration of more functions onto a single piece of silicon is likely to continue. Also, due to advances in semiconductor processing, there has been a reduction in the cost of digital components [12]. For these reasons, we propose the detection of transient faults, as they occur in HW, through a HW-based fault tolerance scheme, while the SW-based fault tolerance stays at the top level as a second check for HW errors and first check for SW errors. As a result, the transient errors that arise in HW are mostly taken care of by lightweight processing at the HW level with little overhead (in terms of area, power, and delay), saving tremendous computation resources at the system level. The potential for catastrophic consequences in big data systems justify the overhead incurred due to HW-based fault tolerance method.

On the other hand, fault tolerance has also become an integral part of very large-scale integration (VLSI) circuits, where downsized, large-scaled, and low-power VLSI systems are prone to transient faults [14]. Transient faults or soft errors are transient-induced events on memory and logic circuits caused by the striking of rays emitted from an IC package and high energy alpha particles from cosmic rays [1418]. Also, in multilevel cell memories like NAND Flash memories, these errors are mostly caused by cell-to-cell interference and data retention errors [19]. Physical protection such as shielding, temperature control, and grounding circuits are not always feasible; hence, in such cases, concurrent error detecting (CED) methods are employed for protection against these errors. Since CED circuits add to the overall area and delay of the system, the selection of appropriate error detecting, and even error correcting, circuits for a particular application leads to an efficient design [18]. It has been reported that the biggest portion of errors that occur in VLSI circuits and memories are related to unidirectional errors (UE) [1921] because these errors shift threshold voltage levels to either the positive or negative side [22], causing the circuit node logic from “0” to “1” or from “1” to “0,” but not both at the same time.

Many all unidirectional error detection (AUED) schemes have been proposed and implemented, among which the Berger code technique [23] is agreed to be the least redundant. With the ability to detect single- as well as multiple-bit unidirectional errors, this technique provides error detection by simply summing the logic 0’s (a B0 scheme) or 1’s (a B1 scheme) in the information word, expressing its sum in binary. If the information word contains “n-bits,” then a Berger code will require -bits. A Berger code checker employs a 0’s (or 1’s) counter circuitry for reencoding the information word to check bits and then compares it with the preencoded check bits using a two-rail checker [23]. A chain of adders and a tree of two-rail checkers are required to design these checker circuits [23], where area and latency increase drastically as data length increases.

An m-out-of-n code is one in which all valid code words have exactly “m” 1’s and “n-m” 0’s. These codes can also detect all unidirectional errors when n = 2m. This condition not only increases the code size, but also the checker’s area. Cellular realization of an m-out-of-2m code circuit was deemed by Lala [24] as more area- and delay-efficient than the previous implementations.

Given the importance of fault tolerance at the HW level in big data and cloud computing applications, in this paper, we present a fault secure (FS) SEDC checker used with SEDC codes [25]. An FS checker has the ability to safely hide or self-check (detect) its own faults as they occur in its circuitry. The SEDC partitions the input data into smaller segments (2, 3, and 4 bits) and encodes them in parallel. This unique scaling feature makes the system faster and less complex to design for any binary data length. The FS SEDC checker inherits all these features of SEDC codes (i.e., simple scalability, constant latency, and less power dissipation) which suits its implementation in online fault detection in processors, cache memories, and NAND Flash-based memories for big data applications. The major contributions of this paper are as follows:(1)We propose HW-level fault tolerance for circuits designed to process big data and cloud computing applications.(2)In order to show the effectiveness of the proposed HW-level fault tolerance scheme in a big data scenario, we compare the cost associated with and without the proposed fault tolerance scheme and present results that show a significant reduction in the overall cost of fault tolerance in big data when the proposed HW-based fault tolerance scheme is applied.(3)We also present a novel FS SEDC checker for use with SEDC-based HW-level fault tolerance systems.(4)In order to prove the superiority of the FS SEDC checker presented in contrast with state-of-the-art AUED checkers, we show that the FS SEDC checker achieves state-of-the-art performance in terms of area, delay, and power dissipation.

The rest of the paper is organized as follows. We present an overall system diagram of the proposed HW-level fault tolerance system in Section 2. We give a brief mathematical foundation of the SEDC scheme and an example to encode logical circuits using SEDC in Section 3. Design details of the FS SEDC checker are described in Section 4. The proposed checker is shown to be FS through the fault testing methods; and its area, delay, and power comparison with state-of-the-art are derived in Section 5. We compute the fault coverage of the proposed SEDC-based fault tolerance system and present the experimental details and results in Section 5. To show the effectiveness of the proposed method in big data and cloud computation applications, we also perform a cost-performance analysis of fault tolerance at the SW level versus HW level in Section 5. Finally, we conclude the paper in Section 6.

2. Introduction to the Overall System

Figure 1 shows the main components of an error detecting codes based HW-level fault tolerance. The functional circuit consists of two subcircuits: an information symbol generator (ISG) and a check symbol generator (CSG). These two circuits do not share any logic. The ISG takes input and performs some operation and produces output . The CSG is a carefully chosen logic function that acts as the encoder and generates check bits using the same input , such that , where denotes the particular coding function. The checker normally contains another encoder that reencodes the information bits into and then compares both and . A mismatch between and is treated as an error, which is indicated by the error indication or verification signal .

The checker shown in Figure 1 plays a vital role in the overall fault tolerance system. The checker must exhibit a self-checking property or failsafe property to make sure that the whole system is fault secure (FS). If the checker is both self-checking and failsafe, the overall system is said to be as totally self-checking (TSC). In order to formally define these properties, let us consider the output of the functional circuit shown in Figure 1 to be represented by , where is the input and is the fault, and then in fault-free operation, i.e., , the output can be represented by . Also, consider the input code space , output code space , and an assumed fault set ; then according to the definition of totally self-checking (TSC), is(1)self-testing if for each fault in there exists at least one input code that produces a noncode output; i.e., ,(2)fault secure (FS) if for all faults in , and all code inputs , the output is either correct or is a noncode word; i.e., , or .

In the proposed SEDC-based HW-level fault tolerance system, the CSG circuit is realized by an SEDC check symbol generator (SCSG) circuit, which generates the SEDC code words corresponding to the information bits . We presented a realization of an SEDC encoded SCSG circuit in [27], i.e., an SEDC encoded arithmetic logic unit (ALU) of a microprocessor. The SEDC encoded ALU circuit (SCSG) computes the SEDC codes corresponding to the output of the ISG (in [27] a normal ALU). Any fault that causes multiple unidirectional errors at the output of the normal ALU is detected by the SEDC checker. Any logic circuitry including SRAM-based memory cells [28] can be made fault tolerant by encoding them similar to the methods given in [27, 28]. In the next section we briefly introduce the SEDC scheme with an example to encode an adder circuit, while in the rest of the paper we focus on the proposed FS SEDC checker that can be used with any SEDC-based HW-level fault tolerance system.

3. Scalable Error Detection Coding (SEDC) Scheme

The Scalable Error Detection Coding scheme [25] is an AUED scheme formulated and designed in such a way that only the resultant circuit area is scaled, while its latency depends on a small portion of the input data (explained later).

For any binary data of length -bits represented as with for , two parameters, and , are computed usingwhere parameter can only take a positive integer value, i.e., , and parameter . Satisfying the condition for parameter, the maximum possible value for parameter is selected. The SEDC code word is represented as with for , where denotes the length of the SEDC code word and is computed byAfter computing the values for parameters and , the SEDC code for binary data is computed. SEDC is designed to generate codes basically for 2-, 3-, and 4-bit data and is accordingly referred to as the SEDC2, SEDC3, and SEDC4 scheme, respectively. It is then extended for any integer values of , as shown in Figure 2(a).

3.1. SEDC2 Code

A two-dimensional (2D) illustration of a 2-bit SEDC (SEDC2) scheme is shown in Figure 2(b), where nodes represent data words, and their corresponding code words are written in brackets.

The SEDC coding scheme assigns code words to different data words with a unique criterion. Whenever there is a change of a bit (or bits) in a data word from “1” “0,” as shown with a bold arrow in Figure 2(b), the change is reflected in the code word in the opposite way; i.e., the code changes from “0” “1,” as shown with the dashed arrow in Figure 2(b), and vice versa. Equation (3) is used to assign 2-bit code words to the 2-bit data words . Clearly, we can interchange the bit positions of and for another variant of SEDC2 codes. This will not affect the code characteristics.In (3), represent the concatenated SEDC code bits, and are the logical operations, and SEDC2 is the basic coding scheme.

3.2. SEDC3 Code

SEDC3 code for 3-bit data is computed using (4) as follows:where the bar sign (e.g., ) in (4) represents the logical NOT operation.

Figure 3 shows a 3D cube, illustrating the unidirectional error detection mechanism of SEDC3 codes. The same notations are used in Figure 3 as in Figure 2(b). The dashed side of the cube represents the embedded SEDC2 coding scheme in SEDC3. Note that when there is a 2-bit unidirectional change in data word “001” “111” (the two MSBs changing from “00” “11”), the code changes in the opposite direction (the least significant bit of the code changes from “1” “0”). In a similar way, the scheme detects -bit or all unidirectional errors in the data word .

3.3. SEDC4 Code

A SEDC4 code for 4-bit data is formulated by (5) as follows:The MSB of the code word is completely dependent upon the MSB of the data word for SEDC4; hence, any change in the MSB of the data word is detected. The rest of the three data bits are encoded using the same SEDC3 scheme.

It can be observed from (3), (4), and (5) that the SEDC2 is embedded in 3-bit SEDC (SEDC3) and consequently in 4-bit SEDC (SEDC4) to detect all unidirectional errors in 3-bit and 4-bit data, as shown later. This ability to scale codes is not present in any other concurrent error detecting (CED) coding scheme.

In general, for , the -bit binary data is grouped into one -bit segment and the number of 3-bit segments, and then these segments are encoded using one and number/numbers of SEDC3 modules in parallel, as shown in Figure 2(a). It is noteworthy that each group of data segments and corresponding code segments is independent of each other. This independence makes our scheme scalable and able to detect some portion of bidirectional errors (BE) (discussed in Section 5.3).

If we interchange and for SEDC3 in Figure 3, the corresponding SEDC3 code is equal to Berger codes for a 3-bit segment, but our way of deriving the SEDC3 code is a lot different from that of Berger codes. SEDC3 codes are basically scaled from SEDC2 codes, and SEDC2 codes have no commonality with 2-bit Berger codes.

3.4. SEDC-Based HW-Level Fault Tolerance System Example

In order to illustrate the designing of a HW-level fault tolerance system using the SEDC scheme, we take the example of a 4-bit adder. Let us consider that this 4-bit adder is a part of a processor which processes big data applications, and we want to make this 4-bit adder fault tolerant against transient errors that arise in its circuitry, so the general HW-level fault tolerance system diagram shown in Figure 1 will be converted to the one shown in Figure 4. As shown in Figure 4, the 4-bit adder acts as an ISG and its equivalent SEDC encoder acts as a CSG. The SEDC encoder or CSG can be implemented using (6) as follows:As the output of 4-bit adder is a 5-bit value, hence the equivalent SEDC code has a 4-bit value according to (2). We used Altera’s Quartus II software to synthesize the 4-bit adder (ISG), SEDC encoder (CSG), and the SEDC checker shown in Figure 4 and utilized the synthesized circuit for computing the fault coverage of the SEDC scheme, which is presented in Section 5.3. In the next section, we present the proposed FS SEDC checker, which completes the overall proposed SEDC-based HW-level fault tolerance system.

4. The FS SEDC Checker

As shown in Figure 4, the FS SEDC checker takes -information bits and -SEDC check bits from the functional unit. The FS SEDC checker is also composed of one -bit FS SEDC checker and sets of 3-bit FS SEDC checkers. With 1-, 2-, and 3-bit FS SEDC checkers, the output can be directly used as an error indication signal, but for , one level of wired-AND-OR logic gates is used to combine all the output of subblocks of FS SEDC checkers and generate the 2-bit error indication signal. Subsections discuss logic and circuit diagrams for primitive FS SEDC checkers (SEDC1, SEDC2, SEDC3, and SEDC4 checkers) which can be used to scale the SEDC checker to an -bit FS SEDC checker (i.e., an FS checker).

4.1. The FS SEDC1 Checker

Table 1 shows the logic for a 1-bit SEDC (FS SEDC1) checker. The valid input code words are “10” and “01” and the valid output code word is “10”. denotes the 1-bit information word that is the output of ISG, and denotes the 1-bit SEDC check bit generated by the SEDC check symbol generator (SCSG). is the 2-bit error indication signal of the FS SEDC1 checker. and signals are generated by the circuits shown in Figure 5(a).

4.2. The FS SEDC2 Checker

In Figure 5, the symbols P1-P13 and N1-N13 represent the PMOS and NMOS transistors, respectively, and Vss represents the voltage supply. For simplicity, we used the CMOS-based implementation of SEDC checker circuits. Any other technology can be used to design these circuits, but the underlying algorithm, i.e., SEDC, will remain the same.

4.3. The FS SEDC3 Checker

Figure 6(a) shows the block diagram and the logic for a 3-bit FS SEDC checker. Three-bit data from the ISG and 2-bit SEDC check bits from the SCSG are first converted to and , respectively, and then are checked using the same 2-bit FS SEDC checker, as shown in Figure 6(a). When the bit is “1,” and are inverted, whereas if is “0,” then and remain the same. As the outputs of the XOR gates are fed to the FS SEDC2 checker, any error in the XOR gates is detected. This makes the overall 3-bit SEDC checker FS.

4.4. The FS SEDC4 Checker

A 4-bit FS SEDC checker consists of one FS SEDC1 checker and one FS SEDC3 checker, as shown in Figure 6(b). Both SEDC1 and SEDC3 checkers generate 2-bit output . Because the valid code word is “10,” to make sure that both checker units generate the “10” output during error-free operation, we “AND” the output-bit of the FS SEDC1 checker with the output-bit of the FS SEDC3 checker. Also, we “OR” the output-bits of both FS SEDC checkers using wired logic gates. We checked and confirmed by fault simulation that wired-AND and wired-OR gates are also FS for single faults (stuck-at-0, stuck-at-1, transistor-stuck-on, and transistor-stuck-off).

4.5. The FS SEDCn Checker

Like the SEDC code generator, the FS SEDC checker also consists of multiple 1-, 2-, and 3-bit FS SEDC checkers, depending upon the value of and from (1). For example, if bits, then (1) ⇒ and . This requires one FS SEDC2 checker and two FS SEDC3 checkers to realize an 8-bit FS SEDC checker.

The area of wired-AND-OR gates will also definitely increase as is increased. Figure 7 shows the block diagram of an -bit FS SEDC checker. For bits, there will be total of three FS SEDC checkers, each with 2-bit output; hence, a 3-input wired-AND and a 3-input wired-OR gate is required to compare all and bits. In general, for -bit input, there are “” FS SEDC checkers, each with 2-bit output. So we require “”-input wired-AND and wired-OR gates. With each increasing input to the wired-AND-OR network, one extra transistor is required by each of the wired gates. This causes the circuit to expand width-wise; hence, the latency of the wired logic remains constant for any value of .

The size of the load transistor driving these wired-AND and -OR gates will also increase with increasing input, so we consider the maximum fan-in of one gate as equal to 4. For , an extra load transistor is connected in parallel. Generally, for k-inputs we require load transistors. A total of transistors is required to design the k-input wired AND-OR network with a constant latency of 1 transistor.

5. Experiments and Results

In this section, we present the experiments we conducted on the proposed FS SEDC checker and the overall proposed SEDC-based HW-level fault tolerance system. The results of each experiment are given along with the experimental details in the subsections below.

5.1. Fault Test on FS SEDC Checker

The FS SEDC1, SEDC2, SEDC3, and SEDC4 circuits in our paper were tested for stuck-at-0, stuck-at-1, transistor-stuck-ON, and transistor-stuck-OFF faults. We assume a single-fault model where faults occur one at a time, and there is enough time between detection of the first fault and the occurrence of another fault [29]. In Table 2, we provide a summary of fault analysis of an SEDC1 checker circuit. We applied one fault at a time in the circuit of Figure 5(a) and observed the output. In single-fault operation, the circuit either produced the correct output or never produced any invalid code words (exhibiting FS property), as shown in Table 2.

Case 1 (transistor stuck ON). In Table 2, we show all six cases of transistor stuck ON faults (one at a time). For the cases with N3 or N4 stuck ON, the circuit shows fault detection by one input code combination (represented with * symbol), and hence, the circuit is self-testing, whereas other cases showed that the circuit is fault secure as well as code disjoint.

Case 2 (transistor stuck OFF). In Table 2, all six cases for transistor stuck OFF faults are shown. In cases where N1 or N2 was stuck OFF, the circuit demonstrates the self-testing property (represented with * symbol) and for the rest of the cases, the circuit is fault secure.

Case 3 (input stuck at 0). When input G0 or S0 is stuck at 0, the circuit demonstrates the self-testing property; otherwise, it remains fault secure.

Case 4 (input stuck at 1). When input G0 or S0 is stuck at 1, the circuit shows the self-testing property; otherwise, it remains fault secure.

There is one case where the output becomes floating (i.e., P3 or P4 stuck OFF). In either case, if we consider the floating voltage as logic high, then the circuit is fault secure, and if we consider the floating voltage as logic low, then the circuit is self-testing. Hence, we can say that the circuit in Figure 5(a), which is a 1-bit SEDC checker, is FS. Similar analysis was carried out when testing 2-, 3-, and 4-bit SEDC checkers, and we found that all these checkers are FS.

5.2. Area, Delay, and Power Comparison

In this section, we compare the area and delay of TSC Berger, FS SEDC, and m-out-of-2m code checkers. We use the two possible TSC Berger checker implementations from Piestrak et al. [23] and Pierce Jr. and Lala [26], with the m-out-of-2m code checker from Lala [24] for comparison. For the sake of fairness, the area overhead was measured in terms of the number of equivalent transistors. We made use of the assumptions by Smith [30] to translate gate-level circuits to transistor-level circuits.

Before comparison, we illustrate the functional dissimilarities of the three checkers with the help of Figure 8. Figure 8(a) shows the general block diagram of a TSC Berger code checker. For all the information symbols that the ISG of the functional circuit can produce in normal operation, the check symbol complement generator (CSCG) outputs correspond to the bit-by-bit complement of the expected check symbol . The TSC two-rail checker validates that each bit of is the complement of corresponding bit of . As the size of the input data increases, the length of check symbol also increases, resulting in a longer length for the TSC two-rail checker tree, and hence the resulting delay.

A general block diagram of a TSC m-out-of-2m code checker is shown in Figure 8(b). The checker takes the information bits and check bits and partitions them into two parts. The numbers of 1’s, i.e., the weight, of both parts are mapped to a pair of values which in binary belongs to a code, in most cases a two-rail code. The checker consists of a cellular structure of AND-OR gates as given by Lala [24].

Figure 8(c) depicts the general block diagram for an FS SEDC checker that resembles the structure of an m-out-of-2m code checker and differs from a Berger code checker. The FS SEDC checker block receives the information and check bits from the functional unit. If the input data length increases, the size of the FS checker block increases width-wise. The FS block contains “” pairs of small SEDC checkers (subblocks). Each subblock of the FS SEDC checker produces “10” as the valid code output. The overall SEDC checker has a final 2-bit output ; unlike two-rail codes, only one of the output combinations “10” is considered a valid code word. A nonvalid checker output “00,” “01”, or “11” at output indicates the presence of a fault in the functional circuit or the FS checker itself. The k-input wired AND-OR network takes the “” pairs of output from each SEDC checker subblock and then converts them into a final 2-bit error indication signal .

5.2.1. Area Overhead

Area-optimized realization of TSC Berger code checkers in Piestrak et al. [23] showed less area overhead than m-out-of-2m code checkers, which is apparent from Figure 9. But, if we consider the delay-optimized implementation of the TSC Berger code checker from Pierce Jr. and Lala [26], we see that the TSC Berger code checker requires more area than the FS SEDC and m-out-of-2m codes checkers [24], as shown in Table 3. For clarity, we discretely listed the area overhead offered based on code storage area and code checker area in Table 3. Also listed separately are the area overhead required by the TRC tree for the TSC Berger code checker, the wired-AND-OR network for FS SEDC, and the m-out-of-2m code checker.

For a fair comparison, the extra cost of the code storage area is also taken into account. We assumed that 1-bit storage is implemented by 12-MOS transistors [30]. Table 3 lists the area (in terms of the number of transistors) occupied by FS SEDC, delay-optimized Berger code, and m-out-of-2m code checkers for up to 32-bit data.

The FS checker block shown in Figure 8(c) requires fewer gates, implemented with [26 + (a × 50)] MOS transistors if “b = 2,” [50 + (a × 50)] MOS transistors if “b = 3,” and [58 + (a × 50)] MOS transistors if “b = 4.” The m-out-of-2m code checker implementation of Lala [24] requires 2m2 - 2m + 2 gates. The gate-level circuit is also translated to transistor-level circuits using data from Smith [30].

The results show that when scaling a 7-bit 0's counter to an 8-bit 0's counter, 154 extra MOS transistors are required. The m-out-of-2m code checker requires 60 MOS transistors when scaling a 7-out-of-14 checker to an 8-out-of-16 checker, whereas the SEDC checker requires only 18 extra MOS transistors. That is because a 7-bit SEDC checker is implemented with one SEDC3 and one SEDC4 circuit that contain 50 and 58 MOS transistors, respectively (a total of 108 transistors). An 8-bit SEDC checker is implemented using one SEDC2 and two SEDC3 checkers, requiring 26 and 100 (50x2) MOS transistors (a total of 126 transistors). This means that SEDC saves 88% of the number of transistors compared to a Berger code checker [26], and it saves 70% of the transistors when compared to m-out-of-2m code checkers. Although Berger and m-out-of-2m checkers are TSC, while the proposed SEDC checker is only FS, all three checkers provide the same fault security.

5.2.2. Delay

As far as delay is concerned, the FS SEDC checker also performs better than Berger and cellular implementations for an m-out-of-2m code checker, as shown in Table 4. For the sake of uniformity, we designed all the basic gates using the same technology transistors (PMOS = 8μ/2μ, NMOS = 4μ/2μ) and evaluated the worst-case propagation delay of each circuit.

The SEDC checker shows almost a constant delay for n > 3 bits due to its parallel implementation, whereas the delay in the Berger code checker increases owing to an increase in gate levels (from 6 to 16) in the critical path, as shown by Pierce Jr. and Lala [26]. The delay for m-out-of-2m code checkers also continues to increase with increasing data lengths because the cellular implementation requires “m (= input data length)” gate levels in the critical path.

5.2.3. Power Dissipation

In order to evaluate the power dissipation of the three checkers, we used the PowerPlay power analyzer tool. We implemented the Berger [24], m-out-of-2m [26], and SEDC checker using Verilog and synthesized the circuits using Altera’s Quartus II software. We targeted the circuit for a Cyclone II EP2C5AF256A7 chip, which has the least power dissipating properties among the Cyclone family. We allowed the synthesizer to create a balance between area and delay while synthesizing in order to get a better power estimate. We also enabled the synthesizer to use synthesizing model that takes intensive steps to optimize power for all three circuits. We clocked the inputs of the circuit with the default toggle rate and estimated the total thermal power dissipation for different values of input data width.

Figure 10(a) shows a comparison of power dissipation between the three checkers. The Berger and m-out-of-2m checkers exhibited a sudden increase in power dissipation when the input data width was changed from 16-bits to 32-bits, while SEDC showed a minimal change. This happens due to the increase in the number of two-rail checkers in the case of the Berger checker and due to the increase in the checker circuitry itself in the case of the m-out-of-2m checker, which is also evident in Figure 10(b), which depicts an area comparison between the three checkers in terms of # of logic elements (LE) occupied by the checkers.

5.3. Fault Coverage of the Proposed HW-Level Fault Tolerance Scheme

In order to elaborate the effectiveness of the SEDC CSG and its FS checker, we computed the fault coverage of the proposed SEDC-based HW-level fault tolerance scheme. We applied faults in the example circuit of Figure 4, given in Section 3.4. As most of the VLSI combinational circuits designed for mathematical operations, like add, subtract, multiply, division, etc., consist of multiple instances of 1-bit adders (full adders), hence the example circuit, i.e., a 4-bit adder, is a simple and good candidate for presenting the effectiveness of our scheme. We injected two major types of transient errors, i.e., stuck-at-0 and stuck-at-1 [29], at 24 nodes (at 6 nodes per full adder, as shown in Figure 11(b)). We injected these errors using 2-to-1 multiplexers, whose output is given byIn Figure 11(a), the symbols A[3:0], B[3:0], Cin, f_enable, and F[23:0] denote the 4-bits input A, 4-bits input B, 1-bit carry-in, 1-bit fault enabling signal, and 24-bits fault signals, respectively, while Cout is the carry-out and S[3:0] represents the 4-bits sum output of the 4-bits adder. Figure 11(b) shows the detailed schematic of a single full adder.

We considered that the faults can occur at the outputs of the logic gates only and adopted a single-fault model according to which only one fault can occur at a time [29]. We used Altera’s Quartus II software to design and synthesize the overall system and then simulated the system using ModelSim. We designed a self-checking test bench to evaluate the overall fault coverage. The statistics of the fault injection and its results are summarized in Table 5.

In total, we injected 6425 faults exhaustively, out of which 1748 faults actually caused a logical error at the output of the adder circuitry. Only 14.42% of these injected faults resulted in bidirectional errors (BEs), while most of the faults caused unidirectional errors (UEs). This also proved the fact that most of the errors in VLSI circuits result in UEs at the output [1921]. Even though SEDC is an AUED scheme, and it provides 100% fault coverage against UEs, it also successfully detected 47.62% of the BEs, as shown in Table 5. This is due to the reason that SEDC partitions the input data word into multiple parts and encodes and decodes each part independently. Consequently, a subset of BEs is also partitioned into multiple UEs and thus detected by the proposed SEDC scheme.

5.4. Cost Analysis: SW-Based Fault Tolerance Versus HW-Based Fault Tolerance

In this section, we discuss the effect of fault propagation and the estimated cost of recovery from failure (also known as repair time) in big data computing in two cases: (a) when HW-based fault tolerance is applied, and (b) when only SW-based fault tolerance is applied. For simplicity in our analysis, we take the example of a coordinated checkpointing (CC) algorithm, which is widely used in HDFS for data recovery [31].

In HDFS, an image is used to define metadata (which contains node data and a list of blocks belonging to each file), while checkpoint defines the persistent record of the image, stored on a secondary NameNode (SNN) (also called DataNode) or Checkpoint Node, or in some cases on the primary NameNode (PNN) itself. If the PNN uses the CC data recovery algorithm, the checkpoints are distributed among multiple SNNs. During normal operation, the SNN sends heartbeats (a communication signal) to the PNN periodically. If the PNN does not receive a heartbeat from the SNN for certain fixed amount of time, the SNN is considered to be out of service, and the block replicas it hosts are considered to be unavailable. In this case, the PNN initiates the CC recovery algorithm, which includes signaling (sending heartbeats with control signals to other nodes) and replicating the copy of failed SNN data (available on the checkpoint nodes) to the other nodes in a coordinated way [31].

For our cost analysis, we would like to compute the cost associated with the CC data recovery algorithm for which we assume a cloud application, such as a message passing interface (MPI) program that comprises logical processes that communicate through message passing (heartbeats). Each process is executed on a virtual machine and sends a message to remaining processes with equal probabilities. We also consider that the message sending, checkpointing, and fault occurrence events are independent of each other. Assuming that a process is modelled as a sequence of deterministic events, i.e., every step taken by the process has a known outcome, and failure only occurs during message passing with equal probability and not during checkpointing or recovery, we use the analytical cost model given in [4] for cost analysis of fault tolerance at the SW level. According to [4], denotes the total execution time of a process without fault tolerance, while and represent the checkpointing and failure recovery overheads, respectively. Then, the total cost of fault tolerance per process is given byAssuming that the average time to roll back a failed process is and mean time between failures is , where denotes the probability of failure, then according to [4], the average recovery cost in CC per process is given byLet denote the probability that a process starts checkpointing, then becomes the probability that processes do not start checkpointing, while becomes the probability that at least one process starts a checkpoint. Consequently, represents the checkpointing interval. A process can be the initiator of checkpointing with probability and generate request (REQ) and acknowledgement signals (ACK) to the rest of the noninitiators (total signals) and likewise be a noninitiator with probability and generate only one ACK signal in response to the initiator. As a result, there are average messages generated per checkpoint, and the average overhead per checkpoint is , where denotes the average time to write a checkpoint to a stable node and denotes the average network latency. Then, the average checkpointing cost for a process is given byUsing the cost model given in (9), (10), and (11), we carried out the cost of data recovery in the CC algorithm with the parameters, processes (virtual machines), (one checkpointing per 15 minutes), , , , as given in [4]. We consider the value of which implies that 100% of the faults in hardware are propagated to the SW level in the absence of HW-level fault tolerance, while each fault occurs after 168 hours (one week’s time). After we apply HW-level fault tolerance, the probability of failure reduces to , where the value 0.755 signifies that only 7.55% of the faults are unhandled by the proposed HW-level fault tolerance system (see Table 5). We vary one of the above parameters by keeping the other constant and observe the effect of data recovery cost with and without the proposed HW-level fault tolerance.

The graph in Figure 12(a) shows the average cost of data recovery when the number of processes is increased from 32 to 4096 (virtual machines). We consider that an application is partitioned into processes and each process runs on a virtual machine. The increase in number of processes causes a sharp increase in data recovery cost in the CC algorithm because every process has to coordinate with each other in case of a failure.

Figure 12(b) depicts the effect of network latency on the cost of data recovery. In this case we increased the network latency from 2 milliseconds to 300 milliseconds. Network latency depends heavily upon the traffic situation, network bandwidth, data size, and number of active nodes in the network. Figure 12(b) shows that increasing network latency has a negative impact on data recovery because it takes a longer time for processes to communicate with each other, resulting in delayed data recovery.

Figure 13 illustrates the situation where we increase the checkpointing frequency from one checkpoint per hour (1/60) to one checkpoint per minute. Even though the increase in checkpointing frequency improves the overall fault tolerance, it also increases the overall fault tolerance overhead, as shown in Figure 13.

Finally, we show the effect of the increasing probability of failure on the cost of data recovery in Figure 14. We varied the failure frequency from one failure per 1024 hours to one failure per 2 hours, which caused a huge impact on fault tolerance overhead, as shown in Figure 14. But, if we detect most of the errors at the hardware level, the average cost of data recovery reduces to a tolerable limit, as shown in Figure 14.

Because of the errors arising at the HW level, the average cost of data recovery in terms of percent increase in runtime in all of the above cases is much higher if we apply fault tolerance at the SW level only. Among the four parameters, i.e., # of processes, network latency, checkpointing frequency, and frequency of failure, frequency of failure has the worst effect on the average cost of data recovery. The proposed HW-level fault tolerance reduces the average cost to a tolerable limit, which is promising for big data and cloud computing applications. Although there is a one-time cost associated with HW-level fault tolerance, it provides high reliability against potential failures leading to severe socioeconomic consequences in big data and cloud computing.

6. Conclusions and Future Work

In this paper, we presented a concurrent error detection coding-based HW-level fault tolerance scheme for big data and cloud computing. The proposed method uses SEDC codes to protect against transient errors, which is a major problem in modern VLSI circuits. We also presented an FS SEDC checker that not only detects errors in the functional circuitry but also remains failsafe under s-a-1, s-a-0, s-open, and s-short errors within checker circuitry. We compared the performance of the proposed SEDC checker with Berger and m-out-of-2m checker in terms of area, delay, and power dissipation, which proves the superiority of the proposed SEDC checker. Using the example of a 4-bit adder circuit, we presented a complete SEDC-based HW-level fault tolerance system and computed its fault coverage by exhaustive fault injection. The SEDC-based HW-level fault tolerance method shows 100%, 47%, and 92.5% fault coverage against unidirectional, bidirectional, and total errors, respectively. In order to show the effectiveness of the proposed SEDC-based HW-level fault tolerance method in big data and cloud computing applications, we compared the average cost of fault tolerance overhead with and without HW-level fault tolerance. The results show that HW-level fault tolerance reduces the probability of failure due to transient errors, consequently reducing the average cost of fault tolerance overhead to a great extent when compared with SW level fault tolerance only.

From hardware-level evolution such as microprocessors, memories, and parallel computing devices, to system-level advancements such as networking, data security, resource sharing protocols, and operating systems, the underlying technologies have changed a lot since the emergence of big data and cloud computing. Fault tolerance plays a vital role in big data and cloud computing because of the uncertain failures associated with the huge amount of data, both at SW and HW levels. Given this, we believe that this research opens new opportunities for fault tolerance at the hardware-level for big data and cloud computing.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was partly supported by research funds from Chosun University, 2017, Sogang University Research Grant of 2012 (201210056.01) and MISP (Ministry of Science, ICT & Future Planning), Korea, under the National Program for Excellence in SW (2015-0-00910) supervised by the IITP (Institute for Information & communications Technology Promotion).