Abstract
Computing systems with fieldprogrammable gate arrays (FPGAs) often achieve fault tolerance in highenergy radiation environments via triplemodular redundancy (TMR) and configuration scrubbing. Although effective, TMR suffers from a 3x area overhead, which can be prohibitive for many embedded usage scenarios. Furthermore, this overhead is often worsened because TMR often has to be applied to existing registertransferlevel (RTL) code that designers created without considering the triplicated resource requirements. Although a designer could redesign the RTL code to reduce resources, modifying RTL schedules and resource allocations is a timeconsuming and errorprone process. In this paper, we present a more transparent highlevel synthesis approach that uses scheduling and binding to provide attractive tradeoffs between area, performance, and redundancy, while focusing on FPGA implementation considerations, such as resource realization costs, to produce more efficient architectures. Compared to TMR applied to existing RTL, our approach shows resource savings up to 80% with average resource savings of 34% and an average clock degradation of 6%. Compared to the previous approach, our approach shows resource savings up to 74% with average resource savings of 19% and an average heuristic execution time improvement of 96x.
1. Introduction
Recently, computing systems in space and other extreme environments with highenergy radiation (e.g., highenergy physics, high altitudes) have been turning to fieldprogrammable gate arrays (FPGAs) to meet performance and power constraints not met by other computing technologies [1]. One challenge for FPGAs in these environments is susceptibility to radiationinduced singleevent upsets (SEUs), which can alter the functionality of a design by changing bits in memories and flipflops. Although radiationhardened FPGAs exist, some of those devices are still susceptible to SEUs and commonly have prohibitive costs compared to commercialofftheshelf (COTS) devices [2].
To mitigate these issues, designers often use triplemodular redundancy (TMR) on COTS FPGAs. TMR is a wellknown form of hardware redundancy that replicates a design into three independent modules with a voter at the outputs to detect and correct errors. Research has shown that TMR with frequent configuration scrubbing (i.e., reconfiguring faulty resources) provides an effective level of fault tolerance for many FPGAbased space applications [3].
One key disadvantage of TMR is the 3x resource overhead, which often requires large FPGAs that may exceed cost or power constraints for embedded systems. Although resource sharing is a common strategy for reducing this overhead, designers often apply TMR to registertransferlevel (RTL) code, where the productivity challenge of exploring resource sharing and scheduling options is often impractical. Furthermore, RTL code is not available for the common case of using encrypted or presynthesized IP cores, making such exploration impossible.
In this paper, we automate this exploration during highlevel synthesis (HLS) by integrating resource sharing and TMR into a scheduling and binding heuristic called the ForceDirected FaultToleranceAware (FDFTA) heuristic that provides attractive tradeoffs between performance, area, and redundancy. More specifically, the heuristic explores the impact of varying hardware redundancy with the capability to correct an error, which we measure as an errorcorrection percentage. Our heuristic is motivated by the observation that, for many situations, error correction is not as critical as error detection. By allowing a designer to specify an errorcorrection percentage constraint that is appropriate for their application, our heuristic can explore numerous options between performance and area that would be impractical to do manually.
Although other FPGA work has also approached fault tolerance through highlevel synthesis, that earlier work mainly focused on different fault models or reliability goals. For example, Golshan et al. [4] introduced an HLS approach for minimizing the impact of SEUs on the configuration stream, which is complementary to our approach. Shastri et al. [5] presented a conceptually similar HLS approach, but that work focused on minimizing the number of coarsegrained resources without considering their FPGA implementation costs, while also suffering from long execution times and the need for manual parameter tuning. This paper presents an extension of [5] that addresses previous limitations with an automated approach, showing resource savings of up to 74% compared to the earlier approach, while also reducing heuristic execution time by 96x on average.
Similarly, highlevel synthesis for ASICs has introduced conceptually similar techniques [6], but whereas ASIC approaches must deal with transient errors, FPGAs must pay special attention to SEU in configuration memory that will remain until scrubbing or reconfiguration (referred to as semipermanent errors for simplicity). Due to these semipermanent errors, FPGA approaches require significantly different HLS strategies.
Compared to the common strategy of applying TMR to existing RTL code, our heuristic has average resource savings of 34% and displays significant improvements as the latency constraint and the benchmark size increases. For a latency constraint of 2x the minimumpossible latency, our heuristic shows average resource savings of 47%, which achieves a maximum of 80% resource savings in the largest benchmark.
The paper is organized as follows. Section 2 describes related work on studies pertaining to areas such as FPGA reliability and faulttolerant HLS. Section 3 defines the problem and the assumptions of our fault model. Section 4 describes the approach and implementation of the FDFTA heuristic. Section 5 explains the experiments used to evaluate the FDFTA heuristic, and Section 6 presents conclusions from the study.
2. Related Work
With growing interest in FPGAs operating in extreme environments, especially within space systems, a number of studies have been done on assessing FPGA reliability in these environments. Some of these studies rely on analyzing FPGA reliability through models. For example, Ostler et al. [7] investigated the viability of SRAMbased FPGAs in Earthorbit environments by presenting a reliability model for estimating mean time to failure (MTTF) of SRAM FPGA designs in specific orbits and orbital conditions. Similarly, Héron et al. [8] introduced an FPGA reliability model and presented a case study for its application on a XC2V3000 under a number of soft IP cores and benchmarks. In addition to reliability models, a number of studies have been done on emulating and simulating faults in FPGAs (e.g., [9, 10]). Rather than relying on costly testing in a radiation beam, such approaches facilitate costeffective testing of faulttolerant designs. Although many studies analyze SRAMbased technologies, other studies have also considered antifuse FPGAs and flash FPGAs. For example, McCollum compares the reliability of antifuse FPGAs to ASICs [11]. Wirthlin highlights the effects of radiation in all three types of FPGAs and explores the challenges of deploying FPGAs in extreme environments, such as in space systems and in highenergy physics experiments [2]. In this paper, we complement these earlier studies by presenting an HLS heuristic that transparently adds redundancy to improve the reliability of SRAMbased FPGA designs. As described in Section 3, the fault model for our heuristic makes several assumptions based on the findings of these earlier works.
In our approach, we leverage the use of TMR to apply fault tolerance to FPGA designs. TMR is a wellstudied fault tolerance strategy in FPGAs that has been studied in different use cases and under additional modifications. Morgan et al. [12], for example, compared TMR with several alternative faulttolerant designs in FPGAs and showed that TMR was the most costeffective technique for increasing reliability in a LUTbased architecture. Other works have introduced modifications that are complementary to our TMR design and may be incorporated into our heuristic. For example, Bolchini et al. [13] present a reliability scheme that uses TMR for fault masking and partial reconfiguration for reconfiguring erroneous segments of the design. Our work focuses on automatically applying the TMR architecture from highlevel synthesis and could incorporate partial reconfiguration regions for enhanced reliability at the cost of producing devicespecific architectures.
Work by Johnson and Wirthlin [14] approaches the issue of voter placement for FPGA designs using TMR and compares three algorithms for automated voter insertion based on strongly connected component (SCC) decomposition. Compared to the naive approach of placing voters after every flipflop, these algorithms are designed to insert fewer voters through feedback analysis. Since FPGA TMR designs consist of both applying redundancy and voter insertion, our paper focuses specifically on the problem of automated TMR and can be potentially used alongside these voter insertion algorithms.
Although scheduling and binding in highlevel synthesis is a wellstudied problem [15–17], many of those studies do not consider fault tolerance or errorcorrection percentage. More recent works have treated reliability as a primary concern in the highlevel synthesis process but focus on different reliability goals in ASIC designs. For example, Tosun et al. [18] introduce a reliabilitycentric HLS approach that focuses on maximizing reliability under performance and area constraints using components with different reliability characterizations. Our work, in contrast, focuses on minimizing area under performance and reliability constraints using components with the same reliability characterizations. Antola et al. [6] present an HLS heuristic that applies reliability by selectively replicating parts of the datapath for selfchecking as an errordetection measure. Our heuristic differs by applying reliability through TMR with resource sharing. Our heuristic additionally varies the reliability through an errorcorrection percentage constraint that enables selective resource sharing between TMR modules.
Other work has notably used HLS to apply reliability through other means than replication. For example, Chen et al. [19] introduce an HLS approach that uses both TMR and gatelevel hardening technique called gate sizing on different resources to minimize both soft error rate and area overhead. In another example, Hammouda et al. [20] propose a design flow that automatically generates onchip monitors to enable runtime checking of control flow and I/O timing behavior errors in HLS hardware accelerators. Our heuristic focuses specifically on scheduling and binding and may be complementary to several of these approaches. Although our heuristic could also be potentially applied to ASIC design with modification, we have tailored scheduling and binding to the FPGA architecture, which notably has different HLS challenges compared to ASICs [21], and have compared its effectiveness with other FPGAspecific approaches. Our heuristic is especially applicable to FPGAs given the rise of FPGAs in space missions which commonly implement faulttolerant logic through TMR and configuration scrubbing [3].
Compared to ASICs, far fewer reliabilitycentric highlevel synthesis studies have targeted FPGAs. Golshan et al. [4] introduced a TMRbased HLS process for datapath synthesis, placement, and routing that targets SRAMbased FPGAs. Although conceptually similar to our work, that study focused on mitigating the impact of SEUs in the configuration bitstreams by enforcing selfcontainment of SEUs within a TMR module and by minimizing potential SEUinduced bridging faults that connect two separate nets in a routing resource. By contrast, our work focuses on minimizing the resources needed for TMR and can intentionally neglect selfcontainment of SEUs within a TMR module for additional resource savings. Dos Santos et al. [22] investigated another TMRbased HLS design flow on SRAMbased FPGAs and compares the reliability of these designs with their respective unhardened equivalents. Compared to our work, our heuristic focuses on making tradeoffs between redundancy and area, which can include full TMR. Shastri et al. [5] introduced a TMRbased HLS heuristic that focused on solely minimizing coarsegrained resources under latency and redundancy constraints. By contrast, our work considers each resource’s implementation costs and uses improved scheduling and binding algorithms for better scalability in larger benchmarks and increased latency constraints. At a 2x normalized latency constraint, our heuristic provides average resource savings of 34% compared to TMR applied to existing RTL and achieves resource savings of 74% relative to [5] approach on the largest benchmark.
3. Problem Definition
Although there are different optimization goals that could be explored while varying the amount of error correction, in this paper we focus on the problem of minimumresource, latency and errorconstrained scheduling and binding, which for brevity we simply refer to as the problem. To explain the problem, we introduce the following terms:(i)Fault: a resource with an SEUinduced error(ii)Module: one instance of the dataflow graph (DFG), analogous to a module in TMR(iii)Error: any fault where one or more of the three modules output an incorrect value(iv)Undetectable error: any fault where all three modules output the same incorrect value(v)Detectable error: any fault where one or more modules output different values(vi)Uncorrectable error: any fault where two or more modules output incorrect values(vii)Correctable error: any fault where two or more modules output a correct value(viii)Errorcorrection % (EC%): the percentage of total possible errors that are correctable by a given solution.
Figure 1 illustrates several example error types, where all operations with the same color are bound to a single resource. If the black resource experiences a fault, this binding results in an undetectable error because all three modules will produce the same incorrect value. If there is a fault in the mediumgray resource, this binding causes an uncorrectable error because two modules (2 and 3) will produce incorrect values. A fault in the lightgray resource results in a correctable error because modules 2 and 3 both produce correct outputs. Note that both gray resources result in detectable errors because at least one module outputs a different value than the other modules. We consider the error correction to be 100% if all errors can be classified in this way as correctable errors, although failures that occur in other parts of the system may still cause incorrect outputs.
The input to the problem is a dataflow graph (DFG) , a latency constraint expressed in number of cycles, and an error constraint specified as the minimum acceptable EC%. The output is a solution , which is a combination of a schedule and binding for a redundant version of . Given these inputs and outputs, we define the problem as follows:
In other words, the goal of the problem is to find a schedule and binding that minimizes the number of required resources, where the schedule does not exceed the latency constraint , the binding does not exceed the error constraint , and all errors are detectable. We provide an informal proof that this problem is NPhard as follows. If we remove both error constraints from the problem definition, the problem is equivalent to minimumlatency and resourceconstrained scheduling followed by binding, which are both NPhard problems [15]. The correctable and detectable error constraints only make the problem harder by expanding the solution space with replicated versions of the input.
Note that a more complete definition of this problem would include other FPGA resources (e.g., DSP units, block RAM), as opposed to solely using LUTs. However, because there is no effective relative cost metric for different FPGA resources, comparison between solutions with different types of FPGA resources is difficult. For example, it is not clear whether or not a solution with 100 DSPs and 10,000 LUTs is preferable to a solution with 10 DSPs and 100,000 LUTs. An alternative to this approach would be minimizing one resource while using constraints on the other resources. This approach however may exclude solutions that minimize multiple selected resources. For ease of explanation and comparison, the presented heuristic implements coarsegrained resources using only LUTs and focuses on minimizing the design’s overall LUT count.
We assume that scrubbing occurs frequently enough so there cannot be more than one faulty resource at a time, which is often true due to the low frequency of SEUs in many contexts. For example, in the Cibola Flight Experiment [23], the experiment’s Virtex FPGAs experienced an average SEU rate of 3.51 SEUs/day compared to their scrubbing cycle of 180 ms. With this assumption, based on our definitions, the total number of possible faults (and errors) is equal to the total number of resources used by the solution. Due to the likely use of SRAMbased FPGAs, we assume that all faults persist until scrubbing removes the fault. This contrasts with earlier work that focuses on transient faults [24–26]. We assume the presence of an implicit voter at the output of the modules, potentially using strategies from [14].
One potential challenge with error correction is the possibility of two modules producing incorrect errors that have the same value, which we refer to as aliased errors. Although we could extend the problem definition to require no instances of aliased errors, this extension is not a requirement for many usage cases (e.g., [26]). In addition, by treating aliased errors as uncorrectable errors, good solutions will naturally tend to favor bindings that have few aliased errors. To further minimize aliased errors, our presented heuristic favors solutions with the highest EC% when there are multiple solutions that meet the error constraint with equivalent resources.
4. ForceDirected FaultToleranceAware (FDFTA) Heuristic
To solve the problem of minimumresource, latency and errorconstrained scheduling and binding, we introduce the ForceDirected FaultToleranceAware (FDFTA) heuristic, which performs scheduling and binding during highlevel synthesis while simultaneously applying TMR and resource sharing to reduce overhead. By using an errorcorrection constraint combined with a latency constraint, the heuristic explores various tradeoffs between area, performance, and redundancy. As described in Algorithm 1, the heuristic first triplicates the DFG, schedules the triplicated DFG under a given latency constraint, and then binds the scheduled operations under a given EC% constraint.

The heuristic is divided into two key parts: scheduling and binding. We discuss the scheduling algorithm in Section 4.1 and the binding algorithm in Section 4.2.
4.1. Scheduling
In highlevel synthesis, scheduling is the process of assigning each operation into a specific cycle or control state. The resulting schedule for the entire application is then implemented using a finitestate machine. In this section, we discuss the limitations of previous faulttoleranceaware schedulers (Section 4.1.1) and then present a heuristic that adapts ForceDirected Scheduling (FDS) [27] to address those limitations (Section 4.1.2).
4.1.1. Previous FaultTolerance Aware Scheduling
The previous work on faulttoleranceaware scheduling from [5] used the Random NonzeroSlack List Scheduling. As a variant form of minimumresource, latency constraint (MRLC) list scheduling, this scheduling algorithm is a greedy algorithm that makes scheduling decisions on a cyclebycycle basis based on a resource bound and operation slack (i.e., the difference between the latest possible cycle start time and the cycle under consideration). To minimize resource usage, the algorithm iterates over each cycle and schedules operations ordered from lowest to highest slack up to the resource bound. If there are still zeroslack operations in a particular cycle once the resource bound is reached, those operations are scheduled and the resource bound is updated to match this increased resource usage. This process continues until all of the operations are scheduled. Unlike MRLC list scheduling, the Random NonzeroSlack List Scheduling schedules nonzeroslack operators with a 50% probability up to the resource bound. With this randomness, the scheduling algorithm is intended to produce different schedules for each TMR module which would increase the likelihood of intermodule operations bindings following scheduling. To escape local optima, this previous work used the scheduling algorithm in a multipass approach that would collect the best result from a series or phase of scheduling and binding runs. The heuristic would then continue to run these phases until the best result of a phase showed no significant improvement compared to the previous phase’s result.
There are several key disadvantages of using the previous heuristic with this scheduling algorithm that we address in this paper. One primary disadvantage is the heuristic’s lengthy execution time from its multipass approach, which relies on randomness in the schedule to find an improved solution. Using a userdefined percentage, the heuristic continues to another phase with double the amount of scheduling and binding runs if the current phase’s result is not significantly better than the previous phases. The execution time can therefore be largely influenced by the randomness and the data dependencies between operations. In addition to long execution times, the heuristic requires finetuning of starting parameters to avoid premature exiting, which worsens productivity and can be error prone. The heuristic also has the disadvantage of exploring a restricted solution space by favoring the scheduling of operations in earlier cycles. This scheduling tendency results from the use of a fixed 50% scheduling probability for nonzeroslack operations in each cycle. By contrast, our proposed heuristic performs the scheduling and binding process once, using a more complex forcedirected scheduler, which generally reduces the execution time and improves quality without the need for manually tuning heuristic parameters.
In terms of complexity, both heuristics consist of scheduling followed by binding. As such, the proposed heuristic consists of complexity for the forcedirected scheduler [27] described in the next subsection and for the binder described in Section 4.2 [28]. The overall complexity is therefore where is the latency constraint and is the number of operations. In contrast, the previous heuristic consists of complexity for its variant listscheduler and for its clique partitioning binder. Since the previous heuristic will generally limit the number of phases or total number of scheduling and binding runs in its multipass approach, the overall complexity is , where is a userdefined limit on total iterations and is the number of operations. The proposed heuristic therefore has a lower complexity than the previous heuristic when which is commonly true as the user will generally set such that to avoid premature exiting.
Additionally, the previous heuristic suffers from the same disadvantages as general MRLC list scheduling. Notably, it is a local scheduler that makes decisions on a cyclebycycle basis using operation slack as a priority function. Such an approach has a tendency towards locally optimal operation assignments that often leads to suboptimal solutions when providing a latency constraint that is significantly larger than the minimumpossible latency. Similarly, the heuristic’s use of slack may present suboptimal results in scenarios where operation mobility may underestimate resource utilization. By contrast, ForceDirected Scheduling is a global stepwiserefinement algorithm that selects operation assignments from any cycle based on its impact on operation concurrency.
4.1.2. ForceDirected Scheduling
ForceDirected Scheduling is a latencyconstrained scheduling algorithm that focuses on reducing the number of functional units by balancing the concurrency of the operations assigned to the units. Algorithm 2 presents an overview of the algorithm. A more detailed description can be found in [27].

As demonstrated in Algorithm 2, ForceDirected Scheduling balances operation concurrency by using time frames, distribution graphs, and force values. Time frames refer to the possible cycles to which an operation may be assigned, such that the resulting schedule does not violate the latency constraint. Intuitively, the assignment of an operation to a specific cycle may impact the time frames of other operations if there are data dependencies. Distribution graphs refer to the probability that a given operation type is assigned to a specific cycle for each cycle within the latency constraint. For each cycle, the algorithm assigns probabilities to the distribution graphs by finding all operations of the same type that can be scheduled at a given cycle and then sums their individual probabilities.
To better illustrate these structures, Figure 2 shows the time frames and distribution graphs of a DFG that consists of two operations and is subject to a latency constraint of 4 cycles. In Figure 2(a), the DFG is shown scheduled with an assoonaspossible (ASAP) schedule and an aslateaspossible (ALAP) schedule. The ASAP schedule focuses on scheduling an operation at the earliest possible cycle given data dependencies. Notice how operation B cannot be scheduled on cycle 1 because it is dependent on operation A. Similarly, the ALAP schedule focuses on scheduling an operation at the latest possible cycle. ForceDirected Scheduling uses these two schedules to form the time frame of each operation, which represents the cycle bounds of an operation assignment.
(a)
(b)
Figure 2(b) shows the distribution graphs of the DFG before any operation has been scheduled. Assuming operations 1 and 2 are of the same type (or can share a resource), each operation has a uniform 33% chance of being scheduled in any cycle of its 3cycle time frame. The distribution graph therefore shows a 33% probability of that operation type being scheduled in cycles 1 and 4 and a 66% probability for the cycles where the time frames of the operations overlap.
ForceDirected Scheduling abstracts each distribution graph as a series of springs (for each cycle) connected to each operation of the DFG. Therefore, any spring displacement exerts a “force” on each operation. In this abstraction, the spring’s strength is the distribution graph’s value in a particular cycle and the displacement is a change in probability in a cycle due to a tentative assignment. A tentative operation assignment will therefore cause displacements in the cycle it is assigned to, whose probability is now 100%, and in the cycles it can no longer be assigned to, whose probabilities are now 0%. Using this abstraction, each operation assignment has an associated total force value that reflects the assignment’s impact on the time frames and distribution graphs of the operation and its preceding and succeeding operations. While there are unscheduled operations, the algorithm schedules the operation assignment with the lowest force value reflecting the lowest impact on operation concurrency.
Figure 3 depicts several examples of how an operation assignment may impact the time frames of other operations using the DFG from Figure 2. In each of these examples, operation A is scheduled in a specific cycle, which limits the operation’s time frame to that cycle. For the first diagram, operation A is scheduled in cycle 1 and has no impact on operation B’s time frame. In contrast, scheduling operation A in cycle 2 or 3 reduces operation B’s time frame, since operation A must be scheduled in a cycle before operation B.
Figure 4 depicts the corresponding distribution graph after such cycle assignments. Notice that an operation assignment in a particular cycle changes the probability of that operation type being scheduled to 100%. In (a), operation A contributes a 100% probability in cycle 1, while operation B contributes a uniform 33% probability over its time frame from cycle 2 to cycle 4. In (b), operation A still contributes a 100% probability in its scheduled cycle, while operation B contributes a uniform 50% probability over its reduced time frame. Noticeably, in (c), operation A’s assignment in cycle 3 will effectively force operation B’s later assignment to cycle 4 due to data dependencies. It should be noted that a distribution graph may have probabilities over 100% which can represent multiple operation assignments of that type in the same cycle. The main goal of ForceDirected Scheduling is to balance operation concurrency such that the distribution graph for each operation type will have a relatively uniform probability over each cycle after all operations are scheduled. Using the equations for , the assignment of operation A to cycle 1 would have the lowest force value of the three possible assignments since it balances the operation probability across the most cycles.
(a)
(b)
(c)
In [27], the authors describe an adjustment for optimization in scenarios with resources with different realization costs. This adjustment involves scaling the force values by a cost factor reflecting the relative realization costs. In our proposed heuristic, we scale the force values by the LUT requirements of the resource.
4.2. Binding
In highlevel synthesis, binding refers to the process of assigning operations and memory access to hardware resources. When following scheduling, the binding process yields a mapping of operations to hardware resources at specific cycles.
4.2.1. SingletonShare Binding
Our proposed heuristic, referred to as SingletonShare Binding, involves a twostage process that first performs binding in each TMR module separately and then merges nonconflicting bindings between TMR modules. By binding operations from different TMR modules to the same resource, the heuristic has therefore made tradeoffs between the circuit’s capability to correct an error and the total area needed by the circuit. Using the aforementioned EC% constraint, a designer can limit the total amount of intermodule binding that can occur.
Algorithm 3 displays the pseudocode of SingletonShare Binding.

In the first stage, SingletonShare Binding considers each TMR module separately and binds the module’s operations to resources using the Left Edge Algorithm (LEA) as originally described in [29]. Although originally introduced as a method for packing wire segments, this algorithm has found popularity in resource binding that focuses on binding as many operations to a resource as possible before binding on another resource.
In the second stage, the binding algorithm consolidates singleton bindings (i.e., resources with only one mapped operation) and attempts to merge each singleton binding with a nonconflicting binding from another module. By merging these bindings, a single resource will contain operation bindings from different TMR modules which will reduce the circuit’s errorcorrecting capability at the benefit of a lower area cost. This binding process will therefore reduce the total number of resources used in the design at the cost of reducing the design’s EC%. As mentioned before, the heuristic maintains 100% error detection by limiting the resource sharing to only two modules. This process allows the design to detect errors, albeit uncorrectable, in the event of a fault in a shared resource. Algorithm 3 shows the pseudocode for the second stage which first merges bindings between singleton and nonsingleton bindings and then merges bindings between singleton binding pairs. Bindings may only be merged if the resulting binding does not involve multiple operations scheduled on the same cycle or operations of all three modules.
The key requirement of the second stage is ensuring that the EC% does not fall under the error constraint . Since the number of uncorrectable errors equals the number of resources that have been shared across two modules, this constraint satisfaction is illustrated as shown inwhich can be reduced to
These equations however do not reveal a limit on the number of shares that can be done by the second stage binder. Since sharing a resource also reduces the number of resources, the relationship between the number of shares and the initial number of bindings is related in
Since the number of shares must be an integer, we define the integer as the maximum number of shares done by the second stage binder while meeting the error constraint. We simplify (4) in terms of in
Figure 5 depicts an example of the binding process. In these subfigures, each nonsingleton resource is represented by a node with a solid color, whereas each singleton resource is represented by a node with a pattern. The format on each node refers to an operation A from the original DFG performed on module . Figure 5(a) therefore displays three modules each with distinct singleton and nonsingleton resources and represents a possible scheduling and binding after the first stage of binding.
(a)
(b)
(c)
Using a 50% EC constraint and (5), the second stage of binding may share up to two resources to meet the EC% constraint. As the first step, the binding process attempts to merge singleton bindings with nonsingleton bindings. In Figure 5(a), the only candidates for sharing is singleton bindings of either node or node , with the nonsingleton binding containing nodes , , and (the lightgray resource). The singleton binding containing is ineligible since all nonsingleton bindings already have a conflicting operation scheduled during cycle 2. Similarly, no other nonsingleton binding can be merged with a singleton as each contains a conflicting operation during cycles 1 and 2. Merging singleton binding of node with the candidate nonsingleton binding, the binding process produces Figure 5(b). Notice that node is now part of the nonsingleton binding of , , and , as represented by the same color.
Since there are no more candidate pairs, the binding process then attempts to merge singleton binding pairs. In Figure 5(b), there is only one candidate pair between the singleton of and of . Merging these bindings, the binding process produces a nonsingleton binding containing these two nodes, as seen in Figure 5(c) with the solid black color. At this point, the binding process completes as the number of shared resources has met the limit. Instead of using the initial size resources, the final design now produces a scheduling and binding on four resources with a 50% EC. If the limit had not been reached, the binding process may continue until there are no more candidate pairs.
For registerbinding, our heuristic provides a register unit for each functional unit to avoid relatively expensive multiplexers on the FPGA’s LUTbased architecture while also providing register sharing for compatible variables originating from the same functional unit. Due to the FPGA’s registerrich fabric, register sharing can be more expensive than register duplication and is rarely justified [30].
5. Experiments
In this section, we evaluate the effectiveness of the FDFTA heuristic in a variety of benchmarks. Section 5.1 describes the experimental setup. Section 5.2 shows a comparison in resource savings with TMR applied to existing RTL. Section 5.3 shows a comparison in resource savings with a previous approach. Section 5.4 shows a comparison in execution time with the previous approach. Section 5.5 illustrates a comparison of clock frequencies using the different approaches.
5.1. Experimental Setup
To evaluate our heuristic, we implemented Algorithm 1 in C++, while also using Vivado 2015.2 to provide LUT counts for each resource and the clock frequency of the final solution. Our target architecture was a Xilinx Virtex7 xc7vx485tffg17612, which uses Xilinx 7 Series Configurable Logic Blocks with 6input LUTs.
5.1.1. Benchmarks
Table 1 summarizes the benchmarks and shows the numbers of nodes and edges and the operation types. To represent signalprocessing applications, we used DFGs for convolution, 8point radix2 butterfly FFT, 16point radix2 butterfly FFT, and 4point radix4 dragonfly FFT. Of these, the radix4 FFT efficiently expands the complex arithmetic involved to equivalent real operations as in [31]. For the radix2 FFTs, we use resources capable of directly performing complex operations. To represent fluiddynamics and similar applications, we used two DFGs that solve 5dimensional linear equations () using the Jacobi iterative method and the successive overrelaxation (SOR) iterative method [32]. We also used two DFGs that solve Laplace’s equation () using the Jacobi and SOR methods. We also supplement these real benchmarks with synthetic benchmarks (small0–small7, medium0–medium4) that we created using DFGs generated through random directed acyclic graphs.
5.1.2. Baselines
Two baselines were used to evaluate our proposed heuristic: a common approach where designers apply TMR to existing RTL (collectively referred to as TMRRTL for simplicity) and the approach from [5] (referred to as the previous approach). We use the TMRRTL approach baseline to motivate the benefits of our heuristic over common use cases and the previous approach baseline to demonstrate improvements over the state of the art.
We model the TMRRTL approach based on two observations: () designers often deal with existing RTL code and () designers may not be willing or capable of performing an extensive exploration of faulttolerance tradeoffs. Because most existing RTL is implemented without considering the effects of TMR, the code is generally not written to minimize area. To approximate this use case, we use an ASAP schedule and LEA bindings. Although a designer could certainly use an approach that reduces area, we have observed that ASAP schedules are common in RTL cores, likely due to ease of implementation.
5.2. Comparison with TMR Applied to Existing RTL
In this section, we evaluate the effectiveness of our FDFTA heuristic compared to TMRRTL under different EC% and latency constraints. Unlike a simple triplicated RTL circuit, an HLS RTL circuit may use more extensive resource sharing during highlevel synthesis to automatically target different latency and redundancy constraints. Applying triplication to an existing circuit, in contrast, limits the designer to the circuit’s existing architecture and leaves little opportunities to target different constraints, especially when using encrypted or presynthesized IP cores. Table 2 presents the resources savings where the rows correspond to different benchmarks and the columns correspond to different EC% and latency constraints. We vary the latency constraint from the DFG’s minimumpossible latency (as determined by an ASAP schedule) up to 2x the minimum latency, in increments of 0.2x. The EC% constraint is explored for 100% and 70%. On average, the FDFTA heuristic shows savings compared to the TMRRTL for each pair of normalized latency and EC% constraints. Although relatively small at the lowest latency constraint, these average savings become significant at higher latency constraints reaching 45% and 49% savings at 2.0x normalized latency constraint for both 100% and 70% EC constraints, respectively. For individual benchmarks, the FDFTA heuristic shows significant savings at each set of constraints except for benchmarks with low operation counts and for certain benchmarks at low normalized latency constraints.
From Table 2, we observe results consistent with expectations. For a normalized latency constraint of 1.0x and a EC constraint of 100%, the FDFTA heuristic yielded no savings for about half of the benchmarks and up to 65% savings for the remainder. The lack of savings for certain benchmarks is expected given that some benchmarks have no flexibility to schedule operations on different cycles when constrained by the minimumpossible latency. In such cases, designers should use the lowest complexity scheduling algorithm as all algorithms will produce similar schedules. For the other benchmarks, up to 65% savings correspond to benchmarks with little to some flexibility, whereas the 65% savings in the benchmark correspond to DFGs with large operation flexibility. In these cases, we start to see a trend of growing resource savings as operation flexibility increases. Additionally, the FDFTA heuristic displayed a minimal degradation between −2% and −1%, on some benchmarks. Further investigation of each approach’s output reveals that nonoptimal bindings by the FDFTA heuristic resulted in multiplexers with more inputs than their counterparts in the TMRRTL approach.
For a normalized latency constraint of 2.0x and a EC constraint of 100%, the FDFTA heuristic yielded savings up to 75% with savings above 50% for about half of the benchmark set. We expect such high savings due to the large operation flexibility granted by the latency constraint which enables the FDFTA heuristic’s global scheduling approach to balance the operation concurrency over 2x as many cycles as the minimumpossible latency. The TMRRTL approach, in contrast, will maintain a static scheduling approach despite the larger latency constraint. Like any basic triplication approaches, the TMRRTL approach is unable to target new constraints without manual intervention and may suffer from a higher area cost due to fewer opportunities for resource sharing. Despite these savings, there are a few benchmarks where the FDFTA heuristic experienced little to no savings. These benchmarks (e.g., , , and 1) are however very small in operation count and represent cases where the FDFTA heuristic’s attempt to balance operation concurrency will yield output similar to the TMRRTL approach. Overall, these results have supported our expectation that the FDFTA heuristic performs better with increasing latency constraints due to larger scheduling flexibility.
Comparing the results from nonsynthetic benchmarks and synthetic benchmarks, Table 2 shows that the trends found in each individual benchmark set are similar to the trends found when both sets are considered together. For nonsynthetic benchmarks, the FDFTA heuristic yielded no savings for about half of the benchmarks and up to 65% savings for the remainder at a latency constraint of 1.0x and EC constraint of 100%. As the latency constraint increased, the FDFTA heuristic generally showed increased savings with exception to and which displayed minimal degradation between −2% and −1% under certain latency constraints. At a latency constraint of 2.0x, the FDFTA heuristic yielded savings up to 75%. Similarly, in the synthetic benchmarks, the FDFTA heuristic also yielded no savings for about half of the benchmarks and up to 30% savings for the remainder at a latency constraint of 1.0x and EC constraint of 100%. As the latency constraint increased, the FDFTA heuristic also generally showed increased savings but at a slower rate than the nonsynthetic benchmarks. At a latency constraint of 2.0x, the FDFTA heuristic showed savings up to 62%. With similar trends, both benchmark sets also showed averages within 5% of their respective total averages.
To better illustrate these trends, Figure 6 shows the LUT savings of certain benchmarks that exemplify the different behaviors observed. In this graph, four benchmarks are displayed with EC constraints of 70% and 100%. Each benchmark entry is shown with six bars representing the LUT savings compared to the TMRRTL approach for the six normalized latency constraints.
As one may expect, the impact of a latency constraint for HLS has a quite varied impact depending on the DFG structure. For the and other small benchmarks, increasing the latency constraint generally had a minimal impact in resource savings since the FDFTA heuristic is likely to produce a similar schedule as TMR applied to existing RTL at small DFG operation counts. Similarly, the benchmark also experienced small increases in savings as the latency constraint increased, despite the large initial savings. This behavior is caused by the benchmark’s innate structure that enables much operation scheduling flexibility, even for the minimumpossible latency constraint. Due to the large flexibility, the FDFTA heuristic’s scheduling algorithm can balance the operation concurrency easier and experiences diminished benefits from additional flexibility granted by an increased latency constraint. In contrast, the and benchmarks show significant increases in savings when the latency constraint is increased. We attribute this behavior to minimal scheduling flexibility at the minimumpossible latency constraint which is alleviated with larger latency constraints.
When comparing the savings of 70% EC with 100% EC, Table 2 shows that the experiments under the 70% EC constraint had additional savings up to 27% compared to experiments under the 100% EC constraint with the same latency constraint. Similar to previous observations, the decreased EC% largely had no impact in experiments with a normalized latency constraint of 1x which is primarily due to the lack of operation flexibility. For this baseline latency constraint, the singletons of each module are likely scheduled on the same cycle and are therefore not eligible for singletonsharing as seen in a majority of the benchmarks. Otherwise, the 70% EC iterations generally showed additional savings compared to their counterpart 100% EC iterations as the latency constraint is increased. Notably, these additional savings remain relatively similar once savings occur. Such savings are expected as EC% reflects the number of resources that may be shared across modules which places an upper bound on the additional savings due to EC.
Despite the overall positive impact by decreasing the EC%, there are a few cases where decreasing the EC% also decreased the LUT savings. For those cases, the savings due to sharing were offset by the cost of increasing the number of inputs on the interconnection multiplexers for the remaining resources. This trend is however only largely noticeable in the smaller DFGs, where the cost of input multiplexers may be comparable to the cost of resources. Similarly, there are also cases were decreasing the EC% showed a sudden increase in resource savings for a single latency constraint, but not for others, as seen in the benchmark. In these scenarios, the specific latency constraint of increased resource savings may present the only scenario where the SingletonShare Binding is capable of sharing singletons. In contrast, the heuristic may present incompatible singleton bindings due to inflexible schedules at lower latency constraints or suboptimal schedules at higher latency constraints due to the forcedirected global scheduling approach. Overall, determining the impact of the EC% constraint is counterintuitive as the constraint’s impact relies primarily on the heuristic’s distribution of singletons among cycles within each module. This distribution cannot be easily determined from the benchmark or constraints as the scheduling and binding process may vary the singleton distribution at the slightest change of constraints or in DFG structure. This impact is further affected by the shared resource type and by whether the sharing has increased the size of the resource’s input multiplexer.
5.3. Comparison with Previous Approach
In this section, we evaluate the effectiveness of our FDFTA heuristic compared to the approach of previous work described in [5] under different EC% and latency constraints. Table 3 presents the resource savings of the FDFTA heuristic compared to the previous approach and uses the same presentation style as Table 2. On average, the FDFTA heuristic shows savings compared to the previous work for each pair of normalized latency and EC% constraints. Although minimal at the lowest latency constraint, these savings become significant at higher latency constraints reaching 21% and 23% average savings at 2.0x normalized latency constraint for 100% and 70% EC, respectively. For individual benchmarks, the FDFTA heuristic commonly shows significant savings at each set of constraints except for benchmarks with low operation counts and for most benchmarks at low normalized latency constraints.
Like the previous experiment, Table 3 expresses some similar results for benchmarks under the minimumpossible latency constraint. For the normalized latency constraint of 1.0x and 100% EC constraint, the FDFTA heuristic generally has a small degradation ranging from −1% down to −5% in . Similar to the results in Table 2, the FDFTA heuristic has a tendency to make nonoptimal bindings that result in larger input multiplexers than the previous approach. Although present in all iterations of the experiment, such behavior is usually amortized by better utilization on fewer resources, especially on larger latency constraints. For the few cases where the FDFTA heuristic experienced savings under these constraints, these iterations involved benchmarks with some flexibility to schedule operations on different cycles. In the case of the largest savings in the benchmark, the solution produced by the FDFTA heuristic used far less multipliers than the previous approach’s solution. This better utilization of the costly multipliers ultimately contributed to a smaller design.
Table 3 expresses some clear trends in the results of this experiment. These trends are general increases in resource savings as both the latency constraint and the number of operations in the benchmark increase. The experiment displays these trends in a majority of benchmarks and constraint sets with a few exceptions. These trends however are expected as the FDFTA heuristic was designed to address disadvantages of the previous approach’s scheduling algorithm. As mentioned in Section 4.1, the previous approach’s scheduling algorithm had a number of scalability issues. The first trend of savings increasing with increased latency constraint can be attributed to the MRLC scheduling algorithm that the previous approach is based on. Since MRLC tries to minimize resources in a local cycle basis, this algorithm keeps a maximum operation count that limits the number of operations that may be scheduled in a cycle. These counts are only increased if the number of zeroslack operations in a cycle exceeds its current resource count. This behavior noticeably causes problems for latency constraints larger than the minimumpossible latency constraint since all operations are guaranteed to have nonzero slacks in the first few cycles. In this case, the algorithm will only schedule at most one operation per cycle until it experiences numerous zeroslack operations at later cycles. This lack of initial operation concurrency will likely cause large operation concurrency at later cycles. The design will therefore be forced to instantiate additional resources to meet this large concurrency. The result is a large number of resources that are underutilized in early cycles. By contrast, the FDFTA heuristic’s scheduling algorithm aims to balance the operation concurrency among all cycles such that all resources maintain a high utilization and therefore uses fewer overall resources.
Comparing the results from nonsynthetic benchmarks and synthetic benchmarks, Table 3 shows that the trends found in each individual benchmark set are similar to the trends found when both sets are considered together. For nonsynthetic benchmarks, the FDFTA heuristic yielded no savings for most benchmarks and up to 19% savings for the remainder at a latency constraint of 1.0x and EC constraint of 100%. As the latency constraint increased, the FDFTA heuristic generally showed increased savings with exception to and which decreased in savings. At a latency constraint of 2.0x, the FDFTA heuristic yielded savings up to 67%. Similarly, in the synthetic benchmarks, the FDFTA heuristic also yielded no savings for most benchmarks and up to 27% savings for the remainder at a latency constraint of 1.0x and EC constraint of 100%. As the latency constraint increased, the FDFTA heuristic also generally showed increased savings but at a slower rate than the nonsynthetic benchmarks. At a latency constraint of 2.0x, the FDFTA heuristic showed savings up to 58%. With similar trends, both benchmark sets also showed averages within 3% of their respective total averages.
To better illustrate these trends, Figure 7 shows the resource savings of certain benchmarks that exemplify the different behaviors observed. In this graph, four benchmarks are displayed with EC constraints of 70% and 100%. Each benchmark entry is shown with six bars representing the LUT savings compared to the previous approach for different normalized latency constraints.
Similar to Figure 6, the impact of a latency constraint for HLS has a quite varied impact depending on the DFG structure and on the effectiveness of the previous approach. For the and other small benchmarks, increasing the latency constraint generally had a small impact in resource savings since the FDFTA heuristic is likely to produce a similar schedule to the previous approach. This behavior is mainly caused by the small operation count which limits the number of possible schedules. For a 70% EC constraint and 2.0x normalized latency constraint, the FDFTA heuristic even displays a notable loss in savings. It should be cautioned that, at such small DFGs, any “significant” gain or loss of savings may be only the difference of one additional adder.
For the fft8 and conv5x5 benchmarks, we observe a significant increase in savings as the latency constraint increases. This trend represents the general case where the previous approach has difficulties finding an optimal schedule with larger operation counts. There are exceptions to this trend as seen in the linsor benchmark where the previous approach performs better than the FDFTA heuristic under most latency constraints. In fact, the FDFTA heuristic does progressively worse as the latency constraint increases.
The second trend of increased savings with increased operation count can be attributed to the previous approach’s iterative approach and its randomness in the scheduling algorithm. Since that approach largely relies on finding an optimal schedule randomly over many iterations, that approach is much more likely to find an optimal schedule in small DFGs rather than in larger DFGs given good starting parameters. As a result, as the size of the DFG increases, the previous approach finds less optimal solutions and encounters its exit condition much earlier. Although a design could modify the starting parameters to find a better solution at different sizes, such an approach requires much finetuning and is not a scalable solution. In contrast, the FDFTA heuristic uses a global scheduling approach that balances operation concurrency over all possible cycles which allows the heuristic to scale with the size of the DFG at the expense of additional complexity.
5.4. Heuristic Execution Time Comparison
This section compares the average heuristic execution times of the FDFTA heuristic with the previous approach. In this context, execution time refers to the duration of the scheduling and binding process and does not include the placement and routing of the design on a FPGA. We provide this comparison to demonstrate the scalability of these heuristics, which may limit the practical use cases of a heuristic. For FPGAs, long execution times may be an acceptable tradeoff due to placement and routing generally dominating compiling times. Relatively short execution times may make a heuristic amenable to fast recompilation in areas such as FPGA overlays [33].
Figure 8 presents the results of this experiment comparing the execution time of the two approaches with a 100% EC constraint and a 2x the normalized latency constraint. The figure is arranged such that the horizontal axis represents benchmarks arranged by operation count from smallest to largest and that the vertical axis represents execution time in milliseconds on a logarithmic scale. It should be noted that each tick of the horizontal axis does not represent the same increase in operation count between benchmarks. Similarly, benchmarks of the same operation count may vary in execution for both heuristics based on the overall DFG structure.
Generally, the results of this experiment match our expectations on execution time. The figure shows that, for each benchmark, the previous approach’s execution time is orders of magnitude longer than the FDFTA heuristic with an average speedup of 96x, a median speedup of 40x, and a max speedup of 570x in the benchmark. Our switch to the proposed FDFTA heuristic was intended to address the previous approach’s long execution time.
Based on the experiments on resource savings, these results suggest that both heuristics and the TMRRTL approach have Paretooptimal solutions within the design space for execution times, resource requirements, and latency constraints. Although the TMRRTL approach does not require any scheduling and binding, results from Table 3 indicate a much smaller design may be achievable with the other approaches, especially at a larger latency constraint. Similarly, the approach of previous work may take vastly longer than other approaches but may achieve significantly smaller designs for latency constraints at the minimumpossible latency at a negligible difference in execution time for small DFGs. Notably, the FDFTA heuristic achieves significantly more savings than the previous work at increased latency constraints and on larger DFGs.
5.5. Clock Results
This section compares the clock frequencies of designs generated by the proposed heuristic, the TMRRTL approach, and the previous approach. Since increased resource sharing can increase the size of input multiplexers and potentially the critical path, we include this experiment to explore the impact of each approach on a design’s clock frequency. Since synthesis tools perform placement and routing with a pseudorandom process, we determine an approach’s clock frequency over multiple compilation runs through a script that does a binary search over the possible clockconstraint space and determines the theoretical max clock frequency based on a target frequency and slack values. Due to the lengthiness of this process, we provide this comparison only on a subset of nonsynthetic benchmarks and on a reduced set of constraints. For constraints, we test each selected benchmark at 1x and 2x normalized latency constraint with 100% EC constraint. Since the TMRRTL approach does not rely on the latency constraint, we repeat the value in both latency constraints for comparison purposes.
To obtain these results, we convert the solution’s netlist provided the C++ code into RTL code through scripts. We then use Vivado 2015.2 for compilation and for determining clock frequencies after placement and routing. For resources, we use IP cores from Xilinx for each respective resource and configure them to be implemented in LUTs and with singlecycle latencies.
Figure 9 shows the results of this experiment. In this figure, each approach has a bar for each benchmark and normalized latency constraint pairing. On the horizontal axis, the results are clustered by benchmark and then by normalized latency constraint relative to the minimumpossible latency. On the vertical axis, the clock frequency is reported in MHz.
For each benchmark and latency constraint pairing, Figure 9 presents similar results for each approach. Compared to the TMRRTL approach, the previous approach and the FDFTA heuristic had an average clock degradation of 7.08% and 6.03%, respectively. This clock overhead is unexpectedly small since sharing increases the steering logic and the FDFTA heuristic and previous approach incorporate much more sharing than the TMRRTL approach. Upon further analysis of the timing reports, each approach had different critical paths. For the TMRRTL approach, the critical path was generally a path from an input register to the output register of a multiplier. For the other approaches, the critical path was usually a path from the output register of one resource to the output register of a multiplier. These results suggest that the larger routing problem associated with using more resources in the TMRRTL approach had a similar impact on a design’s clock frequency as the steering logic for the other approaches. The one exception to this trend was the benchmark where the TMRRTL approach had much higher clock frequencies than the other two approaches. In terms of increasing the latency constraint, both the FDFTA heuristic and the previous approach generally showed minor increases in clock frequency.
6. Conclusions
This paper introduced the ForceDirected FaultToleranceAware (FDFTA) heuristic to solve the minimumresource, latency and errorconstrained scheduling and binding problem. Compared to TMR applied to existing RTL (TMRRTL) and to a previous approach, this heuristic provides attractive tradeoffs by efficiently performing redundant operations on shared resources. The FDFTA heuristic improves upon previous work with improved scheduling and binding algorithms for better scalability on larger benchmarks and with increased latency constraints. Compared to TMRRTL implementations, the FDFTA heuristic had average savings of 34% and displayed average savings of 47% at a 2x normalized latency constraint. Although skewed by small benchmarks which have little opportunities for sharing, the FDFTA heuristic had savings up to 80% under a 2x normalized latency constraint, with most savings occurring in the larger benchmarks. Compared to the previous approach, the FDFTA heuristic had average savings of 19% and displayed average savings of 22% at a 2x normalized latency constraint. The FDFTA’s comparisons with the previous approach are also skewed by small benchmarks and showed up to 74% resource savings under a 2x normalized latency constraint with most savings occurring in the larger benchmarks. Additionally, the FDFTA showed a 96x average heuristic execution time improvement compared to the previous approach and produced FPGA designs with a 6% average clock degradation compared to the TMRRTL implementations. Future work includes support for pipelined circuits and for alternative strategies for FPGA faulttolerance.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work is supported in part by the I/UCRC Program of the National Science Foundation, under Grant nos. EEC0642422, IIP1161022, and CNS1149285.