Abstract

We address the automatic synthesis of DSP algorithms using FPGAs. Optimized fixed-point implementations are obtained by means of considering (i) a multiple wordlength approach; (ii) a complete datapath formed of wordlength-wise resources (i.e., functional units, multiplexers, and registers); (iii) an FPGA-wise resource usage metric that enables an efficient distribution of logic fabric and embedded DSP resources. The paper shows (i) the benefits of applying a multiple wordlength approach to the implementation of fixed-point datapaths and (ii) the benefits of a wise use of embedded FPGA resources. The use of a complete fixed-point datapath leads to improvements up to 35%. And, the wise mapping of operations to FPGA resources (logic fabric and embedded blocks), thanks to the proposed resource usage metric, leads to improvements up to 54%.

1. Introduction

This paper addresses the architectural synthesis (AS) of Digital Signal Processing (DSP) algorithms implemented using modern FPGAs. High levels of optimization are achieved by means of the use of Multiple wordlength (MWL) fixed-point descriptions of the algorithms and also the use of both LUT-based and embedded FPGA resources. The former reduces implementation costs notably, and the latter minimizes area in FPGA implementations.

The MWL implementation of fixed-point DSP algorithms [14] has proved to provide significant cost savings when compared to the traditional uniform wordlength (UWL) design approach. The introduction of MWL issues in AS increases optimization complexity, but it opens the door to significant cost reductions [2, 3, 5, 6].

FPGA devices have been extensively used in the implementation of DSP algorithms, especially due to the recent introduction of specialized embedded blocks (i.e., memory blocks, DSP blocks, etc.). Traditional approaches to estimate FPGA resource usage do not apply to modern FPGAs, which present a heterogeneous architecture composed of both logic fabric and embedded blocks, since they only account for lookup table- (LUT-) based resources [7]. This situation calls for new resource usage metrics that can be integrated as part of automatic synthesis techniques to fully exploit the possibilities that embedded resources offer [810].

The current approaches to perform MWL-oriented architectural synthesis are not tuned to modern FPGAs [2, 3] or an efficient distribution between logic fabric and specialized embedded blocks is not applied [11, 12]. Also, the resource set used during the optimization process does not include the multiplexers necessary to transfer data from memory elements to arithmetic resources.

The main contributions of this paper are the following.

(i)The presentation of a novel resource usage metric that guarantees minimum resource usage for heterogeneous FPGA implementations if integrated within an optimization framework. (ii)The presentation of an architectural synthesis procedure tuned to fixed-point implementations, that handles a complete datapath (functional units, multiplexers, and registers). (iii)A novel strategy for fixed-point data multiplexing.

The paper is divided as follows. In Section 2, the architectural synthesis of DSP datapaths using multiple wordlength systems and modern FPGAs is introduced. Section 3 deals with the implementation results from synthesizing several DSP benchmarks for different latency constraints and output noise constraints. Finally, in Section 4, the conclusions are drawn.

2. Synthesis of Fixed-Point Datapaths

2.1. Formal Description

This work focuses on the time-constrained resource minimization problem [13]. The notation used is based on [13], and it is similar to that in [2, 4, 6].

Given a sequencing graph , a maximum latency , and a set of resources (e.g., functional units , registers , and steering logic ), it is the goal of AS to find the time step when each operation is executed (scheduling), the types and number of resources forming (resource allocation), and the binding between operations and variables to functional units and registers (resource binding) that comply with the constraints, while minimizing cost (i.e., area). As a result, a datapath able to compute the algorithm's operations (see Figure 1) as well as the required control logic is generated.

is a formal representation of a single iteration of an algorithm, where is the set of operations and is the set of signals that determine the data flow. We consider composed of typical DSP operations: multiplications, gains (multiplication by a constant value), additions, unit delays, and input and output nodes. Signals are in two's complement fixed-point format, defined by the pair , where p is number of integer bits [4] and n is the wordlength of the signal not including the sign bit (see Section 2.5). The values of the couples have been computed during a previously performed wordlength optimization (WLO) [1, 1416]. See Section 2.5 for more information about the wordlength optimization process.

Functional units () are in charge of executing the set of operations from . Registers () store the data produced by FUs and some intermediate values. Finally, steering logic () interconnects FUs and registers by means of multiplexers. The set of functional units is composed of LUT-based adders, LUT-based generic multipliers, and embedded multipliers. This set of FUs covers a representative set of modern FPGA devices. An FU is defined by its type and by its size, that depends on the input wordlengths. An operation is compatible with an FU if they have compatible types and if the size of the operation is smaller than or equal to the size of the FU [4, 6].

Scheduling is expressed by means of function , which assigns a start time to each operation. Resource binding is divided into FU binding and register binding. FU binding makes use of the compatibility graph [2], which indicates the compatible resources for each by means of the set of edges . The binding between operations and resources is expressed by means of function , where indicates that operation is bound to the th instance of resource . The compatibility rules impose that . In a similar fashion, register binding links variables to registers by means of function . The set of variables is extracted from considering that there is a variable assigned to the output of each operation from the subset and to each delay connected to another delay. Registers have an associated size that determines the maximum allowed wordlength of the variables bound to them.

The steering logic consists of multiplexers connected at the inputs of FUs and registers. They are in charge of sending data to and from these two types of resources. is determined by , , and , since determines when data are generated, when data are used by FUs, and where data are stored.

2.2. Handling Resource Heterogeneity

The recent appearance of specialized blocks in FPGAs calls for new design methods to efficiently exploit their advantages. In [8], it is proposed to use a normalized resource usage vector. Given an FPGA with different types of resources (), each type with a maximum number of resources, the resource requirements of a particular design implementation can be expressed as the following normalized area vector: where is the number of resources of type used. Two useful norms are the -norm and the -norm:

The inverse of -norm represents the number of times that the same implementation of design can be replicated within the FPGA device (see [8]), and the -norm gives information about the overall resource usage of the implementation. Each norm is interesting on its own, but they have some pitfalls. On the one hand, if two implementations have the same -norm this implies that they can be replicated the same number of times, but there is no way to know which implementation requires less resources. On the other hand, the -norm can tell if a design implementation requires less resources than other, but that does not guarantee that the implementation with less resources can be replicated more times than the other. In the work presented here a combination of -norm and -norm , called -norm (plus-norm), is proposed and applied. A metric that exploits the benefits of both norms but none of the drawbacks should fulfill the following conditions:

This can be expressed by means of a combination of the two norms

A feasible solution for can be found by trying to comply with (6) for areas and , such that requires only one type of resource, and has the biggest value that allows

First let us find upper bounds for and

Substituting (5) and (7) into (6) allows

Since and , a possible range of values of that complies with (4) is expressed in terms of the number of types of resources () and the maximum number resources of any type (max)

guarantees that for any two implementations and : (i) if then can be replicated more times than ; (ii) if then can be replicated more times than , or the same number of times consuming less resources. Therefore, minimizing -norm implies that the design can be replicated within the FPGA the maximum possible number of times while using the minimum possible number of resources.

The metric -norm has a low computational cost and it is suitable for integer linear programming approaches [4, 15] and heuristic approaches [6, 17].

2.3. Resource Modeling

Resources are divided into three types: functional units (), registers (), and steering logic (). The area and latency of FUs and registers (i.e., and ) are expressed as functions of the input and output wordlength information ( and ). They are obtained by applying curve fitting to hundreds of synthesis results. The use of accurate delay cost functions proved to provide significant performance improvements compared to some other existent naive approaches (from 12% to 63%, see [6]). Registers are assumed to have a zero latency in terms of clock period, which is true as long as the clock frequency enables to comply with setup and hold times.

Note that A is a vector with as many components as types of FPGA resources. Thus, it is possible to apply the -norm to A in order to optimize the total datapath area. Multiplexers and wiring latencies are neglected, which could be easily overcome by means of multiplying the clock period by an empirical factor [18].

2.3.1. MWL Multiplexers

The area of multiplexers in UWL systems is only affected by the data wordlength, which sets the multiplexer size, and by the number of different data sources (e.g., registers or FUs), which determines the multiplexer width. An estimation of the area of an -input multiplexer of wordlength for Virtex-II devices is given by This estimation is specific for Virtex-II, Spartan-3, and Virtex-4, since the implementation of multiplexers relies on the combination of 4-input LUTs and dedicated multiplexers. Another FPGA architectures (i.e., Altera's Stratix-II) that make use only of 4-input LUTs would require a different estimation.

In MWL systems, data must be aligned before being processed by FUs or stored by registers. In [19] the problem of data alignment and multiplexing is tackled by means of alignment blocks introduced before multiplexers. In this work, multiplexers are used for both data multiplexing and data alignment, since the combination of these two tasks leads to a reduction in the number of control signals, and therefore, control logic. In addition, the chances for logic optimization are greater than if two separate blocks (an alignment block and a multiplexer) are used.

Alignment is required at the inputs of adders and at the outputs of both adders and multipliers. On one hand, adders require the alignment of their inputs in order to obtain a meaningful result. If an adder is shared to compute several additions (i.e., and ), an alignment block is required to arrange the MSB of the inputs in the right position for each operation (different alignments will be necessary for and ). On the other hand, the output of the different arithmetic operations—both additions and multiplications—in an algorithm can have the MSBs in different positions. Again, if the FUs are shared the output's MSB changes its position depending on the operation executed, therefore, it is necessary to dynamically align the FU's output using in order to store the data in a register.

Figure 2 presents three different types of alignments for a 4-input multiplexer with inputs signals , , , and and output : arbitrary alignment (see Figure 2(a)), least significant bit (LSB) alignment (see Figure 2(b)), and most significant bit (MSB) alignment (see Figure 2(c)). Note that sign extension (see Figures 2(a) and 2(b)) does not offer any opportunity for logic optimization, while zero padding (see Figures 2(a) and 2(c)) does offer it, due to the reduction in the number of signals and the introduction of constant bits (zeros) that can be hard-wired into the multiplexer logic. In fact, it is MSB alignment (see Figure 2(c)) the option which allows a greater logic reduction. Therefore, it is recommended to apply this alignment whenever possible.

A lower bound on the multiplexers' area if the MSB alignment is adopted can be computed as where is the maximum wordlength present and is the wordlength of signal .

2.4. Optimization Procedure

In this subsection we extend on the work presented in [6, 17], where the optimization was steered by the -norm and registers and multiplexers were not considered. The optimization procedure is based on Simulated Annealing (SA) [20] and it is shown in Algorithm 1. The inputs are the sequential graph and the total latency constraint . The optimization procedure determines the set of resources of the datapath , the scheduling , the FU binding , and the register binding , which define the datapath, the steering logic, and the timing of the control signals.

Input: ,
Output: , , ,
( ) Extract , ,
( ) Find initial mapping
( ) Compute initial area from
( )
  
  
  iteration = accepted = exit = 0
( ) while exit condition do
( )
( )iteration = iteration + 1
( )Perform change to current
( )Compute area A from (Algorithm 2)
( )
( )if then
( )   
   
   accepted = accepted + 1
( )else
( )    ,
( )   if then
( ) ,
   accepted = accepted + 1
( )   end if
( )end if
( )if equilibrium state then
(20)
(21)iterations = accepted = exit = 0
(22)  else iffrozen state then
(23)
(24) iterations = accepted = 0
(25) exit = exit + 1
(26)  end if
(27)  if restart condition then
(28)
   
(29)  end if
(30) end while

2.4.1. Simulated Annealing

First, the set of functional units , the set of registers , and the compatibility graph are extracted (line 1). An initial resource mapping is selected by mapping each operation to the fastest LUT-based resource among the available compatible resources for that operation (line 2), and the area occupied by the resulting datapath is used as the initial area (line 3). From this point (lines 5–30), the optimization proceeds following the typical SA behavior: the algorithm iterates while producing changes (line 8)—also referred to as movements—that modify the value of the cost function (i.e., area) until a certain exit condition is reached. If these changes lead to a cost reduction, they are accepted (line 11), if not, they are accepted with a certain probability which depends on the current temperature (line 15). The temperature starts at a high value and decreases with time. Most movements are accepted at the beginning of the process, thus enabling a wide design space exploration. As temperature decreases, only those movements which produce small cost deviations are accepted. The temperature is decreased when the equilibrium state is reached (line 19). Sporadic restarting [21] is also allowed (line 27), which repositions the optimization variables at the last minimum state found.

A summary of simulated annealing parameters and conditions is in Table 1. The annealing factor of 0.95 was chosen empirically aiming at balancing the tradeoff between optimality and solving time.

The variation in cost is normalized with respect to the initial area (line 10). This is a simple way to control that the behavior of SA is not affected by the complexity of the algorithm [22], which it is approximated by . The value of must be set to 1 (or ) for homogeneous-architecture FPGAs, and to “’’ for heterogeneous-architecture FPGAs.

The changes on the cost function (line 8) are performed by applying with equal probabilities the following movements to the resource mapping function :

(i): map an operation to a non mapped resource, (ii): map an operation to another already mapped resource, (iii): swap the mapping of two compatible operations mapped to different resources.
2.4.2. Area Computation

The computation of the area cost is shown in Algorithm 2. First, it is checked whether the current resource mapping complies with latency (lines 1–4). If it does not, the actual latency is computed. Later on (line 26), any deviation from the design constraints is penalized by means of increasing the area cost of the solution. Thus, solutions that do not meet the latency constraint are included within the design space exploration [23]. Even though these solutions are never accepted as valid, their inclusion allows a wider architecture exploration than rejecting solutions that do not comply with .

Input: ,
Output: , , , , A
( ) Compute minimum latency for mapping
( ) if then
( )
( ) end if
( ) Find set of functional units with mapped operations
( ) for all : Compute instances lower bound [24] and upper bound
( )
( )
( ) for do
( )for all
( )
( )if then
( )if then
( )
( )  
( ) end if
( ) end if
( ) end for
( ) for all ,
( )
(21) Extract ,
(22) Compute register binding
(23) Extract
(24)
(25)
(26)
(27)
(28) if then
(29)
(30) end if

Then, the resource allocation and resource binding that minimizes FU area is sought by means of a loop where several list-based scheduling operations are performed (lines 5–18). The purpose of the loop is to check different combinations on the number of instances of the resources. Both lower [24] and upper bounds on the number of instances for each resource are computed (line 6). All combinations of possible instances are computed and stored in the set of vectors . The list-based scheduling performs an ASAP scheduling to the operations sorted by mobility in ascendant order, providing a fast way to find a valid solution. Note that the size of is being pruned while the loop iterates; all combinations of FU instances that require areas greater than the minimum found so far are removed (line 15). Thus, resource allocation is sped up.

Once the minimum FU area scheduling is found, the datapath is defined. The tasks of register binding and multiplexer allocation are not commonly included within the optimization loop, in spite of their impact in the final architecture. In this work, these two tasks are part of the optimization procedure.

Register binding is performed by applying a left-edge algorithm [13]. Inputs signals are supposed to be available for all cycles and do not require storing. Each variable assigned to a delay is initially assigned a register, and after that, the left-edge algorithm is applied as usual.

From sets and and functions , , and it is possible to extract the steering logic resources . Registers have a single multiplexer (see Figure 1), while FUs have two. A goal of multiplexer definition is to maximize the use of the MSB alignment. This aligment can be applied directly to registers and multipliers. However, adders require that the inputs must be aligned to each other. Thus, if an MSB alignment if applied to the mux connected to one of the inputs, it is not possible to do so for the remaining mux, and vice versa. Finally, the control signals can be easily extracted from the scheduling contained in .

The area vector is computed by adding the area of each resource multiplied by the number of instances required (line 25). If the area is penalized by means of factor . If the implementation does not comply with the latency constraint and if the resulting penalized area is smaller than , then the area is forced to be bigger than (see line 28).

Summarizing, the optimization procedure is actually controlled by changing iteratively the mapping between operations and FUs. These changes impact on the structure of the datapath and, therefore, on its area cost, which is the function to be minimized. This method provides a robust way to simultaneously perform the tasks of scheduling, resource allocation, and resource binding for multiple wordlength systems. This procedure was satisfactorily applied in [25].

2.5. Wordlength Optimization: A Case Study

Let us introduce this section through a simple LTI case study (Algorithm 3).

Input: , uniformly distributed
Output:
( ) while true do
( )Get new value of , , and
( )
( )
( )
( )
( ) New value of output:
( ) end while

Algorithm 3 performs the weighted summation of three signals. The operations involved are two constant multiplications (i.e., gains) and two additions. There are a total of 8 signals.

The goal of WLO is to define the fixed-point format for each signal that enables to produce a hardware implementation of the algorithm. The fixed-point format, as mentioned in Section 2.1, is composed of the pair . Thus, the ultimate goal of WLO is to find the proper set of pairs to optimize the hardware realization of an algorithm. Figure 3 depicts the meaning of this parameters: is the distance in bits from the fractionary point to the MSB (a zero distance implies that there is no integer part in the number); is the number of bits used to represent the number without considering the sign bit. A common way to address WLO is to split it into two sequential subtasks: scaling, where the values of are selected, and wordlength selection, where the values of are chosen.

Scaling is performed by means of performing a floating-point simulation and gathering the maximum absolute value of each signal and computing:

Once scaling is accomplished, the values of are fixed and the values of are obtained through an optimization process (wordlength selection). The number of bits assigned to a signal (i.e., ) determines the quantization noise that the signal introduces, and, therefore, it has a high impact in the final precision of the system, producing an error in the output signal. During the optimization process different combinations of are tried in order to look for a particular set that minimizes cost (i.e., area, speed, power) while complying with the output error constraint. The error of the system is typically measured in terms of the peak error value [5, 26], the signal to quantization noise ratio (SQNR) [11, 27], and the variance of the output error [15]. In this work, we adopt the variance of the output error ().

Table 2 contains the fixed-point formats (i.e., ) of the signals of Algorithm 3 for both UWL and MWL WLO approaches for different error constraints ( and ). The UWL synthesis is achieved by computing the minimum values of and that if applied to all signals the fixed-point realization complies with the noise constraint [15]. The MWL synthesis is achieved by means of an SA-based approach, which minimized the area of a resource-dedicated implementation (with no resource sharing) [28].

Let us focus on the results for . The UWL approach clearly requires longer wordlengths than the MWL approach. The main reason for this is that the UWL optimization is far too simple. Also, note that some signals' wordlengths are decreased considerably in the MWL approach ( and ). This is due to the fact that signal is multiplied by a small constant, so the quantization noise introduced is also small. Similar results are also present for . In this case the values of are bigger since the error constraint is more restrictive.

Summarizing, the MWL approach enables the generation of fixed-point realizations that require a small number of bits. The only drawback is that the complexity of the design process is increased and techniques, such as the proposal in this section, are required.

3. Results

Here, the implementation results are presented. The following benchmarks are used:

(i)ITU RGB to YCrCb converter (ITU) [15], (ii)3rd-order lattice filter () [29], (iii)4th-order IIR filter () [30], (iv)8th-order linear-phase FIR filter ().

All algorithms are assigned 8-bit inputs and 12-bit constant coefficients. The algorithm implementations have been tested under different latency and output noise constraint scenarios assuming a system clock of 125 MHz. In particular, the noise constraints were , where is the minimum number that makes as close as possible to the variance of the quantization noise that would present the output of the benchmark if quantized to 8 bits ().

The target devices belong to the Xilinx Virtex-II family. The area results are normalized with respect to the XC2V40 device (256 slices, 4 embedded multipliers) and expressed according to (2). For instance, an area vector with -norm equal to or smaller than implies that the device XC2V40 is the smallest-cost device able to hold the design; whereas a -norm greater than and equal to or smaller than implies that the smallest-cost device able to hold the design is the XC2V80, and so on.

Before AS, each algorithm is translated to a fixed-point specification by means of two wordlength optimization procedures, that follow a UWL approach and an MWL approach, respectively.

The area results in this section are computed using the resource model explained in Section 2.3, which provides a good estimation of actual synthesis results.

3.1. Uniform Wordlength versus Multiple Wordlength Synthesis: Homogeneous Architectures

Figures 4 and 5 display results on the comparison of UWL versus MWL synthesis using a homogeneous-resource architecture (i.e., only LUT-based resources). Note that the subfigures are arranged in couples, which are related to the same benchmark. The left subfigures depict the area versus latency curves for a particular output noise constraint (see Figures 4(a), 4(c), 5(a), and 5(c)), while the right subfigures contain the detailed resource distribution graph of a particular point of its counterpart (see Figures 4(b), 4(d), 5(b), and 5(d)). Let us define as the minimum latency attainable for a UWL-homogeneous implementation of an algorithm, and the equivalent for an MWL-homogeneous implementation. The latency used for the experiments ranges from to .

Figures 4(a) and 4(b) contain the implementation results of the benchmark with an output noise variance of . Figure 4(a) depicts how both the UWL and MWL areas decrease as long as the latency increases. This is expected since the greater the latency the greater the chance of FU reuse. The comparison of the two implementation curves yields that the improvement obtained by means of using an MWL approach ranges from 51% to 77%. Also, the minimum latency that each implementation achieves differs considerably. The fine-grain tradeoff between area and quantization noise performed by the MWL approach allows important area reductions when compared to the UWL approach. Figure 4(b) displays the detailed resource distribution for the UWL and MWL implementations correspondant to and . The overall area savings are 77%, and it is due to the fact that the wordlengths of the majority of signals, which impact on FUs, multiplexers and registers, have been highly reduced; FUs' area has been reduced 83%, FU's multiplexers 59%, registers 62%, and registers' multiplexers 39%. It is important to highlight that the area due to multiplexers and registers, although smaller than the FUs' area, makes up a significant part of the total area (20% for UWL and 39% for MWL). Hence the importance of including its cost within the optimization loop is analyzed in Section 3.4.

The other benchmarks also show large area improvements: up to 49% (see Figure 4(c)), up to 49% (see Figure 5(a)), and up to 28% (see Figure 5(c)). As observed in the detailed resource distribution subfigures (see Figures 4(d), 5(b), and 5(d)), the area of the majority of the resources has been highly decreased. Also, it is noted that the percentage of area devoted to multiplexation and data storing is high in proportion to the overall implementation area. The minimum latency is also improved (see Figures 4(a), 4(b), and 5(a)).

In Figure 4(c) the MWL area does not decrease as long as the latency increases. This is due to the fact that the wordlengths are small enough as to allow maximum resource sharing for all latencies, thus the coincidence in the area results for the MWL implementations. This situation might change if a different error constraint () is applied during WLO.

Table 3 contains the implementation results for all the benchmarks corresponding to three different quantization noise scenarios. For each quantization scenario the latency ranges from to , and the minimum, maximum, and mean values of the area improvements obtained by the MWL implementations in comparison to the UWL implementations are computed. The first column in the table contains the name of the benchmark. The second, the output noise variance. And the third column contains area improvement values. The last row holds the minimum, maximum, and average improvements considering all results simultaneously.

The area improvements obtained are remarkable: mean improvements range from 47% to 77%. Note that the minimum improvements obtained for all benchmarks are quite close to both the maximum and the mean. The results clearly show that an MWL-based AS approach achieves significant area reductions.

Regarding latency, the minimum latency achievable by UWL implementations is reduced in average a 22% by means of MWL AS.

3.2. Uniform Wordlength versus Multiple Wordlength Synthesis: Heterogeneous Architectures

Figures 6 and 7 contain results on the comparison of UWL versus MWL synthesis using a heterogeneous-resource architecture (i.e., both LUT-based and embedded resources present). The arrangement of figures is similar to that of the previous subsection. Now, the latency ranges from to (HET indicates heterogeneous implementations).

Figures 6(a), 6(c), 7(a), and 7(c) contain the implementation area versus latency curves. The graphs clearly show how the area is reduced by means of an MWL synthesis: up to 79% (see Figure 6(a)), up to 35% (see Figure 6(c)), up to 40% (see Figure 7(a)), and up to 26% (see Figure 7(b)). The detailed resource distribution in Figures 6(b), 6(d), 7(b), and 7(d) shows how the majority of resources are decreased, and in particular the embedded multipliers and the FUs' multiplexers are clearly optimized. For instance, the resource distribution for and (see Figure 6(b)) shows an overall area reduction of 72%. The LUT-based resources are reduced 59% (LUT-based FUs' area has been reduced 32%, FU's multiplexers 74%, registers 32%, and registers' multiplexers 36%); while the embedded FUs are reduced 75%.

Note that the area of embedded resources for Figures 6(d) and 7(d) is the same for both the UWL and MWL approaches, in fact a single multiplier is being used (1 out of 4). This happens because the wordlengths involved in multiplications, though not the same, are small enough for both UWL and MWL approaches as to enable the use of a single embedded multiplier. However, the LUT-based areas are quite different, and, as a result, the overall resource usage is much smaller for the MWL implementation.

In Figure 6(c) the UWL and MWL areas do not decrease as long as the latency increases. Again, this is due to the fact that the particular wordlengths obtained allow maximum resource sharing for all latencies. Different error constraints () might change this situation.

Again, the figures show how the minimum latency can be highly improved by means of an MWL approach. Also, it can be seen that the LUT-based resources are devoted almost entirely to data multiplexing and storing.

Table 4 contains the implementation results of all the benchmarks corresponding to three different quantization noise scenarios. For each quantization scenario the latency ranges from to , and the minimum, maximum, and mean area improvements obtained by the MWL implementations in comparison to the UWL implementations are computed considering -norm area, the LUT-based area, and the embedded FUs area. The first column in the table contains the name of the benchmark. The second, the output noise variance applied. The third column contains the minimum, maximum, and mean -norm area improvement values. The fourth column contains the minimum, maximum, and mean values of the LUT-based resource area. And the last column contains the minimum, maximum, and mean values of the embedded FUs area.

The area improvements obtained are considerable; obtains up to 80.77%, up to 48.87%, up to 65.13%, up to 38.01%. Note that the minimum improvements obtained for most of the benchmarks are again quite close to both the maximum and the mean.

The LUT-based area reductions are up to 81.07% for , up to 48.87% for , up to 65.13% for , and up to 43.83% for . The embedded resources are only reduced for benchmarks (up to a 75.00%) and (up to 83.33%). Benchmarks and use the minimum possible number of embedded resources (1 embedded multiplier), hence the 0% improvement.

Area improvements up to 80.77% are achieved. The average improvement is 44.88% for the overall area, 42.76% for the LUT-based resources, and 24.03% for the embedded resources. The results clearly show that an MWL-based approach for AS leads to significant area reductions.

As a final note regarding these area results, the authors would like to emphasized that the plus-norm has been used during the optimization process, but it is not used to present the results as it cannot be directly related to the percentage of occupation of the FPGA. Thus, the -norm is used instead.

The latency analysis throws that the minimum UWL latency is reduced an average 19% by means of MWL AS.

3.3. MWL Synthesis: Heterogeneous versus Homogeneous

Table 5 contains the implementation results of all the benchmarks corresponding to three different quantization noise scenarios. For each quantization scenario the latency ranges from to , and the minimum, maximum, and mean values of the area improvements, in terms of -norm, obtained by the MWL implementations in comparison to the UWL implementations are computed. The first column of the table contains the name of the benchmark. The second, the output noise variance applied. And, the third column contains the minimum, maximum and mean area improvement values.

The area improvements obtained are remarkable; obtains up to 54.76%, up to 43.09%, up to 48.79%, up to 44.68%. Note that, again, the minimum improvements obtained for all benchmarks are quite close to both the maximum and the mean. Area improvements up to 54.76% are achieved, being the average improvement 40.23%. The results clearly show that the inclusion of embedded resources within AS leads to higly optimized DSP implementations.

Regarding latencies, the minimum latency achievable by both homogeneous and heterogeneous implementations is the same for the experiments performed. This is due to the fact that the latency of resources is very similar in the particular conditions used for the tests. The same experiments presented in this section were repeated increasing the constant wordlength to 16 bits, obtaining that heterogeneous implementations reduced 7% the minimum latency in constrast to homogeous implementations.

3.4. Effect of Registers and Multiplexers

In this subsection the effect of including the cost of registers and multiplexers within the optimization loop is investigated. As in the previous experiments, the analysis is performed implementing the benchmarks using different noise and latency constraints. Before AS is applied a gradient-descent quantization [28] is applied according to the given noise constraint. The comparison is done by using Algorithm 1 to perform the AS using two different area cost estimation solutions: (i) Algorithm 2, which is referred to as the complete area estimation algorithm, and (ii) a simplified version of Algorithm 2 (simplified area estimation algorithm) where the cost of registers and multiplexers is neglected. When the simplified area estimation is used, the cost of registers and multiplexers is included after the optimization loop has finished its execution, using the complete area estimation (Algorithm 2).

Table 6 contains the results of this analysis. The latencies range from to , where ARCH refers to the type of FPGA architecture used (homogeneous or heterogeneous). The noise constraints are the same used in the previous subsection (three for each benchmark), though the results have been combined into a single row. The first column contains the type of FPGA architecture. The second column indicates the benchmark used. And the fourth column contains the minimum, maximum, and average area improvements obtained by the complete area estimation synthesis in contrast to the simplified area estimation synthesis. The last row includes the minimum, maximum, and mean improvements for all benchmarks.

The average improvements for the different benchmarks range from 0.00% to 8.09%, being the overall average improvement of 2.09%. The maximum improvement found is 35.57%. These results clearly show that failing to include the cost of registers and multiplexer during the optimization procedure can lead to unwanted area penalties.

4. Conclusions

In this paper an architectural synthesis procedure able to produce optimized fixed-point implementations using modern FPGA devices is presented. The key to success is provided by the use of highly accurate models of the datapath resources, a complete datapath resource set that includes multiplexer and registers, a novel method to handle fixed-point data alignment and multiplexing, and also the introduction of a novel resource usage metric that can cope with LUT-based and embedded FGPGA resources.

The AS procedure produces area improvements of up to 80% when compared to uniform-wordlength implementations, and latency improvement of up to 22%. The efficient use of embedded resources achieves area improvements of up to 54% when compared to homogeneous implementations. Also, the inefficiency of current FPGA architectures to implement data steering was exposed.

These results are intented to be further improved by means of tightly combining the fixed-point refinement process within the architectural synthesis [4, 31]. Also, the inclusion of the control logic in the resource model is regarded as a future research line.

Acknowledgment

This work was supported by the Spanish Ministry of Education and Science under Research Project TEC2006-13067-C03-03.