Research Article  Open Access
HighLevel Synthesis under FixedPoint Accuracy Constraint
Abstract
Implementing signal processing applications in embedded systems generally requires the use of fixedpoint arithmetic. The main problem slowing down the hardware implementation flow is the lack of highlevel development tools to target these architectures from algorithmic specification language using floatingpoint data types. In this paper, a new method to automatically implement a floatingpoint algorithm into an FPGA or an ASIC using fixedpoint arithmetic is proposed. An iterative process on highlevel synthesis and data wordlength optimization is used to improve both of these dependent processes. Indeed, highlevel synthesis requires operator wordlength knowledge to correctly execute its allocation, scheduling, and resource binding steps. Moreover, the wordlength optimization requires resource binding and scheduling information to correctly group operations. To dramatically reduce the optimization time compared to fixedpoint simulationbased methods, the accuracy evaluation is done through an analytical method. Different experiments on signal processing algorithms are presented to show the efficiency of the proposed method. Compared to classical methods, the average architecture area reduction is between 10% and 28%.
1. Introduction
Implementing signal processing applications in embedded systems generally requires the use of fixedpoint arithmetic [1, 2]. In the case of fixedpoint architectures, operators, buses, and memories need less area and consume less power compared to their equivalent using floatingpoint arithmetic. Furthermore, floatingpoint operators are more complex and lead to longer execution time.
However, the main problem slowing down the hardware implementation flow is the lack of highlevel development tools to target these architectures from algorithmic specification language using floatingpoint data types. In this design process, mainly two kinds of highlevel Computer Aided Design (CAD) tools are required for reducing the timetomarket: floatingpoint to fixedpoint conversion and HighLevel Synthesis (HLS).
For hardware implementation like FPGA or ASIC, the floatingpoint to fixedpoint conversion is a complex and an error prone task that converts an application specified with highprecision floatingpoint data into an algorithm using fixedpoint arithmetic, usually under an accuracy constraint. Then, HLS automatically translates the algorithm specified with fixedpoint data into an optimized dedicated architecture. In the processing part of this architecture, the number and the type of operators must be defined Moreover, each operator input and output wordlength must be determined. For complex designs, the wordlength search space is too large for a manual exploration. Thus, timetomarket reduction requires highlevel tools to automate the fixedpoint architecture synthesis process.
The aim of HLS, handling multiple wordlength, is to minimize the implementation cost for a given fixedpoint accuracy constraint. This process leads to an architecture where each operator wordlength has been optimized. Best results are obtained when the wordlength optimization (WLO) process is coupled with the HLS process [3, 4]. HLS requires operator wordlength to correctly execute its allocation, scheduling, and resource binding steps. But the wordlength optimization process requires operationtooperator binding. To deal with this optimization issue, an iterative refinement process should be used. Many published methodologies [5–9] do not couple data WLO and HLS processes. Moreover, simulationbased accuracy evaluation is used, which leads to prohibitive optimization time.
In this paper, a new method for HLS under accuracy constraint is proposed. The WLO process and the HLS are combined through an iterative method. Moreover, an efficient WLO technique based on tabu search algorithm is proposed to obtain solutions having better quality. Compared to existing methods, the HLS synthesis process is not modified and thus this method can take advantage of existing academic and commercial tools. Furthermore, the proposed method benefits from an analytical accuracy evaluation tool [10], which allows obtaining reasonable optimization times. Experiments show that good solution can be obtained with a few iterations.
This paper is organized as follows. In Section 2, related work in the area of multiple wordlength architecture design is summarized. Then, the proposed fixedpoint conversion method for hardware implementation is presented in Section 3. The multiple wordlength architecture optimization is detailed in Section 4. In Section 5, different experiments on various signal processing algorithms are presented to show the efficiency of the proposed method. Finally, Section 6 draws conclusions.
2. Related Works
The classical method used to optimize data wordlength relies on handling uniform wordlengths (UWL) for all data that reduces the search space to one dimension and simplifies the synthesis because all operations will be executed on operators with the same wordlength. However, considering a specific fixedpoint format for each data leads to an implementation with a reduced power, a smaller area, and a smaller execution time [9].
In the sequential method [5–9], the wordlengths are first optimized and then the architecture is synthesized. The first step gives a fixedpoint specification that respects the accuracy constraint. For this purpose, a dedicated resource is used for each operation. So, the HLS is not considered first, because there is no resource sharing. The second step of the process corresponds to HLS. In [9], a heuristic to combine scheduling and resource sharing is proposed for a data flow graph with different wordlengths. This method implements a fixedpoint application, which leads to a numerical accuracy greater than the constraint. In the first step, the data WLO gives a numerical accuracy close to the accuracy constraint, but, in the second step, the binding to larger operators will improve the global numerical accuracy. Consequently, the obtained solution may not be optimized exactly for the specified accuracy constraint given that the two steps are not coupled.
A method combining wordlength optimization and HLS has been proposed in [11]. This method is based on a Mixed Integer Linear Programming (MILP). This MILP formulation leads to an optimal solution. Nevertheless, some simplifications have been introduced to limit the number of variables. This method is restricted to linear timeinvariant systems, and the operator latency is restricted to one cycle. Moreover, the execution time to solve the MILP problems can become extremely long and several hours could be needed for a classical IIR filter.
In [3], the authors propose a method where the HLS is achieved during the WLO phase. The authors take account of the resource sharing to reduce the hardware cost but also to reduce the optimization time. Indeed, the accuracy evaluation is obtained through fixedpoint simulations. Therefore, heuristics are used to limit the search space and to obtain reasonable optimization time. A first step analyzes the application SFG and groups some data according to rules. For example, addition inputs will be specified with the same fixedpoint format. The second step determines the required minimum wordlength (MWL) for each data group. The MWL of a group corresponds to the smallest wordlength for the data of the group allowing fulfilling the accuracy constraint when the quantization effect of the other groups is not considered. This MWL is used as a starting point because its computation can be achieved in a reasonable execution time when simulationbased methods are used to evaluate the accuracy. In the third step, the fixedpoint specification is scheduled and groups are bound to operators using the wordlength found in the previous step. During the combined schedulingbinding, some operations are bound to larger operators. Finally, the last step corresponds to the operator WLO.
The synthesis and WLO processes have to be interactive and have to be finally terminated with a synthesis to exactly implement the fixedpoint specification optimized for the given accuracy constraint. Indeed, the last step of the method proposed in [3] optimizes the operator wordlength. But this process can challenge the scheduling obtained in the previous step.
In [4], a method combining WLO and HLS through an optimization process based on simulated annealing is proposed. In the following, a movement refers to a modification in the system state of the simulated annealing optimization heuristic. This method starts with the solution obtained with uniform wordlength (UWL). In this optimization process based on simulated annealing, movements on the HLS are carriedout by changing the mapping of the operations to the operators. An operation can be mapped to a nonmapped or to another already mapped resource or operations can be swapped. Movements on WLO are carried out by modifying the operation wordlength. The movement can increase or decrease the wordlength of a signal of one bit or make more uniform the wordlength of the operation mapped to a same operator. A movement is accepted if the implementation cost is improved compared to the previous solutions and the accuracy constraint fulfilled. If the accuracy constraint is fulfilled but the implementation cost is not improved, the movement is accepted with a certain probability decreasing with time. Thus, for each movement, the implementation cost, the fixedpoint accuracy, and the total latency of the current solution must be computed. Stochastic algorithms lead to good quality solutions for optimization problems with local minimum. But they are known to require a great number of iterations to obtain the optimized solution. Given that each iteration requires an architecture synthesis and an accuracy evaluation, the global optimization time can be very high.
In this paper, a new HLS method under accuracy constraint is proposed. An iterative process is used to link HLS and WLO processes, and good results are obtained with a few iterations. The accuracy evaluation is carriedout through an analytical method leading to reasonable optimization time. Compared to [3, 4], a classical HLS tool can be used and no modification of this tool is required. Thanks to the analytical method, the optimized wordlength (OWL) associated to each operation can be computed in a reasonable time and is used as starting point, as opposed to the MWL like in [3]. It is obvious that the MWL is less relevant than the OWL since that the quantization effects of the other operations are not taken into account.
3. HighLevel Synthesis under Accuracy Constraint
A fixedpoint data is composed of two parts corresponding to the integer part and the fractional part. The number of bits for each part is fixed. is the number of bits for the integer part integrating the sign bit, is the number of bits for the fractional part, while is the total number of bits. The scaling factor associated with the data does not evolve according to the data value as in floatingpoint arithmetic. So, the aim of fixedpoint conversion process is to determine the optimized number of bits for the integer part and the fractional part.
The proposed HLS method under accuracy constraint is detailed in Figure 1. The aim is to implement an algorithm specified with floatingpoint data types into an architecture using fixedpoint arithmetic. This method is based on the definition of a cosynthesis and WLO environment. The input of the framework, for multiple wordlength highlevel synthesis, is the Data Flow Graph (DFG) of the application. Nevertheless, Control Data Flow Graph can be used if the HLS tool supports this intermediate representation. For each operation operand, the binary pointposition has been determined. The binary pointposition must allow the representation of the data extreme values without overflow and the minimization of the number of bits used for the integer part.
In our multiple wordlength highlevel synthesis approach, the operator WLO and the HLS are coupled. The goal is to minimize the architecture cost as long as the accuracy constraint is verified. The multiple wordlength HLS is an iterative process as explained in Section 3.1. The HLS, under throughput constraint, is carried out with the tool GAUT [12]. The HLS and the WLO process usees a library composed of various types of operator characterized in terms of performances for different operand wordlengths.
3.1. Multiple WordLength Architecture Optimization
To obtain an optimized multiple wordlength architecture, the operator wordlength determination and the HLS must be coupled. Indeed, for the HLS, the operation wordlength must be known. The operator propagation time depends on the input and output wordlength. For the operator WLO, the resource sharing must be taken into account. A group is defined as a set of operations that will be computed on the same operator . To determine a group, the operation assignment must be known. The operations executed by the same operator must have the same wordlength , and this condition must be taken into account during the optimization of the group wordlength.
To couple HLS and WLO, the proposed method is based on an iterative process with the aim of finding the optimized operation binding, which minimizes the cost through wordlength minimization. The method efficiently combines resource sharing obtained through HLS and WLO search. This process is presented in Figure 2. This process is in four steps.
The first step defines the number of groups needed for each type of arithmetic operation. For the first iteration, the number of group for each operation type is set to one. Indeed, the number of operators required to execute each operation type is unknown and can be defined only after an architecture synthesis. For the other iterations, the number of groups for each operation type is defined from the HLS results obtained in the previous iteration. The group number for an operation type is fixed to the number of operators used for this operation type. In the second step, a grouping algorithm is applied. This step, relatively similar to clustering [6, 9], aims to find the best operation combinations, which would lead to interesting results for the WLO process and HLS. The technique is presented in Section 3.1.1. The third step searches the optimal wordlength combination for this grouping, that minimizes the implementation cost and fulfills the accuracy constraint. This optimization process is detailed in Section 4. The fourth step is the architecture processing part synthesis from the fixedpoint specification obtained in the third step. After this synthesis, the number of operators used for each operation type has to be reconsidered. Indeed, operation wordlengths have been reduced leading to operator latency decrease. This can offer the opportunity to reduce the number of operators during the scheduling. Thus, an iterative process is necessary to converge to an optimized solution, and the algorithm stops when successive iterations lead to the same results or when the maximal number of iterations is reached.
3.1.1. Operation Grouping
Operation grouping is achieved from an analysis of the synthesis result. A group is defined for each operator of the synthesized architecture. Each group is associated with a wordlength that corresponds to the maximal wordlength of the operations associated to the group where is the optimized wordlength associated with each operation . corresponds to optimized wordlength obtained with a spatial implementation, that is, where each operation has a dedicated fixedpoint operator. This wordlength is obtained with the optimization algorithm presented in Section 4.3, when a group is assigned at each operation .
For each operation a mobility interval is computed. The mobility index is defined as the difference between the execution dates, and , obtained for two list schedulings in the direct and reverse directions. Operations are treated with a priority to least mobility operations. The mobility index is used to select the most appropriate group for operation .
To group the operations, the optimized wordlength associated with each operation is considered. An operation is preferably associated to the group with the wordlength immediately greater than to the optimized operation wordlength and compatible in terms of mobility interval. In case of mobility inconsistency, the grouping algorithm tries, firstly, to make the operation take the place of one or more operations having a smaller wordlength , secondly, to place in another group having a greater wordlength and finally creates a new group with this operation if other alternatives have failed. The idea of this method is to obtain for each operation the smaller wordlength, and to favor placement in smaller wordlength groups. When an operation has been removed from a group, this operation returns to the top of the priority list of operations to be assigned. The convergence of the algorithm is ensured by the fact that operations are placed one by one in groups according to their priority and can be returned in the priority list only by operations having strictly a higher wordlength.
3.1.2. Illustrative Example
The following example illustrate, the concepts for operation grouping described above. The DFG presented in Figure 3 is considered. The circles represent addition operations (named a1,…,a7), rectangles represent multiplication operations (named m1,…,m4), arrows represent data dependency, number in italics are the optimized wordlength of corresponding operation, and vertical bars represent scheduling alternatives for given multiplication. The time constraint is set to 6 clock cycles. Table 1 gives the optimized wordlength , the initial mobility index, and the associated priority for multiplication operations.

In the rest of the example, scheduling and cost of additions are not considered to simplify illustration. The latency of the multiplications is equal to two clock cycles. For the second iteration of the iterative process, the algorithm proceeds as follows.
Step 1. The ready operation m1 with highest priority is scheduled and assigned to a resource. As there is no resource selected yet, this operation defines the first resource for type m. This resource is named M1 (with a wordlength of 11 bits).
Step 2. The ready operation m2 (with highest priority now) is scheduled and assigned to a resource. As there is scheduled time on resource M1 compatible with m2, m2 is assigned on M1.
Step 3. The ready operation m3 (with highest priority now) is scheduled and assigned to a resource. As there is no compatible time on M1 regarding the mobility of m3, a second resource M2 is created with a wordlength of 7 bits.
Step 4. Operation m4 was always ready but with a lowest priority due to its highest mobility compared to the other operations. Standard list scheduling algorithm would have allocate this operation on resource M2 since there is no more place on resource M1, increasing wordlength of group M2 to 10 bits. The proposed algorithm allows operation m4 to deallocate operation m2 that have smaller optimized wordlength and m4 is scheduled on resource M1.
Step 5. This step try to realocate operation m2 on resource of immediately superior wordlength, corresponding to resource M1. As there is no place and no operation with smaller optimized wordlength, operation m2 is placed on resource M2 and M2 wordlength is updated to 8 bits. Observe that mobility of present operation m3 is used to maximize use of resources and let operations m2 and m3 fit on resource M2.
After Step 5, there is no more operation to schedule and allocate, so the algorithm finished. Resource M1 will execute operations m1 and m4 with an effective wordlength of 11 bits, and resource M2 will execute operations m2 and m3 with an effective wordlength of 8 bits resulting in a smaller architecture, while a more naive algorithm would have required an 11bit and 10bit multiplier.
Figure 4 presents the various steps of assignments. The step number of operation assignment is indicated by circled numbers. Figure 4(a) presents Steps 1 to 3 and Figure 4(b) Steps 4 to 5 after reassignment of operation m2.
(a) Steps 1–3
(b) Steps 45
3.2. HighLevel Synthesis Process
The highlevel synthesis tool Gaut [12] is used to generate an optimized architecture from the DFG of the application. The aim is to minimize the architecture area for a given throughput constraint. The highlevel synthesis process is composed of different steps. The selection module selects the best operator for each operation from the library. This library is the same as the one used for wordlength optimization. Each component is characterized in terms of area, latency, cadence, and energy consumption. A list scheduling is used to schedule the operations. The algorithm is based on a mobility heuristic depending on the availability of allocated operators. Operation assignment to operators is carried out simultaneous to the scheduling task. Finally, the architecture is globally optimized to obtain a good tradeoff between the storage elements (register, memory) and the interconnection elements (multiplexer, demultiplexer, tristates, and buses). Different algorithms can be used. The best results for complex applications are obtained with a variant of the leftedge algorithm.
4. WordLength Optimization
The aim of the WLO is to find the best group wordlengths, which minimize the architecture cost as long as the accuracy constraint is fulfilled. This optimization problem is described with the following expression: with , the vector containing the wordlength associated to each group. corresponds to the numerical accuracy obtained for a given group wordlength . The evaluation of the numerical accuracy is summarized in Section 4.2.1. The cost function is evaluated with the method presented in Section 4.1. For each tested combination, the accuracy and the cost function are evaluated with mathematical expressions, so that, the optimization time will be significantly reduced compared to a simulationbased method.
4.1. Cost Function
The aim of the HLS is to obtain an optimized architecture from the application functional specification. The architecture processing part is built by assembling different logic entities corresponding to arithmetic operators, multiplexers, and registers. These elements come from a library associated to the targeted FPGA or ASIC technology.
4.1.1. Generation of Operator Library
In the case of multiple wordlength synthesis, the arithmetic operator library contains operators with different input and output wordlengths. The library generation flow is described in Figure 5. First, the different library elements are synthesized to obtain placedandrouted blocks. Then, these elements are characterized by the information collected after operation synthesis.
A parameterized VHDL description is written for each operator type. From this description, a logic synthesis is achieved separately for each library element. The scriptbased method is used to automatically generate the library elements for different wordlengths. This logic synthesis is achieved with the Synplify Pro tool (Synopsys) for FPGA and with Design Compiler (Synopsys) for ASIC.
Let denote the set containing all the library elements. Each operator is characterized by the number of resources used (logic cells and dedicated multipliers, for FPGA, and standard cells and flipflops for ASIC), the propagation time , and the energy consumption for different input and output wordlengths. For the HLS, the latency of the operator is expressed in number of cycles where is the clock period used for the system.
This different information is used in the HLS and WLO processes. The mean power consumption of these components is characterized at the gate level with several random input vectors. The number of vectors is chosen to ensure the convergence to the mean value. This characterization is finally saved as an XML database exploited in the proposed method.
4.1.2. Model of Cost Function
The aim of the WLO is to minimize the architecture cost . Let denote the subset of operators used for the architecture from the library (). Let denote the cost associated with each operator . This cost depends on , the wordlength of operator . The global cost for the architecture processing part is defined as the sum of the different costs of the operator used for the architecture
The cost used in the proposed method corresponds to the architecture area evaluated through the number of LUTs of a functional unit. For FPGA integrating dedicated resources, the user has to define a maximum number for each dedicated resource type. Moreover, other cost functions can be used to optimize energy consumption.
4.2. Constraint Function
The constraint function of the optimization problem corresponds to the fixedpoint numerical accuracy. The use of fixedpoint arithmetic leads to unavoidable error between the results in finite precision and in infinite precision. The fixedpoint implementation is correct only if the application quality criteria are still fulfilled. Given that the link between the fixedpoint operator wordlengths and the application quality criteria is not direct, an intermediate metric is used to define the fixedpoint accuracy. The most commonly used metric is the Signal to Quantization Noise Ratio (SQNR) [3]. This metric corresponds to the ratio between the signal power and the quantization noise power.
The accuracy constraint , corresponding to the minimal value of the SQNR, is determined from the system performance constraints. This accuracy constraint is defined such as the system quality criteria will be still verified after the fixedpoint conversion process.
4.2.1. FixedPoint Accuracy Evaluation
Two kinds of method can be used to determine the fixedpoint accuracy. These methods are either based on fixedpoint simulations or analytical. Simulationbased methods estimate the quantization noise power statistically from signal samples obtained after fixedpoint and floatingpoint simulations [3]. The floatingpoint result is considered as the reference because the associated error is negligible compared to the fixedpoint one. The fixedpoint simulation requires to emulate all the fixedpoint arithmetic mechanisms. Moreover, to obtain an accurate evaluation, an important number of samples is necessary. The combination of these two phenomena leads to an important simulation time. In the WLO process, the fixedpoint accuracy is evaluated at each iteration. For complex systems, where the number of iterations is important, the fixedpoint simulation time becomes prohibitive, and the search space cannot be explored.
An alternative to simulationbased methods is the analytical approach, which determines a mathematical expression of the noise power at the system output according to the statistical parameters of the different noise sources induced by quantization. In this case, the execution time required to evaluate the noise power values is definitely lower. Indeed, the SQNR expression determination is done only once. Then, the SQNR is evaluated quickly at each iteration of the WLO process through a mathematical expression. The method used in this paper to compute the accuracy allows obtaining automatically the quantization noise power expression from the signal flow graph (SFG) of the application. The SFG is obtained from the DFG by inserting the delay operations between data.
An analytical approach, to evaluate the fixedpoint accuracy, has been proposed for linear timeinvariant systems in [10] and for systems based on smooth operations in [13]. An operation is considered to be smooth if the output is a continuous and differentiable function of its inputs, as it the case for arithmetic operations. In the analytical expression of the output quantization noise power, the gains between the different noise sources and the output are computed from the impulse response of the system between the output and the noise sources. This approach has been implemented in a software tool to automate this process. Our numerical accuracy evaluation tool generates the analytical expression of the output quantization noise from the signal flow graph of the application. This analytical expression is implemented through a C function having the wordlength of all data as input parameters. This C code can be compiled and dynamically linked to the fixedpoint conversion tool for the optimization process.
4.3. Optimization Techniques
In the proposed method, deterministic optimization approach is retained to lead to reasonable optimization times. However, classical greedy algorithms based on steepestdescent ( [14]) or mildestascent ( [14]) can lead to weak quality solutions. To improve the solution quality, a tabu search algorithm is used. The proposed method is based on three main steps. First, an initial solution is determined by computing the minimal wordlength associated to each optimization variable . Then, a mildestascent greedy algorithm is used to optimize the wordlength. This algorithm starts with the initial solution and leads to the optimized solution . Finally, this optimized solution is refined by using a tabu search algorithm to obtain a better quality solution .
In the first step, the minimum wordlength combination is determined with the algorithm presented in Algorithm 1. For that, all variable wordlengths are initially set to their maximal value . In that case, the accuracy constraint is satisfied. Then, for each variable, the minimum wordlength still satisfying the accuracy is determined, all other variable wordlengths staying at their maximum value.

The mildestascent greedy algorithm presented in Algorithm 2 is used to optimize the wordlength. Each variable is set to its minimal value . With this combination, the accuracy constraint will surely not be satisfied anymore. But the advantage of this starting point is that wordlengths only have to be increased to get the optimized solution. At each step of the algorithm, the wordlength of one operator is modified to converge to the optimized solution obtained when the accuracy constraint is fulfilled. A criterion has to be defined to select the best direction, that is, the operator for which the wordlength has to be modified. The criterion is based on the computation of the discrete gradient of the cost and the accuracy. Let denote the gradient of the accuracy function with and .

This gradient on the accuracy is used as a criterion for finding the best direction in the bit algorithm [14]. Amongst deterministic algorithm, bit does not always give a good result. It takes sometimes the wrong direction and returns poor quality results. To improve this criterion, the cost and the accuracy are taken into account as follows:
This criterion selects the direction, which minimizes the cost increase and maximizes the accuracy increase.
Currently, all greedy algorithms used in WLO are monodirection, either steepestdescent () or mildestascent (). To improve the solution obtained with these monodirection algorithms, the proposed algorithm is based on tabu search [15] and allows the movement in both directions.
The set is the tabu list and contains tabu variables. When a variable is added in the tabu list, its value will not be modified afterwards and thus this variable is no longer considered in the optimization process. The term represents the direction, ascending direction is used with , and descending direction is used with . The vector corresponds to the best combination of wordlengths which have been obtained and is the cost associated with .
The algorithm starts with the solution obtained with the mildestascent greedy algorithm presented in Algorithm 2. This algorithm iterates until all the variable, are not in the tabu list (lines 22–23).
For each variable , the possibility of a movement is analyzed in lines 8–15. If a variable reaches its maximal value in the ascending direction, or its minimal value in the descending direction, this variable is added to the tabu list. In the other cases, a movement is possible and the metric for finding the best direction is computed in the lines 16–21. During this metric computation, the cost and the accuracy are compared, respectively, to the best cost and the accuracy constraint , and the best solution is updated if necessary.
After the computation of the metric for each variable, the best possible direction is selected. For the ascending direction, the solution leading to the highest value of is selected (lines 26–28). It corresponds to the solution leading to the best tradeoff between the increase of accuracy and the increase of cost. For the descending direction, the solution leading to the lowest value of is selected (lines 33–35). The aim is to reduce the cost without reducing too much the accuracy.
As soon as the accuracy constraint is crossed, the direction is inverted (lines 29–31 and 36–38). In this case, the operator is added to the tabu list if the direction is ascending (lines 29–31). This algorithm iterates until all the variables are not in the tabu list.
5. Experiments
5.1. WordLength Optimization Technique
First the quality and the efficiency of the WLO technique based on the tabu search algorithm (Algorithms 1, 2 and Algorithm 3) is evaluated on different benchmarks. The tested applications are a eightorder Infinite Impulse Response filter (IIR) implemented through four second order cells as presented in Figure 6, a Fast Fourier Transform (FFT) on 128 points using a radix2 and decimation in frequency (DIF) structure and a Normalized Least Mean Square adaptive filter (NLMS) with 128 taps using an adaptation step of 0.5. The implementation cost and the optimization time are measured for the proposed technique and compared with the results and obtained with only the greedy algorithm corresponding to Algorithms 1 and 2. The number of variables inside the optimization problem is adjusted by grouping together operations. Let denote the improvement of the solution quality due to the tabu search algorithm such as

Let denote the overcost in terms of optimization time due to the tabu search algorithm
For the different experiments, the input signal is normalized in the interval and different values for the SQNR are tested between 40 to 60 dB by step of 1 dB. The results presented in Table 2 show the improvement obtained with the tabu search algorithm. In our experiments, the improvement can reach up to . The optimization time is significantly increased compared to the greedy algorithm, but the execution time is still reasonable and low compared to other combinatorial optimization approaches like stochastic algorithms.

5.2. Illustrative Example for HLS under Accuracy Constraint
To illustrate the proposed method, an infinite impulse response (IIR) filter example is detailed. This filter is an eightorder IIR filter implemented as four cascaded secondorder cells. The signal flow graph (SFG) of this IIR filter, presented in Figure 6, contains 20 multiplications and 16 additions. The method presented in Section 3 is used to obtain the data dynamic range and the binary pointposition and thus, a correct fixedpoint specification. The SQNR analytical expression is determined and the accuracy constraint is set to 60 dB. The Stratix FPGA is used for the experiments with no dedicated resources.
Firstly, the different operation wordlengths is optimized for a spatial implementation. In this case, an operator is used for each operation. The obtained wordlengths are presented in Figure 7 (number between parentheses). For the first iteration, a group is defined for each operation type and the group wordlengths are optimized. Thus, multiplications are executed on a 17 × 17bit multiplier and additions on 20bit adders. The minimal system clock frequency is set to 200 MHz, so the operator latency is a multiple of 5 ns. The multiplier and adder propagation times are, respectively, equal to 10.3 ns and 2.5 ns, so the latency of the multiplier and adder is set, respectively, to 3 and 1 clock cycles. The hardware synthesis for this fixedpoint specification leads to the scheduling presented in Figure 7. For a 70 ns time constraint, five multipliers and two adders are needed. In the next step, five new groups for multiplications and two new groups for the additions are defined. These groups, presented in Figure 7, are built depending on the wordlengths obtained for the spatial implementation and the operation mobility.
A group WLO under accuracy constraint is carriedout for these seven groups. This optimization results in lower wordlengths. The five multiplication group wordlengths are, respectively, 17, 16, 15, 14, and 14 bits. The HLS for this new fixedpoint architecture leads to the scheduling presented in Figure 8. Given that, below 16 bits, multipliers have a critical path lower than 10 ns, that is, 2 clock cycles, and so only four multipliers are now needed. Therefore, this architecture uses one less multiplier than the previous one. The wordlength reduction combined with the decrease in the number of operators reduces the area by .
A uniform wordlength architecture optimization leads to five multipliers and two adders with a precision of 19 bits. Compared to this architecture, the total area saved on operators, with the proposed method, is . A sequential approach carryingout a wordlength optimization in the case of spatial implementation and a high level synthesis leads to the same number of operators as our approach. Nevertheless, the wordlength of the operators is higher than those obtain with our approach. Indeed, the operator wordlength is imposed by the operation with the greater wordlength. Consequently, compared to this sequential approach, the total area saved on operators, with the proposed method, is . These results show the interest of using multiple wordlength architecture and efficiency of the proposed method, which couples HLS and WLO.
5.3. Pareto Frontier
The proposed method for multiple wordlength HLS generates, for a given timing constraint (latency or throughput) and accuracy constraint, an architecture optimized in terms of implementation cost. By testing different accuracy and timing constraints, the Pareto frontier associated to the application can be obtained. The different tradeoff between implementation cost, accuracy, and latency can be analyzed from this curve.
The results obtained for the searcher module of a WCDMA receiver are presented in Figure 9. The data flow graph of the application can be found in [16]. The targeted architecture is an FPGA and only LUTs are considered. The results show an evolution by plateau. For the latency, the plateaus are due to the introduction of one operator or several in parallel, to reduce the application latency. For the accuracy, the evolution is piecewise linear. The smooth evolution is due to the gradual increase of the operator wordlength to reach the accuracy constraint. The evolution is linear for this application because the architecture is dominated by addition and subtraction operations, which have a linear implementation cost according to the operand wordlength. As for the latency, the abrupt changes in the evolution are due to to the introduction of one operator or several in parallel to reach the constraints. The accuracy increase requires operators with greater wordlength and thus leads to higher operator latency. Consequently, when the operator latency increase does not satisfy the global timing constraint, one or more additional operators are required. The location of these abrupt changes in the Pareto frontier is tightly linked to the clock period. The discretization of the operator latency in an integer number of cycles leads to the occurrence of abrupt changes.
5.4. Comparison with Other Solutions
In this section, the solution obtained with the proposed method is first compared with a classical method based on a uniform wordlength (UWL) and then with the solution using a single wordlength for each type of operation. As in [4, 8], to evaluate the efficiency of the proposed method, the obtained solutions are compared with the UWL solutions. In this last case, a single wordlength is used for all data. For a Fast Fourier Transform (FFT), the UWL solution with a 16bit precision leads to a SQNR of 58 dB. The cost is evaluated with the proposed method (OPT) and with the UWL method for this accuracy constraint of 58 dB and for different timing constraints. The results are presented in Figure 10. For this application, the proposed method performs better with a gain on the implementation cost between and . When the timing constraint is low, several operators are used for the same kind of operations and the multiple wordlengths approach benefits from the possibility to distribute different wordlengths to each operator. When the timing constraint is high, the number of operators in the architecture is lower and the difference between the OPT and MWL solutions decreases. In the sequential method used in [8, 17], the wordlengths are first optimized and then the architecture is synthesized. The results presented in [8] lead to a gain of up to compared to the UWL solution, and the results presented in [17] leads to a gain of up to compared to the UWL solution. These results show that the combination of the WLO and the HLS in the proposed method gives better results than the sequential method. In [4], the WLO and HLS processes are combined through a simulated annealing optimization and the gain obtained compared to the UWL solution is between and . The proposed method leads to similar gains but with significantly less iterations required for a good solution. Moreover, in our case the HLS process is not modified and existing academic or commercial tools can be directly used.
To analyse the efficiency of the proposed iterative method and the interest of coupling WLO and HLS, the optimized solution obtained after several iterations and the solution obtained at the first iteration are compared. The solution obtained at the first iteration (INIT) corresponds to the case where a single wordlength is used for all the operators of the same type. In this case, the operation binding on operators is not taken into account. The architecture area reduction compared to the first iteration is measured and the results obtained for the Stratix FPGA are given in Table 3 for different digital signal processing kernels. The complex correlator computes the correlation between complex signal like in a WCDMA receiver. The experiments are carried out for different accuracy and timing constraints and the maximal and mean values are reported. The proposed method can reduce the architecture area up to in the case of the FFT. In average, the reduction is between and . These results show the efficiency of the iterative method to improve the cost implementation. Gradually, the information collected from the previous iterations allows the convergence to an efficient operation grouping, which improves the HLS.

6. Conclusion
In this paper, a new HLS method under accuracy constraint is proposed. An iterative process is used to link HLS and WLO. This coupling is achieved through an iterative process. To reduce significantly the optimization time compared to the simulationbased methods, the accuracy is evaluated with an analytical method.
The efficiency of proposed method is shown through experiments. Compared to classical implementations based on a uniform wordlength, the proposed method reduces significantly the number of resources used to implement the system. These results show the relevance of using multiple wordlength architecture. The interest of coupling HLS and WLO is shown on different digital signal processing kernels. This technique reduces the number of operators used in the architecture and also reduces the latency.
References
 D. Novo, B. Bougard, A. Lambrechts, L. Van Der Perre, and F. Catthoor, “Scenariobased fixedpoint data format refinement to enable energyscalable software defined radios,” in Proceedings of the IEEE/ACM Conference on Design, Automation and Test in Europe (DATE '08), pp. 722–727, Munich, Germany, March 2008. View at: Publisher Site  Google Scholar
 D. Novo, M. Li, B. Bougard, L. Van Der Perre, and F. Catthoor, “Finite precision processing in wireless applications,” in Proceedings of the IEEE/ACM Conference on Design, Automation and Test in Europe Conference and Exhibition (DATE '09), pp. 1230–1233, April 2009. View at: Google Scholar
 K. Kum and W. Sung, “Combined wordlength optimization and highlevel synthesis of digital signal processing systems,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, vol. 20, pp. 921–930, 2001. View at: Google Scholar
 G. Caffarena and C. Carreras, “Architectural synthesis of DSP circuits under simultaneous error and time constraints,” in Proceedings of the 18th IEEE/IFIP International Conference on VLSI and SystemonChip (VLSISoC '10), pp. 322–327, Madrid, Spain, June 2010. View at: Publisher Site  Google Scholar
 B. Le Gal, C. Andriamisainat, and E. Casseau, “Bitwidth aware highlevel synthesis for digital signal processing systems,” in Proceedings of the IEEE International SOC Conference (SOCC '06), pp. 175–178, September 2006. View at: Google Scholar
 P. Coussy, G. LhairechLebreton, and D. Heller, “Multiple wordlength highlevel synthesis,” Eurasip Journal on Embedded Systems, vol. 2008, no. 1, Article ID 916867, 11 pages, 2008. View at: Publisher Site  Google Scholar
 B. Le Gal and E. Casseau, “Latencysensitive highlevel synthesis for multiple wordlength DSP design,” Eurasip Journal on Advances in Signal Processing, vol. 2011, Article ID 927670, 11 pages, 2011. View at: Publisher Site  Google Scholar
 J. Cong, Y. Fan, G. Han et al., “Bitwidthaware scheduling and binding in highlevel synthesis,” in Proceedings of the ACM/IEEE Asia South Pacific Design Automation Conference (ASPDAC '05), pp. 856–861, Shanghai, China, 2005. View at: Google Scholar
 G. A. Constantinides, P. Y. K. Cheung, and W. Luk., Synthesis and Optimization of DSP Algorithms, Kluwer Academic, 2004.
 D. Menard, R. Rocher, and O. Sentieys, “Analytical fixedpoint accuracy evaluation in linear timeinvariant systems,” IEEE Transactions on Circuits and Systems I, vol. 55, no. 10, pp. 3197–3208, 2008. View at: Publisher Site  Google Scholar
 G. Caffarena, G. Constantinides, P. Cheung, C. Carreras, and O. NietoTaladriz, “Optimal combined wordlength allocation and architectural synthesis of digital signal processing circuits,” IEEE Transactions on Circuits and Systems II, vol. 53, no. 5, pp. 339–343, 2006. View at: Google Scholar
 P. Coussy, C. Chavet, P. Bomel, D. Heller, E. Senn, and E. Martin, HighLevel Synthesis From Algorithm to Digital Circuit, chapter GAUT: A HighLevel Synthesis Tool for DSP Applications From C Algorithm to RTL Architecture, Springer, Amsterdam, The Netherlands, 2008.
 R. Rocher, D. Menard, P. Scalart, and O. Sentieys, “Analytical accuracy evaluation of FixedPoint Systems,” in Proceedings of the European Signal Processing Conference (EUSIPCO '07), Poznan, Poland, September 2007. View at: Google Scholar
 M.A. Cantin, Y. Savaria, and P. Lavoie., “A comparison of automatic word length optimization procedures,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '02), vol. 2, pp. 612–615, 2002. View at: Google Scholar
 F. Glover, “Tabu search—part I.,” INFORMS Journal on Computing, vol. 1, no. 3, pp. 190–206, 1989. View at: Google Scholar
 H.N. Nguyen, D. Menard, R. Rocher, and O. Sentieys, “Energy reduction in wireless system by dynamic adaptation of the fixedpoint specification,” in Proceedings of the Workshop on Design and Architectures for Signal and Image Processing (DASIP '08), Bruxelles, Belgium, November 2008. View at: Google Scholar
 G. A. Constantinides, P. Y. K. Cheung, and W. Luk, “Wordlength optimization for linear digital signal processing,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, vol. 22, no. 10, pp. 1432–1442, 2003. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2012 Daniel Menard et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.