Research Article  Open Access
Supriya Aggarwal, Kavita Khare, "RedesignedScaleFree CORDIC Algorithm Based FPGA Implementation of Window Functions to Minimize Area and Latency", International Journal of Reconfigurable Computing, vol. 2012, Article ID 185784, 8 pages, 2012. https://doi.org/10.1155/2012/185784
RedesignedScaleFree CORDIC Algorithm Based FPGA Implementation of Window Functions to Minimize Area and Latency
Abstract
One of the most important steps in spectral analysis is filtering, where window functions are generally used to design filters. In this paper, we modify the existing architecture for realizing the window functions using CORDIC processor. Firstly, we modify the conventional CORDIC algorithm to reduce its latency and area. The proposed CORDIC algorithm is completely scalefree for the range of convergence that spans the entire coordinate space. Secondly, we realize the window functions using a single CORDIC processor as against two serially connected CORDIC processors in existing technique, thus optimizing it for area and latency. The linear CORDIC processor is replaced by a shiftadd network which drastically reduces the number of pipelining stages required in the existing design. The proposed design on an average requires approximately 64% less pipeline stages and saves up to 44.2% area. Currently, the processor is designed to implement Blackman windowing architecture, which with slight modifications can be extended to other widow functions as well. The details of the proposed architecture are discussed in the paper.
1. Introduction
Window filtering techniques [1, 2] are commonly employed in signal processing paradigm to limit time and frequency resolution. Various window functions are developed to suit different requirements for sidelobe minimization, dynamic range, and so forth. Commonly, many hardware efficient architectures are available for realizing FFT [3–5], but the same is not true for windowing–architectures. The conventional hardware implementation of window functions uses lookup tables which give rise to various area and time complexities with increase in word lengths. Moreover, they do not allow userdefined variations in the window length. An efficient implementation of flexible and reconfigurable window functions using CORDIC algorithm is suggested in [6, 7]. Though they allow userdefined variations in window length, latency is a major problem. The CORDIC algorithm [8–10] inherently suffers from latency issues and using two CORDIC processors in series, as is done in [6, 7]; the overall latency of the system is hampered.
In this paper, a new areatime efficient FPGA implementation to realize Blackman window function is suggested. We first redesign the conventional CORDIC algorithm to eliminate scalefactor compensation network and optimize its microrotation sequence identification. We then replace the linear CORDIC processors used in the existing design by shiftadd tree derived using Booth multiplication. These modifications scale down the area consumption of the window architecture, with decrease in the number of pipeline stages.
The rest of the paper is structured as follows. Section 2 provides a comprehensive idea about various window functions and the conventional CORDIC algorithm. In Section 3, we propose a new CORDIC algorithm as redesignedscalefree CORDIC. Section 4 deals with architecture for implementing the window functions. Section 5 presents the FPGA implementation and complexity issues, while Section 6 concludes the paper.
2. Background
2.1. Window Filtering Techniques
Window filtering is a wellknown processing technique for limiting any signal to shorttime segment in various fields, like audio or video signal processing, communication systems, and so forth. The rectangular, Gaussian, Hamming, Hanning, BlackmanHarris, and Kaiser are some of the most common available windowing techniques [2, 11]. The selection of the available windows is based on the spectral characteristics desired by the applications. Equations (1a)–(1c) explains the Hanning, Hamming, and the Blackman window family as follows: where is the window length. where .
The values of and are determined to achieve maximum sidelobe cancellation. For Hamming window, the coefficients are calculated as and ;
where .
The Blackman Harris window has three degrees of freedom which can be used to design a family of window functions having different window amplitudes, rolloff rates, and sidelobe rejections. The Blackman window with coefficients , and has sidelobe roll off rate of 18 dB/octave and the worst case sidelobe level is about 58 dB; while with coefficients , and the sidelobe level is 71.48 dB with sidelobe roll off rate of 6 dB/octave.
The hardware implementation of window functions invlove trigonometric computations. The primitive technique to compute trigonometric functions uses LUTs. But this approach fails to support userdefined changes in the windowlength. Another popular algorithm for computing trigonometric functions is known as CORDIC (coordinate rotation digital computer) algorithm. This algorithm is used in [6, 7] for efficient window implementation in hardware and to provide application dependent changes in the window length. It uses two serially connected CORDIC processors operating in different modes, one in linear and other in circular. Inherently, the CORDIC algorithm suffers from latency issues; and the design in [6, 7] operates two CORDIC processors in series, as a result the latency is the major drawback in the existing designs of [6, 7]. Therefore, we redesign the CORDIC algorithm to minimize the number of iterations and hence reduce latency. Moreover, we replace one of the CORDIC processors with a booth multiplication shiftadd tree to further minimize latency and area.
2.2. CORDIC Algorithm
The conventional CORDIC algorithm [9, 10, 12, 13] has various modes of operation and trajectories. But as window functions use CORDIC algorithm in rotation mode following circular trajectory, we restrict our discussion to circularrotation mode CORDIC only.
The coordinates of two vectors [, ] and [, ] separated by an angle “” are related as
Equation (2) forms the basic principle for iterative CORDIC coordinate calculations [8]. The key concept for realizing rotations using CORDIC algorithm is to express the desired rotation angle “” as an aggregation of predefined elementary angles, defined as: where , = and is the word length.
The rotation matrix in its original form (2) requires determining the sine and cosine values and four multiplication operations. Factoring the cosine term simplifies the rotation matrix (4) by converting the multiplication operations to shift, as the tangent of elementary angles are defined in the negative powers of two (3) as
The rotation matrix in (4) is applicable for anticlockwise vector rotations. To support both clockwise and anticlockwise CORDIC rotations, the rotation matrix is altered as where for anticlockwise rotations and for clockwise rotations.
In its original form, the CORDIC algorithm suffers from major disadvantages like scalefactor compensation, latency, and optimal identification of microrotations. We propose a redesignedscalefree CORDIC algorithm to overcome these disadvantages.
3. RedesignedScaleFree CORDIC Algorithm
The proposed CORDIC algorithm is an improved version of the conventional CORDIC algorithm in circularrotation mode. The major ideas which lead to the proposed CORDIC algorithm are as follows: (i) redefine the elementary angles to eliminate the ROM required in conventional CORDIC algorithm to store the elementary angles, (ii) extend the Taylor series approximation of ScalingFree CORDIC [13] to provide completely scalefree solution over the entire coordinate space, and (iii) obviate the redundant CORDIC iterations using new microrotation sequence identification. However, the existing scalingfree CORDIC [13] is outperformed by the conventional CORDIC beyond 20 bit implementation. But since an extensive set of applications work on word lengths up to 16 bits, our aim is to redesign the scalingfree CORDIC for wordlength up to 16 bits.
3.1. Redefining the Elementary Angles
We redefine the elementary angles used in conventional CORDIC (3) as
The above definition of elementary angles obviates the ROM required by the conventional CORDIC algorithm to store the elementary angles.
3.2. Coordinate Calculation Equations
We derive a new set of coordinate calculation equations by modifying the Taylor series expansion of sine and cosine functions used in scalingfree CORDIC [13]. Instead of using second order approximation of scalingfree CORDIC, we use third order of Taylor series approximation. It is necessary to analyze various orders of Taylor series approximation before third order is finalized for use in coordinate equations. We compare the mean square errors in the xcoordinate and ycoordinate for various orders of approximation in Table 1. The errors are calculated from the results obtained after simulating the CORDIC processors. The rotation matrix of the CORDIC processors was designed using the orders of approximation mentioned in Table 1 in (2) and given by:

The errors are calculated for 16 bit word length, for angles lying in the range [], since for sine/cosine functions this range can be easily extended over the entire coordinate space using the octant wave symmetry. From Table 1, we conclude that the errors are of the same order for various orders of approximation of Taylor series expansion. Therefore, to keep the hardware complexity to minimum, we choose third order of approximation. Thus, the rotation matrix of the proposed CORDIC algorithm is given by
In order to implement the above rotation matrix using shiftadd implementation, we approximate (3!) to . With this approximation, the mean square errors in the coordinate and ycoordinate are and , respectively. The errors are calculated for 16 bit word length, for angles lying in the range [] since for sine/cosine functions this range can be easily extended over the entire coordinate space using the octant wave symmetry. As these errors are of the same order as the errors in Table 1, this approximation does not affect the accuracy. Finally, the rotation matrix of the proposed CORDIC algorithm is defined as
3.3. Determination of Highest Elementary Angle
The use of Taylor series approximation imposes a restriction on the highest elementary angle being used in CORDIC iterations [13]. This restriction ensures that the higher order terms neglected due to the order of approximation used do not affect the accuracy of the processor. For third order of approximation, fourth and subsequent higher order terms should be zero after the shift operation of CORDIC so that their role in mathematical operations is obviated.
For a word length of Nbits, order term is zero if it gets right shift by bits, defined as
For third order of approximation, , the smallest value of and the highest permissible elementary angle are given by:
Thus, for 16 bit word length, and the highest elementary angle permissible is radians.
3.4. MicroRotation Sequence Identification
The proposed microrotation sequence generation is different from the conventional CORDIC microrotation identification. In conventional technique, each elementary angle is used only once; while we allow multiple microoperations corresponding to the same elementary angle. Then, the use of every elementary angle is a must in conventional CORDIC, where as we have selective microrotations that depend on the angle of rotation. Further, we restrict the microrotations in single direction (anticlockwise) only as against bidirectional microrotations (clockwise and anticlockwise) in conventional CORDIC.
The microrotation sequence generator selects appropriate elementary angle for the current CORDIC iteration. Using the redefined elementary angles (6), the microrotations can be identified using the circuit shown in Figure 1. It comprises of a priority encoder and a reset circuitry. The input to the microrotation sequence generator is the rotation angle , where is the word length. The priorities of the encoder are hooked in the reverse order with having the highest priority and the least. The reset circuitry resets a bit of the input rotation angle to generate the residue angle for next CORDIC iteration. Since, the microrotation sequence generates the shiftindex for one CORDIC iteration, it is required in every stage of the CORDIC pipeline (the implementation of CORDIC stage is discussed in the forthcoming sections). The microrotation sequence generation block handles the angles in the range . This range can be extended to the entire coordinate space using the octant symmetry of sine and cosine functions [14].
3.5. Number of Iterations
The number of iterations required to realize this range of rotation angles is decided based on: (i) maximum iterations of the highest elementary angle and (ii) the iterations of the other elementary angles. The rotation angle is given by where and .
The maximum angle that can be handled by the microrotation sequence generator is radians. Therefore, no more than 3 iterations of highest elementary angle ( radians) is required, that is, maximum of iterations are required to realize any angle of rotation in the range []. The rest iterations determine accuracy. To select an appropriate value of , we simulate the CORDIC processor for varying iterations, the mean square error is tabulated in Table 2. After observing the errors in Table 2, we can say that the errors for and are of same order. Therefore, to minimize the number of CORDIC iterations, we select . We require a maximum of iterations for the proposed CORDIC processor.

3.6. Error Analysis
The error analysis of the proposed CORDIC algorithm is divided into two parts: (i) residue angle error and (ii) error in the coordinate values.
3.6.1. Residue Angle Error
In the proposed methodology, desired angle of rotation is expressed as where is minimum shiftindex (11), is word length.
We identify the microrotations by using the bit representation of the desired rotation angle. The residue angle error depends on the number of bits set in the radix2 representation of the rotation angle and varies for different rotation angles. Therefore, we derive the worstcase angle error in the range of convergence [].
The maximum number of iterations is fixed for all rotation angles. The input rotation angle with the MSBnibble value of 4′b1011, requires four iterations of , while, three or less iterations are required for other MSBnibble values. From second MSBnibble onwards each bit set to 1′b1 in the radix2 representation of the rotation angle would require one iteration; therefore, maximum four iterations are required if the second MSB nibble value is 4′b1111. Since the iteration count is seven, the worstcase error is (). This worstcase residue angle error is specific to the rotation angle of 16′b1011_1111_1111_1111, while for other rotation angles the residue angle error will be less. In the proposed 16 bit fixed point representation scheme, 16′b1011_1111_1111_1111 is ; the worstcase residue angle error is .
3.6.2. Error in Coordinate Values
For fixedpoint implementation, the error is represented in terms of biterror position (BEP). The BEP in and coordinates calculated using the proposed CORDIC processor is shown in Figure 2. For a BEP of , the conventional CORDIC requires a word length of bits [15]. For a BEP of 10 bits as achieved by the proposed CORDIC algorithm, the conventional CORDIC will require 16 bit word length. We, therefore, compare the proposed design with the existing design using conventional CORDIC processor [7] for 16 bits.
4. Architecture for Implementing Window Functions
In this section, we focus on implementing the pipelined architecture to generate window functions. The length of the window function is selected by the user at run time. Currently, the architecture implements the Blackman window, but with slight modifications it can be extended to other window functions as well. In the proposed architecture, the output bit width is set to 16 bits.
Figure 3 shows the block diagram for generating Blackman window function. The circuit consists of theta generator unit (TGU), window coefficient multiplier (WCM), circular CORDIC processor (CCP) and FIFO. The TGU generates the two angle values and required in the threeterm Blackman window function. WCM multiplies the input signal samples with the window constants using a shiftadd tree derived from Booth multiplication algorithm. CCP is used for generating the cosine terms in the window function. The FIFO is used for proper synchronization between the window coefficients having cosine terms and constants.
4.1. Circular CORDIC Processor (CCP)
The CCP is pipelined implementation of the proposed redesignedscalefree CORDIC algorithm discussed in Section 3. A total of seven ( and ) iterations are required (as discussed in Section 3.5), since each pipeline stage performs one iteration, the proposed CCPpipeline is seven stages long. Each stage (Figure 4) is a combination of three blocks (i) the coordinate calculation unit, (ii) the shiftindex calculation, and (iii) the microrotation sequence generation. The coordinate calculation unit implements (9) using shiftadd implementation. The shiftindex calculation computes the necessary shifts ( and ()), required by the coordinate calculation unit. The microrotation sequence generation is shown in Figure 1.
The complexity of coordinate calculation unit is equal to six N bit logic shifters and six bit adder/subtractor. The shiftindex calculation unit requires three bit adders, where are the extra bits required to store the sum. Even though, the coordinate calculation unit of the proposed redesignedscalefree CORDIC is more complex than the conventional CORDIC [8]; the overall gate count of the proposed window architecture using the proposed CCPpipeline is reduced.
4.2. Window Coefficient Multiplier (WCM)
The WCM unit multiplies the input samples with the Blackman window coefficients (, , and ). The shiftadd tree for multiplication with , , and is derived using the Booth multiplication algorithm. In radix2 representation system, multiplication with 0.5 is equivalent to single right shift. Therefore, multiplication with is realized using a hardwired shifter. The coefficient is represented in 16 bit fixedpoint format as 0001_1011_0010_0000, that requires four 16 bit adders and five hardwired shifters, while is represented as 0000_0101_0000_0010 and requires two 16 bit adders and three hardwired shifters.
The complexity of the WCM unit is equivalent to six 16 bit adders, as hardwired shifters do not incur any hardware costs.
4.3. Theta Generator Unit (TGU)
The TGU generates the two angles given by where is a multiple of 2 such that .
The difference between the consecutive values of is given by
Using binomial theorem (B.T.), we simplify (15c) to the following:
Generally in most signal processing applications, not less than 16point DFT is used which implies and . Therefore, only three terms of binomial expansion are sufficient for 16 bit accuracy as follows:
For 16 bit word length and , the term always gets a right shift greater than or equal to 16. Therefore, is zero for 16 bit word length.
Figure 5 shows the block diagram representation of TGU. The angles in the windowing function are uniformly distributed over the entire coordinate space. The CCP unit handles angles in the range of . Therefore, the TGU divides the entire coordinate space into octants, so that the input angle to CCP always lies in the range . The octants are distinguished as shown in Figure 6; the TGU also generates signals for proper octant mapping of values generated by CCP.
The TGU requires three 16 bit adders, two barrelshifters and one encoder.
4.4. Window Generation
In Figure 7, we compare the Blackman window generated using the proposed processor with that of MATLAB inbuilt function blackman() for .
5. FPGA Implementation and Complexity Issues
The proposed architecture is coded in Verilog and simulated and synthesized using Xilinx ISE 9.2i Design Suite to be mapped on Xilinx Virtex 2Pro (XC2VP506FF1148) device. For 16 bit implementation, the proposed design consumes 1800 slices and 3371 4Input LUTs, with a maximum operating frequency of 101.284 MHz. The total delay of 9.873 nsec is distributed as 58.7% logic delay and 41.3% route delay. The total gate count of the proposed design is 34739.
5.1. Comparison with Existing Architecture
The CORDIC processor both linear and circular used in [7] is designed using conventional CORDIC algorithm. The scalingfree CORDIC [13] and enhanced scalingfree CORDIC [16] are currently the best available hardware designs for circular CORDIC implementation. We compare our processor with three designs: (i) the existing design in [7] using conventional circular CORDIC, (ii) replace the conventional circular CORDIC in [7] with scalingfree CORDIC [13], and (iii) replace the conventional circular CORDIC in [7] with enhanced scalingfree CORDIC [16]. The area complexity and latency of the proposed design with three variants of existing design [7] mentioned above are compared in Table 3.
 
^{
1}Latency is defined in terms of number of pipelining stages required by the design. ^{ 2}The gate count of the proposed design is 34739. ^{ 3}The latency of the proposed design is 10. 
5.1.1. Area Comparison
The area of conventional circular CORDIC processor is calculated using Xilinx CORDIC IP v3.0. The Xilinx CORDIC Core is optimized for circular CORDIC computation with maximum pipelining for 16 bit word length. The gate count is 20122. In [13], the complexity of 16 bit scalingfree CORDIC is computed to be equivalent to 1000 1 bit full adders and 597 1 bit registers. This area complexity approximately uses 16776 gates for implementation. The SFB4C architecture of enhanced scalingfree (ESF) CORDIC [16] replaces the initial four scalingfree CORDIC iterations with conventional CORDIC iterations. Thus, the complexity of 16 bit ESF CORDIC without scalefactor compensation is equivalent to 512 1 bit full adders and 420 1 bit registers, approximately equal to 9504 gates. The complexity of the 16 bit linear conventional CORDIC is equivalent to 512 1 bit full adders and 768 1 bit registers, approximately equal to 12288 gates. The other units like theta generator unit, FIFO, and adders required for realizing the window processor are common for the proposed as well as the existing design.
5.1.2. Latency
The throughput of all the designs is same, that is, one data sample per clock cycle, while the latency is different and is closely related to number of iterations in circular CORDIC and linear CORDIC processor when the designs are operating at the same clock frequency. The 16 bit linear CORDIC processor uses 16stages long pipeline. The conventional circular CORDIC processor again uses 16stages pipeline for 16 bit word length. For the same 16 bit word length, the scalingfree CORDIC [13] processor uses 12stages long pipeline, while the ESF CORDIC [16] pipeline is 9 stages long. Therefore, the latency of existing design in [7] with conventional circular CORDIC is 32 stages, while with scalingfree is 28 stages and with ESFCORDIC is 25 stages.
The new redesignedCORDIC pipeline is 7 stages long (Section 4.1). The delay of the WCM unit (Section 4.2) is three adders in serial, which can be considered equivalent to three linear CORDIC iterations. Hence, the total latency of the proposed design is 10 stages, which is far less as compared to existing design using the best of the available circular CORDIC hardware.
5.1.3. Delay
The delay is the time required to generate one set of window coefficients for a window length of when the design is operating at the maximum clock frequency. The critical path for the proposed design is the TGU. Since the existing design using the conventional circular CORDIC and the scalingfree circular CORDIC also work using the same TGU, while, the existing design using the ESFCORDIC uses a slightly less complex TGU as compared to other designs. The TGU for the proposed design and the existing design using the conventional circular CORDIC and the scalingfree circular CORDIC generates angle in the range [], while for existing design using the ESFCORDIC the TGU generates angles in the range []. Therefore, the maximum clock frequency for the existing design using ESFCORDIC is 101.983 MHz and for other designs including the proposed design is 101.284 MHz. Figure 8 compares the delay for the four designs for various window lengths.
6. Conclusion
In this paper, we present an areatime efficient CORDIC based processor for realizing window functions. Currently, the architecture implementing the Blackman window function, with slight modification, can be extended to other window functions as well. We also propose a circular CORDIC processor for word lengths up to 16 bits. The redesigned scalefree CORDIC processor uses third order of approximation of Taylor series to realize scalefree CORDIC iterations. However, removal of scaling factor comes with the disadvantage of complex coordinate calculations. The microrotation sequence generation is optimized using a priority encoder which reduces the total CORDIC processor pipeline to seven stages. A shiftadd tree derived using Booth multiplication algorithm replaces the linear CORDIC processor in the original design of window architecture. The proposed Blackman window architecture saves approximately 44.2% area and drastically reduces latency with no affect on accuracy.
References
 K. K. Parhi, VLSI Digital Signal Processing Systems, John Wiley & Sons, 1999.
 J. G. Proakis and D. G. Manolakis, Digital Signal Processing Principles, Algorithms and Applications, Prentice Hall, 3rd edition, 2006.
 A. M. Despain, “Fourier transform computers using CORDIC iterations,” IEEE Transactions on Computers, vol. 23, no. 10, pp. 993–1001, 1974. View at: Google Scholar
 T. Sansaloni, A. PérezPascual, and J. Valls, “Areaefficient FPGAbased FFT processor,” Electronics Letters, vol. 39, no. 19, pp. 1369–1370, 2003. View at: Publisher Site  Google Scholar
 M. Ayinala, M. Brown, and K. K. Parhi, “Pipelined parallel FFT architectures via folding transformation,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 20, no. 6, pp. 1068–1081, 2011. View at: Publisher Site  Google Scholar
 K. C. Ray and A. S. Dhar, “CORDICbased unified VLSI architecture for implementing window functions for real time spectral analysis,” IEE Proceedings: Circuits, Devices and Systems, vol. 153, no. 6, pp. 539–544, 2006. View at: Publisher Site  Google Scholar
 K. C. Ray and A. S. Dhar, “High throughput VLSI architecture for Blackman windowing in real time spectral analysis,” Journal of Computers, vol. 3, no. 5, pp. 54–59, 2008. View at: Google Scholar
 J. E. Volder, “The cordic trigonometric computing technique,” IRE Transactions on Electronic Computers, vol. 8, no. 3, pp. 330–334, 1959. View at: Publisher Site  Google Scholar
 J. S. Walther, “A unified algorithm for elementary functions,” in Proceedings of the 38th Spring Joint Computer Conferences, pp. 379–385, Atlantic City, NJ, USA, 1971. View at: Google Scholar
 P. K. Meher, J. Valls, T. B. Juang, K. Sridharan, and K. Maharatna, “50 years of CORDIC: algorithms, architectures, and applications,” IEEE Transactions on Circuits and Systems I, vol. 56, no. 9, pp. 1893–1907, 2009. View at: Publisher Site  Google Scholar
 J. O. Smith III, Spectral Audio Signal Processing, W3K, 2011.
 T. B. Juang, S. F. Hsiao, and M. Y. Tsai, “ParaCORDIC: parallel CORDIC rotation algorithm,” IEEE Transactions on Circuits and Systems I, vol. 51, no. 8, pp. 1515–1524, 2004. View at: Publisher Site  Google Scholar
 K. Maharatna, S. Banerjee, E. Grass, M. Krstic, and A. Troya, “Modified virtually scalingfree adaptive CORDIC rotator algorithm and architecture,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 11, pp. 1463–1474, 2005. View at: Publisher Site  Google Scholar
 J. Vankka, Digital Synthesizers and Transmitters for Software Radio, Springer, Dordrecht, Netherlands, 2005.
 K. Kota and J. R. Cavallaro, “Numerical accuracy and hardware tradeoffs for CORDIC arithmetic for specialpurpose processors,” IEEE Transactions on Computers, vol. 42, no. 7, pp. 769–779, 1993. View at: Publisher Site  Google Scholar
 F. J. Jaime, M. A. Sánchez, J. Hormigo, J. Villalba, and E. L. Zapata, “Enhanced scalingfree CORDIC,” IEEE Transactions on Circuits and Systems I, vol. 57, no. 7, pp. 1654–1662, 2010. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2012 Supriya Aggarwal and Kavita Khare. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.