In the last decade, CORDIC algorithm has drawn wide attention from academia and industry for various applications such as DSP, biomedical signal processing, software defined radio, neural networks, and MIMO systems to mention just a few. It is an iterative algorithm, requiring simple shift and addition operations, for hardware realization of basic elementary functions. Since CORDIC is used as a building block in various single chip solutions, the critical aspects to be considered are high speed, low power, and low area, for achieving reasonable overall performance. In this paper, we first classify the CORDIC algorithm based on the number system and discuss its importance in the implementation of CORDIC algorithm. Then, we present systematic and comprehensive taxonomy of rotational CORDIC algorithms, which are subsequently discussed in depth. Special attention has been devoted to the higher radix and flat techniques proposed in the literature for reducing the latency. Finally, detailed comparison of various algorithms is presented, which can provide a first-order information to designers looking for either further improvement of performance or selection of rotational CORDIC for a specific application.

1. Introduction

The current research in the design of high speed VLSI architectures for real-time digital signal processing (DSP) algorithms has been directed by the advances in the VLSI technology, which have provided the designers with significant impetus for porting algorithm into architecture. Many of the algorithms used in DSP and matrix arithmetic require elementary functions such as trigonometric, inverse trigonometric, logarithm, exponential, multiplication, and division functions. The commonly used software solutions for the digital implementation of these functions are table lookup method and polynomial expansions, requiring number of multiplication and additions/subtractions. However, digit-by-digit methods exist for the evaluation of these elementary functions, which compute faster than software solutions.

Some of the digit-by-digit methods for the computation of the above mentioned elementary functions were described by Henry Briggs in 1624 in “Arithmetica Logarithmica” [1, 2]. These are iterative pseudo division and pseudo multiplication processes, which resemble repeated-addition multiplication and repeated-subtraction division. In 1959, Volder has proposed a special purpose digital computing unit known as COordinate Rotation DIgital Computer (CORDIC), while building a real time navigational computer for use in an aircraft [3, 4]. This algorithm was initially developed for trigonometric functions which were expressed in terms of basic plane rotations.

The conventional method of implementation of 2D vector rotation shown in Figure 1 using Givens rotation transform is represented by the equations 𝑥out=𝑥incos𝜃−𝑦in𝑦sin𝜃,out=𝑥insin𝜃+𝑦incos𝜃,(1) where (𝑥in,𝑦in) and (𝑥out,𝑦out) are the initial and final coordinates of the vector, respectively. The hardware realization of these equations require four multiplications, two additions/subtractions and accessing the table stored in memory for trigonometric coefficients. The CORDIC algorithm computes 2D rotation using iterative equations employing shift and add operations. The versatility of CORDIC is enhanced by developing algorithms on the same basis to convert between binary to binary coded decimal (BCD) number representation by Daggett in 1959 [5]. These iterative methods were described using decimal radix for the design of powerful small machines by Meggitt in 1962 [6]. Subsequently, Walther in 1971 [7, 8] has proposed a unified algorithm to compute rotation in circular, linear, and hyperbolic coordinate systems using the same CORDIC algorithm, embedding coordinate systems as a parameter.

During the last 50 years of the CORDIC algorithm a wide variety of applications have emerged. The CORDIC algorithm has received increased attention after an unified approach is proposed for its implementation [7]. Thereafter, CORDIC based computing has been the choice for scientific calculator applications and HP-2152A co-processor, HP-9100 desktop calculator, HP-35 calculator are a few such devices based on the CORDIC algorithm [1, 8]. The CORDIC arithmetic processor chip is designed and implemented to perform various functions possible in rotation and vectoring mode of circular, linear, and hyperbolic coordinate systems [9]. Since then, CORDIC technique has been used in many applications [10], such as single chip CORDIC processor for DSP applications [11–15], linear transformations [16–21], digital filters [17], [22–24], and matrix based signal processing algorithms [25, 26]. More recently, the advances in the VLSI technology and the advent of EDA tools have extended the application of CORDIC algorithm to the field of biomedical signal processing [27], neural networks [28], software defined radio [29], and MIMO systems [30] to mention a few.

Although CORDIC may not be the fastest technique to perform these operations, it is attractive due to its potential for efficient and low cost implementation of a large class of applications. Several modifications have been proposed in the literature for the CORDIC algorithm during the last two decades to provide high performance and low cost hardware solutions for real time computation of a two dimensional vector rotation and transcendental functions.

A new type of arithmetic operation called fast rotations or orthonormal 𝜇-rotations over a set of fixed angles is proposed [31]. These orthonormal 𝜇-rotations are based on the idea of CORDIC and share the property that performing the rotation requires a minimal number of shift-add operations. These fast rotations methods form a viable low cost alternative to the CORDIC arithmetic for certain applications such as FIR filter banks for image processing, the generation of spherical sample rays in 3D graphics, and the computation of eigenvalue decomposition and singular value decomposition.

We have carried out the critical study of different architectures proposed in the literature for 2D rotational CORDIC in circular coordinate system, to initiate the work for further latency reduction or throughput improvement. In this paper, we will review the architectures proposed for rotational CORDIC. Specifically, we focus on redundant unfolded architectures, employing techniques suitable to increase throughput and reduce latency.

The rest of the paper is organized as follows. In Section 2, the basics of redundant arithmetic are presented. In Section 3, we present a review of generalized CORDIC algorithm, radix-2 and radix-4 CORDIC algorithms. In Section 4, general architectures being employed in literature for the implementation of the CORDIC algorithm are discussed. In Section 5, the complete taxonomy of rotational CORDIC algorithms is presented. Section 6 presents the low latency nonredundant CORDIC algorithm. Sections 7–9 provide different redundant CORDIC algorithms along with the architectures being proposed in the literature for the rotational CORDIC, followed by the comparison of different methods in Section 10. Finally, conclusions are presented in Section 11.

2. Redundant Arithmetic [32, 33]

A nonredundant radix-𝜌 number system has the set {0,1,…,𝜌−1} and all numbers can be uniquely represented. To avoid carry propagation delay in addition, redundant binary number system is employed. The two common redundant number systems employed in CORDIC arithmetic are the signed-digit (SD) [34–37] and the carry-save (CS) [38] number systems. In a SD number system for radix 𝜌, the numbers are represented with digit set {−𝛽,−𝛽+1,…,−1,0,+1,…,𝛼}, where 𝛼≤(𝜌−1) and (1≤𝛽≤(𝜌−1)). For symmetric digit set, 𝛼=𝛽, and each digit 𝑠 of SD number system is represented as (𝑠+,𝑠−) by (𝑝,𝑛) encoding such that (𝑠+−𝑠−=𝑠). In the radix-2 SD number system, numbers are represented with digits {−1,0,1}. In the carry-save number system, numbers are represented with digit set {0,1,2}. It may be observed that, in both SD and CS number systems each number can be represented in multiple ways. The redundancy in SD and CS number representation limits the carry propagation from each stage to its immediate more significant bit position only. In both the SD/CS adders, all sum bits are generated with two full adder delay independent of the word length. Hence, the application of redundant arithmetic can accelerate the additions/subtractions due to carry-free or limited carry-propagation.

3. CORDIC Algorithm

The CORDIC algorithm involves rotation of a vector 𝑣 on the 𝑋𝑌-plane in circular, linear and hyperbolic coordinate systems depending on the function to be evaluated. Trajectories for the vector 𝑣𝑖 for successive CORDIC iterations are shown in Figure 2. This is an iterative convergence algorithm that performs a rotation iteratively using a series of specific incremental rotation angles selected so that each iteration is performed by shift and add operation. The norm of a vector in these coordinate systems is defined as √𝑥2+𝑚𝑦2, where 𝑚∈{1,0,−1} represents a circular, linear or hyperbolic coordinate system respectively. The norm preserving rotation trajectory is a circle defined by 𝑥2+𝑦2=1 in the circular coordinate system. Similarly, the norm preserving rotation trajectory in the hyperbolic and linear coordinate systems is defined by the function 𝑥2−𝑦2=1 and 𝑥=1, respectively. The CORDIC method can be employed in two different modes, namely, the rotation mode and the vectoring mode. The rotation mode is used to perform the general rotation by a given angle 𝜃. The vectoring mode computes unknown angle 𝜃 of a vector by performing a finite number of microrotations.

3.1. Generalized CORDIC Algorithm

The generalized equations of the CORDIC algorithm for an iteration can be written as [7] 𝑥𝑖+1=ğ‘¥ğ‘–âˆ’ğ‘šğœŽğ‘–ğ‘¦ğ‘–ğœŒâˆ’ğ‘†ğ‘š,𝑖,𝑦𝑖+1=ğœŽğ‘–ğ‘¥ğ‘–ğœŒâˆ’ğ‘†ğ‘š,𝑖+𝑦𝑖,𝑧𝑖+1=ğ‘§ğ‘–âˆ’ğœŽğ‘–ğ›¼ğ‘š,𝑖,(2) where ğœŽğ‘– represents either clockwise or counter clockwise direction of rotation, 𝜌 represents the radix of the number system, 𝑚 steers the choice of circular (𝑚=1), linear (𝑚=0) or hyperbolic (𝑚=−1) coordinate systems, 𝑆𝑚,𝑖 is the nondecreasing integer shift sequence, and 𝛼𝑚,𝑖 is the elementary rotation angle. The latter directly depends on 𝑆𝑚,𝑖 through the relation𝛼𝑚,𝑖=1√𝑚tan−1√𝑚𝜌−𝑆𝑚,𝑖.(3) The shift sequence 𝑆𝑚,𝑖 depends on the coordinate system and the radix of number system. 𝑆𝑚,𝑖 affects the convergence of the algorithm and 𝑛 affects the accuracy of the final result. A detailed discussion on these is presented later. The value of ğœŽğ‘– depends on the radix of the number system and is determined by the following equation assuming that vector is either in the first or in the fourth quadrant: ğœŽğ‘–=𝑧sign𝑖𝑦,forrotationmode,−sign𝑖,forvectoringmode,(4) where 𝑧 and 𝑦 are the steering variables in rotation and vectoring mode respectively. The required microrotations are not perfect and increase the length of the vector. In order to maintain a constant vector length, the obtained results have to be scaled by the scale factor 𝐾=𝑖𝑘𝑖,𝑘𝑖=1+ğ‘šğœŽ2𝑖𝜌−2𝑆𝑚,𝑖,(5) where 𝑘𝑖 denotes the elementary scaling factor of the 𝑖th iteration, and 𝐾 is the resultant scaling factor after 𝑛 iterations. The computation of scale factor and its compensation increases the computational overhead and hardware depending on the number system employed in the CORDIC arithmetic.

With the appropriate initial values of 𝑥, 𝑦, and 𝑧, both rotation and vectoring modes can be used to compute commonly used elementary functions [39] given in Table 1.

3.2. CORDIC Algorithm for Circular Coordinate System

We present in this section the detailed description of 2D plane rotation in circular coordinate system, since this is used in many applications. The CORDIC algorithm calculates trigonometric functions, rotation of a vector and angle of a vector by realizing two dimensional vector rotation in circular coordinate systems. Figure 3 shows the rotation of a vector with length 𝑀in by a sequence of microrotations through the elementary angles 𝛼𝑖. Equation (2) represents the iterative rotation by an angle 𝛼𝑖 in circular coordinate system for 𝑚=1 and is given by 𝑥𝑖+1=ğ‘¥ğ‘–âˆ’ğœŽğ‘–ğ‘¦ğ‘–ğœŒâˆ’ğ‘–,𝑦𝑖+1=ğœŽğ‘–ğ‘¥ğ‘–ğœŒâˆ’ğ‘–+𝑦𝑖,𝑧𝑖+1=ğ‘§ğ‘–âˆ’ğœŽğ‘–ğ›¼ğ‘–.(6) The values of 𝛼𝑖 are chosen such that tan(𝛼𝑖)=𝜌−𝑖 and the multiplication of tangent term is reduced to simple shift operation. It may observed that the norm of vector in (𝑖+1)th iteration is extended compared to that in 𝑖th rotation, that is 𝑀𝑖+1=𝑀𝑖√1+tan2𝛼. The increase in magnitude of the vector in every iteration depends on the radix of the number system and number of iterations and is represented by the scale factor 𝐾. The direction of iterative rotation is determined using 𝑧𝑖 or 𝑦𝑖 depending on rotation mode or vectoring mode respectively. The number of microrotations to be performed in both the modes depends on the desired computing accuracy and can be constant for a particular computer of finite word length. The number of microrotations in turn decides the number of elementary angles. The iterative equations of the CORDIC algorithm for radix-2 and radix-4 number systems will be presented in the following sections.

3.2.1. Rotation Mode

In rotation mode, the input angle 𝜃 will be decomposed using a finite number of elementary angles [3] 𝜃=ğœŽ0𝛼0+ğœŽ1𝛼1+⋯+ğœŽğ‘›âˆ’1𝛼𝑛−1,(7) where 𝑛 indicates the number of microrotations, 𝛼𝑖 is the elementary angle for 𝑖th iteration and ğœŽğ‘– is the direction of 𝑖th microrotation. In rotation mode, 𝑧0 is the angle accumulator initialized with the input rotation angle. The direction of vector in every iteration must be determined to reduce the magnitude of the residual angle in the angle accumulator. Therefore, the direction of rotation in any iteration is determined using the sign of the residual angle obtained in the previous iteration. The coordinates of a vector obtained after 𝑛 microrotations are 𝑥𝑛𝑥=𝐾incos𝜃−𝑦in,𝑦sin𝜃𝑛𝑥=𝐾insin𝜃+𝑦in,𝑧cos𝜃𝑛⟶0.(8)

3.2.2. Vectoring Mode

In vectoring mode, the unknown angle of a vector is determined by performing a finite number of microrotations satisfying the relation [3] −𝜃=ğœŽ0𝛼0+ğœŽ1𝛼1+⋯+ğœŽğ‘›âˆ’1𝛼𝑛−1.(9) The vectoring mode rotates the input vector through a predetermined set of 𝑛 elementary angles so as to reduce the 𝑦 coordinate of the final vector to zero as closely as possible. Therefore, the direction of rotation in every iteration must be determined based on the sign of residual 𝑦 coordinate obtained in the previous iteration. The coordinates obtained in vectoring mode after 𝑛 iterations are given by 𝑥𝑛=𝐾𝑥2in+𝑦2in,𝑦𝑛𝑧⟶0,𝑛=tan−1𝑦in𝑥in.(10)

3.2.3. Radix-2 CORDIC Algorithm

The iteration equations of the radix-2 CORDIC algorithm [7] in rotation mode of circular coordinate system at the (𝑖+1)th step are obtained by using 𝜌=2 in (6) and are given by 𝑥𝑖+1=ğ‘¥ğ‘–âˆ’ğœŽğ‘–2−𝑖𝑦𝑖,𝑦𝑖+1=ğœŽğ‘–2−𝑖𝑥𝑖+𝑦𝑖,𝑧𝑖+1=ğ‘§ğ‘–âˆ’ğœŽğ‘–ğ›¼ğ‘–,(11) where 𝛼𝑖=tan−1(2−𝑖) andğœŽğ‘–=−1,for𝑧𝑖<0,1,otherwise.(12) In order to maintain a constant vector length, the obtained results have to be scaled by the scale factor 𝐾 given by 𝐾=𝑛−1𝑖=0√1+2−2𝑖.(13) For radix-2 CORDIC, 𝐾≈1.65. The major drawback of the conventional CORDIC algorithm is its relatively high latency and low throughput due to the sequential nature of the iteration process with carry propagate addition and variable shifting in every iteration. To overcome these drawbacks, pipelined implementations are proposed [40, 41]. However, the carry propagate addition remained a bottleneck for further throughput improvement. Two major methodologies have been employed in the literature to increase the speed of CORDIC implementation. One reduces the delay of each iteration by adopting redundant arithmetic to radix-2 CORDIC [42] to eliminate carry propagate addition. The other technique involves reducing the number of iterations by increasing the radix employed for the implementation of CORDIC algorithm [43].

The redundant radix-2 CORDIC [42] is proposed by employing redundant arithmetic. The direction of rotations ğœŽğ‘–, are selected from the set {−1,0,1} in contrast to {−1,1} employed in the conventional CORDIC. These ğœŽğ‘– values are computed by evaluating a few most significant digits of 𝑧𝑖, since the determination of sign of a redundant number takes long time. This redundant CORDIC algorithm performs no rotation extension for ğœŽğ‘–=0 and affects the value of scaling factor 𝐾, thus making it data-dependent. Therefore, 𝐾 has to be calculated for each microrotation. This calculation and correction increases the computation time and hardware.

3.2.4. Redundant Radix-4 CORDIC Algorithm

As mentioned above, the speed of CORDIC algorithm implementation can be improved by reducing the number of iterations. The iteration equations for the radix-4 CORDIC algorithm in rotation mode derived at the (𝑖+1)th step by using 𝜌=4 in (6) and are given by 𝑥𝑖+1=ğ‘¥ğ‘–âˆ’ğœŽğ‘–4−𝑖𝑦𝑖,𝑦𝑖+1=ğœŽğ‘–4−𝑖𝑥𝑖+𝑦𝑖,𝑤𝑖+1=𝑤𝑖−tan−1î€·ğœŽğ‘–4−𝑖,(14) where ğœŽğ‘–âˆˆ{−2,−1,0,1,2}. The final 𝑥 and 𝑦 coordinates are scaled by𝐾=𝑖≥0𝑘𝑖=𝑖≥01+ğœŽ2𝑖4−2𝑖1/2.(15) Here, the scale factor 𝐾 depends on the values of ğœŽğ‘– and hence, has to be computed in every iteration. The range of 𝐾 is (1,2.52) for radix-4 CORDIC. In this CORDIC, the direction of rotation is computed based on the estimated value of 𝑤𝑖 [43]. The 𝑤 path involves the computation of estimated 𝑤𝑖 and evaluation of selection function to determine ğœŽğ‘– resulting in increase of the iteration delay compared to that of radix-2. However, the number of iterations required for radix-2 CORDIC can be halved by employing the radix-4 CORDIC algorithm.

The Scale factor computation and compensation, CORDIC algorithm convergence and accuracy aspects are presented in following sections.

3.2.5. Scale Factor Computation

The CORDIC rotation steps change the length of the vector in every iteration resulting in the distortion of the norm of the vector as shown in Figure 3 and is given by (5). In nonredundant radix-2 CORDIC, 𝐾 is constant since ğœŽ=±1. However, 𝐾 is no longer constant for nonredundant radix higher than 2, and redundant number system. For radix-2, the scale factor needs to be computed for 𝑛/2 iterations as 𝑘𝑖=√1+2−2𝑖 becomes unity for 𝑖>𝑛/2+1. In redundant radix-4 CORDIC [43], scale factor (15) is not constant. In addition, it is sufficient to compute 𝐾 for 𝑛/4 iterations as 𝑘𝑖=√1+4−2𝑖 becomes unity thereafter.

3.2.6. Scale Factor Compensation

The scale factor compensation technique involves scaling of the final coordinates (𝑥𝑛,𝑦𝑛) with 1/𝐾. The most direct method for scaling operation is the multiplication of (𝑥𝑛,𝑦𝑛) by 1/𝐾 using the CORDIC module in linear mode [7]. This can realized using the CORDIC module in linear mode [7]. However, this method requires 𝑛 shift and add operations which are comparable to the computational effort of the CORDIC algorithm itself. Since 𝐾−1 is constant for radix-2, the computational overhead can be reduced by using CSD recoded multiplier. On an average, the number of nonzero digits can be reduced to 𝑛/3 using CSD representation [32] and hence, the effort for multiplication using CSD recoded multiplier is approximately one third that required using conventional multiplier. Further, scaling can also be implemented using a Wallace tree by fully parallelizing multiplication and is preferred for applications aiming for low latency at the expense of more silicon area [44].

Scaling may be done by extending the sequence of CORDIC iterations [9, 16, 17] to avoid additional hardware required in the direct method. A comparison of several scale factor compensation techniques proposed in the literature along with two additional methods, additive and multiplicative decomposition approaches, for radix-2 CORDIC is presented in [44]. It is observed from the presented results that additive technique offers a low latency solution and multiplicative technique offers an area economical solution for applications of CORDIC employing array and pipelined architectures. An algorithm is proposed [45] to performs scale factor compensation in parallel with the CORDIC rotation using nonredundant and redundant arithmetic, thereby, eliminating the final multiplication [3] or additional scaling iterations [9, 16, 17].

3.2.7. Convergence

The CORDIC algorithm involves the rotation of a vector to reduce the 𝑧 or 𝑦 coordinate of the final vector as closely as possible to zero for rotation or vectoring mode respectively. The maximum value of rotation angle by which the vector can be rotated depends on the shift sequence [7]. The expected results of the CORDIC algorithm can be obtained if the 𝑧 or 𝑦 coordinate is driven sufficiently close to zero. In addition, it can be guaranteed to drive 𝑧 or 𝑦 to zero, if the initial values of a vector (𝑥in,𝑦in,𝑧in) or (𝑥in,𝑦in) lies within the permissible range. These ranges define the domain of convergence of the CORDIC algorithm.

For 𝑛-bit precision, the given rotation angle can be decomposed as𝜃=𝑛−1𝑖=0ğœŽğ‘–ğ›¼ğ‘–+𝜑,(16) where 𝜑 is an angle approximation error such that |𝜑|<𝛼𝑛−1 and is negligible in practical computation [7]. This angle approximation error in rotation and vectoring mode can be computed as 𝜑(rotation)=tan−1𝑧𝑛,𝜑(vectoring)=tan−1𝑦𝑛𝑥𝑛.(17) The magnitude of elementary angle for the given shift sequence may be predetermined using𝛼𝑖=tan−1𝜌−𝑆𝑚,𝑖,(18) where 𝜌 is the radix of the number system. The direction of rotation ğœŽğ‘– must be selected to drive 𝑧 or 𝑦 towards zero for rotation or vectoring respectively. The range of ğœŽğ‘– depends on the radix and digit set being used for the number system. Since the number of iterations and elementary angles to be traversed by the vector during these iterations are predetermined, the range of 𝜃 for which CORDIC algorithm can be used, called domain of convergence, is given by [7]||𝜃||=𝑛−1𝑖=0𝛼𝑖+𝛼𝑛−1.(19) The convergence range of CORDIC algorithm can be defined for rotation mode as𝑧in≤𝑛−1𝑖=0𝛼𝑖+𝛼𝑛−1(20) and for vectoring mode astan−1𝑦in𝑥in≤𝑛−1𝑖=0𝛼𝑖+𝛼𝑛−1.(21) The expected final results cannot be obtained, if the given initial values 𝑥in,𝑦in and 𝑧in do not satisfy these convergence values. The range of convergence of the CORDIC algorithm can be extended from ±𝜋/2 to ±𝜋 using preprocessing techniques [7, 27, 46].

3.3. Accuracy

The accuracy of the CORDIC algorithm is affected by two primary sources of error, namely, angle approximation and rounding error. The error bounds for these two sources of error are derived by performing the detailed numerical analysis of the CORDIC algorithm [47]. The approximation error and the rounding error derived are combined to yield the overall quantization error in the CORDIC computation. The overall quantization error can be assured to be within the range by considering an additional log2𝑛 guard bits in the implementation of the CORDIC algorithm [7].

3.3.1. Angle Approximation Error

Theoretically, the rotation angle 𝜃 is decomposed into infinite number of elementary angles as shown in Figure 3. For practical implementation, a finite number of microrotations 𝑛 are considered. Hence, the input rotation angle 𝜃 can only be approximated resulting in an angle approximation error 𝜑||𝜑||<𝛼𝑛−1,(22) where 𝛼𝑛−1 is the residual angle after 𝑛 microrotations. Hence, the accuracy of the output of the 𝑛th iteration is principally limited by the magnitude of the last rotation angle.

3.3.2. Rounding Error

The second type of error called rounding error is due to the truncation of CORDIC internal variables by the finite length of storage elements. In addition scale factor compensation also contributes to this error. In a binary code, the truncation of intermediate results after every iteration introduces maximum rounding error of log2𝑛 bits. To achieve a final accuracy of 1 bit in 𝑛 bits, an additional log2𝑛 guard bits must be considered in implementation of this algorithm [7].

4. CORDIC Architectures

In this section, a few architectures for mapping the CORDIC algorithm into hardware are presented. In general, the architectures can be broadly classified as folded and unfolded as shown in Figure 4, based upon the realization of the three iterative equations (6). Folded architectures are obtained by duplicating each of the difference equations of the CORDIC algorithm into hardware and time multiplexing all the iterations into a single functional unit. Folding provides a means for trading area for time in signal processing architectures. The folded architectures can be categorized into bit-serial and word-serial architectures depending on whether the functional unit implements the logic for one bit or one word of each iteration of the CORDIC algorithm.

The CORDIC algorithm has traditionally been implemented using bit serial architecture with all iterations executed in the same hardware [3]. This slows down the computational device and hence, is not suitable for high speed implementation. The word serial architecture [7, 48] is an iterative CORDIC architecture obtained by realizing the iteration equations (6). In this architecture, the shifters are modified in each iteration to cause the desired shift for the iteration. The appropriate elementary angles, 𝛼𝑖 are accessed from a lookup table. The most dominating speed factors during the iterations of word serial architecture are carry/borrow propagate addition/subtraction and variable shifting operations, rendering the conventional CORDIC [7] implementation slow for high speed applications. These drawbacks were overcome by unfolding the iteration process [41, 48], so that each of the processing elements always perform the same iteration as shown in Figure 5. The main advantage of the unfolded pipelined architecture compared to folded architecture is high throughput due to the hardwired shifts rather than time and area consuming barrel shifters and elimination of ROM. It may be noted that the pipelined architecture offers throughput improvement by a factor of 𝑛 for 𝑛-bit precision at the expense of increasing the hardware by a factor less than 𝑛.

5. CORDIC Taxonomy

The implementation of CORDIC algorithm has evolved over the years to suit varying requirements of applications from conventional nonredundant to redundant nature. The unfolded implementation with redundant arithmetic initiated the efforts to address high latency in conventional CORDIC. Subsequently, several modifications have been proposed for redundant CORDIC algorithm to achieve reduction in iteration delay, latency, area and power. The evolution of the unfolded rotational CORDIC algorithms is shown in Figure 6. As this taxonomy is fairly rich, the remainder of the review presents taxonomy in top-down approach.

CORDIC is broadly classified as nonredundant CORDIC and redundant CORDIC based on the number system being employed. The major drawback of the conventional CORDIC algorithm [3, 7] was low throughput and high latency due to the carry propagate adder used for the implementation of iterative equations. This contradicted the simplicity and novelty of the CORDIC algorithm attracting the attention of several researchers to device methods to increase the speed of execution. The obvious solution is to reduce the time for each iteration or the number of iterations or both. The redundant arithmetic has been employed to reduce the time for each iteration of the conventional CORDIC. We have analyzed and presented in the following Sections, features of different pipelined and nonpipelined unfolded implementations of the rotational CORDIC.

6. Low Latency Nonredundant Radix-2 CORDIC [49]

A significant improvement for the conventional rotational CORDIC algorithm in circular coordinate system is proposed [50], employing linear approximation to the rotation when the remaining angle is small. This remaining angle is chosen such that a first order Taylor series approximation of sin𝜃𝑟 and cos𝜃𝑟, calling 𝜃𝑟 the remaining angle, may be employed as sin𝜃𝑟≈𝜃𝑟 and cos𝜃𝑟≈1. The architecture for the implementation of this algorithm using nonredundant arithmetic is presented in [49]. The iteration equations of this algorithm for the first 𝑛/2+1 microrotations are same as those for the conventional CORDIC algorithm (11). The ğœŽğ‘– values for the first 𝑛/3 iterations are determined iteratively using the sign of angle accumulator 𝑧𝑖. The rotation directions from iteration 𝑛/3+1 onwards can be generated in parallel, since the conventional circular arc tangent radix values approach the radix-2 coefficients progressively for increasing values of CORDIC iteration index as evident from the expression limğ‘˜â†’âˆžî€·2tan−𝑘2−𝑘=1.(23) For the range of iterations (𝑛/3+1)≤𝑖≤(𝑛/2+1), all ğœŽğ‘– values are determined from the recoded representation of remaining angle 𝑧(𝑛/3+1). These ğœŽğ‘– values are used to obtain 𝑧(𝑛/2+1) from 𝑧(𝑛/3+1). For 𝑖>(𝑛/2+1), the CORDIC microrotations are replaced by a single rotation using the remaining angle 𝑧(𝑛/2+1). Thus, (11) is modified as 𝑥𝑓=𝑥(𝑛/2+2)=𝑘(𝑛/2+1)𝑥(𝑛/2+1)−𝜃𝑟𝑦(𝑛/2+1),𝑦𝑓=𝑦(𝑛/2+2)=𝑘(𝑛/2+1)𝜃𝑟𝑥(𝑛/2+1)+𝑦(𝑛/2+1),(24) where 𝜃𝑟=𝑧(𝑛/2+1), 𝑘(𝑛/2+1) is the scale factor in the (𝑛/2+1)th iteration and (𝑥𝑓,𝑦𝑓) are the scaled final coordinates.

Scale Factor
The low latency nonredundant radix-2 CORDIC algorithm achieves constant scale factor since ğœŽğ‘–âˆˆ{−1,1} and performs the scale factor compensation concurrently with the computation of 𝑥 and 𝑦 coordinates, using two multipliers in parallel [49]. This is in contrast to two series multiplications required in the algorithm [50].

7. Constant Scale Factor Redundant Radix-2 CORDIC

Redundant radix-2 CORDIC methods can be classified as variable and constant scale factor methods based on the dependence of scale factor on the input angle. In redundant radix-2 CORDIC [42], ğœŽğ‘–âˆˆ{−1,0,1} and hence scale factor 𝐾 is data-dependent. Therefore, 𝐾 has to be calculated for each microrotation. This calculation and correction increases the computation time and hardware. Several redundant CORDIC algorithms with constant scale factor are available in the literature [51–53] to address data dependency of the scale factor as shown in Figure 7. In these methods, the iterative rotations of a point around the origin on the 𝑋𝑌-plane are considered (see Figure 1). The direction of each rotation depends on the sign of steering variable 𝑧𝑖, which represents the remaining angle of rotation. Since the computation of the sign of redundant number requires more time, estimated value of 𝑧𝑖 (𝑧𝑖) is used to determine the direction of rotation. The estimated value is computed based on the value of the three most significant digits of 𝑧𝑖. Constant scale factor is achieved by restricting ğœŽğ‘– to the set {−1,1}, thus facilitating a faster implementation. The constant scale factor methods can be classified based on the arithmetic employed as redundant radix-2 CORDIC with signed digit arithmetic and carry save arithmetic (see Figure 7).

Scale Factor
The scale factor need not be computed for the implementation of all the constant scale factor techniques discussed in this section. In these methods, no specific scale factor compensation technique is considered. It may be noted that a specific compensation technique can be considered depending on the application.

7.1. Constant Scale Factor Redundant CORDIC Using SD Arithmetic

The redundant radix-2 CORDIC using SD arithmetic can be further classified based on the technique employed to achieve constant scale factor (see Figure 7). These methods are implemented using the basic CORDIC iteration recurrences (11) with necessary transformations.

7.1.1. Double Rotation Method [51]

The double rotation method performs two rotation-extensions for each elementary angle during the first 𝑛/2 iterations for 𝑛 bit precision to achieve constant scale factor independent of the operand. One rotation extension is performed for every elementary angle for iterations greater than 𝑛/2. A negative rotation is performed by two negative subrotations, and a positive rotation by two positive subrotations. A nonrotation is performed by one negative and one positive subrotation. Hence, 50% additional iterations are required compared to the redundant CORDIC [42].

7.1.2. Correcting Rotation Method [51]

This is another method proposed to achieve constant scale factor for the computation of sine and cosine functions. This method avoids rotation corresponding to ğœŽğ‘–=0 and performs one rotation extension in every iteration depending on the 𝑧𝑖. Further, extra rotation extensions are performed at fixed intervals for correcting the error introduced by avoiding ğœŽğ‘–=0 and to assure convergence. If 𝑏 fractional bits are used to estimate 𝑧𝑖, the interval between correcting iterations should be less than or equal to (𝑏−2) [54]. This method also requires 50% additional iterations, if three or four most significant digits are used for sign estimation. The increase in latency of rotational CORDIC due to these double rotation and correcting iteration methods is reduced using branching algorithm [52].

7.1.3. Branching Method [52]

This method implements CORDIC algorithm using SD arithmetic, restricting the direction of rotations ğœŽğ‘– to ±1, without the need for extra rotations. This requires two modules in parallel to perform two conventional CORDIC iterations, such that, the correct result is retained at the end of each iteration. Two modules perform the rotation in the same direction if the sign of corresponding 𝑧𝑖 can be determined. Otherwise, branching is performed by making one CORDIC module (𝑧+) perform rotation with ğœŽğ‘–=+1 and another module (𝑧−) perform rotation with ğœŽğ‘–=−1. The direction of rotation in the next subsequent rotation is decided by the sign of that 𝑧𝑖 module whose value is small. In every iteration 𝑖, angle accumulator (𝑧+ or 𝑧−) computes the remaining angle (𝑧+𝑖 or 𝑧−𝑖) to determine the direction of rotation for the next iteration. The direction of rotation is determined by examining window of three digits of 𝑧+𝑖or 𝑧−𝑖.

The disadvantage of branching method is the necessity of performing two conventional CORDIC iterations in parallel which requires almost two fold effort in terms of implementation complexity. In addition, one of the modules will not be utilized when branching does not take place. However, this method offers faster implementation than double and correcting rotation methods [51], since, it does not require additional iterations to achieve constant scale factor.

7.1.4. Double Step Branching Method [53]

The performance of branching algorithm is enhanced by the double step branching method to improve utilization of hardware. This method involves determining two distinct ğœŽğ‘– values in each step with some additional hardware compared to the branching method, where the two modules do different computations only when branching takes place. Double step branching method determines the two direction of rotations by examining the six most significant digits to do a double step. These six digits are divided into two subgroups of three digits each, and each subgroup is handled in parallel, to generate the required ğœŽğ‘– using zeroing modules (𝑧 path). Although double stepping method introduces a small hardware overhead compared to the branching method, it is better than the latter since it increases the utilization of 𝑥/𝑦 rotator modules.

7.2. Constant Scale Factor Redundant CORDIC Using CS Arithmetic

It is worth discussing here one more classification related to constant scale factor redundant radix-2 CORDIC (see Figure 7). The implementation of redundant CORDIC with constant scale factor using signed arithmetic results in an increase in the chip area [51–53] and latency [51] by at least 50% compared to redundant radix-2 CORDIC [42]. Low latency CORDIC algorithm [55] and differential CORDIC algorithm [56, 57] with constant scale factor using CS arithmetic have been proposed to reduce this overhead, the details of which are discussed below.

7.2.1. Low Latency Redundant CORDIC [55]

This algorithm is proposed to reduce the latency of redundant CORDIC [51] by subdividing the 𝑛 iterations into different groups and using different techniques for each of these groups. For all the iterations, if ğœŽğ‘–=±1, conventional iteration equations (11) are used. This method avoids ğœŽğ‘–=0 for iterations between 0≤𝑖≤(𝑛−3)/4 and employs correcting rotation method [51]. For iterations (𝑛−3)/4<𝑖≤(𝑛+1)/2, ğœŽğ‘–=0 is considered as a valid choice. Since for this group of iterations 𝑘𝑖=√1+2−2𝑖=1+2−2𝑖−1 holds within 𝑛-bit precision, vector is not rotated for ğœŽğ‘–=0. However, the length of the vector is increased by the scale factor for that iteration, as the final coordinates are scaled assuming constant scale factor. For the iterations 𝑖>(𝑛+1)/2, no correcting factor is required as the scale factor becomes unity.

7.2.2. DCORDIC [56]

In the sign estimation methods [51–53], half of the computational effort in the 𝑥/𝑦/𝑧 data paths of rotational CORDIC is required to allow for the correction of possible errors, as the sign estimation is not entirely perfect. This problem is reduced by high speed bit-level pipelining technique with CS arithmetic proposed in [57]. This algorithm involves the transformation of the conventional CORDIC iteration equations (11) into partially fixed iteration equations, given by ||𝑧𝑖+1||=||||𝑧𝑖||−𝛼𝑖||,𝑥𝑖+1=𝑥𝑖𝑧−sign𝑖2−𝑖𝑦𝑖,𝑦𝑖+1𝑧=sign𝑖2−𝑖𝑥𝑖+𝑦𝑖.(25) It is clear from these expressions that the computation of 𝑥 and 𝑦 requires the actual sign of 𝑧𝑖, while the angle accumulator requires only the absolute value of 𝑧𝑖. The actual sign of 𝑧𝑖 (ğœŽğ‘–) can be determined by taking into account the initial sign of 𝑧0 and providing information about sign changes during the absolute value computation of 𝑧𝑖. Similarly, all ğœŽğ‘– values are computed recursively. Later this technique is implemented with SD arithmetic and proposed as Differential CORDIC (DCORDIC) algorithm [56]. Since the sign calculation of steering variable (𝑧𝑖) during absolute value computation takes long time, most significant digit first absolute value technique is employed. This technique replaces the word level sign dependence by a bit level dependence, reducing the overall computation time. The bit level pipelined architecture is proposed to implement these transformed iteration sequences, thus allowing high operational speed.

8. Higher Radix Redundant CORDIC

As mentioned earlier, throughput and latency are important performance attributes in CORDIC based systems. The various radix-2 CORDIC algorithms presented so far may be used to reduce the iteration delay, thereby improving the throughput, with constant scale factor. Higher radix CORDIC algorithms using SD arithmetic [54, 58] and CS arithmetic [43, 59] are proposed to address latency reduction. This is possible, since higher radix representation reduces the number of iterations. The classification of redundant CORDIC algorithms proposed in the literature based on the radix of the number system is shown in Figure 8. The application of radix-4 rotations in the CORDIC algorithm was initially proposed in [54] to accelerate the radix-2 algorithm.

Scale factor need not be computed for the constant scale factor algorithms to be discussed in this section. Since no specific scale factor compensation technique is considered for these methods, a compensation technique can be considered depending on the application.

8.1. Pipelined Radix-4 CORDIC [58]

The generalized CORDIC algorithm for any radix in three coordinate systems and implementation of the same in rotation mode of circular coordinate system using radix-4 pipelined CORDIC processor is presented in [58]. This algorithm performs two successive radix-2 microrotations with the same microrotation angle using the iteration equations 𝑥𝑖+1=ğ‘¥ğ‘–âˆ’î€·ğœŽğ‘–,1+ğœŽğ‘–,24âˆ’ğ‘–ğ‘¦ğ‘–âˆ’ğœŽğ‘–,1ğœŽğ‘–,24−2𝑖𝑥𝑖,𝑦𝑖+1=î€·ğœŽğ‘–,1+ğœŽğ‘–,24−𝑖𝑥𝑖+ğ‘¦ğ‘–âˆ’ğœŽğ‘–,1ğœŽğ‘–,24−2𝑖𝑦𝑖,𝑧𝑖+1=ğ‘§ğ‘–âˆ’î€·ğœŽğ‘–,1+ğœŽğ‘–,2𝛼𝑖,(26) where ğœŽğ‘–,1 and ğœŽğ‘–,2 are two redundant radix-2 coefficients to decompose radix-4 coefficient ğœŽğ‘–âˆˆ{−2,−1,0,+1,+2} satisfying the relation (ğœŽğ‘–=ğœŽğ‘–,1+ğœŽğ‘–,2). The value of 𝛼𝑖 is selected as 𝛼0=2−1 and 𝛼𝑖=4−𝑖 for 1≤𝑖≤𝑛−1. The selection function for ğœŽğ‘– is determined using the five most significant digits of 𝑧-coordinate, ensuring the convergence of this algorithm. This algorithm is designed using SD arithmetic and requires two adders/subtractors for each stage of 𝑥/𝑦 data path in contrast to one adder/subtractor required in radix-2 CORDIC [42], for 𝑖<𝑛/4. However, the number of additions required are reduced during the last 𝑛/4 stages.

Scale Factor Computation
The scale factor 𝐾 in radix-4 CORDIC algorithm is variable, since ğœŽğ‘– takes values from the digit set {−2,−1,0,+1,+2}. 𝐾 is computed in each iteration using the combinational circuit by realizing the expression 𝐾=𝑛/2−1𝑖=0𝑘𝑖=𝑛/2−1𝑖=0||ğœŽ1+𝑖,1||4−2𝑖1/2||ğœŽ1+𝑖,2||4−2𝑖1/2.(27)

8.2. Redundant Radix 2-4 CORDIC [59]

The number of rotations in a redundant radix-2 CORDIC rotation unit is reduced by about 25% by expressing the direction of rotations in radix-2 and radix-4 [54]. This algorithm employs different modified CORDIC algorithms using CS arithmetic for different subsets of iterations. For the iterations 1≤𝑖<𝑛/4, nonredundant radix-2 CORDIC algorithm with ğœŽğ‘– = {−1,1} is employed. For 𝑛/4≤𝑖≤(𝑛/2+1), correcting iteration method [51] is employed. For 𝑖>(𝑛/2+1), redundant radix-4 CORDIC algorithm is employed, thus, halving the number of iterations. An unified architecture is proposed for the implementation of this algorithm to operate in rotation/vectoring mode of circular and hyperbolic coordinate systems.

Scale Factor Computation
This algorithm achieves constant scale factor, since the rotation corresponding to ğœŽ=0 is avoided for 𝑖≤𝑛/2+1. For𝑖>𝑛/2+1 scale factor need not be computed as 𝑘𝑖=√1+4−2𝑖∼1.

8.3. Radix-4 CORDIC [43]

A redundant radix-4 CORDIC algorithm is proposed using CS arithmetic, to reduce the latency compared to redundant radix-2 CORDIC [42]. This algorithm (14) computes ğœŽğ‘– values using two different techniques. For the microrotations in the range 0≤𝑖<(𝑛/6), ğœŽğ‘– is determined sequentially using angle accumulator. For the microrotations in the range 𝑖≥(𝑛/6), the ğœŽğ‘– values are predicted from the the remaining angle after the first 𝑛/6 [60]. Thus, the complexity of the 𝑤 path is 𝑛/6, compared to 𝑛 in the other architectures [42–53] presented in the previous sections. For the range 0≤𝑖<(𝑛/6), microrotations are pipelined in two stages to increase the throughput. A 32-bit pipelined architecture is proposed for the implementation of the radix-4 CORDIC algorithm using CS arithmetic.

Scale Factor Computation
The possible scale factors are precomputed and stored in a ROM. The number of possible scale factors for ğœŽ2𝑖∈{0,1,4} is 3𝑛/4+1. The size of ROM and access time increases with 𝑛. Hence, the scale factors for some iterations are stored in ROM and these values are used to compute the scale factor for remaining iterations with the combinational logic. This is designed by realizing the first few terms of Taylor series expansion of scale factor. For this redundant radix-4 implementation, the number of iterations are reduced at the expense of adding hardware for computing the scale factor.

9. Parallel CORDIC Algorithms

The CORDIC algorithms discussed so far have represented 𝜃 using a set of elementary angles 𝛼𝑖 called arc tangent radix set [3]𝜃=ğœŽ0𝛼0+ğœŽ1𝛼1+⋯+ğœŽğ‘›âˆ’1𝛼𝑛−1,(28) where 𝛼𝑖=tan−1(2−𝑖) and ğœŽğ‘–âˆˆ{−1,1}, satisfying the convergence theorem [7]𝛼𝑖−𝑛−1𝑗=𝑖+1𝛼𝑗<𝛼𝑛−1(29) in contrast to the representation using a normal radix𝜃=ğœŽ020+ğœŽ12−1+⋯+ğœŽğ‘›âˆ’12−𝑛+1.(30) The direction of rotation ğœŽğ‘– for the 𝑖th iteration is determined after computing the (𝑖−1) iterations sequentially. It is evident from this sequential dependence of the radix system that the speed of CORDIC algorithm can be improved by avoiding the sequential behavior in the computation of ğœŽğ‘– values or 𝑥/𝑦 coordinates. The various redundant CORDIC algorithms proposed in the literature employing either one or both these techniques are shown in Figure 9 and are discussed in the following sections.

9.1. Low Latency Radix-2 CORDIC [55]

The low latency parallel radix-2 CORDIC architecture presented for the rotation mode [55] predicts ğœŽğ‘–'s by eliminating sequential dependency of the 𝑧 path. In order to minimize the prediction error, directions are predicted for a group of iterations at a time rather than for all iterations together. This architecture does not allow rotation for index 𝑖=0. Hence, the convergence range of this architecture is less than (−𝜋/2,+𝜋/2). On the other hand, the requirement of redundant to binary conversions of intermediate results in the 𝑧 path restricts the pipelined implementation of this architecture. In order to reduce the latency of this parallelizing scheme further, termination algorithm and booth encoding method have been proposed.

9.2. P-CORDIC [61]

The sequential procedure in the computation of direction of rotations of the CORDIC algorithm is eliminated by the P-CORDIC algorithm, while maintaining a constant scale factor. This algorithm precomputes the direction of microrotations before the actual CORDIC rotation starts iteratively in the 𝑥/𝑦 path. This is obtained by deriving a relation between the constructed binary representation of direction of rotations 𝑑, and rotation angle 𝜃 [40, 62] given by ğœŽ=0.5𝜃+0.5𝑐1+sign(𝜃)𝜖0+𝛿,(31) where 𝑐1∑=2âˆ’âˆžğ‘–=0(2−𝑖−tan−1(2−𝑖)), ∑𝛿=𝑛/3𝑖=1(ğœŽğ‘–ğœ–ğ‘–), 𝜖0=1−tan−1(1), and 𝜖𝑖=2−𝑖−tan−1(2−𝑖). Here, 𝛿 is computed using the partial offset 𝜖𝑖 and the corresponding direction bit ğœŽğ‘– for the first 𝑛/3 iterations, since the value of 𝜖𝑖 decreases by a factor of 8 beyond 𝑛/3 iterations. The direction of rotations for any input angle 𝜃 in binary form are obtained by realizing this expression taking a variable offset 𝛿 from ROM. The unfolded architecture proposed for the implementation of this algorithm eliminates the 𝑧 path and reduces the area of the implementation. This architecture achieves latency and hardware reduction over the radix-2 unfolded parallel architecture [55].

Scale Factor
The scale factor in the implementation of P-CORDIC algorithm remains constant, as ğœŽğ‘–âˆˆ{−1,1} being generated for the implementation of 𝑥/𝑦 path. The scale factor compensation is implemented using constant factor multiplication technique as discussed in Section 3.2.6.

9.3. Hybrid CORDIC Algorithm

For 𝑛-bit fixed point CORDIC processor in circular coordinate system, nearly 𝑛/3 iterations must be computed sequentially. This is true for both generation of direction and rotation without affecting accuracy [60]. The subsequent rotation directions for the last 2𝑛/3 iterations can be generated in parallel since the conventional circular ATR values approach the radix-2 coefficients progressively with increasing iteration index, that is, lim𝑘→+∞tan2−𝑘2−𝑘=1.(32) This behavior is exploited by introducing the hybrid CORDIC algorithms to speed up the conventional CORDIC rotator. This algorithm involves partitioning 𝜃 into 𝜃𝐻 and 𝜃𝐿. The rotation by 𝜃𝐻 are performed as in the conventional CORDIC algorithm and the iterations related to 𝜃𝐿 can be simplified as in linear coordinate system. This algorithm led to the development of several parallel CORDIC algorithms [63–65]. These can be categorized broadly as mixed-hybrid CORDIC and partitioned-hybrid CORDIC algorithms. In mixed-hybrid CORDIC algorithms [65], the input angle 𝜃 and initial coordinates (𝑥in,𝑦in) are used to compute the rotations for the first 𝑛/3 iterations as in the conventional CORDIC. The remaining angle after these first 𝑛/3 iteration is used for computing directions for the last 2𝑛/3 iterations. The implementation is designed to keep the fast timing characteristics of redundant arithmetic in the 𝑥/𝑦 path of the CORDIC processing. In the partitioned-hybrid CORDIC [63, 64], the first 𝑛/3 direction of rotations are generated using the first 𝑛/3 bits of 𝜃 and last 2𝑛/3 direction of rotations are predicted using the 2𝑛/3 least significant bits of 𝜃.

9.3.1. Flat CORDIC [63]

The flat CORDIC algorithm is proposed to eliminate iterative nature in the 𝑥/𝑦 path for reducing the total computation time. This algorithm transforms 𝑥/𝑦 recurrences (11) of the conventional CORDIC into a parallelized version by successive substitution to express the final vectors in terms of the initial vectors, resulting in a single equation for 𝑛-bit precision. The expressions for final coordinates of 16-bit sine/cosine generator are 𝑥16=î€ºğœŽ1−1ğœŽ22−12−2âˆ’â‹¯âˆ’ğœŽ1ğœŽ232−12−23âˆ’ğœŽ2ğœŽ32−22−3âˆ’â‹¯âˆ’ğœŽ9ğœŽ102−92−10+î€·ğœŽ1ğœŽ2ğœŽ3ğœŽ42−12−22−32−4+⋯+ğœŽ2ğœŽ3ğœŽ4ğœŽ52−22−32−42−5+⋯+ğœŽ3ğœŽ4ğœŽ6ğœŽ72−32−42−62−7+𝐸𝐶−𝑋,𝑦16=î€ºğœŽ12−1+ğœŽ22−2+⋯+ğœŽ162−16âˆ’î€·ğœŽ1ğœŽ2ğœŽ32−12−22−3âˆ’â‹¯âˆ’ğœŽ5ğœŽ7ğœŽ82−52−72−8+î€·ğœŽ1ğœŽ2ğœŽ3ğœŽ4ğœŽ52−12−22−32−42−5+⋯+ğœŽ2ğœŽ3ğœŽ4ğœŽ5ğœŽ62−22−32−42−52−6+𝐸𝐶−𝑌,(33) where 𝐸𝐶−𝑋 and 𝐸𝐶−𝑌 are the error compensation factors in 𝑥16 and 𝑦16, respectively. 𝑥in and 𝑦in are initialized with 1/𝐾 and 0 respectively. The 16 sign digits (ğœŽ1,ğœŽ2,…,ğœŽ15,ğœŽ16) for 16-bit precision represents the polarity of 16 microrotations required to achieve the target angle. These equations demonstrate the complete parallelization of the conventional CORDIC algorithm. This technique precomputes ğœŽğ‘– which takes values from the set {−1,1} to achieve constant scale factor. The ğœŽğ‘–'s for the first 𝑛/3 iterations are precomputed employing a technique, called Split Decomposition Algorithm (SDA), which limits the input angle range to (0,𝜋/4) [66]. The last 2𝑛/3 number of ğœŽğ‘–'s are predicted from the remaining angle of 𝑛/3 iterations. The internal word length of the architecture proposed for this technique is considered as (𝑛+log2𝑛) for 𝑛-bit external accuracy [47]. It may be noted that the complete parallelization of 𝑥/𝑦 iterations lead to the exponential increase of terms to be flattened, affecting the circuit complexity. In addition, the implementation of flat CORDIC needs complex combinational hardware blocks with poor scalability.

Scale Factor
The scale factor in the implementation of the flat CORDIC algorithm is maintained constant, since ğœŽğ‘–âˆˆ{−1,1}. The scale factor compensation is implemented using a multiplier designed with CS adder tree.

9.3.2. Para-CORDIC [64]

The Para-CORDIC parallelizes the generation of direction of rotations ğœŽ from the binary value of the input angle 𝜃 by employing binary to bipolar representation (BBR) and microrotation angle recoding (MAR) techniques. This algorithm computes 𝑥/𝑦 coordinates iteratively while eliminating iterative 𝑧 path completely. The input angle 𝜃 is divided into the higher part 𝜃𝐻 and lower part 𝜃𝐿. The two's complement binary representation of input angle 𝜃 is𝜃=−𝑑0+𝑙−1𝑖=1𝑑𝑖2−𝑖+𝑛𝑖=𝑙𝑑𝑖2−𝑖,(34) where 𝑑𝑖∈{0,1} and 𝑙=(𝑛−log23)/3. The (𝑙−1) bits of input angle are converted into BBR, and MAR technique is employed to determine the direction of rotations ğœŽ1 to ğœŽğ‘™âˆ’1. Since tan−12−𝑖≠2−𝑖, this method performs additional microrotations for every iteration depending on each positional binary weight 2−𝑖 for 𝑖=1,2,…,𝑙−1. The remaining angle after the first (𝑙−1) rotations is added to 𝜃𝐿. The values of ğœŽğ‘™ to ğœŽğ‘›+1 are obtained from BBR of the corrected 𝜃𝐿. This method eliminates ROM for storing the predetermined direction of rotations. However, it requires additional 𝑥/𝑦 stages for the repetition of a certain microrotations and array of adders to compute the corrected 𝜃𝐿.

9.3.3. Semi-Flat CORDIC [65]

The iterative nature in the implementation of the conventional CORDIC algorithm is partially eliminated by semi flat algorithm. This is designed for the semi parallelization of the 𝑥/𝑦/𝑧 recurrences, to improve the speed of a rotational unfolded CORDIC without increasing the area requirements. The internal precision is taken higher than the required external precision in order to reduce the quantization error encountered in the CORDIC algorithm as discussed in Section 3.2.6. For the first 𝜆 bits of ğœŽğ‘–, 𝑥/𝑦 recurrences are computed iteratively using the double rotation method [51] resulting in 𝑥𝜆−1/𝑦𝜆−1. Then, 𝑥𝑛−1/𝑦𝑛−1 can be expressed in terms of these 𝑥𝜆−1/𝑦𝜆−1, if all ğœŽğ‘–'s are predicted. The ğœŽğ‘–'s for (𝑛in𝑡/3−𝜆) bits (𝑛in𝑡 = internal precision) of input angle are precomputed and stored in ROM, which is addressed by (𝑛in𝑡/3−𝜆) bits of input angle. The remaining (2𝑛in𝑡/3) number of ğœŽğ‘–'s are predicted from rotation angle [60]. It may be noted that neither the description nor the reference is provided for split decomposition method employed to precompute (𝑛in𝑡/3−𝜆) number of ğœŽğ‘–'s.

The computation time and area of the chip are affected by the choice of 𝜆, which is clear from the simulation results presented in [65]. It is observed from these simulation results that the best trade-off is obtained with 𝜆=6 and 𝜆=8 for a 16-bit CORDIC (internal precision 22 bits) and 32-bit CORDIC (internal precision 39 bits) respectively. After 𝜆 iterations, all the terms of (𝑥𝑛/𝑦𝑛) were added using the Wallace tree, flattening the 𝑥/𝑦 path. However, this architecture has poor scalability.

Scale Factor
This algorithm achieves constant scale factor, since ğœŽğ‘– takes value from the set {−1,1}.

10. Comparison

We have presented a latency estimate comparison of unfolded architectures available in the literature for 2D rotational CORDIC in Table 2. Latency is defined as sum of the delays for the computation of redundant 𝑥/𝑦 coordinates, scale factor compensation and redundant to binary conversion of final 𝑥/𝑦 coordinates. The design detail of scale factor compensation and redundant to binary conversion stages is not made available in the literature for all the architectures as discussed in the previous sections (Sections 6–9). Hence, we have compared all the CORDIC algorithms with respect to the latency required for the rotation computation, excluding the scale factor compensation and redundant to binary conversion stages. All the architectures presented in this table are implemented using redundant arithmetic except the conventional CORDIC [3] and the Low latency nonredundant CORDIC [49].

The nonpipelined and pipelined implementation of the conventional radix-2 CORDIC algorithm [3, 7] requires 𝑛 iterations to compute 𝑥/𝑦 coordinates iteratively. The iteration delay depends on the fast carry propagate adder, which is the bottleneck to increase throughput and reduce latency.

The application of redundant arithmetic [42] to the conventional CORDIC makes ğœŽğ‘– to take values from the set {−1,0,1} instead of the set {−1,1}. The ğœŽğ‘– values are computed iteratively and the choice of ğœŽğ‘–=0 resulted in the variability of the usually constant scale factor. The variable scale factor increases the area and delay for scale factor computation. The latency of this implementation is 𝑛𝑡stage, where 𝑡stage is the iteration stage delay in terms of full adder delay 𝑡FA.

The double rotation and correcting rotation redundant CORDIC methods using SD arithmetic are proposed in [51], to reduce the cost of the scale factor computation. The nonpipelined and pipelined implementation of these methods require latency of 1.5𝑛𝑡stage to compute final 𝑥/𝑦 coordinates iteratively. These methods achieve constant scale factor, increasing the latency by 50% compared to [42].

Low latency CORDIC algorithm [55] reduces the latency to ((9𝑛−3)/8)𝑡stage compared to that 1.5𝑛𝑡stage in [51]. This algorithm computes iteratively the direction of rotations and 𝑥/𝑦 coordinates. In addition, a nonpipelined architecture is also proposed in this paper using prediction technique. The latency of this architecture is (𝑛+log3𝑛−1)𝑡stage.

Branching algorithm using signed digit arithmetic is proposed to achieve constant scale factor. The latency of nonpipelined and pipelined implementation of this algorithm is 𝑛𝑡stage. This algorithm achieves 50% latency improvement over [51] to compute final 𝑥/𝑦 coordinates iteratively. However, it requires double the hardware as two sets of 𝑥/𝑦/𝑧 modules are employed.

The direction of rotations computed using the sign estimation methods [51, 52, 55] may not be accurate, therefore, half of the computational effort is required for correction. DCORDIC algorithm is proposed to determine the direction of rotations iteratively using the sign of steering variable. However, this method requires an initial latency of 𝑛𝑡FA before the CORDIC rotation starts, to obtain the first direction of rotation. The signs are obtained for the remaining iterations with one full adder delay using bit level pipelined architecture with 𝑛 stages. This implementation requires latency of 𝑛𝑡stage+(𝑛+1)𝑡FA to compute the final 𝑥/𝑦 coordinates iteratively. In addition, this method requires 2.5𝑛 initial register rows for skewing of input data.

All the methods presented so far reduce the latency by decreasing the iteration delay using redundant arithmetic. Since the latency reduction can also be obtained by reducing the number of iterations, the same has motivated to implement radix-4 pipelined CORDIC processor [58], which results in latency of (3𝑛/4+1)𝑡stage.

The mixed radix CORDIC algorithm [59] is proposed using radix-2 and radix-4 rotations for designing a pipelined processor to operate in rotation and vectoring modes of circular and hyperbolic coordinate systems. The latency of this pipelined architecture requires (3𝑛/4+1) stages with three different stage delays (𝑡stage) as 31𝑡NAND(1≤𝑖<𝑛/4), 34𝑡NAND(𝑛/4≤𝑖≤(𝑛/2+1)) and 36𝑡NAND(𝑖>(𝑛/2+1)). This architecture takes more stage delay as this is designed for various modes of operation.

The advantage of applying radix-4 rotations for all iteration stages is exploited in [43] with less number of adders as compared to [58]. For the microrotations in the range 0≤𝑖<(𝑛/6), the pipelined architecture proposed for this algorithm implementation determines ğœŽğ‘– values sequentially using angle accumulator. For the microrotations in the range 𝑖≥(𝑛/6), the ğœŽğ‘– values are determined from the remaining angle after 𝑛/6 iterations. The latency of this architecture to compute the final 𝑥/𝑦 coordinates iteratively is (2𝑛/3+2)𝑡stage.

In [61], P-CORDIC algorithm is proposed to eliminate 𝑧 path completely, using a linear relation between the rotation angle 𝜃 and the corresponding direction of all microrotations for rotation mode. This algorithm computes the 𝑥/𝑦 coordinates iteratively. The latency of the nonpipelined architecture proposed to implement this algorithm for 𝑛-bit precision is (𝑛/12+log2𝑛+1.75+2𝑛)𝑡FA.

The iterative nature in the 𝑥/𝑦/𝑧 path is eliminated at the cost of scalability by the flat CORDIC algorithm [63]. This algorithm transforms 𝑥/𝑦 recurrences (11) of the conventional CORDIC into a parallelized version, by successive substitution to express the final vectors in terms of the initial vectors, resulting in a single equation for 𝑛-bit precision. The direction of rotations are precomputed before initiating the computation of 𝑥/𝑦 coordinates. The final 𝑥/𝑦 coordinates are computed using combinational blocks with the latency of 34𝑡FA/16-bit and 50𝑡FA/32-bit. The expressions for 𝑥 and 𝑦 variables need to be derived and combinational building blocks have to be redesigned with change in precision.

In [64], Para-CORDIC algorithm is proposed to precompute the direction of rotations without using ROM, while eliminating iterative 𝑧 path completely. This method uses additional 𝑥/𝑦 stages for the repetition of a certain microrotations to predict the direction of rotations in contrast to ROM employed in [61, 63, 65]. The latency of this Para-CORDIC is ((2(𝑠(𝑛)+𝑛/2−𝑙+2)+⌈log1.5𝑛+2⌉))𝑡FA, where 𝑙=(𝑛−log23)/3 and 𝑠(𝑛) represents the total number of microrotations required in MAR recoding of (𝑙−1) bits of the input angle. The values of 𝑠(𝑛) for 16/32/64-bit precision are 5, 18, 52 respectively.

The semiflat technique is proposed in [65], to partially eliminate the iterative nature in 𝑥/𝑦/𝑧 paths for the (𝑛−𝜆) iterations (𝜆=6 for a 16-bit CORDIC and 𝜆=8 for a 32-bit CORDIC, respectively). The latency of the nonpipelined implementation of this algorithm is 33𝑡FA/16-bit and 49𝑡FA/32-bit, respectively. It is observed that this architecture is combinational after 𝜆 iterations and has poor scalability.

In [49], the 𝑥/𝑦 coordinates are computed iteratively for the (𝑛/2+1) iterations using (𝑛/2+1) number of fast adders. These values are used to compute the final 𝑥/𝑦 coordinates using two multipliers in parallel and one adder resulting in the latency of ((𝑛/2+2)𝑡adder+𝑡multiplier). The ğœŽğ‘– values for the first (𝑛/3+1) iterations are determined iteratively using the sign of angle accumulator 𝑧𝑖. For the range (𝑛/3+1)<𝑖≤(𝑛/2+1), the rotation directions are generated in parallel.

11. Conclusions

In this paper, we have surveyed the algorithms for unfolded implementation of 2D rotational CORDIC algorithms. Special attention has been devoted to the systematic and comprehensive classification of solutions proposed in the literature. In addition to the pipelined implementation of nonredundant radix-2 CORDIC algorithm that has received wide attention in the past, we have discussed the importance of redundant and higher radix algorithms. We have also stressed the importance of prediction algorithms to precompute the directions of rotations and parallelization of 𝑥/𝑦 path. It is worth noting that the considered algorithms should not be implemented as alternatives over the others, rather they should be integrated depending on the design constraints of a specific application.

We can draw final conclusions about the different algorithms to achieve efficient implementation of application specific rotational CORDIC algorithm. As far as the application of redundant arithmetic to the pipelined implementation of the conventional radix-2 CORDIC algorithm is concerned, area is doubled with reduction in the adder delay of each stage from (log2𝑛)𝑡FA to 2𝑡FA. Similarly, the hardware and iteration delay of redundant radix-2 CORDIC can be reduced by employing prediction technique for the precomputation of direction of rotations. Further, the latency reduction of this can be achieved by integrating the prediction technique with the redundant radix-4 arithmetic trading the area for variable scale factor computation. Another important observation about the solutions proposed with fully parallelization of 𝑥/𝑦 path is that it affects the modularity and regularity of the architecture leading to a poor scalable implementation. Finally, we conclude that the solution which can allow the design of scalable architecture, employing prediction and 𝑥/𝑦 path parallelization techniques to redundant CORDIC algorithm can achieve both latency reduction and throughput improvement.