Abstract
In the last decade, CORDIC algorithm has drawn wide attention from academia and industry for various applications such as DSP, biomedical signal processing, software defined radio, neural networks, and MIMO systems to mention just a few. It is an iterative algorithm, requiring simple shift and addition operations, for hardware realization of basic elementary functions. Since CORDIC is used as a building block in various single chip solutions, the critical aspects to be considered are high speed, low power, and low area, for achieving reasonable overall performance. In this paper, we first classify the CORDIC algorithm based on the number system and discuss its importance in the implementation of CORDIC algorithm. Then, we present systematic and comprehensive taxonomy of rotational CORDIC algorithms, which are subsequently discussed in depth. Special attention has been devoted to the higher radix and flat techniques proposed in the literature for reducing the latency. Finally, detailed comparison of various algorithms is presented, which can provide a first-order information to designers looking for either further improvement of performance or selection of rotational CORDIC for a specific application.
1. Introduction
The current research in the design of high speed VLSI architectures for real-time digital signal processing (DSP) algorithms has been directed by the advances in the VLSI technology, which have provided the designers with significant impetus for porting algorithm into architecture. Many of the algorithms used in DSP and matrix arithmetic require elementary functions such as trigonometric, inverse trigonometric, logarithm, exponential, multiplication, and division functions. The commonly used software solutions for the digital implementation of these functions are table lookup method and polynomial expansions, requiring number of multiplication and additions/subtractions. However, digit-by-digit methods exist for the evaluation of these elementary functions, which compute faster than software solutions.
Some of the digit-by-digit methods for the computation of the above mentioned elementary functions were described by Henry Briggs in 1624 in βArithmetica Logarithmicaβ [1, 2]. These are iterative pseudo division and pseudo multiplication processes, which resemble repeated-addition multiplication and repeated-subtraction division. In 1959, Volder has proposed a special purpose digital computing unit known as COordinate Rotation DIgital Computer (CORDIC), while building a real time navigational computer for use in an aircraft [3, 4]. This algorithm was initially developed for trigonometric functions which were expressed in terms of basic plane rotations.
The conventional method of implementation of 2D vector rotation shown in Figure 1 using Givens rotation transform is represented by the equations where and are the initial and final coordinates of the vector, respectively. The hardware realization of these equations require four multiplications, two additions/subtractions and accessing the table stored in memory for trigonometric coefficients. The CORDIC algorithm computes 2D rotation using iterative equations employing shift and add operations. The versatility of CORDIC is enhanced by developing algorithms on the same basis to convert between binary to binary coded decimal (BCD) number representation by Daggett in 1959 [5]. These iterative methods were described using decimal radix for the design of powerful small machines by Meggitt in 1962 [6]. Subsequently, Walther in 1971 [7, 8] has proposed a unified algorithm to compute rotation in circular, linear, and hyperbolic coordinate systems using the same CORDIC algorithm, embedding coordinate systems as a parameter.

During the last 50 years of the CORDIC algorithm a wide variety of applications have emerged. The CORDIC algorithm has received increased attention after an unified approach is proposed for its implementation [7]. Thereafter, CORDIC based computing has been the choice for scientific calculator applications and HP-2152A co-processor, HP-9100 desktop calculator, HP-35 calculator are a few such devices based on the CORDIC algorithm [1, 8]. The CORDIC arithmetic processor chip is designed and implemented to perform various functions possible in rotation and vectoring mode of circular, linear, and hyperbolic coordinate systems [9]. Since then, CORDIC technique has been used in many applications [10], such as single chip CORDIC processor for DSP applications [11β15], linear transformations [16β21], digital filters [17], [22β24], and matrix based signal processing algorithms [25, 26]. More recently, the advances in the VLSI technology and the advent of EDA tools have extended the application of CORDIC algorithm to the field of biomedical signal processing [27], neural networks [28], software defined radio [29], and MIMO systems [30] to mention a few.
Although CORDIC may not be the fastest technique to perform these operations, it is attractive due to its potential for efficient and low cost implementation of a large class of applications. Several modifications have been proposed in the literature for the CORDIC algorithm during the last two decades to provide high performance and low cost hardware solutions for real time computation of a two dimensional vector rotation and transcendental functions.
A new type of arithmetic operation called fast rotations or orthonormal -rotations over a set of fixed angles is proposed [31]. These orthonormal -rotations are based on the idea of CORDIC and share the property that performing the rotation requires a minimal number of shift-add operations. These fast rotations methods form a viable low cost alternative to the CORDIC arithmetic for certain applications such as FIR filter banks for image processing, the generation of spherical sample rays in 3D graphics, and the computation of eigenvalue decomposition and singular value decomposition.
We have carried out the critical study of different architectures proposed in the literature for 2D rotational CORDIC in circular coordinate system, to initiate the work for further latency reduction or throughput improvement. In this paper, we will review the architectures proposed for rotational CORDIC. Specifically, we focus on redundant unfolded architectures, employing techniques suitable to increase throughput and reduce latency.
The rest of the paper is organized as follows. In Section 2, the basics of redundant arithmetic are presented. In Section 3, we present a review of generalized CORDIC algorithm, radix-2 and radix-4 CORDIC algorithms. In Section 4, general architectures being employed in literature for the implementation of the CORDIC algorithm are discussed. In Section 5, the complete taxonomy of rotational CORDIC algorithms is presented. Section 6 presents the low latency nonredundant CORDIC algorithm. Sections 7β9 provide different redundant CORDIC algorithms along with the architectures being proposed in the literature for the rotational CORDIC, followed by the comparison of different methods in Section 10. Finally, conclusions are presented in Section 11.
2. Redundant Arithmetic [32, 33]
A nonredundant radix- number system has the set and all numbers can be uniquely represented. To avoid carry propagation delay in addition, redundant binary number system is employed. The two common redundant number systems employed in CORDIC arithmetic are the signed-digit (SD) [34β37] and the carry-save (CS) [38] number systems. In a SD number system for radix , the numbers are represented with digit set , where and . For symmetric digit set, , and each digit of SD number system is represented as by encoding such that . In the radix-2 SD number system, numbers are represented with digits . In the carry-save number system, numbers are represented with digit set . It may be observed that, in both SD and CS number systems each number can be represented in multiple ways. The redundancy in SD and CS number representation limits the carry propagation from each stage to its immediate more significant bit position only. In both the SD/CS adders, all sum bits are generated with two full adder delay independent of the word length. Hence, the application of redundant arithmetic can accelerate the additions/subtractions due to carry-free or limited carry-propagation.
3. CORDIC Algorithm
The CORDIC algorithm involves rotation of a vector on the -plane in circular, linear and hyperbolic coordinate systems depending on the function to be evaluated. Trajectories for the vector for successive CORDIC iterations are shown in Figure 2. This is an iterative convergence algorithm that performs a rotation iteratively using a series of specific incremental rotation angles selected so that each iteration is performed by shift and add operation. The norm of a vector in these coordinate systems is defined as , where represents a circular, linear or hyperbolic coordinate system respectively. The norm preserving rotation trajectory is a circle defined by in the circular coordinate system. Similarly, the norm preserving rotation trajectory in the hyperbolic and linear coordinate systems is defined by the function and , respectively. The CORDIC method can be employed in two different modes, namely, the rotation mode and the vectoring mode. The rotation mode is used to perform the general rotation by a given angle . The vectoring mode computes unknown angle of a vector by performing a finite number of microrotations.

(a) Circular system

(b) Linear system

(c) Hyperbolic system
3.1. Generalized CORDIC Algorithm
The generalized equations of the CORDIC algorithm for an iteration can be written as [7] where represents either clockwise or counter clockwise direction of rotation, represents the radix of the number system, steers the choice of circular , linear or hyperbolic coordinate systems, is the nondecreasing integer shift sequence, and is the elementary rotation angle. The latter directly depends on through the relation The shift sequence depends on the coordinate system and the radix of number system. affects the convergence of the algorithm and affects the accuracy of the final result. A detailed discussion on these is presented later. The value of depends on the radix of the number system and is determined by the following equation assuming that vector is either in the first or in the fourth quadrant: where and are the steering variables in rotation and vectoring mode respectively. The required microrotations are not perfect and increase the length of the vector. In order to maintain a constant vector length, the obtained results have to be scaled by the scale factor where denotes the elementary scaling factor of the th iteration, and is the resultant scaling factor after iterations. The computation of scale factor and its compensation increases the computational overhead and hardware depending on the number system employed in the CORDIC arithmetic.
With the appropriate initial values of , , and , both rotation and vectoring modes can be used to compute commonly used elementary functions [39] given in Table 1.
3.2. CORDIC Algorithm for Circular Coordinate System
We present in this section the detailed description of 2D plane rotation in circular coordinate system, since this is used in many applications. The CORDIC algorithm calculates trigonometric functions, rotation of a vector and angle of a vector by realizing two dimensional vector rotation in circular coordinate systems. Figure 3 shows the rotation of a vector with length by a sequence of microrotations through the elementary angles . Equation (2) represents the iterative rotation by an angle in circular coordinate system for and is given by The values of are chosen such that and the multiplication of tangent term is reduced to simple shift operation. It may observed that the norm of vector in th iteration is extended compared to that in th rotation, that is . The increase in magnitude of the vector in every iteration depends on the radix of the number system and number of iterations and is represented by the scale factor . The direction of iterative rotation is determined using or depending on rotation mode or vectoring mode respectively. The number of microrotations to be performed in both the modes depends on the desired computing accuracy and can be constant for a particular computer of finite word length. The number of microrotations in turn decides the number of elementary angles. The iterative equations of the CORDIC algorithm for radix-2 and radix-4 number systems will be presented in the following sections.

(a)

(b)

(c)
3.2.1. Rotation Mode
In rotation mode, the input angle will be decomposed using a finite number of elementary angles [3] where indicates the number of microrotations, is the elementary angle for th iteration and is the direction of th microrotation. In rotation mode, is the angle accumulator initialized with the input rotation angle. The direction of vector in every iteration must be determined to reduce the magnitude of the residual angle in the angle accumulator. Therefore, the direction of rotation in any iteration is determined using the sign of the residual angle obtained in the previous iteration. The coordinates of a vector obtained after microrotations are
3.2.2. Vectoring Mode
In vectoring mode, the unknown angle of a vector is determined by performing a finite number of microrotations satisfying the relation [3] The vectoring mode rotates the input vector through a predetermined set of elementary angles so as to reduce the coordinate of the final vector to zero as closely as possible. Therefore, the direction of rotation in every iteration must be determined based on the sign of residual coordinate obtained in the previous iteration. The coordinates obtained in vectoring mode after iterations are given by
3.2.3. Radix-2 CORDIC Algorithm
The iteration equations of the radix-2 CORDIC algorithm [7] in rotation mode of circular coordinate system at the th step are obtained by using in (6) and are given by where and In order to maintain a constant vector length, the obtained results have to be scaled by the scale factor given by For radix-2 CORDIC, . The major drawback of the conventional CORDIC algorithm is its relatively high latency and low throughput due to the sequential nature of the iteration process with carry propagate addition and variable shifting in every iteration. To overcome these drawbacks, pipelined implementations are proposed [40, 41]. However, the carry propagate addition remained a bottleneck for further throughput improvement. Two major methodologies have been employed in the literature to increase the speed of CORDIC implementation. One reduces the delay of each iteration by adopting redundant arithmetic to radix-2 CORDIC [42] to eliminate carry propagate addition. The other technique involves reducing the number of iterations by increasing the radix employed for the implementation of CORDIC algorithm [43].
The redundant radix-2 CORDIC [42] is proposed by employing redundant arithmetic. The direction of rotations , are selected from the set in contrast to employed in the conventional CORDIC. These values are computed by evaluating a few most significant digits of , since the determination of sign of a redundant number takes long time. This redundant CORDIC algorithm performs no rotation extension for and affects the value of scaling factor , thus making it data-dependent. Therefore, has to be calculated for each microrotation. This calculation and correction increases the computation time and hardware.
3.2.4. Redundant Radix-4 CORDIC Algorithm
As mentioned above, the speed of CORDIC algorithm implementation can be improved by reducing the number of iterations. The iteration equations for the radix-4 CORDIC algorithm in rotation mode derived at the th step by using in (6) and are given by where . The final and coordinates are scaled by Here, the scale factor depends on the values of and hence, has to be computed in every iteration. The range of is for radix-4 CORDIC. In this CORDIC, the direction of rotation is computed based on the estimated value of [43]. The path involves the computation of estimated and evaluation of selection function to determine resulting in increase of the iteration delay compared to that of radix-2. However, the number of iterations required for radix-2 CORDIC can be halved by employing the radix-4 CORDIC algorithm.
The Scale factor computation and compensation, CORDIC algorithm convergence and accuracy aspects are presented in following sections.
3.2.5. Scale Factor Computation
The CORDIC rotation steps change the length of the vector in every iteration resulting in the distortion of the norm of the vector as shown in Figure 3 and is given by (5). In nonredundant radix-2 CORDIC, is constant since . However, is no longer constant for nonredundant radix higher than 2, and redundant number system. For radix-2, the scale factor needs to be computed for iterations as becomes unity for . In redundant radix-4 CORDIC [43], scale factor (15) is not constant. In addition, it is sufficient to compute for iterations as becomes unity thereafter.
3.2.6. Scale Factor Compensation
The scale factor compensation technique involves scaling of the final coordinates with . The most direct method for scaling operation is the multiplication of by using the CORDIC module in linear mode [7]. This can realized using the CORDIC module in linear mode [7]. However, this method requires shift and add operations which are comparable to the computational effort of the CORDIC algorithm itself. Since is constant for radix-2, the computational overhead can be reduced by using CSD recoded multiplier. On an average, the number of nonzero digits can be reduced to using CSD representation [32] and hence, the effort for multiplication using CSD recoded multiplier is approximately one third that required using conventional multiplier. Further, scaling can also be implemented using a Wallace tree by fully parallelizing multiplication and is preferred for applications aiming for low latency at the expense of more silicon area [44].
Scaling may be done by extending the sequence of CORDIC iterations [9, 16, 17] to avoid additional hardware required in the direct method. A comparison of several scale factor compensation techniques proposed in the literature along with two additional methods, additive and multiplicative decomposition approaches, for radix-2 CORDIC is presented in [44]. It is observed from the presented results that additive technique offers a low latency solution and multiplicative technique offers an area economical solution for applications of CORDIC employing array and pipelined architectures. An algorithm is proposed [45] to performs scale factor compensation in parallel with the CORDIC rotation using nonredundant and redundant arithmetic, thereby, eliminating the final multiplication [3] or additional scaling iterations [9, 16, 17].
3.2.7. Convergence
The CORDIC algorithm involves the rotation of a vector to reduce the or coordinate of the final vector as closely as possible to zero for rotation or vectoring mode respectively. The maximum value of rotation angle by which the vector can be rotated depends on the shift sequence [7]. The expected results of the CORDIC algorithm can be obtained if the or coordinate is driven sufficiently close to zero. In addition, it can be guaranteed to drive or to zero, if the initial values of a vector () or lies within the permissible range. These ranges define the domain of convergence of the CORDIC algorithm.
For -bit precision, the given rotation angle can be decomposed as where is an angle approximation error such that and is negligible in practical computation [7]. This angle approximation error in rotation and vectoring mode can be computed as The magnitude of elementary angle for the given shift sequence may be predetermined using where is the radix of the number system. The direction of rotation must be selected to drive or towards zero for rotation or vectoring respectively. The range of depends on the radix and digit set being used for the number system. Since the number of iterations and elementary angles to be traversed by the vector during these iterations are predetermined, the range of for which CORDIC algorithm can be used, called domain of convergence, is given by [7] The convergence range of CORDIC algorithm can be defined for rotation mode as and for vectoring mode as The expected final results cannot be obtained, if the given initial values and do not satisfy these convergence values. The range of convergence of the CORDIC algorithm can be extended from to using preprocessing techniques [7, 27, 46].
3.3. Accuracy
The accuracy of the CORDIC algorithm is affected by two primary sources of error, namely, angle approximation and rounding error. The error bounds for these two sources of error are derived by performing the detailed numerical analysis of the CORDIC algorithm [47]. The approximation error and the rounding error derived are combined to yield the overall quantization error in the CORDIC computation. The overall quantization error can be assured to be within the range by considering an additional guard bits in the implementation of the CORDIC algorithm [7].
3.3.1. Angle Approximation Error
Theoretically, the rotation angle is decomposed into infinite number of elementary angles as shown in Figure 3. For practical implementation, a finite number of microrotations are considered. Hence, the input rotation angle can only be approximated resulting in an angle approximation error where is the residual angle after microrotations. Hence, the accuracy of the output of the th iteration is principally limited by the magnitude of the last rotation angle.
3.3.2. Rounding Error
The second type of error called rounding error is due to the truncation of CORDIC internal variables by the finite length of storage elements. In addition scale factor compensation also contributes to this error. In a binary code, the truncation of intermediate results after every iteration introduces maximum rounding error of bits. To achieve a final accuracy of 1βbit in bits, an additional guard bits must be considered in implementation of this algorithm [7].
4. CORDIC Architectures
In this section, a few architectures for mapping the CORDIC algorithm into hardware are presented. In general, the architectures can be broadly classified as folded and unfolded as shown in Figure 4, based upon the realization of the three iterative equations (6). Folded architectures are obtained by duplicating each of the difference equations of the CORDIC algorithm into hardware and time multiplexing all the iterations into a single functional unit. Folding provides a means for trading area for time in signal processing architectures. The folded architectures can be categorized into bit-serial and word-serial architectures depending on whether the functional unit implements the logic for one bit or one word of each iteration of the CORDIC algorithm.

The CORDIC algorithm has traditionally been implemented using bit serial architecture with all iterations executed in the same hardware [3]. This slows down the computational device and hence, is not suitable for high speed implementation. The word serial architecture [7, 48] is an iterative CORDIC architecture obtained by realizing the iteration equations (6). In this architecture, the shifters are modified in each iteration to cause the desired shift for the iteration. The appropriate elementary angles, are accessed from a lookup table. The most dominating speed factors during the iterations of word serial architecture are carry/borrow propagate addition/subtraction and variable shifting operations, rendering the conventional CORDIC [7] implementation slow for high speed applications. These drawbacks were overcome by unfolding the iteration process [41, 48], so that each of the processing elements always perform the same iteration as shown in Figure 5. The main advantage of the unfolded pipelined architecture compared to folded architecture is high throughput due to the hardwired shifts rather than time and area consuming barrel shifters and elimination of ROM. It may be noted that the pipelined architecture offers throughput improvement by a factor of for -bit precision at the expense of increasing the hardware by a factor less than .

5. CORDIC Taxonomy
The implementation of CORDIC algorithm has evolved over the years to suit varying requirements of applications from conventional nonredundant to redundant nature. The unfolded implementation with redundant arithmetic initiated the efforts to address high latency in conventional CORDIC. Subsequently, several modifications have been proposed for redundant CORDIC algorithm to achieve reduction in iteration delay, latency, area and power. The evolution of the unfolded rotational CORDIC algorithms is shown in Figure 6. As this taxonomy is fairly rich, the remainder of the review presents taxonomy in top-down approach.

CORDIC is broadly classified as nonredundant CORDIC and redundant CORDIC based on the number system being employed. The major drawback of the conventional CORDIC algorithm [3, 7] was low throughput and high latency due to the carry propagate adder used for the implementation of iterative equations. This contradicted the simplicity and novelty of the CORDIC algorithm attracting the attention of several researchers to device methods to increase the speed of execution. The obvious solution is to reduce the time for each iteration or the number of iterations or both. The redundant arithmetic has been employed to reduce the time for each iteration of the conventional CORDIC. We have analyzed and presented in the following Sections, features of different pipelined and nonpipelined unfolded implementations of the rotational CORDIC.
6. Low Latency Nonredundant Radix-2 CORDIC [49]
A significant improvement for the conventional rotational CORDIC algorithm in circular coordinate system is proposed [50], employing linear approximation to the rotation when the remaining angle is small. This remaining angle is chosen such that a first order Taylor series approximation of and , calling the remaining angle, may be employed as and . The architecture for the implementation of this algorithm using nonredundant arithmetic is presented in [49]. The iteration equations of this algorithm for the first microrotations are same as those for the conventional CORDIC algorithm (11). The values for the first iterations are determined iteratively using the sign of angle accumulator . The rotation directions from iteration onwards can be generated in parallel, since the conventional circular arc tangent radix values approach the radix-2 coefficients progressively for increasing values of CORDIC iteration index as evident from the expression For the range of iterations , all values are determined from the recoded representation of remaining angle . These values are used to obtain from . For , the CORDIC microrotations are replaced by a single rotation using the remaining angle . Thus, (11) is modified as where , is the scale factor in the th iteration and () are the scaled final coordinates.
Scale Factor
The low latency nonredundant radix-2 CORDIC algorithm achieves constant scale factor since and performs the scale factor compensation concurrently with the computation of and coordinates, using two multipliers in parallel [49]. This is in contrast to two series multiplications required in the algorithm [50].
7. Constant Scale Factor Redundant Radix-2 CORDIC
Redundant radix-2 CORDIC methods can be classified as variable and constant scale factor methods based on the dependence of scale factor on the input angle. In redundant radix-2 CORDIC [42], and hence scale factor is data-dependent. Therefore, has to be calculated for each microrotation. This calculation and correction increases the computation time and hardware. Several redundant CORDIC algorithms with constant scale factor are available in the literature [51β53] to address data dependency of the scale factor as shown in Figure 7. In these methods, the iterative rotations of a point around the origin on the -plane are considered (see Figure 1). The direction of each rotation depends on the sign of steering variable , which represents the remaining angle of rotation. Since the computation of the sign of redundant number requires more time, estimated value of () is used to determine the direction of rotation. The estimated value is computed based on the value of the three most significant digits of . Constant scale factor is achieved by restricting to the set , thus facilitating a faster implementation. The constant scale factor methods can be classified based on the arithmetic employed as redundant radix-2 CORDIC with signed digit arithmetic and carry save arithmetic (see Figure 7).

Scale Factor
The scale factor need not be computed for the implementation of all the constant scale factor techniques discussed in this section. In these methods, no specific scale factor compensation technique is considered. It may be noted that a specific compensation technique can be considered depending on the application.
7.1. Constant Scale Factor Redundant CORDIC Using SD Arithmetic
The redundant radix-2 CORDIC using SD arithmetic can be further classified based on the technique employed to achieve constant scale factor (see Figure 7). These methods are implemented using the basic CORDIC iteration recurrences (11) with necessary transformations.
7.1.1. Double Rotation Method [51]
The double rotation method performs two rotation-extensions for each elementary angle during the first iterations for bit precision to achieve constant scale factor independent of the operand. One rotation extension is performed for every elementary angle for iterations greater than . A negative rotation is performed by two negative subrotations, and a positive rotation by two positive subrotations. A nonrotation is performed by one negative and one positive subrotation. Hence, 50 additional iterations are required compared to the redundant CORDIC [42].
7.1.2. Correcting Rotation Method [51]
This is another method proposed to achieve constant scale factor for the computation of sine and cosine functions. This method avoids rotation corresponding to and performs one rotation extension in every iteration depending on the . Further, extra rotation extensions are performed at fixed intervals for correcting the error introduced by avoiding and to assure convergence. If fractional bits are used to estimate , the interval between correcting iterations should be less than or equal to [54]. This method also requires 50 additional iterations, if three or four most significant digits are used for sign estimation. The increase in latency of rotational CORDIC due to these double rotation and correcting iteration methods is reduced using branching algorithm [52].
7.1.3. Branching Method [52]
This method implements CORDIC algorithm using SD arithmetic, restricting the direction of rotations to , without the need for extra rotations. This requires two modules in parallel to perform two conventional CORDIC iterations, such that, the correct result is retained at the end of each iteration. Two modules perform the rotation in the same direction if the sign of corresponding can be determined. Otherwise, branching is performed by making one CORDIC module perform rotation with and another module perform rotation with . The direction of rotation in the next subsequent rotation is decided by the sign of that module whose value is small. In every iteration , angle accumulator or computes the remaining angle or to determine the direction of rotation for the next iteration. The direction of rotation is determined by examining window of three digits of or .
The disadvantage of branching method is the necessity of performing two conventional CORDIC iterations in parallel which requires almost two fold effort in terms of implementation complexity. In addition, one of the modules will not be utilized when branching does not take place. However, this method offers faster implementation than double and correcting rotation methods [51], since, it does not require additional iterations to achieve constant scale factor.
7.1.4. Double Step Branching Method [53]
The performance of branching algorithm is enhanced by the double step branching method to improve utilization of hardware. This method involves determining two distinct values in each step with some additional hardware compared to the branching method, where the two modules do different computations only when branching takes place. Double step branching method determines the two direction of rotations by examining the six most significant digits to do a double step. These six digits are divided into two subgroups of three digits each, and each subgroup is handled in parallel, to generate the required using zeroing modules ( path). Although double stepping method introduces a small hardware overhead compared to the branching method, it is better than the latter since it increases the utilization of rotator modules.
7.2. Constant Scale Factor Redundant CORDIC Using CS Arithmetic
It is worth discussing here one more classification related to constant scale factor redundant radix-2 CORDIC (see Figure 7). The implementation of redundant CORDIC with constant scale factor using signed arithmetic results in an increase in the chip area [51β53] and latency [51] by at least 50 compared to redundant radix-2 CORDIC [42]. Low latency CORDIC algorithm [55] and differential CORDIC algorithm [56, 57] with constant scale factor using CS arithmetic have been proposed to reduce this overhead, the details of which are discussed below.
7.2.1. Low Latency Redundant CORDIC [55]
This algorithm is proposed to reduce the latency of redundant CORDIC [51] by subdividing the iterations into different groups and using different techniques for each of these groups. For all the iterations, if , conventional iteration equations (11) are used. This method avoids for iterations between and employs correcting rotation method [51]. For iterations , is considered as a valid choice. Since for this group of iterations holds within -bit precision, vector is not rotated for . However, the length of the vector is increased by the scale factor for that iteration, as the final coordinates are scaled assuming constant scale factor. For the iterations , no correcting factor is required as the scale factor becomes unity.
7.2.2. DCORDIC [56]
In the sign estimation methods [51β53], half of the computational effort in the data paths of rotational CORDIC is required to allow for the correction of possible errors, as the sign estimation is not entirely perfect. This problem is reduced by high speed bit-level pipelining technique with CS arithmetic proposed in [57]. This algorithm involves the transformation of the conventional CORDIC iteration equations (11) into partially fixed iteration equations, given by It is clear from these expressions that the computation of and requires the actual sign of , while the angle accumulator requires only the absolute value of . The actual sign of () can be determined by taking into account the initial sign of and providing information about sign changes during the absolute value computation of . Similarly, all values are computed recursively. Later this technique is implemented with SD arithmetic and proposed as Differential CORDIC (DCORDIC) algorithm [56]. Since the sign calculation of steering variable () during absolute value computation takes long time, most significant digit first absolute value technique is employed. This technique replaces the word level sign dependence by a bit level dependence, reducing the overall computation time. The bit level pipelined architecture is proposed to implement these transformed iteration sequences, thus allowing high operational speed.
8. Higher Radix Redundant CORDIC
As mentioned earlier, throughput and latency are important performance attributes in CORDIC based systems. The various radix-2 CORDIC algorithms presented so far may be used to reduce the iteration delay, thereby improving the throughput, with constant scale factor. Higher radix CORDIC algorithms using SD arithmetic [54, 58] and CS arithmetic [43, 59] are proposed to address latency reduction. This is possible, since higher radix representation reduces the number of iterations. The classification of redundant CORDIC algorithms proposed in the literature based on the radix of the number system is shown in Figure 8. The application of radix-4 rotations in the CORDIC algorithm was initially proposed in [54] to accelerate the radix-2 algorithm.

Scale factor need not be computed for the constant scale factor algorithms to be discussed in this section. Since no specific scale factor compensation technique is considered for these methods, a compensation technique can be considered depending on the application.
8.1. Pipelined Radix-4 CORDIC [58]
The generalized CORDIC algorithm for any radix in three coordinate systems and implementation of the same in rotation mode of circular coordinate system using radix-4 pipelined CORDIC processor is presented in [58]. This algorithm performs two successive radix-2 microrotations with the same microrotation angle using the iteration equations where and are two redundant radix-2 coefficients to decompose radix-4 coefficient satisfying the relation . The value of is selected as and for . The selection function for is determined using the five most significant digits of -coordinate, ensuring the convergence of this algorithm. This algorithm is designed using SD arithmetic and requires two adderssubtractors for each stage of data path in contrast to one addersubtractor required in radix-2 CORDIC [42], for . However, the number of additions required are reduced during the last stages.
Scale Factor Computation
The scale factor in radix-4 CORDIC algorithm is variable, since takes values from the digit set . is computed in each iteration using the combinational circuit by realizing the expression
8.2. Redundant Radix 2-4 CORDIC [59]
The number of rotations in a redundant radix-2 CORDIC rotation unit is reduced by about 25% by expressing the direction of rotations in radix-2 and radix-4 [54]. This algorithm employs different modified CORDIC algorithms using CS arithmetic for different subsets of iterations. For the iterations , nonredundant radix-2 CORDIC algorithm with = is employed. For , correcting iteration method [51] is employed. For , redundant radix-4 CORDIC algorithm is employed, thus, halving the number of iterations. An unified architecture is proposed for the implementation of this algorithm to operate in rotationvectoring mode of circular and hyperbolic coordinate systems.
Scale Factor Computation
This algorithm achieves constant scale factor, since the rotation corresponding to is avoided for . For scale factor need not be computed as
8.3. Radix-4 CORDIC [43]
A redundant radix-4 CORDIC algorithm is proposed using CS arithmetic, to reduce the latency compared to redundant radix-2 CORDIC [42]. This algorithm (14) computes values using two different techniques. For the microrotations in the range , is determined sequentially using angle accumulator. For the microrotations in the range , the values are predicted from the the remaining angle after the first [60]. Thus, the complexity of the path is , compared to in the other architectures [42β53] presented in the previous sections. For the range , microrotations are pipelined in two stages to increase the throughput. A 32-bit pipelined architecture is proposed for the implementation of the radix-4 CORDIC algorithm using CS arithmetic.
Scale Factor Computation
The possible scale factors are precomputed and stored in a ROM. The number of possible scale factors for is . The size of ROM and access time increases with . Hence, the scale factors for some iterations are stored in ROM and these values are used to compute the scale factor for remaining iterations with the combinational logic. This is designed by realizing the first few terms of Taylor series expansion of scale factor. For this redundant radix-4 implementation, the number of iterations are reduced at the expense of adding hardware for computing the scale factor.
9. Parallel CORDIC Algorithms
The CORDIC algorithms discussed so far have represented using a set of elementary angles called arc tangent radix set [3] where and , satisfying the convergence theorem [7] in contrast to the representation using a normal radix The direction of rotation for the th iteration is determined after computing the iterations sequentially. It is evident from this sequential dependence of the radix system that the speed of CORDIC algorithm can be improved by avoiding the sequential behavior in the computation of values or coordinates. The various redundant CORDIC algorithms proposed in the literature employing either one or both these techniques are shown in Figure 9 and are discussed in the following sections.

9.1. Low Latency Radix-2 CORDIC [55]
The low latency parallel radix-2 CORDIC architecture presented for the rotation mode [55] predicts 's by eliminating sequential dependency of the path. In order to minimize the prediction error, directions are predicted for a group of iterations at a time rather than for all iterations together. This architecture does not allow rotation for index . Hence, the convergence range of this architecture is less than . On the other hand, the requirement of redundant to binary conversions of intermediate results in the path restricts the pipelined implementation of this architecture. In order to reduce the latency of this parallelizing scheme further, termination algorithm and booth encoding method have been proposed.
9.2. P-CORDIC [61]
The sequential procedure in the computation of direction of rotations of the CORDIC algorithm is eliminated by the P-CORDIC algorithm, while maintaining a constant scale factor. This algorithm precomputes the direction of microrotations before the actual CORDIC rotation starts iteratively in the path. This is obtained by deriving a relation between the constructed binary representation of direction of rotations , and rotation angle [40, 62] given by where , , , and . Here, is computed using the partial offset and the corresponding direction bit for the first iterations, since the value of decreases by a factor of 8 beyond iterations. The direction of rotations for any input angle in binary form are obtained by realizing this expression taking a variable offset from ROM. The unfolded architecture proposed for the implementation of this algorithm eliminates the path and reduces the area of the implementation. This architecture achieves latency and hardware reduction over the radix-2 unfolded parallel architecture [55].
Scale Factor
The scale factor in the implementation of P-CORDIC algorithm remains constant, as being generated for the implementation of path. The scale factor compensation is implemented using constant factor multiplication technique as discussed in Section 3.2.6.
9.3. Hybrid CORDIC Algorithm
For -bit fixed point CORDIC processor in circular coordinate system, nearly iterations must be computed sequentially. This is true for both generation of direction and rotation without affecting accuracy [60]. The subsequent rotation directions for the last iterations can be generated in parallel since the conventional circular ATR values approach the radix-2 coefficients progressively with increasing iteration index, that is, This behavior is exploited by introducing the hybrid CORDIC algorithms to speed up the conventional CORDIC rotator. This algorithm involves partitioning into and . The rotation by are performed as in the conventional CORDIC algorithm and the iterations related to can be simplified as in linear coordinate system. This algorithm led to the development of several parallel CORDIC algorithms [63β65]. These can be categorized broadly as mixed-hybrid CORDIC and partitioned-hybrid CORDIC algorithms. In mixed-hybrid CORDIC algorithms [65], the input angle and initial coordinates are used to compute the rotations for the first iterations as in the conventional CORDIC. The remaining angle after these first iteration is used for computing directions for the last iterations. The implementation is designed to keep the fast timing characteristics of redundant arithmetic in the path of the CORDIC processing. In the partitioned-hybrid CORDIC [63, 64], the first direction of rotations are generated using the first bits of and last direction of rotations are predicted using the least significant bits of .
9.3.1. Flat CORDIC [63]
The flat CORDIC algorithm is proposed to eliminate iterative nature in the path for reducing the total computation time. This algorithm transforms recurrences (11) of the conventional CORDIC into a parallelized version by successive substitution to express the final vectors in terms of the initial vectors, resulting in a single equation for -bit precision. The expressions for final coordinates of 16-bit sine/cosine generator are where and are the error compensation factors in and , respectively. and are initialized with and 0 respectively. The 16 sign digits for 16-bit precision represents the polarity of microrotations required to achieve the target angle. These equations demonstrate the complete parallelization of the conventional CORDIC algorithm. This technique precomputes which takes values from the set to achieve constant scale factor. The 's for the first iterations are precomputed employing a technique, called Split Decomposition Algorithm (SDA), which limits the input angle range to [66]. The last number of 's are predicted from the remaining angle of iterations. The internal word length of the architecture proposed for this technique is considered as for -bit external accuracy [47]. It may be noted that the complete parallelization of iterations lead to the exponential increase of terms to be flattened, affecting the circuit complexity. In addition, the implementation of flat CORDIC needs complex combinational hardware blocks with poor scalability.
Scale Factor
The scale factor in the implementation of the flat CORDIC algorithm is maintained constant, since . The scale factor compensation is implemented using a multiplier designed with CS adder tree.
9.3.2. Para-CORDIC [64]
The Para-CORDIC parallelizes the generation of direction of rotations from the binary value of the input angle by employing binary to bipolar representation (BBR) and microrotation angle recoding (MAR) techniques. This algorithm computes coordinates iteratively while eliminating iterative path completely. The input angle is divided into the higher part and lower part . The two's complement binary representation of input angle is where and . The bits of input angle are converted into BBR, and MAR technique is employed to determine the direction of rotations to . Since , this method performs additional microrotations for every iteration depending on each positional binary weight for . The remaining angle after the first rotations is added to . The values of to are obtained from BBR of the corrected . This method eliminates ROM for storing the predetermined direction of rotations. However, it requires additional stages for the repetition of a certain microrotations and array of adders to compute the corrected .
9.3.3. Semi-Flat CORDIC [65]
The iterative nature in the implementation of the conventional CORDIC algorithm is partially eliminated by semi flat algorithm. This is designed for the semi parallelization of the recurrences, to improve the speed of a rotational unfolded CORDIC without increasing the area requirements. The internal precision is taken higher than the required external precision in order to reduce the quantization error encountered in the CORDIC algorithm as discussed in Section 3.2.6. For the first bits of , recurrences are computed iteratively using the double rotation method [51] resulting in . Then, can be expressed in terms of these , if all 's are predicted. The 's for bits ( = internal precision) of input angle are precomputed and stored in ROM, which is addressed by bits of input angle. The remaining number of 's are predicted from rotation angle [60]. It may be noted that neither the description nor the reference is provided for split decomposition method employed to precompute number of 's.
The computation time and area of the chip are affected by the choice of , which is clear from the simulation results presented in [65]. It is observed from these simulation results that the best trade-off is obtained with and for a 16-bit CORDIC (internal precision 22 bits) and 32-bit CORDIC (internal precision 39 bits) respectively. After iterations, all the terms of were added using the Wallace tree, flattening the path. However, this architecture has poor scalability.
Scale Factor
This algorithm achieves constant scale factor, since takes value from the set .
10. Comparison
We have presented a latency estimate comparison of unfolded architectures available in the literature for 2D rotational CORDIC in Table 2. Latency is defined as sum of the delays for the computation of redundant coordinates, scale factor compensation and redundant to binary conversion of final coordinates. The design detail of scale factor compensation and redundant to binary conversion stages is not made available in the literature for all the architectures as discussed in the previous sections (Sections 6β9). Hence, we have compared all the CORDIC algorithms with respect to the latency required for the rotation computation, excluding the scale factor compensation and redundant to binary conversion stages. All the architectures presented in this table are implemented using redundant arithmetic except the conventional CORDIC [3] and the Low latency nonredundant CORDIC [49].
The nonpipelined and pipelined implementation of the conventional radix-2 CORDIC algorithm [3, 7] requires iterations to compute coordinates iteratively. The iteration delay depends on the fast carry propagate adder, which is the bottleneck to increase throughput and reduce latency.
The application of redundant arithmetic [42] to the conventional CORDIC makes to take values from the set instead of the set . The values are computed iteratively and the choice of resulted in the variability of the usually constant scale factor. The variable scale factor increases the area and delay for scale factor computation. The latency of this implementation is , where is the iteration stage delay in terms of full adder delay .
The double rotation and correcting rotation redundant CORDIC methods using SD arithmetic are proposed in [51], to reduce the cost of the scale factor computation. The nonpipelined and pipelined implementation of these methods require latency of 1.5 to compute final coordinates iteratively. These methods achieve constant scale factor, increasing the latency by compared to [42].
Low latency CORDIC algorithm [55] reduces the latency to compared to that in [51]. This algorithm computes iteratively the direction of rotations and coordinates. In addition, a nonpipelined architecture is also proposed in this paper using prediction technique. The latency of this architecture is .
Branching algorithm using signed digit arithmetic is proposed to achieve constant scale factor. The latency of nonpipelined and pipelined implementation of this algorithm is . This algorithm achieves 50 latency improvement over [51] to compute final coordinates iteratively. However, it requires double the hardware as two sets of modules are employed.
The direction of rotations computed using the sign estimation methods [51, 52, 55] may not be accurate, therefore, half of the computational effort is required for correction. DCORDIC algorithm is proposed to determine the direction of rotations iteratively using the sign of steering variable. However, this method requires an initial latency of before the CORDIC rotation starts, to obtain the first direction of rotation. The signs are obtained for the remaining iterations with one full adder delay using bit level pipelined architecture with stages. This implementation requires latency of to compute the final coordinates iteratively. In addition, this method requires initial register rows for skewing of input data.
All the methods presented so far reduce the latency by decreasing the iteration delay using redundant arithmetic. Since the latency reduction can also be obtained by reducing the number of iterations, the same has motivated to implement radix-4 pipelined CORDIC processor [58], which results in latency of .
The mixed radix CORDIC algorithm [59] is proposed using radix-2 and radix-4 rotations for designing a pipelined processor to operate in rotation and vectoring modes of circular and hyperbolic coordinate systems. The latency of this pipelined architecture requires stages with three different stage delays () as (), () and (). This architecture takes more stage delay as this is designed for various modes of operation.
The advantage of applying radix-4 rotations for all iteration stages is exploited in [43] with less number of adders as compared to [58]. For the microrotations in the range , the pipelined architecture proposed for this algorithm implementation determines values sequentially using angle accumulator. For the microrotations in the range , the values are determined from the remaining angle after iterations. The latency of this architecture to compute the final coordinates iteratively is .
In [61], P-CORDIC algorithm is proposed to eliminate path completely, using a linear relation between the rotation angle and the corresponding direction of all microrotations for rotation mode. This algorithm computes the coordinates iteratively. The latency of the nonpipelined architecture proposed to implement this algorithm for -bit precision is .
The iterative nature in the path is eliminated at the cost of scalability by the flat CORDIC algorithm [63]. This algorithm transforms recurrences (11) of the conventional CORDIC into a parallelized version, by successive substitution to express the final vectors in terms of the initial vectors, resulting in a single equation for -bit precision. The direction of rotations are precomputed before initiating the computation of coordinates. The final coordinates are computed using combinational blocks with the latency of 34/16-bit and 50/32-bit. The expressions for and variables need to be derived and combinational building blocks have to be redesigned with change in precision.
In [64], Para-CORDIC algorithm is proposed to precompute the direction of rotations without using ROM, while eliminating iterative path completely. This method uses additional stages for the repetition of a certain microrotations to predict the direction of rotations in contrast to ROM employed in [61, 63, 65]. The latency of this Para-CORDIC is , where and represents the total number of microrotations required in MAR recoding of bits of the input angle. The values of for 16/32/64-bit precision are 5, 18, 52 respectively.
The semiflat technique is proposed in [65], to partially eliminate the iterative nature in paths for the iterations ( for a 16-bit CORDIC and for a 32-bit CORDIC, respectively). The latency of the nonpipelined implementation of this algorithm is 33/16-bit and 49/32-bit, respectively. It is observed that this architecture is combinational after iterations and has poor scalability.
In [49], the coordinates are computed iteratively for the iterations using number of fast adders. These values are used to compute the final coordinates using two multipliers in parallel and one adder resulting in the latency of . The values for the first iterations are determined iteratively using the sign of angle accumulator . For the range , the rotation directions are generated in parallel.
11. Conclusions
In this paper, we have surveyed the algorithms for unfolded implementation of 2D rotational CORDIC algorithms. Special attention has been devoted to the systematic and comprehensive classification of solutions proposed in the literature. In addition to the pipelined implementation of nonredundant radix-2 CORDIC algorithm that has received wide attention in the past, we have discussed the importance of redundant and higher radix algorithms. We have also stressed the importance of prediction algorithms to precompute the directions of rotations and parallelization of path. It is worth noting that the considered algorithms should not be implemented as alternatives over the others, rather they should be integrated depending on the design constraints of a specific application.
We can draw final conclusions about the different algorithms to achieve efficient implementation of application specific rotational CORDIC algorithm. As far as the application of redundant arithmetic to the pipelined implementation of the conventional radix-2 CORDIC algorithm is concerned, area is doubled with reduction in the adder delay of each stage from to 2. Similarly, the hardware and iteration delay of redundant radix-2 CORDIC can be reduced by employing prediction technique for the precomputation of direction of rotations. Further, the latency reduction of this can be achieved by integrating the prediction technique with the redundant radix-4 arithmetic trading the area for variable scale factor computation. Another important observation about the solutions proposed with fully parallelization of path is that it affects the modularity and regularity of the architecture leading to a poor scalable implementation. Finally, we conclude that the solution which can allow the design of scalable architecture, employing prediction and path parallelization techniques to redundant CORDIC algorithm can achieve both latency reduction and throughput improvement.