Research Article  Open Access
Parallel Jacobi EVD Methods on Integrated Circuits
Abstract
Design strategies for parallel iterative algorithms are presented. In order to further study different tradeoff strategies in design criteria for integrated circuits, A 10 × 10 Jacobi BrentLukEVD array with the simplified μCORDIC processor is used as an example. The experimental results show that using the μCORDIC processor is beneficial for the design criteria as it yields a smaller area, faster overall computation time, and less energy consumption than the regular CORDIC processor. It is worth to notice that the proposed parallel EVD method can be applied to realtime and lowpower array signal processing algorithms performing beamforming or DOA estimation.
1. Introduction
We are on the edge of many important developments which will require parallel data and information processing. The transmission systems are using higher and higher frequencies and the carrier frequencies are increasing to 10 GHz and above. Because of the smaller wavelength more antennas can be implemented on a single device leading to massive MIMO systems. Parallel VLSI architectures will be needed in order to provide the required computational power for 10 GHz and above, massive MIMO, and big data processing [1, 2].
In parallel matrix computation at the circuit level, implementing an iterative algorithm on a multiprocessor array results in a tradeoff between the complexity of an iteration step and the number of required iteration steps. Therefore, as long as the algorithm's convergence properties are guaranteed, it is possible to adjust the architecture, which can significantly reduce the complexity with regard to the implementation. Computing the parallel eigenvalue decomposition (EVD) as a preprocessing step to MUSIC or ESPRIT algorithm with Jacobi's iterative method is used as an important example as the convergence of this method is extremely robust to modifications of the processor elements [3–6].
In [7], it was shown that BrentLukEVD architecture with a modified CORDIC for performing the plane rotation of the Jacobi algorithm can be realized in advanced VLSI design. Based on it, a Jacobi EVD array is realized by implementing a scalingfree microrotation CORDIC (CORDIC) processor in this paper, which only performs a predefined number of CORDIC iterations. Therefore, the size of the processor array can be reduced for implementing a largescale EVD array in parallel VLSI architectures. After that, several modifications of the algorithm/processor are studied and their impact on the design criteria is investigated for different sizes of EVD array ( to ). Finally, a strategy to comply with the design criteria is established, especially in terms of balancing the number of microiterations and the computational complexity. The proposed architecture is ideal for realtime antenna array applications, such as a flying object carrying an antenna array for beamforming or DOA estimation that would require a realtime, lowpower, and efficient architecture for EVD, or joint timedelay and frequency estimation using a sensor network.
This paper is organized as follows. Serial and parallel Jacobi methods are described in Section 2. In Section 3, the design issues of the parallel Jacobi EVD array are discussed, leading to the simplification from a regular full CORDIC to the CORDIC processor with an adaptive number of iterations. Section 4 shows the implementation results. Section 5 concludes this paper.
2. Parallel Eigenvalue Decomposition
2.1. Jacobi Method
An eigenvalue decomposition of a real symmetric matrix is obtained by factorizing into three matrices , where is an orthogonal matrix () and is a diagonal matrix containing the eigenvalues of . The Jacobi method approximates the EVD iteratively as follows: where is an orthonormal plane rotation by the angle in the plane.
The plane rotations , where , can be executed in various orders to obtain the eigenvalues. The most common order of sequential plane rotations is called cyclicbyrow, meaning is chosen as follows:
The execution of all index pairs is called a sweep. Matrix will converge into a diagonal matrix once sweeps are applied, where contains the eigenvalues :
2.2. Jacobi EVD Array
Instead of performing the plane rotations one by one in a cyclicbyrow order, they can be separated into multiple subproblems and executed in parallel on a log dimensional multicore platform. Ahmedsaid et al. [3] first presented a parallel array based on Jacobi’s method. It consists of PEs and each PE contains a subblock of the matrix . Figure 1 shows a typical EVD array with 16 PEs. This Jacobi array can perform subproblems in parallel. Initially, each PE holds a submatrix of : where and .
A rotation angle has to be chosen in order to zero out the offdiagonal elements of the submatrix by solving a symmetric EVD subproblem as shown in the following: where .
The maximal reduction can be obtained by applying the optimal angle of rotation : where the range of is limited to .
This optimal angle , which can annihilate the offdiagonal elements ( and ), is computed using diagonal PEs in (6). Once these rotation angles are computed, they will be sent to the offdiagonal PEs. This transmission is indicated by the dashed lines in the vertical and horizontal direction in Figure 1. All offdiagonal PEs will perform a twosided rotation with the corresponding rotation angles obtained from the row and column , respectively.
Once these rotations are applied, the matrix elements are interchanged between processors as indicated by the diagonal solid lines in Figure 1, for the execution of the next rotations. One sweep needs to perform of these parallel rotation steps. After several sweeps (iterations) are executed, the eigenvalues will concentrate in the diagonal PEs.
3. CORDIC Approach
3.1. Regular CORDIC
Within each PE, a simple way to solve the subproblem of (5) in VLSI for zeroing out the offdiagonal elements is to use the CORDIC algorithm. An orthogonal CORDIC rotator is defined as [8, 9] when , .
In the Cartesian coordinate system, the CORDIC orthogonal rotation mode can be used to compute (5) by separating the twosided rotation into two parts, and . that is computed by where the plane rotation with the desired rotation angle is executed using two CORDIC rotators. The CORDIC processors apply steps, usually for single floating precision. A constant scaling value is subsequently required to fix the rotated vectors and in order to retain the orthonormality. Similarly, these two CORDIC rotators can also be applied to compute :
Meanwhile, the angle can also be determined by using the CORDIC orthogonal vector mode. The CORDIC rotates the input vector through whatever angle is necessary to align the resulting vector with the axis:
The CORDIC with an orthogonal vector mode can compute the arctangent result iteratively , if the angle accumulator is initialized with zero ().
In the VLSI design, two common approaches can be used to realize the CORDIC dependence flow graph in hardware: the folded (serial) or the parallel (pipelining) [10, 11]. Note that we limit our efforts to the conventional CORDIC iteration scheme, as given in (7). In Figure 2(a), the structure of a folded CORDIC PE is shown, which requires a pair of adders for plane rotation and another adder for steering the next angle direction (computing the following and ). All internal variables are buffered in the registers separately until the iteration number is large enough to obtain the result. The signs of all three intermediate variables are fed into a control unit that generates the rotation direction flags to steer the add or suboperations and keep track of the rotation angle . For example, offdiagonal can directly apply the flags from to (8)’s and to (8)’s . After the rotation, the required scaling procedure can be obtained using the part of Figure 2(b) that fixes , where two multiplexers are required to select the inputs into the barrel shifters. This folded dependence graph is typical for the orthogonal rotation mode and benefits in a small area in the VLSI design.
(a) Rotating
(b) Scaling
In practice, the angle accumulator is not required for the offdiagonal PEs. The from (7) can be used to steer the rotators. Thus, the transmission on the vertical and horizontal dashed lines in Figure 1 will be replaced by a sequence of flags, meaning that the offdiagonal computation efforts for computing the optimal angle can be omitted.
3.2. Simplified Rotation CORDIC
As the process technologies continue to shrink, it becomes possible to directly implement a BrentLukEVD array with the Jacobi method [12, 13]. However, the size of the EVD array that can be implemented on the current configurable device with the regular CORDIC is still small, say, . Therefore, we must simplify the architecture in order to integrate more processors. A scalingfree CORDIC for performing the plane rotation in (5) is used [5, 6], where the number of inner iterations is reduced from 32 iterations to only one iteration.
The definition of CORDIC can be developed from (7) as where is the required scaling factor per iteration and is the scaling error. The idea of the CORDIC rotation is to reduce the number of iterations of the full CORDIC to only a few iterations. Meanwhile, the scaling error will be small enough to be neglected as long as the orthonormality is retained. Figure 3 shows four different methods for different sizes of rotation angles and Table 1 shows a lookup table for the CORDIC, listing 32 approximated rotation angles for each rotation type, the required number of shiftadd operations and its computation cycles. Note that the approximated angles are stored as two times of . When the rotation angle is very tiny (i.e., is tiny, too), Type I with only one iteration will comply with the limited working range , if the selected () is larger than 16. In Figure 3(a), a pair of shiftadd operations realizing one iteration step is sufficient. Furthermore, it is scaling free when the angle . These orthonormal rotations are chosen such that they satisfy a predefined accuracy condition in order to approximate the original rotation angles and are constructed by the least computation efforts.

(a) Type I
(b) Type II
(c) Type III
(d) Type IV
Next, for the Type II rotation (as shown in Figure 3(b)), when is selected from 8 to 15 for small angles, two pairs of shiftadd operations are enough to retain the orthonormality. Moreover, when the is selected from 5 to 7, Type III requires three rotations. No scaling is required by Types I through III. Finally, for large rotation angles, the scaling errors cannot be omitted. Figure 3(d) shows the corresponding dependence flow graph for Type IV. Besides the rotation itself, it requires two pairs of shiftadd operations at the beginning of the flow graph, while 2 to 4 pairs of shiftadd operations are required to fix the scaling factor : Note that the scaling costs to pairs of shiftadd operations. In general, the cost of Type IV is bounded by pairs of shiftadd operations. For example, when the index is 2, the scaling is
These four subtypes have three identical parts: Type I with one iteration, the scaling part of Type IV, and the second iteration of Type II. These three parts can be integrated together by using multiplexers to select the data paths, as shown in Figure 4, where 2 adders, 2 shifters, and 4 multiplexers are required [5].
3.3. Adaptive CORDIC Iterations
To improve the computational efficiency, the CORDIC has been modified to perform 6 iterations per cycle as CORDIC6. As the global clock in a synchronous circuit is determined by the critical path, the maximum timing delay per iteration is 6 cycles (when the index is 1, Type IV). Therefore, the inner iteration steps of the angles are repeated until they are close to the critical one. The required number of repetitions is quoted in Table 1. For example, when the rotation angle index is 8, it will repeat three times from the index to the index ; when the rotation angle index is 20, it is repeated six times from the index to the index . On the other hand, we can adjust the number of iterations by selecting the average angle during the last sweep and name it as CORDICmean.
4. Experimental Results
4.1. Matlab Simulation
The full CORDIC with 32 iteration steps, the CORDIC with one iteration step, and two different adaptive modes have been tested using numerous random symmetric matrices of size to (i.e., EVD array sizes range from to ). Figure 5(a) shows the average number of sweeps needed to compute the eigenvalues/eigenvectors for each size of the EVD array, where the sweep number increases monotonically.
(a) The average number of sweeps versus array sizes
(b) The number of shiftadd operations versus array sizes
(c) The required number of sweeps versus array offdiagonal norm ( array with double precision)
(d) The required number of sweeps versus array offdiagonal norm ( array with single precision)
When the Jacobi EVD array size is , the CORDIC requires 12 sweeps while the full CORDIC only requires 6 sweeps per EVD computation. If we adjust the inner rotations to six times, the sweep number will be 10, smaller than the CORDIC but more than the full CORDIC. Note that using the average rotation angle to decide the rotation number as the CORDICmean seems to be an unwise method because it requires more sweeps. Although the CORDIC requires double sweeps than the full CORDIC, it actually reduces the number of the inner CORDIC rotations, which results in improved computational complexity. For example, a array with the Full CORDIC PE needs 6 sweeps × 32 inner CORDIC rotations and the CORDIC6 needs 10 sweeps × 6 inner CORDIC rotations whereas the CORDIC PE requires only 12 sweeps × 1 inner CORDIC rotation. In Figure 5(b), the average number of shiftadd operations required for each rotation method for different sizes of EVD arrays is demonstrated whereas CORDIC needs significantly fewer shiftadd operations than others. The adaptive CORDIC6 method can offer a compromise between the hardware complexity and the computational effort.
Figure 5(c) shows the offdiagonal Frobenius norm versus the sweep numbers for each array size of with double floating precision. Each rotation method converges to the predefined stop criteria: . The is the Frobenius norm of the offdiagonal elements of (i.e., ).
Figure 5(d) shows the reduction of the offdiagonal Frobenius norm versus the sweep numbers for single floating precision. It can be noticed that the offnorms do not reach the convergence criteria, and each size of the EVD array has different stop criteria for each rotation method (default IEEE 754 single). Therefore, we can first analyze the Frobenius norm of the offdiagonal elements in Matlab and then observe it until it reaches its maximal reduction. Afterwards, a lookup table can be generated and directly assign these stop criteria to the target hardware circuit or IP component.
4.2. VLSI Implementation
The CORDIC is modeled and compared to the folded Full CORDIC in VHDL with the resizing feature. These two methods have been integrated into parallel EVD arrays, with sizes and , through a configurable interface separately. After that, they have been synthesized by using the Synopsys Design Compiler with the TSMC 45 nm standard cell library. Note that the word length is 32 bits with the IEEE 754 single floating precision for both CORDIC methods using the same floating point unit from OpenCORE. Table 2 lists the synthesis results for area, timing delay, and power consumption.

As expected, the combinational logic area and the power consumption of the CORDIC PE are much smaller than the Full CORDIC. Furthermore, in order to determine the time required to compute the EVD of a symmetric matrix, it can be obtained by where , .
The total timing delay per EVD operation is defined by the critical timing delay × the number of inner CORDIC rotations × average number of outer sweeps × size of the matrix . It can be observed that the total operation time is dependent on the relationship between the inner CORDIC rotations and the outer sweeps. Therefore, one obtains a speedup by a factor of 21.4 by reducing the number of inner CORDIC rotations. Although the reduction of power consumption is less significant due to an extra CORDIC’s controller and multiplexers, it actually 6 consumes much less energy per EVD computation due to the shorter computation time. Note that the CORDIC PE requires two inner iterations on average due to the different rotation cycles, from six to one inner iteration, as shown in Table 1. Figure 6 shows the energy consumption for sizes of the array from to . Both rotation methods consume much less energy than the Full CORDIC, where the 6CORDIC can obtain a factor of 40.9 and the CORDIC can obtain a factor of 104.3 on average for energy reduction compared to the Full CORDIC.
In [14], a Jacobi single cyclebyrow EVD algorithm [15] has been implemented with a single CORDIC processor. Since it requires a very complex controller and lookup tables, the throughput is not comparable with a real BrentLuk's parallel EVD array [13]. In comparison to [13], Full CORDIC for Jacobi BrentLukEVD parallel architecture is implemented in FPGA; however, current configurable device can only perform EVD array. The experimental results show that performing the unitary rotation in CORDIC processor is a good solution. It required smaller area size, improved the overall computation time, and reduced the energy consumption. Furthermore, the unitaryrotation method can be also applied to other more efficient CORIDC architectures as long as the orthogonality is obtained during CORDIC iterations, such as pipeline CORDIC [16, 17], or implementing the rotators with better adder structures [18, 19].
As the process technologies continue to shrink, it becomes possible to directly implement a BrentLukEVD array with the Jacobi method [12, 13]. However, the size of the EVD array that can be implemented on the current configurable device with the regular CORDIC is still small, say, . Therefore, we must simplify the architecture in order to integrate more processors. A scalingfree CORDIC for performing the plane rotation in (5) is used [5, 6], where the number of inner iterations is reduced from 32 iterations to only one iteration.
5. Conclusions
The EVD was computed by the parallel Jacobi method, which was selected as an example for a typical iterative algorithm which exhibits very robust convergence properties. A configurable Jacobi EVD array with both Full CORDIC and CORDIC is implemented in order to further study the tradeoff strategies in design criteria for parallel integrated circuits. The experimental results indicate that the presented CORDIC method can reduce the size of the combinational logic, speed up the overall computation time, and improve the energy consumption.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
The authors would like to thank the National Science Council of Taiwan for the support under the contract of NSC 1012218E150001.
References
 S. Aggarwal and K. Khare, “CORDICbased window implementation to minimise area and pipeline depth,” IET Signal Processing, vol. 7, no. 5, pp. 427–435, 2013. View at: Publisher Site  Google Scholar  MathSciNet
 H. M. Ahmed, J. Delosme, and M. Morf, “Highly concurrent computing structure for matrix arithmetic and signal processing,” IEEE Computer Magazine, vol. 15, no. 1, pp. 65–82, 1982. View at: Publisher Site  Google Scholar
 A. Ahmedsaid, A. Amira, and A. Bouridane, “Improved SVD systolic array and implementation on FPGA,” in Proceedings of the IEEE International Conference on FieldProgrammable Technology, pp. 3–42, 2003. View at: Google Scholar
 R. Andraka, “Survey of CORDIC algorithms for FPGA based computers,” in Proceedings of the ACM/SIGDA 6th International Symposium on Field Programmable Gate Arrays (FPGA '98), pp. 191–200, February 1998. View at: Google Scholar
 I. Bravo, P. Jiménez, M. Mazo, J. L. Lázaro, and A. Gardel, “Implementation in FPGAS of Jacobi method to solve the eigenvalue and eigenvector problem,” in Proceedings of the International Conference on Field Programmable Logic and Applications (FPL '06), pp. 1–4, Madrid, Spain, August 2006. View at: Publisher Site  Google Scholar
 R. P. Brent and F. T. Luk, “The solution of singularvalue and symmetric eigenvalue problems on multiprocessor arrays,” SIAM Journal on Scientific and Statistical Computing, vol. 6, no. 1, pp. 69–84, 1985. View at: Publisher Site  Google Scholar  MathSciNet
 G. H. Golub and C. F. van Loan, Matrix Computations, Johns Hopkins University Press, Baltimore, Md, USA, 3rd edition, 1996. View at: MathSciNet
 J. Götze and G. J. Hekstra, “An algorithm and architecture based on orthonormal μrotations for computing the symmetric EVD,” Integration, the VLSI Journal, vol. 20, no. 1, pp. 21–39, 1995. View at: Publisher Site  Google Scholar  Zentralblatt MATH
 J. Götze, S. Paul, and M. Sauer, “An efficient Jacobilike algorithm for parallel eigenvalue computation,” IEEE Transactions on Computers, vol. 42, no. 9, pp. 1058–1065, 1993. View at: Publisher Site  Google Scholar  MathSciNet
 A. Hakkarainen, J. Werner, K. R. Dandekar, and M. Valkama, “Widelylinear beamforming and RF impairment suppression in massive antenna arrays,” Journal of Communications and Networks, vol. 15, no. 4, pp. 383–397, 2013. View at: Publisher Site  Google Scholar
 S. Klauke and J. Götze, “Low power enhancements for parallel algorithms,” in Proceedings of the IEEE International Symopsium on Circuits and Systems, pp. 234–237, 2001. View at: Google Scholar
 Y. Liu, C.S. Bouganis, and P. Y. K. Cheung, “Hardware architectures for eigenvalue computation of real symmetric matrices,” IET Computers and Digital Techniques, vol. 3, no. 1, pp. 72–84, 2009. View at: Publisher Site  Google Scholar
 P. K. Meher and S. Y. Park, “CORDIC designs for fixed angle of rotation,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 21, no. 2, pp. 217–228, 2013. View at: Publisher Site  Google Scholar
 K. K. Parhi and T. Nishitani, Digial Signal Processing for Multimedia Systems, Marcel Dekker, 1999.
 S. Purohit and M. Margala, “Investigating the impact of logic and circuit implementation on full adder performance,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 20, no. 7, pp. 1327–1331, 2012. View at: Publisher Site  Google Scholar
 B. Ramkumar and H. M. Kittur, “Lowpower and areaefficient carry select adder,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 20, no. 2, pp. 371–375, 2012. View at: Publisher Site  Google Scholar
 F. Rusek, D. Persson, B. K. Lau et al., “Scaling up MIMO: opportunities and challenges with very large arrays,” IEEE Signal Processing Magazine, vol. 30, no. 1, pp. 40–46, 2013. View at: Google Scholar
 C.C. Sun and J. Götze, “VLSI circuit design concept for parallel iterative algorithms in nanoscale,” in Proceedings of the 9th International Symposium on Communications and Information Technology (ISCIT '09), pp. 688–692, Icheon, Republic of Korea, September 2009. View at: Publisher Site  Google Scholar
 J. S. Walther, “A unified algorithm for elementary functions,” in Proceedings of the Spring Joint Computer Conference, pp. 379–385, 1971. View at: Google Scholar
Copyright
Copyright © 2014 ChiChia Sun et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.