Mathematical Problems in Engineering

Volume 2015, Article ID 870569, 11 pages

http://dx.doi.org/10.1155/2015/870569

## High Level Synthesis FPGA Implementation of the Jacobi Algorithm to Solve the Eigen Problem

^{1}Electronics Department, University of Alcalá, Alcalá de Henares, 28805 Madrid, Spain^{2}School of Computing, Telecommunications and Networks, Birmingham City University, Millennium Point, Birmingham B4 7XG, UK

Received 2 December 2014; Revised 19 January 2015; Accepted 19 January 2015

Academic Editor: José R. C. Piqueira

Copyright © 2015 Ignacio Bravo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We present a hardware implementation of the Jacobi algorithm to compute the eigenvalue decomposition (EVD). The computation of eigenvalues and eigenvectors has many applications where real time processing is required, and thus hardware implementations are often mandatory. Some of these implementations have been carried out with field programmable gate array (FPGA) devices using low level register transfer level (RTL) languages. In the present study, we used the Xilinx Vivado HLS tool to develop a high level synthesis (HLS) design and evaluated different hardware architectures. After analyzing the design for different input matrix sizes and various hardware configurations, we compared it with the results of other studies reported in the literature, concluding that although resource usage may be higher when HLS tools are used, the design performance is equal to or better than low level hardware designs.

#### 1. Introduction

Eigenvalue calculation is a problem that arises in many fields of science and engineering, such as computer vision [1], business and finance [2], power electronics [3], and wireless sensor networks [4]. In these applications, eigenvalues are normally used in a decision process, so real time performance is often required.

Since eigenvalue computation is usually based on operations such as matrix multiplication and matrix inversion, and a large amount of data must be handled, it is computationally expensive and can represent a potential bottleneck for the rest of the algorithm.

The algorithms involved in eigenvalue and eigenvector computations are generally iterative and present strong data dependencies between iterations; however, many similar operations can be carried out in parallel in the same iteration, such as multiply and accumulate in matrix products.

Although this is not important when working with general purpose processors or more specialized digital signal processors (DSPs), some devices leverage this feature by allowing the execution of repetitive operations in parallel.

For example, graphic processing units (GPUs) are useful when a large number of multiply and accumulate operations are involved, since they are designed to work in graphics processing, which is based on vector and matrix operations.

For similar reasons, field programmable gate arrays (FPGAs) display a remarkable capacity to carry out repetitive operations in parallel.

There are many applications where matrices are real and symmetric, with sizes not exceeding [5, 6]. In this situation, EVD computation is easier, and there are many algorithms that take advantage of this.

Some of the applications where the symmetric eigenvalue problem arises include the principal component analysis (PCA) statistical technique used for dimension reduction in data analysis [1, 7] and the multiple signal classification (MUSIC) algorithm [8] which uses a covariance matrix (which is real and symmetric by definition). In these studies, FPGA devices were employed to solve the problem, using a specialized hardware module to compute the EVD.

As mentioned previously, most of the methods used to solve the eigen problem are iterative. Depending on how the operations are performed on the initial matrix and the results obtained, a distinction can be made between purely iterative methods (Jacobi, power iterations) [9] and algorithms that perform some kind of transformation of the initial matrix (Lanczos-bisection, householder-QR) [10, 11]. A distinction can also be made between algorithms that output all the eigenvalues (Jacobi, QR) and those that only compute extremal eigenvalues or the one closest to a given value (power iterations, Lanczos).

In the studies cited above, the main method employed for eigenvalue and eigenvector computation was the Jacobi algorithm, since its characteristics render it highly suitable for a parallel implementation. The Jacobi algorithm can be implemented in such a way that there are almost no data dependencies between operations in the same iteration and it is possible to use systolic architectures, as has been proposed by Brent et al. [12]. In addition, all the operations involved can be carried out by using the CORDIC algorithm [13], as has been shown by Cavallaro and Luk in [14], and this can be implemented simply with add and shift operations, which only use a small amount of hardware resources compared to multiplication and division operations. Another advantage of the Jacobi method is that its round-off error is very stable [9].

However, there are other methods for computing extremal eigenvalues that can take advantage of the computing power provided by FPGAs, such as the QR algorithm or the Lanczos [15] method, since some of their operations can be carried out in parallel. In this study, we also present an FPGA floating point implementation of the QR algorithm. In terms of a practical FPGA implementation, it is a challenging task to make efficient use of the available resources while at the same time meeting real time requirements. This is mainly because the analysis of algorithm data dependencies and their parallelization is extremely difficult.

Furthermore, FPGA implementations are usually carried out in register transfer level (RTL) languages where there is a very low level of abstraction, and it is therefore necessary to manage resources and timing carefully. This also implies that when some architectural decisions are made, it is difficult to go back without recoding most of the system. Nevertheless, this kind of system can be very efficient in terms of the total amount of resources used.

When an algorithm with relevant mathematical workload, such as the ones mentioned before, has to be implemented in an FPGA, a great part of the design process consist in the partition of the system in small units to achieve the required grade of concurrency. Design time can be shortened by using a high level synthesis (HLS) tool. This tool attempts to raise the abstraction level allowing algorithm specification by means of a high level programming language such as C or C++. Thanks to this, the designer can focus on the algorithm itself, while the tool makes the software to hardware conversion guided by some user directives, which range from resource usage to concurrency grade. Some of the tools that have become popular in the recent years are Impulse C [16], Catapult C by Calypto [17], and Vivado HLS by Xilinx [18].

Hardware implementations of the Jacobi algorithm are very popular because its characteristics permit different architectural designs depending on the required performance. In the literature there are systolic implementations where speed is the most important requirement [19, 20], serial designs where low resource consumption is selected [1, 21], and a meet in the middle approach (semiparallel) where both criteria are balanced [8]. In this study, the Jacobi algorithm was implemented using the Xilinx Vivado HLS tool, to explore different hardware implementations and compare them to select the most efficient one in terms of execution time and FPGA resources used.

In this study, the Jacobi algorithm was implemented using the Xilinx Vivado HLS tool, to explore different hardware implementations and compare them to select the most efficient one in terms of execution time and FPGA resources used.

To gain a better understanding of the results obtained, we also carried out a floating point implementation of the QR algorithm using Vivado HLS. The results obtained were then compared to similar systems described in the literature.

The rest of the paper is organized as follows. In Section 2, the mathematical background behind the Jacobi and QR algorithms is presented together with some details relevant to implementation. The proposed implementations and its parameters are discussed in Sections 3 and 4, respectively. Finally, the system performance is compared with other design approaches in Section 5 and we present our conclusions in Section 6.

#### 2. Mathematical Background

The eigenvalue problem [22] is one of the main questions in numerical algebra, and due to the many applications in which it is useful it has attracted much research attention in recent years [23–25]. The formulation of the problem is simple (1), but it can be approached from very different angles: where and are an eigenvalue and an eigenvector, respectively, of a matrix . Although there is an analytical solution, it involves the calculation of polynomial roots, and therefore it is only suitable for low order () matrices (). When the problem has a higher order, numerical algorithms must be used. In this section, the mathematical background of Jacobi and QR algorithms is presented when they are used to compute the eigenvalues from real and symmetric matrices.

##### 2.1. The Jacobi Algorithm

The Jacobi algorithm [26] was first proposed in 1846 but did not become popular until its computational implementation was explored. The singular value decomposition (SVD) of a matrix is given by where and are orthogonal matrices and is a diagonal matrix storing the singular values of on its diagonal. For the case where is real and symmetric, we can rewrite (2) as

The idea behind the Jacobi method is to perform a series of similarity transformations (i.e., before and after multiplication with an orthogonal matrix) of , to render it more diagonal in each iteration. This can be seen as an iterative expression: where represents the current iteration and the next one. The sequence of transformation matrices is chosen in such a way that the double multiplication renders the updated a little more diagonal than its predecessor . A popular choice for is the Givens rotation matrix (5). Consider where and are the sine and cosine of , respectively. Parallelism of the algorithm resides in this choice. If the product is studied, it can easily be seen that only four elements of are modified, so multiple similarity transformations can be carried at the same time.

Eigenvectors of are obtained by accumulating the rotation matrices used in eigenvalue computation as shown in the following expression:

To take full advantage of this, Brent and Luk [19] proposed a systolic architecture in which the initial matrix is mapped onto a set of matrices, such as the one shown in expression (7). This leaves a subset of diagonalization problems which can easily be solved in hardware making use of the CORDIC (COordinate Rotation DIgital Computer) algorithm [27], as shown in [14]:

Since diagonalization problems now have to be solved, rotation angles () are also required. Given that is symmetric, the angle is selected from the submatrices where , according to expression (8), so elements (where is the order of ) can be eliminated in each iteration. As will now be shown, angle calculation can also be carried out by the CORDIC algorithm, making an efficient use of resources in the implementation:

After each update of , it is necessary to bring new nonzero elements near the main diagonal so that they can be eliminated in the next iteration, repeating the entire process until . To this end, several rows and columns are interchanged as it will be explained in Section 3.

As mentioned before, the CORDIC algorithm [13, 27] can be used to solve the double rotation (4) on every submatrix and to calculate the rotation angles. One great advantage of CORDIC is that it can be expressed by means of shift and add operations, which are easy to implement in reconfigurable hardware. It also facilitates working with fixed point codification, where the word length is not fixed to 32 or 64 bits as occurs with the floating point standard, thus making more efficient use of FPGA resources. For example, in the present study, the entire system was fixed to a word length of 18 bits, since this is the entry word length of Xilinx FPGA multipliers.

The fundamental idea behind CORDIC is to carry out a series of rotations (called microrotations) on an input vector to obtain another vector as output. This is performed in three different coordinate systems (linear, circular, and hyperbolic), and it is possible to obtain various basic functions, such as multiplication, trigonometric operations, logarithms, and exponentials [28], in the components of the output vector, given some initial input vector conditions [27]. Microrotation values are selected as shown in the following expression: The choice of as an inverse power of 2 is what will allow us to implement the algorithm using shifts, since multiplication or division for a power of 2 is equivalent to a shift to the left or the right, respectively, in binary codification.

In the circular coordinate system, it is possible to obtain the two operations required for the Jacobi algorithm: Givens rotation and angle calculation (arctangent). Since CORDIC performs a series of rotations on a vector, it can be seen as an iterative expression (11). Consider
where is a vector determined by its vertical and horizontal components ( and ) and is a variable that keeps track of the total rotated angle. In each iteration the direction of the rotation is determined by the operation that is to be performed. In the particular case of the circular coordinate system, the two algorithm operation modes are as follows. (i)*Rotation mode:* the algorithm takes a vector and an angle and outputs a new vector , which corresponds to rotated radians. To do so, is selected in such a way that approaches zero in each iteration. When , it can be assumed that the starting vector has been fully rotated a angle.(ii)*Translation mode:* algorithm input is again a vector , whose phase we want to compute. To do so, the vector is rotated until its component is zero, selecting accordingly. Since accumulates microrotation angles, when the variable will correspond to the initial phase of the vector. In addition, since , will store the absolute value of .

If this is expressed mathematically, in* rotation mode *, the relation between the input and the output is as follows:

On the other hand, if the algorithm works in* vectoring mode*, must be computed as and the relation between the input and the output is now as shown in the following expression:

As can be observed, in expressions (12) and (13) the result is scaled by a constant value of . This is due to the nonorthogonality of the microrotations, which modifies the absolute value of the vector. To solve this, results obtained can be scaled by . In addition, (14) is known in advance and only varies with the number of microrotations performed. Since is fixed, can be precomputed and stored in a memory element to correct the output data:

##### 2.2. The QR Algorithm

As indicated in Section 1, to show the effectiveness of the Jacobi algorithm over other methods when it is implemented in reconfigurable logic, a QR algorithm implementation was also carried out using Vivado HLS.

In the QR algorithm, another condition must be added to our initial matrix, which was real and symmetric: now it must also be a tridiagonal matrix, like the one presented in the following expression:

To obtain the eigenvalues of , a reduction strategy is used where the entries are zeroed. When, for example, the row and the column involved can be removed, where is an eigenvalue of .

To accomplish this in the QR algorithm a sequence of matrices similar to () is produced [29]. To start with, the initial matrix is factored as follows: where is an orthogonal matrix and is an upper triangular matrix. In a similar way is defined as

As presented thus far, the convergence of the method is slow. To speed it up a shift can be performed on the original matrix for a quantity close to the eigenvalue we are looking for. Thus, we now have a new sequence of matrices such that After each shift, the eigenvalues of will be shifted by the same amount, and it is easy to recover the original values by accumulating the shifts after each iteration. To obtain an approximated value of for each iteration, a popular choice is to calculate the eigenvalues of the matrix (19) and take the one closest to . This is called the Wilkinson shift [26]. Finally, and are chosen to be Givens rotation matrices to eliminate the appropriate element:

#### 3. Implementation Details

As discussed in Section 1, the implementation tool used in this study was the Xilinx Vivado HLS [18], which allows the use of a high level programming language (C, C++ or System C) instead of a hardware description language such as VHDL or Verilog. Although it presents some drawbacks, this means of implementing the algorithms enabled us to test very different hardware configurations of the system with very few design modifications.

Due to its iterative nature, the Jacobi method can easily be expressed as a series of loops, which is the main element used in Vivado HLS to implement the algorithms.

This can be done via #pragma directives or tcl scripting, giving each part of the design different attributes that determine the degree of concurrency or the desired resource constraints. Some important operations that can be performed on design loops include the following.(i)*Pipeline:* which will try to achieve parallelism between the operations performed in the same iteration. Pipeline operation can also be specified throughout the entire design so that all internal loops are unrolled making parallelism between two system iterations possible.(ii)*Unroll:* which will implement each iteration as a separate hardware instance.(iii)*Dataflow:* which will try to achieve parallelism between different loops that are data dependent, implementing communication channels such as ping-pong memories.

As mentioned before, we can also apply resource constraints to the designs, limiting the number of instances of a particular hardware element. The elements that can be limited range from operators (+,) to specific resources such as DSP48E and BRAM memory blocks. User defined elements, such as a function called multiple times, can also be constrained, and this is very important for design optimization, as will be shown later.

Figure 1 shows a functional diagram of the Jacobi algorithm. Since the C programming language was used, each task was implemented as a for loop and labeled to make optimization easier.