Abstract

We present a hardware implementation of the Jacobi algorithm to compute the eigenvalue decomposition (EVD). The computation of eigenvalues and eigenvectors has many applications where real time processing is required, and thus hardware implementations are often mandatory. Some of these implementations have been carried out with field programmable gate array (FPGA) devices using low level register transfer level (RTL) languages. In the present study, we used the Xilinx Vivado HLS tool to develop a high level synthesis (HLS) design and evaluated different hardware architectures. After analyzing the design for different input matrix sizes and various hardware configurations, we compared it with the results of other studies reported in the literature, concluding that although resource usage may be higher when HLS tools are used, the design performance is equal to or better than low level hardware designs.

1. Introduction

Eigenvalue calculation is a problem that arises in many fields of science and engineering, such as computer vision [1], business and finance [2], power electronics [3], and wireless sensor networks [4]. In these applications, eigenvalues are normally used in a decision process, so real time performance is often required.

Since eigenvalue computation is usually based on operations such as matrix multiplication and matrix inversion, and a large amount of data must be handled, it is computationally expensive and can represent a potential bottleneck for the rest of the algorithm.

The algorithms involved in eigenvalue and eigenvector computations are generally iterative and present strong data dependencies between iterations; however, many similar operations can be carried out in parallel in the same iteration, such as multiply and accumulate in matrix products.

Although this is not important when working with general purpose processors or more specialized digital signal processors (DSPs), some devices leverage this feature by allowing the execution of repetitive operations in parallel.

For example, graphic processing units (GPUs) are useful when a large number of multiply and accumulate operations are involved, since they are designed to work in graphics processing, which is based on vector and matrix operations.

For similar reasons, field programmable gate arrays (FPGAs) display a remarkable capacity to carry out repetitive operations in parallel.

There are many applications where matrices are real and symmetric, with sizes not exceeding [5, 6]. In this situation, EVD computation is easier, and there are many algorithms that take advantage of this.

Some of the applications where the symmetric eigenvalue problem arises include the principal component analysis (PCA) statistical technique used for dimension reduction in data analysis [1, 7] and the multiple signal classification (MUSIC) algorithm [8] which uses a covariance matrix (which is real and symmetric by definition). In these studies, FPGA devices were employed to solve the problem, using a specialized hardware module to compute the EVD.

As mentioned previously, most of the methods used to solve the eigen problem are iterative. Depending on how the operations are performed on the initial matrix and the results obtained, a distinction can be made between purely iterative methods (Jacobi, power iterations) [9] and algorithms that perform some kind of transformation of the initial matrix (Lanczos-bisection, householder-QR) [10, 11]. A distinction can also be made between algorithms that output all the eigenvalues (Jacobi, QR) and those that only compute extremal eigenvalues or the one closest to a given value (power iterations, Lanczos).

In the studies cited above, the main method employed for eigenvalue and eigenvector computation was the Jacobi algorithm, since its characteristics render it highly suitable for a parallel implementation. The Jacobi algorithm can be implemented in such a way that there are almost no data dependencies between operations in the same iteration and it is possible to use systolic architectures, as has been proposed by Brent et al. [12]. In addition, all the operations involved can be carried out by using the CORDIC algorithm [13], as has been shown by Cavallaro and Luk in [14], and this can be implemented simply with add and shift operations, which only use a small amount of hardware resources compared to multiplication and division operations. Another advantage of the Jacobi method is that its round-off error is very stable [9].

However, there are other methods for computing extremal eigenvalues that can take advantage of the computing power provided by FPGAs, such as the QR algorithm or the Lanczos [15] method, since some of their operations can be carried out in parallel. In this study, we also present an FPGA floating point implementation of the QR algorithm. In terms of a practical FPGA implementation, it is a challenging task to make efficient use of the available resources while at the same time meeting real time requirements. This is mainly because the analysis of algorithm data dependencies and their parallelization is extremely difficult.

Furthermore, FPGA implementations are usually carried out in register transfer level (RTL) languages where there is a very low level of abstraction, and it is therefore necessary to manage resources and timing carefully. This also implies that when some architectural decisions are made, it is difficult to go back without recoding most of the system. Nevertheless, this kind of system can be very efficient in terms of the total amount of resources used.

When an algorithm with relevant mathematical workload, such as the ones mentioned before, has to be implemented in an FPGA, a great part of the design process consist in the partition of the system in small units to achieve the required grade of concurrency. Design time can be shortened by using a high level synthesis (HLS) tool. This tool attempts to raise the abstraction level allowing algorithm specification by means of a high level programming language such as C or C++. Thanks to this, the designer can focus on the algorithm itself, while the tool makes the software to hardware conversion guided by some user directives, which range from resource usage to concurrency grade. Some of the tools that have become popular in the recent years are Impulse C [16], Catapult C by Calypto [17], and Vivado HLS by Xilinx [18].

Hardware implementations of the Jacobi algorithm are very popular because its characteristics permit different architectural designs depending on the required performance. In the literature there are systolic implementations where speed is the most important requirement [19, 20], serial designs where low resource consumption is selected [1, 21], and a meet in the middle approach (semiparallel) where both criteria are balanced [8]. In this study, the Jacobi algorithm was implemented using the Xilinx Vivado HLS tool, to explore different hardware implementations and compare them to select the most efficient one in terms of execution time and FPGA resources used.

In this study, the Jacobi algorithm was implemented using the Xilinx Vivado HLS tool, to explore different hardware implementations and compare them to select the most efficient one in terms of execution time and FPGA resources used.

To gain a better understanding of the results obtained, we also carried out a floating point implementation of the QR algorithm using Vivado HLS. The results obtained were then compared to similar systems described in the literature.

The rest of the paper is organized as follows. In Section 2, the mathematical background behind the Jacobi and QR algorithms is presented together with some details relevant to implementation. The proposed implementations and its parameters are discussed in Sections 3 and 4, respectively. Finally, the system performance is compared with other design approaches in Section 5 and we present our conclusions in Section 6.

2. Mathematical Background

The eigenvalue problem [22] is one of the main questions in numerical algebra, and due to the many applications in which it is useful it has attracted much research attention in recent years [2325]. The formulation of the problem is simple (1), but it can be approached from very different angles: where and are an eigenvalue and an eigenvector, respectively, of a matrix . Although there is an analytical solution, it involves the calculation of polynomial roots, and therefore it is only suitable for low order () matrices (). When the problem has a higher order, numerical algorithms must be used. In this section, the mathematical background of Jacobi and QR algorithms is presented when they are used to compute the eigenvalues from real and symmetric matrices.

2.1. The Jacobi Algorithm

The Jacobi algorithm [26] was first proposed in 1846 but did not become popular until its computational implementation was explored. The singular value decomposition (SVD) of a matrix is given by where and are orthogonal matrices and is a diagonal matrix storing the singular values of on its diagonal. For the case where is real and symmetric, we can rewrite (2) as

The idea behind the Jacobi method is to perform a series of similarity transformations (i.e., before and after multiplication with an orthogonal matrix) of , to render it more diagonal in each iteration. This can be seen as an iterative expression: where represents the current iteration and the next one. The sequence of transformation matrices is chosen in such a way that the double multiplication renders the updated a little more diagonal than its predecessor . A popular choice for is the Givens rotation matrix (5). Consider where and are the sine and cosine of , respectively. Parallelism of the algorithm resides in this choice. If the product is studied, it can easily be seen that only four elements of are modified, so multiple similarity transformations can be carried at the same time.

Eigenvectors of are obtained by accumulating the rotation matrices used in eigenvalue computation as shown in the following expression:

To take full advantage of this, Brent and Luk [19] proposed a systolic architecture in which the initial matrix is mapped onto a set of matrices, such as the one shown in expression (7). This leaves a subset of diagonalization problems which can easily be solved in hardware making use of the CORDIC (COordinate Rotation DIgital Computer) algorithm [27], as shown in [14]:

Since diagonalization problems now have to be solved, rotation angles () are also required. Given that is symmetric, the angle is selected from the submatrices where , according to expression (8), so elements (where is the order of ) can be eliminated in each iteration. As will now be shown, angle calculation can also be carried out by the CORDIC algorithm, making an efficient use of resources in the implementation:

After each update of , it is necessary to bring new nonzero elements near the main diagonal so that they can be eliminated in the next iteration, repeating the entire process until . To this end, several rows and columns are interchanged as it will be explained in Section 3.

As mentioned before, the CORDIC algorithm [13, 27] can be used to solve the double rotation (4) on every submatrix and to calculate the rotation angles. One great advantage of CORDIC is that it can be expressed by means of shift and add operations, which are easy to implement in reconfigurable hardware. It also facilitates working with fixed point codification, where the word length is not fixed to 32 or 64 bits as occurs with the floating point standard, thus making more efficient use of FPGA resources. For example, in the present study, the entire system was fixed to a word length of 18 bits, since this is the entry word length of Xilinx FPGA multipliers.

The fundamental idea behind CORDIC is to carry out a series of rotations (called microrotations) on an input vector to obtain another vector as output. This is performed in three different coordinate systems (linear, circular, and hyperbolic), and it is possible to obtain various basic functions, such as multiplication, trigonometric operations, logarithms, and exponentials [28], in the components of the output vector, given some initial input vector conditions [27]. Microrotation values are selected as shown in the following expression: The choice of as an inverse power of 2 is what will allow us to implement the algorithm using shifts, since multiplication or division for a power of 2 is equivalent to a shift to the left or the right, respectively, in binary codification.

In the circular coordinate system, it is possible to obtain the two operations required for the Jacobi algorithm: Givens rotation and angle calculation (arctangent). Since CORDIC performs a series of rotations on a vector, it can be seen as an iterative expression (11). Consider where is a vector determined by its vertical and horizontal components ( and ) and is a variable that keeps track of the total rotated angle. In each iteration the direction of the rotation is determined by the operation that is to be performed. In the particular case of the circular coordinate system, the two algorithm operation modes are as follows. (i)Rotation mode: the algorithm takes a vector and an angle and outputs a new vector , which corresponds to rotated radians. To do so, is selected in such a way that approaches zero in each iteration. When , it can be assumed that the starting vector has been fully rotated a angle.(ii)Translation mode: algorithm input is again a vector , whose phase we want to compute. To do so, the vector is rotated until its component is zero, selecting accordingly. Since accumulates microrotation angles, when the variable will correspond to the initial phase of the vector. In addition, since , will store the absolute value of .

If this is expressed mathematically, in rotation mode , the relation between the input and the output is as follows:

On the other hand, if the algorithm works in vectoring mode, must be computed as and the relation between the input and the output is now as shown in the following expression:

As can be observed, in expressions (12) and (13) the result is scaled by a constant value of . This is due to the nonorthogonality of the microrotations, which modifies the absolute value of the vector. To solve this, results obtained can be scaled by . In addition,   (14) is known in advance and only varies with the number of microrotations performed. Since is fixed, can be precomputed and stored in a memory element to correct the output data:

2.2. The QR Algorithm

As indicated in Section 1, to show the effectiveness of the Jacobi algorithm over other methods when it is implemented in reconfigurable logic, a QR algorithm implementation was also carried out using Vivado HLS.

In the QR algorithm, another condition must be added to our initial matrix, which was real and symmetric: now it must also be a tridiagonal matrix, like the one presented in the following expression:

To obtain the eigenvalues of , a reduction strategy is used where the entries are zeroed. When, for example, the row and the column involved can be removed, where is an eigenvalue of .

To accomplish this in the QR algorithm a sequence of matrices similar to () is produced [29]. To start with, the initial matrix is factored as follows: where is an orthogonal matrix and is an upper triangular matrix. In a similar way is defined as

As presented thus far, the convergence of the method is slow. To speed it up a shift can be performed on the original matrix for a quantity close to the eigenvalue we are looking for. Thus, we now have a new sequence of matrices such that After each shift, the eigenvalues of will be shifted by the same amount, and it is easy to recover the original values by accumulating the shifts after each iteration. To obtain an approximated value of for each iteration, a popular choice is to calculate the eigenvalues of the matrix   (19) and take the one closest to . This is called the Wilkinson shift [26]. Finally, and are chosen to be Givens rotation matrices to eliminate the appropriate element:

3. Implementation Details

As discussed in Section 1, the implementation tool used in this study was the Xilinx Vivado HLS [18], which allows the use of a high level programming language (C, C++ or System C) instead of a hardware description language such as VHDL or Verilog. Although it presents some drawbacks, this means of implementing the algorithms enabled us to test very different hardware configurations of the system with very few design modifications.

Due to its iterative nature, the Jacobi method can easily be expressed as a series of loops, which is the main element used in Vivado HLS to implement the algorithms.

This can be done via  #pragma directives or  tcl scripting, giving each part of the design different attributes that determine the degree of concurrency or the desired resource constraints. Some important operations that can be performed on design loops include the following.(i)Pipeline: which will try to achieve parallelism between the operations performed in the same iteration. Pipeline operation can also be specified throughout the entire design so that all internal loops are unrolled making parallelism between two system iterations possible.(ii)Unroll: which will implement each iteration as a separate hardware instance.(iii)Dataflow: which will try to achieve parallelism between different loops that are data dependent, implementing communication channels such as ping-pong memories.

As mentioned before, we can also apply resource constraints to the designs, limiting the number of instances of a particular hardware element. The elements that can be limited range from operators (+,) to specific resources such as DSP48E and BRAM memory blocks. User defined elements, such as a function called multiple times, can also be constrained, and this is very important for design optimization, as will be shown later.

Figure 1 shows a functional diagram of the Jacobi algorithm. Since the C programming language was used, each task was implemented as a  for  loop and labeled to make optimization easier.

When the design starts its operation (a), the input matrix is buffered in a memory element . In addition, to compute eigenvectors, is loaded with an identity matrix (). After the initial operations, there are three tasks (a, b, and c) to be performed times in the external loop, where corresponds to the total number of Jacobi iterations (elimination of the elements adjacent to the main diagonal) that the algorithm will execute and is selected in such way that eigenvector error is minimized.

The first task that must be done in each Jacobi iteration (b) is to calculate rotation angles (). As previously mentioned, this was achieved using the CORDIC algorithm in vectoring mode.

The next tasks to be performed (c.1 and c.2) are eigenvalue and eigenvector rotations, which consist of pre- and postmultiplication with the Givens rotation matrix. Since there are no data dependencies between these two tasks, they are placed in different loops and thus parallelization is possible. Pre- or postmultiplication with a Givens rotation matrix can be expressed as two vector rotations, so it can be performed with two rotation mode CORDIC algorithm executions. This leaves us with six CORDIC executions, four of them which correspond to eigenvalue calculation and the other two to eigenvector computation.

The final operation to be performed before the next Jacobi iteration (d) is a series of row and column permutations with the purpose of introducing new nonzero elements into the diagonal submatrices.

This element rearrangement was implemented as described by Brent and Luk [19] but with some modifications. In their study, they proposed a systolic architecture where each submatrix corresponded to an independent hardware unit called a processor, and all processors were interconnected so that data interchanges were possible.

Since our design did not have a defined hardware architecture at this point, element interchange was substituted with the corresponding row and column permutations in order to fulfill the requirement of introducing new nonzero elements into the diagonal submatrices.

Finally, when all the Jacobi rotations are completed, eigenvalues and eigenvectors are output as arrays, which can be implemented as BRAM ports or AXI buses so that the design can be integrated in different kinds of system.

All tasks were implemented in the C programming language using a fixed point data type, following the XQN quantification scheme, where a number is represented by bits of integer part and bits of fractional part, with a total number of bits (word length) of for signed values and for unsigned values.

Starting from the initial C prototype, different hardware architectures can be considered depending on the design requirements. Mainly, resource usage (LUT, Flip-Flops, multipliers, memory elements) and timing performance (latency (), throughput or initiation interval () and maximum clock frequency) are the parameters that we attempt to optimize:(i)the latency (), which specifies how much time will take until the system generate the first valid result;(ii)the initiation interval (II), which corresponds to the throughput or the number of clock cycles between two valid sets of results, after the latency period.

As a result of the Vivado HLS optimization directives three architectures can be achieved, depending on which parameters are considered more important.(i)Serial architecture: this hardware scheme pursues minimal resource usage at the expense of a higher latency. To reach this optimization level, a pipeline directive is specified on the external loop of the design, which performs the Jacobi iterations.(ii)Parallel architecture: in this kind of system, the main goal is to achieve the best performance in terms of execution time (latency and throughput) at the expense of resource usage. To achieve this, pipeline is applied to the entire design, which means that the external loop will be unrolled and parallelism between two algorithm executions will be possible.(iii)Semiparallel architecture: although timing is still important in this scheme, resource usage is also considered, and attempts are made to minimize it while affecting execution time as little as possible. In our particular case, this was achieved by limiting the number of CORDIC instances of the design in the parallel architecture.

Although all three alternatives were analyzed, the full parallel architecture was found to use a massive amount of internal FPGA resources, greatly exceeding the capacity of the device. This was mainly due to an excessive amount of CORDIC units in the rotation mode, which led to 55 hardware instances of it in the case of an input matrix. Consequently, we ruled out the possibility of parallel architecture.

It is important to clarify the difference between semiparallel and serial architecture. In both cases, we always attempted to maximize concurrency between operations, such as arithmetic operations and memory accesses; however, with parallel architecture it was also possible to achieve concurrence between two full executions of the algorithm.

The excessive amount of CORDIC units implemented by the Vivado HLS synthesizer was taken as a starting point for optimization of the design. Since Vivado HLS allowed us to limit the number of elements of a particular hardware resource or entity, the number of CORDIC instances in both semiparallel and serial architecture was limited to different values, allowing Vivado HLS to redesign the RTL implementation and subsequently analyzing the design performance obtained.

First, we analyze the case of the semiparallel architecture. Since is a common size in the literature for input matrices, the design was optimized for this matrix size, extending the conclusions reached to other matrix sizes. The results obtained (resource consumption and timing performance) are shown in Figure 2.

An analysis of timing performance indicated that although latency did not vary very much, the initiation interval (II) was considerably reduced when the number of rotation mode CORDIC instances was increased from one to four, slowly decreasing as more CORDIC modules were added.

This result was mainly due to data dependencies between operations performed in the Jacobi algorithm. As previously indicated, four rotations must be performed for eigenvalue calculation, and the result of the first two is required to perform the last two, whereas two independent rotations must be performed for eigenvector calculation.

From this, it can be concluded that four CORDIC modules will greatly increase the performance of the system. When timing performance is related to resource usage, it is clear that it increased linearly with CORDIC module usage, and the slope variation was due to the different control logic implemented by Vivado with an odd and an even number of CORDIC modules.

The last conclusion that can be reached from Figure 2 is that there was parallelism between two algorithm iterations, since . This means that the first valid result will be obtained after clock cycles, but the next results will be obtained more rapidly, after every II clock cycles.

The results obtained for the serial architecture are shown in Figure 3. This time, only the latency is shown, since there was no parallelism between two algorithm executions and they only differed in one clock cycle that the system needed to start the next execution ().

As indicated earlier, the execution time between two valid result sets is the same as the latency period (), and in the semiparallel design alternative, it evolved similarly to the initiation interval.

For similar reasons, there was a greater improvement in latency in the first four CORDIC module additions, which did not decrease at all after six. The increase in resource usage was also linear, now with a constant slope, although it used fewer resources.

4. Assessment of Design Parameters

4.1. Optimal Number of Jacobi Iterations

To evaluate the quality of the results obtained using the Jacobi method, we conducted an experimental error analysis to determine how many Jacobi iterations were required to obtain satisfactory results.

The word length (WL) selected for the fixed point codification scheme presented in Section 2 was 18 bits with a fractional part of 15 bits (2Q15), since Xilinx FPGA multiplier entry is fixed to this value and the purpose of using the CORDIC algorithm was to minimize resource usage. In addition, as will now be shown, this codification provided sufficient precision to avoid overflow and growing rounding errors.

Simulating the system for multiple real and symmetric random matrices, we calculated the eigenvalue and eigenvector maximum () and quadratic () error, comparing FPGA Jacobi fixed point results with MATLAB floating point implementation (eig).

When the results were analyzed, we found that the eigenvector error stabilized later than the eigenvalue error (more iterations were required). Consequently, Figure 4 shows the eigenvector component error, calculated according to the following expressions: where represents the total number of samples. The error obtained was plotted as a function of the number of Jacobi iterations (i.e., the number of rotations performed for each submatrix) and was used to determine the optimal number of iterations () that the algorithm must perform to obtain sufficiently accurate results.

Our analysis of matrices indicated that was the best number of iterations for both eigenvectors and eigenvalues in order to obtain a sufficiently accurate solution, since the error was constant from this value on.

After performing the same tests for other input matrix sizes, it was concluded that, in general, expression (21) can be used to determine the optimal number of iterations to be performed for an matrix:

4.2. Serial/Parallel Jacobi Analysis

Since an expression was available for the optimal number of Jacobi iterations (), we also conducted a study of the design performance depending on initial matrix size.

With the parallel architecture, we found that although the Vivado HLS generated an RTL design for matrices with an order , the Vivado suite synthesizer was unable to implement it successfully. However, the serial architecture design performed well for higher order matrices ().

Based on the parametric study performed in Section 3, we decided that a good trade-off between execution time and resource usage would be to use four CORDIC modules, which would perform iterations in both parallel and serial architectures, where WL was the data word length.

Implementation results for the serial architecture are shown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device.

Turning first to resource usage, it can be seen that this did not not increase as fast as execution time, since when matrix size is increased with a fixed amount of CORDIC units, only storage elements and control logic will be added. Consequently, it becomes evident that in terms of resources used, the design has an complexity.

Regarding execution time (), the Vivado HLS enabled us to examine the sequence of operations performed by the design in each clock cycle. An analysis of this information indicated that the design latency, which corresponds to execution time in the serial architecture, follows the following expression: where is the size of the processed matrix and is the number of Jacobi iterations performed, following expression (21). Meanwhile,   (23) corresponds to the time required to read sufficient data from an external memory to start the first Jacobi iteration, and is the amount of clock cycles required to perform a Jacobi iteration:

A similar study was performed for different numbers of rotation CORDIC units, which revealed that the time to perform a Jacobi iteration varied. However, it remained a function of , so system latency () was still related to the size of the input matrix, as shown in the following expression: where is the time required to perform a Jacobi iteration.

Regarding the maximum size () of the input matrix, we assumed that given the serial nature of the system, resource usage would be the limit; however, we found that for , the Vivado HLS took a very long time to guess the control logic, and implementation eventually failed. Despite this, the design has been coded in such a way that input matrix size and the codification scheme can be modified simply by changing a few constants, which is a vast improvement over the classic RTL designs.

5. Comparison with Other Proposals

The results obtained were also compared with those reported in other studies of matrices, since these have received much research attention.

Figure 6 shows the implementation results for the two architectures proposed here (serial and parallel), a Jacobi serial architecture carried out in VHDL [1], a Jacobi semiparallel architecture from the multiple signal classification (MUSIC) algorithm [8], and the Vivado HLS linear algebra library svd implementation.

The criteria selected to compare the different designs were as follows: FPGA resource consumption, in terms of the discrete elements FF and LUT, computation time (latency and initiation interval), and maximum clock frequency ().

The VHDL serial Jacobi algorithm features a fully customized system using two ad hoc CORDIC modules, one of which works in rotation and vectoring mode while the other is optimized to work only in rotation mode.

The Jacobi module used in the MUSIC algorithm features a semiparallel architecture based on the Brent and Luk proposal [12] and also uses the CORDIC algorithm for the calculations. However, it does not implement the full systolic array, only making use of a subset of this to save resources.

Turning first to the two Jacobi algorithm implementations proposed here, it is evident that the best performance in terms of speed was achieved by the semiparallel architecture, whereas the serial architecture only used half of the resources required to implement the parallel one.

The other Vivado HLS implementation considered was the svd function from the Vivado HLS library, which can only be implemented in single precision floating point format. Resource usage is not dissimilar to our Vivado HLS implementations; however, it is slower and makes use of 58 DSP48E blocks.

The difference arises from the use of floating point cores which are implemented with DSP units, while the fixed point CORDIC approach employed in the present study only required two DSP48E multipliers per unit.

As previously mentioned, a floating point QR algorithm was implemented in the same FPGA device also using Vivado HLS. The main purpose of this was to determine whether other algorithms presented greater computational efficiency than the Jacobi method. Although we used different coding systems in the Jacobi and QR algorithms, respectively, an analysis of both implementations revealed that QR was less suitable for a hardware implementation since the operations performed can vary between different iterations.

Nevertheless, we found that although the performance of the algorithm in terms of execution time was poor, it did not use a large amount of FPGA resources, and it therefore represents an alternative to consider when floating point codification is required.

Looking next at the eigen solver implemented as part of the MUSIC algorithm, it can be seen that it did not perform very well. On paper, it is similar to the semiparallel architecture presented here, but the control logic was designed by hand. This shows that HLS performs better than traditional hardware design flows when control logic becomes too complicated, since it is implemented automatically.

Our final comparison was with the VHDL serial architecture, in which only two CORDIC modules were used. Since everything in this design was customized for the application, timing performance was similar to HLS designs, but it used far fewer resources.

As regards the maximum clock frequency at which the designs can work, Figure 7 shows a comparison of all the alternatives discussed.

On the one hand, we have Vivado HLS designs, where the difference between floating point and fixed point is evidenced by the results. While floating point designs can easily go above 200 MHz, fixed point designs stay around 120 MHz.

On the other hand, we find the VHDL design. In this case, the performance of the design proposed in [1] was similar to the floating point designs since it was hand-coded at a very low optimization level.

6. Conclusions

In this paper, we have described a high level synthesis implementation of the Jacobi algorithm for eigenvalue and eigenvector calculations, and this has been compared to similar or related systems reported in the literature.

The main purpose of using a HLS tool was to try to achieve a similar performance to that of a custom RTL system coded in, for example, VHDL, in a fraction of the design time required.

In recent years, the amount of resources available in FPGA devices has grown exponentially, and thus rapidity is becoming more important than efficiency when implementing computational algorithms.

Furthermore, HLS tools have evolved, making it possible to achieve good systems that are easy to integrate into larger designs as hardware coprocessors together with a CPU, or as a processing block in regular RTL designs.

As final paper conclusion, we can say that the use of high level implementation tools reduce the design time. In most cases, the designs are initially coded, simulated, and validated with high level languages, where there exist standard libraries and reference code available. Manual RTL coding is not usually an easy step because there is not a straight way to translate the functional algorithm to RTL description. In addition, the required level of RTL expertise is very high.

Another inconvenience for RTL-based designs appears when new changes have to be included in the algorithm. The RTL structure is not very flexible and any modification to the algorithm requires a lot of time (simulation and recoding). Nevertheless, RTL designs generate hardware reconfigurable approaches with less resources and higher clock frequency than high level hardware designs (see Figures 6 and 7).

The developed algorithm in this work was initially made with a traditional RTL flow [1]. As a result, we can compare the results of a complex algorithm such as Jacobi based on RTL or HLS flows. According to Figures 6 and 7, the achieved results for the HLS methodology compared with RTL-based design are favorably from different points of view.

We have shown that the design process varies from the traditional RTL flow, since it is possible to iterate on the same code to obtain different hardware architectures and perform parametric simulations of the hardware in order to obtain the best results.

Therefore, the designer should balance what are the prior requirements in the system, clock frequency/resources, or design time and choose the methodology (RTL/HLS) that best fit for that particular case.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work has been supported by the Spanish Ministry of Economy and Competitiveness through the ALCOR Project (DPI2013-47347-C2-1-R).