Abstract

Cholesky factorization is a fundamental problem in most engineering and science computation applications. When dealing with a large sparse matrix, numerical decomposition consumes the most time. We present a vector architecture to parallelize numerical decomposition of Cholesky factorization. We construct an integrated analytical parameterized performance model to accurately predict the execution times of typical matrices under varying parameters. Our proposed approach is general for accelerator and limited by neither field-programmable gate arrays (FPGAs) nor application-specific integrated circuit. We implement a simplified module in FPGAs to prove the accuracy of the model. The experiments show that, for most cases, the performance differences between the predicted and measured execution are less than 10%. Based on the performance model, we optimize parameters and obtain a balance of resources and performance after analyzing the performance of varied parameter settings. Comparing with the state-of-the-art implementation in CPU and GPU, we find that the performance of the optimal parameters is 2x that of CPU. Our model offers several advantages, particularly in power consumption. It provides guidance for the design of future acceleration components.

1. Introduction

In engineering and science computations, the solution of large sparse linear system of equations is a fundamental problem in most applications. In general, two methods for solving this problem exist: the direct and iterative methods. In the direct method, when focusing on factorization algorithms, we have LU decomposition, QR decomposition, and Cholesky decomposition [1]. It should be noted that Cholesky decomposition is a special form of LU decomposition that deals with symmetric positive definite matrices. Its computational complexity is roughly half of LU algorithm. The Cholesky algorithm is widely used in applications because most matrices involved are symmetric positive definite.

The solution of large sparse linear system of equations can be adapted quite easily to parallel architectures. Supercomputers and multiprocessor systems became the main focus of this kind of research [24]. However these systems cost highly in making marketable products [5].

There are a lot of software- and hardware-based approaches developed to get better solution in Cholesky decomposition. Field-programmable gate arrays (FPGAs) have unique advantages in solving such problems. First, according to the characteristics of the algorithm, a dedicated architecture could be designed [6]. Second, the computational resources, memory, and I/O bandwidth are reconfigurable. Third, the energy consumption and cost are relatively low under the premise of meeting certain performances. Fourth, it provides experimental and verification platform for future coprocessor design. Yang et al., 2010 [7, 8], presented a Cholesky decomposition based on GPUs and FPGAs. The results show that the dedicated FPGA implementation has the highest efficiency. The design in FPGAs has been able to combine parallel floating-point operations with high memory bandwidth in order to achieve higher performance than microprocessors [9].

A sparse matrix, in which the ratio of nonzero values is low, is especially suitable for direct methods. In our implementation, we chose the Cholesky factorization method over LU decomposition.

The reason why we chose the Cholesky factorization is less operation. The matrix is symmetric and half of factors in matrix need to be calculated and suitable for parallel architectures. In parallel, there is less communication within each interprocessor. There are a lot of work in parallelizing the Cholesky factorization [3, 4]. The memory requirement is significantly lower than other direct methods, because only the lower triangular matrix is used for factorization and the intermediate and final results (the Cholesky factor ) are overwritten in the original matrix.

The three algorithms in the implementation of Cholesky factorization are row-oriented, column-oriented, and submatrix algorithms. The difference among these three algorithms is in the order of the three loops [3]. Among these three algorithms, the column-oriented algorithm is the most popular algorithm for sparse matrix. Once each column is regarded as a node, data dependency of each node in Cholesky factorization can be described as an elimination tree [10]. In the tree, the parent nodes depend on their descendants while their descendants must update them before being decomposed. Thus, the process of factorization can be taken as a process of eliminating the tree from bottom to top. Two main operations of the algorithm exist: scaling operation in columns and updating operation between dependent columns . Basing on the updating order, the researchers found that the algorithm can be classified into left-looking and right-looking algorithms.

A number of studies in sparse Cholesky factorization optimized in GPU are found. Vuduc and Chandramowlishwaran mapped the submatrix tasks directly to the GPU [11], George scheduled the tasks on the GPU [12], and Lucas et al. used multiple CPU threads to take small-scale computing tasks in parallel and used GPU to speed-up large-scale computing tasks [13]. The small-scale computing tasks are in the bottom of the elimination tree whereas the large-scale computing tasks are in the top of the elimination tree. Lacoste et al. modified the task scheduling module of Cholesky factorization algorithm to provide a unified programming interface on CPU-GPU heterogeneous platforms [14]. Hogg et al. proposed a supernodal algorithm of HLS_MA87 solver for multicore CPU platforms [15]. Chen et al. introduced a GPU version of the CHOLMOD solver for CPU-GPU platforms [16]. Zou et al. improved the generation and scheduling of GPU tasks by queue-based approach and designed a subtree-based parallel method for multi-GPU system [17].

To decrease the storage and computation complexity, the sparse matrix would be compressed into a dense format (CSC format, Compressed Sparse Column Format). However, in the process of Cholesky factorization, many original zero entries will turn to be nonzeros. They increase the overhead memory and dynamically change the compressed data structure. To avoid this situation, the matrix should be preprocessed and decomposed in symbol before undergoing numerical factorization (when preprocessed and decomposed in symbol, we could generate nonzero pattern of matrix and its elimination tree. When we reorganize the data, we would get the vector array). Preprocessing exposes parallelism by reordering matrix columns. Symbol decomposition is then executed in advance to identify the sparsity pattern of the result matrix. With the symbol factorization technology being well-developed, numerical decomposition is the most time consuming. This paper discusses and evaluates the performance of numerical decomposition based on the preprocessing and symbol factorization results.

After analyzing Cholesky algorithm when dealing with a large sparse matrix, we rewrite the algorithm in a vector version. In order to explore the relationship between parallel scale and performance, to predict the performance on field-programmable gate arrays (FPGAs) platform, we also designed corresponding vector architecture and construct a parameterized model. The performance depends on the underlying architecture and system parameters. The capacity of on-chip memory, the amount of computational resources, and the size of I/O bandwidth are critical parameters. Our model takes these factors into consideration and focused on their relationship. By using the model, we obtain the best performance when choosing the capacity of on-chip memory and the amount of computational resources.

The rest of this paper is organized as follows. Section 2 introduces the vector version of Cholesky algorithm. Section 3 describes our performance model in ideal, memory-bound, and computational-bound cases. Section 4 proves the consistency of the performance of our model and the implementation; then it analyzes the parameters and tries to find an optimal performance.

2. Cholesky Factorization

Cholesky factorization decomposes a symmetric positive definite matrix into where is a lower triangular matrix. The main operations of the algorithm are scale operation in columns and update operation between dependent columns . divides the subdiagonal elements of column by the square root of the diagonal element in that column, and modifies column by subtracting a multiple of column . The algorithm can be classified into left-looking and right-looking algorithm based on updating order. Right-looking algorithm performs all the updates using column immediately after the . By contrast, left-looking algorithm accumulates all necessary updates for column until before the .

From Algorithm 1 we can note that the parallelism is mainly among the nondata dependency columns and the elements in a column, corresponding to the parallel elimination among different path in the elimination tree (internode parallelism) and the parallel scaling among different subdiagonal elements in a node (intranode parallelism). Besides, because of the columns in dense matrix depending on each other strictly in order, the parallelism among columns only exists in sparse matrix decomposition.

 Input:
 Output:
(1)
(2) for  ; ;   do
(3)     ;               
(4)     for  ; ;   do
(5)          ;
(6)     end
(7)     for  ; ;   do
(8)         if    then
(9)           for  ; ;   do
(10)              ;    
(11)          end
(12)       end
(13)   end
(14) end

In Algorithm 1, lines (), (), and () are a progressive cycle of three layers. The amount of memory access per layer to the array is ( is the scale of the matrix). The complexity of memory access is whether it is left-looking or right-looking. For the scale of matrix increasing, the complexity is unacceptable when a platform lacks cache space and the on-chip memory is small [18]. Duff and Reid [19] proposed a multifrontal Cholesky algorithm. The method reorganizes the overall factorization of a sparse matrix into a sequence of partial factorizations of dense smaller matrices. The multifrontal algorithm accumulates the update factors into a matrix. In this way, the algorithm reduces the memory access time [20] and the update operation.

Algorithm 2 shows its content. The decomposition of every column only needs to read its front matrix and its corresponding update matrix . The descendants updating effects are accumulating in the matrix , thus, avoiding the access of its descendants. In Algorithm 2, means the extended addition operation between two sparse matrices.

 Input:
 Output:
(1)
(2) for  ; ;   do
(3)   are the abscissas of non-zeros ;
(4)   are the child nodes of node ;
(5)  ;   construct the update matrix of node  
(6)  ;   update front matrix:
(7)       decompose  :
(8) end

While decomposing matrix , the update matrix is also generated; specifically, the following equation presents the calculation of matrix .

3. Vector Version and Architecture

This section chiefly introduces our vector version of the Cholesky algorithm and its implementation on FPGA.

3.1. Vector Version of Cholesky Algorithm

In Algorithm 3, we rewrite multifrontal algorithm in a more detailed fashion. is a storage structure, the information of a series th node in all the update matrices . counts the number of its update matrices, which means the amount of child nodes of the th node. is an array that records the corresponding parent node. For instance, is the parent node of the th node. In line () of Algorithm 3, the th column of is calculated by the first column of . Lines () to () work on the generation of update matrix to its parent node and store it in . Notably, the update matrix is symmetric; thus, only a lower part of the matrix is generated and stored.

 Input:
 Output:
(1) for  ; ;   do
(2)   
(3)   for  ; ;   do
(4)    ;
(5)   end
(6)    ;
(7)    ;
(8)    for  ; ;   do
(9)       for; ;   do
(10)       ;
(11)      end
(12)    end
(13) end

The execution of lines () to () in Algorithm 3 is illustrated in a vector fashion. In Algorithm 4 we exploited the strip-mined technology. is the size of while is the amount of lanes in our vector architecture. is the number of child nodes of this processed node. Furthermore, is the th column of update matrix from the th child. Each updating of a column is assigned to a lane at one time. The calculation of and the updating of matrix are in a similar vector processing method.

 Input: ,
 Output:
(1) for  ; ;   do
(2)   ;
(3)   ;
(4)   for  ; ;   do
(5)    for  ; ;   do
(6)       ;
(7)     end
(8)      low = low + ;
(9)      = ;
(10)   end
(11) end
3.2. Architecture

The primary architecture of vector machine contains -independent processing elements (PEs). We divide and modules for each PE. Figure 1 gives the details of the implementation. It is mainly composed of 3 computation modules and a series of storage components. The computation modules are , , and . The storage components include 5 FIFOs (First Input First Output, implemented by RAM) and RAMs for increasing buffer when data is transforming. gets data from   (it stores frontal matrix which is in Algorithm 3) and (it stores update matrix which is the output in Algorithm 4) and combines with the data and the index that is located in data to finish extended addition operation. Then the operation in module starts up. The operation in is to finish the decomposition of column, which includes division and extraction of a root. The implementation of division and extraction of a root are based on SRT algorithm (it is usually called Digit Recurrence algorithm) and they share the implementation of SRT algorithm. stores the results of . The output data from is divided into two parts: is the final results and it is stored in . is stored in for updating. Data in is for to generate update matrix. includes multiplication and subtraction operation. Algorithm 4 is implemented in this module. The initial data in Cholesky factorization is obtained from DDR memory and the results after factorization are also stored in DDR.

The primary architecture of vector machine contains -independent processing elements (PEs). We divide and modules for each PE. Figures 2(a) and 2(b) show the two architectures. Each module has lanes. From the two figures, we could conclude the data flow in one PE as described in Algorithm 3. For the extended addition between and , each lane processes one column at one time. More concretely, Figure 2(a) shows when , the first two columns of and corresponding columns in are processed first. After that, all lanes are fully parallel most of the time. In Figure 2(b), only the first column of needs to be processed when calculating the corresponding column of . The rest of the columns in need to be used when generating the updating matrix for its parent node.

4. Performance Model

Basing on the analysis of the Cholesky algorithm, we construct a performance model on FPGA platform to predict the performance of our designed vector architecture. The performance is related to the underlying architecture and its parameters. When ignoring the implementation architecture, we can predict the peak performance with on-chip memory, computational resources, and bandwidth.

Figure 3 depicts the nonzeros pattern of and its elimination tree. The elimination tree of the matrix is defined to be the structure with nodes and node is the parent of if and only if the following is established.

In the elimination tree, each column in the matrix is presented as a node. In Figure 3 the numbers on the right side of the elimination tree represent the levels of the corresponding nodes. The path with a dashed line is the critical path. Given the existing dependency between parent node and its child node, the amount of nodes in the critical path, which is determined by the algorithm and the nonzeros pattern of the matrix, has the least number of nodes that should be processed sequentially.

To construct the model on FPGA, we define these variables and their descriptions as listed in Table 1.

Assuming that data are initially stored in DDR3, the result will finally be written back to DDR3. Therefore, RW contains the input data, output data, and intermediate values. The intermediate values are forced to store and load in DDR3 when needed because of the small on-chip memory size. Thus . For a given problem, input and output data volumes are fixed, only temp is determined by the problem itself and FPGA on-chip memory.

The amount of operations (AO) is determined by the amount of nonzero entries, specifically, for matrix. It takes one square operation and division operation to scale one column . Given that the update and front matrices are symmetric, only the lower part of the matrix should be taken into consideration. The amount of multiplication operations is , and the AOs for addition in are the same. As presented in the following equation, AO is the sum of all these operations.

The chip resources contain on-chip memory and computational resources. The on-chip memory is the size of the memory that could be loaded immediately. It is a critical factor in RW. Moreover, computational resources determine the extent of parallel execution. In the next two subsections, we construct two different performance models for both memory- and computational-bounded cases.

4.1. Memory-Bound

Assuming FPGA owns enough computational resources and the system is totally memory-bound, the performance is determined by the memory access time; thus, the time of decomposition can be expressed as

For DDR3, when the data storage format and the access pattern are determined, BW will be determined. RW is a function of on-chip memory and the matrices. When memory bounded, the peak performance is described as

4.2. Computational-Bound

Figure 4(a) demonstrates the decomposition process of the matrix. The matrix is shown in Figure 3 with two PEs. We note that the bottleneck is shown in the first PE and we conclude that module and can be executed in parallel; the total time is .

From Figure 4(b), is needed to finish when the operations are pipeline executed. Similarly, will take clocks. , and are the clocks that square root, division, multiplication, and addition operation take. In our implementation, our operation modules of square root, division, multiplication, and addition need 16, 27, 1, and 8 clocks, respectively. Generally, when , , then we have

Assuming that resources are enough, will be independent from the scale of nonzeros. They are executed in parallel and the time it takes to finish a square root, division, multiplication, and addition operation sequentially is the same. Besides that, only the time of nodes in the critical path will be counted into the amount of time. In this ideal case, the total time is and the corresponding performance reaches the roofline value [21].

To be more realistic, we assume one PE contains lanes. A total of nonzeros exist, executed in parallel after the diagonal element being processed in one node. In this case, we assume that one PE requires units of resources. When the computational resources are bound and FPGA available resources are units, one FPGA can contain PEs. It means that it supports nodes processed mostly in parallel.

Assuming that the PE, which reaches bottleneck, is assigned to process nodes, we indicate that the amount of clocks is clocks. If the FPGA working frequency is , the total process time is .

Given that the largest number of nodes processed in parallel is , decomposing the th level needs time for processing nodes. For the whole elimination tree, a total of nodes are processed sequentially at most. Consequently, when computational resources are bounded, the computation time is shown as

, which means that . So when computational resources and data dependency are taken into consideration, the process time is described as

Furthermore, the performance is described as

4.3. Summary

When considering I/O bandwidth and computational resources, we can conclude that the following is the performance equation according to (4), (5), (6), (7), (8), and (9).

The analysis and summary are demonstrated in Table 2. Once the parameters of FPGA and matrix are ascertained, we can obtain the peak performance and bottleneck.

Specifically, when the system is computational-bound, the performance is determined by and . Table 3 discusses the influence of these two factors. From the table, these two factors are subject to the number of and the number of lanes in PE. Greater values of and do not guarantee a better performance. When and , factor and will reach their ultimate values.

5. Experimental Evaluation

The matrices used in this paper are part of the University of Florida matrix collection, which is involved in the field of electronics-magnetics, structure design, and so on. The data format is double floating-point (64 bits), needing 8 bytes per entry when using Compressed Sparse Column format and ignoring the index.

5.1. Accuracy of Performance Modeling

This paper implements a double floating-point version of sparse Cholesky algorithm on Xilinx Virtex-7 FPGA VC709 evaluation board (xc7vx485t-2ffg1761), which tightly integrates a high-performance ARM cortex with programmable logic devices. We implement this version in 4 different scales (1 PE, 1 module in 1 PE. 1 PE, 2 modules in 1 PE. 2 PE, 1 module in 1 PE. 2 PE, and 2 modules in 1 PE) to verify our performance module. The memory chip in our platform is 6.7 MB and the frequency is 200 MHz.

We adopt the following steps in our experiments: () predicting the performance of our model using commonly used matrices, () measuring the actual performance with the matrices in our implementation, and () assessing the consistency of predicted and measured values. The results are shown in Figure 5 for accuracy evaluation and Figure 6 for comparison of performance different rates. In Figure 5, is the number of PEs and is the number of modules in one PE. The percentage of relative difference between a predicted value and measured value is calculated by , where is the predicted value and is the measured value. There are many optimizations in hardware implementation (like DDR storage access mode, hardware configuration parameters). After optimizing the implementation and selecting some general class of matrices so they can get a better agreement, we have the following observations from our experimental data.

The accuracy of model is critical since we need to use this model to infer large-scale parallel performance. We focus on performance and the difference rate of the performance. Figure 5 shows the comparisons between the measured performance and predicted performance by our model of four classes matrix sets (shipsec1, nd24k, Trefethen_2000b, and nd3k) at different parallel scales (1 PE 1 module, 1 PE 2 module, 2 PE 1 module, and 2 PE 2 module). The measured performances are relatively stable and accurate. From [2224] this established literature on performance model, we can conclude that the predicted and measured values have good consistency. Figure 6 presents the performance difference rates. As the scale of parallelism increased, the difference rates dropped form 10 percent to less than 5 percent.

As shown in Figure 5, the estimated performance is higher than the tested performance. We simplify data processing in the computational-bound model. Furthermore in the memory-bound model, a deviation in the bandwidth is found between DDR3 and FPGA.

5.2. Optimization

To obtain a balance in resources in FPGA and the performance, we set the memory chip to 64 MB, with PEs and each PE having modules.

We compare the performance of our optimum implementation with four state-of-the-art supernode algorithms. One is the HSL_MA87 solver for multicore CPU platforms [15] while the other is the GPU version of CHOLMOD solver for CPU-GPU platforms [16]. The rest are subtree-based parallel methods for CPU-GPU platforms [17]. And the factorization time obtained is shown in Table 4. We have the following conclusions in the performance comparison.

The frequency of HSL_MA87, CHOLMOD, and GPU is 2.5 GHz, 3.2 GHz, and 2.93 GHz (CPU is 2.93 GHz, GPU is 1.15 GHz in [16]). Our frequency is 200 MHz. After testing matrix sets nd3k, nd24k, and Trefethen_20000b, our performance is on the same level as GPU performance. When our work is implemented as a coprocessor (i.e., when the frequency increases, our model can guide the design of coprocessor), our implementation has a greater advantage in performance.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (no. 61602496).