Mathematical Problems in Engineering

Volume 2017 (2017), Article ID 3021591, 11 pages

https://doi.org/10.1155/2017/3021591

## Sparse Cholesky Factorization on FPGA Using Parameterized Model

School of Computer, National University of Defense Technology, Deya Road No. 109, Kaifu District, Changsha, Hunan 410073, China

Correspondence should be addressed to Yichun Sun; moc.621@nusnuhciy

Received 6 March 2017; Revised 19 August 2017; Accepted 12 September 2017; Published 17 October 2017

Academic Editor: Oleg V. Gendelman

Copyright © 2017 Yichun Sun et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Cholesky factorization is a fundamental problem in most engineering and science computation applications. When dealing with a large sparse matrix, numerical decomposition consumes the most time. We present a vector architecture to parallelize numerical decomposition of Cholesky factorization. We construct an integrated analytical parameterized performance model to accurately predict the execution times of typical matrices under varying parameters. Our proposed approach is general for accelerator and limited by neither field-programmable gate arrays (FPGAs) nor application-specific integrated circuit. We implement a simplified module in FPGAs to prove the accuracy of the model. The experiments show that, for most cases, the performance differences between the predicted and measured execution are less than 10%. Based on the performance model, we optimize parameters and obtain a balance of resources and performance after analyzing the performance of varied parameter settings. Comparing with the state-of-the-art implementation in CPU and GPU, we find that the performance of the optimal parameters is 2x that of CPU. Our model offers several advantages, particularly in power consumption. It provides guidance for the design of future acceleration components.

#### 1. Introduction

In engineering and science computations, the solution of large sparse linear system of equations is a fundamental problem in most applications. In general, two methods for solving this problem exist: the direct and iterative methods. In the direct method, when focusing on factorization algorithms, we have LU decomposition, QR decomposition, and Cholesky decomposition [1]. It should be noted that Cholesky decomposition is a special form of LU decomposition that deals with symmetric positive definite matrices. Its computational complexity is roughly half of LU algorithm. The Cholesky algorithm is widely used in applications because most matrices involved are symmetric positive definite.

The solution of large sparse linear system of equations can be adapted quite easily to parallel architectures. Supercomputers and multiprocessor systems became the main focus of this kind of research [2–4]. However these systems cost highly in making marketable products [5].

There are a lot of software- and hardware-based approaches developed to get better solution in Cholesky decomposition. Field-programmable gate arrays (FPGAs) have unique advantages in solving such problems. First, according to the characteristics of the algorithm, a dedicated architecture could be designed [6]. Second, the computational resources, memory, and I/O bandwidth are reconfigurable. Third, the energy consumption and cost are relatively low under the premise of meeting certain performances. Fourth, it provides experimental and verification platform for future coprocessor design. Yang et al., 2010 [7, 8], presented a Cholesky decomposition based on GPUs and FPGAs. The results show that the dedicated FPGA implementation has the highest efficiency. The design in FPGAs has been able to combine parallel floating-point operations with high memory bandwidth in order to achieve higher performance than microprocessors [9].

A sparse matrix, in which the ratio of nonzero values is low, is especially suitable for direct methods. In our implementation, we chose the Cholesky factorization method over LU decomposition.

The reason why we chose the Cholesky factorization is less operation. The matrix is symmetric and half of factors in matrix need to be calculated and suitable for parallel architectures. In parallel, there is less communication within each interprocessor. There are a lot of work in parallelizing the Cholesky factorization [3, 4]. The memory requirement is significantly lower than other direct methods, because only the lower triangular matrix is used for factorization and the intermediate and final results (the Cholesky factor ) are overwritten in the original matrix.

The three algorithms in the implementation of Cholesky factorization are row-oriented, column-oriented, and submatrix algorithms. The difference among these three algorithms is in the order of the three loops [3]. Among these three algorithms, the column-oriented algorithm is the most popular algorithm for sparse matrix. Once each column is regarded as a node, data dependency of each node in Cholesky factorization can be described as an elimination tree [10]. In the tree, the parent nodes depend on their descendants while their descendants must update them before being decomposed. Thus, the process of factorization can be taken as a process of eliminating the tree from bottom to top. Two main operations of the algorithm exist: scaling operation in columns and updating operation between dependent columns . Basing on the updating order, the researchers found that the algorithm can be classified into left-looking and right-looking algorithms.

A number of studies in sparse Cholesky factorization optimized in GPU are found. Vuduc and Chandramowlishwaran mapped the submatrix tasks directly to the GPU [11], George scheduled the tasks on the GPU [12], and Lucas et al. used multiple CPU threads to take small-scale computing tasks in parallel and used GPU to speed-up large-scale computing tasks [13]. The small-scale computing tasks are in the bottom of the elimination tree whereas the large-scale computing tasks are in the top of the elimination tree. Lacoste et al. modified the task scheduling module of Cholesky factorization algorithm to provide a unified programming interface on CPU-GPU heterogeneous platforms [14]. Hogg et al. proposed a supernodal algorithm of HLS_MA87 solver for multicore CPU platforms [15]. Chen et al. introduced a GPU version of the CHOLMOD solver for CPU-GPU platforms [16]. Zou et al. improved the generation and scheduling of GPU tasks by queue-based approach and designed a subtree-based parallel method for multi-GPU system [17].

To decrease the storage and computation complexity, the sparse matrix would be compressed into a dense format (CSC format, Compressed Sparse Column Format). However, in the process of Cholesky factorization, many original zero entries will turn to be nonzeros. They increase the overhead memory and dynamically change the compressed data structure. To avoid this situation, the matrix should be preprocessed and decomposed in symbol before undergoing numerical factorization (when preprocessed and decomposed in symbol, we could generate nonzero pattern of matrix and its elimination tree. When we reorganize the data, we would get the vector array). Preprocessing exposes parallelism by reordering matrix columns. Symbol decomposition is then executed in advance to identify the sparsity pattern of the result matrix. With the symbol factorization technology being well-developed, numerical decomposition is the most time consuming. This paper discusses and evaluates the performance of numerical decomposition based on the preprocessing and symbol factorization results.

After analyzing Cholesky algorithm when dealing with a large sparse matrix, we rewrite the algorithm in a vector version. In order to explore the relationship between parallel scale and performance, to predict the performance on field-programmable gate arrays (FPGAs) platform, we also designed corresponding vector architecture and construct a parameterized model. The performance depends on the underlying architecture and system parameters. The capacity of on-chip memory, the amount of computational resources, and the size of I/O bandwidth are critical parameters. Our model takes these factors into consideration and focused on their relationship. By using the model, we obtain the best performance when choosing the capacity of on-chip memory and the amount of computational resources.

The rest of this paper is organized as follows. Section 2 introduces the vector version of Cholesky algorithm. Section 3 describes our performance model in ideal, memory-bound, and computational-bound cases. Section 4 proves the consistency of the performance of our model and the implementation; then it analyzes the parameters and tries to find an optimal performance.

#### 2. Cholesky Factorization

Cholesky factorization decomposes a symmetric positive definite matrix into where is a lower triangular matrix. The main operations of the algorithm are scale operation in columns and update operation between dependent columns . divides the subdiagonal elements of column by the square root of the diagonal element in that column, and modifies column by subtracting a multiple of column . The algorithm can be classified into left-looking and right-looking algorithm based on updating order. Right-looking algorithm performs all the updates using column immediately after the . By contrast, left-looking algorithm accumulates all necessary updates for column until before the .

From Algorithm 1 we can note that the parallelism is mainly among the nondata dependency columns and the elements in a column, corresponding to the parallel elimination among different path in the elimination tree* (internode parallelism)* and the parallel scaling among different subdiagonal elements in a node* (intranode parallelism)*. Besides, because of the columns in dense matrix depending on each other strictly in order, the parallelism among columns only exists in sparse matrix decomposition.