Recent Advances in Industrial Mathematics and ApplicationsView this Special Issue
GPU Preconditioning for Block Linear Systems Using Block Incomplete Sparse Approximate Inverses
Solving sparse triangular systems is the building block for incomplete LU- (ILU-) based preconditioning, but parallel algorithms, such as the level-scheduling scheme, are sometimes limited by available parallelism extracted from the sparsity pattern. In this study, the block version of the incomplete sparse approximate inverses (ISAI) algorithm is studied, and the block-ISAI is considered for preconditioning by proposing an efficient algorithm and implementation on graphical processing unit (GPU) accelerators. Performance comparisons are carried out between the proposed algorithm and serial and parallel block triangular solvers from PETSc and cuSPARSE libraries. The experimental results show that GMRES (30) with the proposed block-ISAI preconditioning achieves accelerations 1.4 × –6.9 × speedups over that using the cuSPARSE library on NVIDIA Tesla V100 GPU.
Sparse triangular solves are essential steps for the incomplete LU- (ILU-) factorized preconditioning  in a Krylov subspace solver. However, they are sequentially designed and pose a performance challenge on today’s parallel computers. To make full use of the state of art parallel architectures such as multicore CPUs and GPU accelerators, various works have been studied and presented. The well-known algorithm is the level-scheduling scheme [1–5] which groups the possible parallelism in levels of sets each of which can be performed in parallel. The implementations on multicore architectures and GPU accelerators are discussed in [1, 4, 5] and [2, 3], respectively. The NVIDIA’s cuSPARSE  library offers a function interface for this algorithm which can be called by users directly in practice. However, there are two drawbacks in some of the applications. One is that the extracted parallelism depends on the sparsity pattern of the matrix; thus, the parallelism and performance is highly limited for some cases. The other is that it is usually a time-consuming job to search parallelism (known as preprocessing) before the actual computation is conducted. Liu et al. [7, 8] proposed a method in which the preprocessing step is not necessarily performed when the sparse matrix is expressed in compressed sparse column (CSC) format, and they showed performance improvement on GPU over the level scheduling method in cuSPARSE.
Compared to using exact computation for preconditioning, some inexact preconditioning ideas [9–13] are attractive in recent years because they seek a tradeoff between exactness and parallelism. Chow et al. [12, 13] proposed a fine-grained algorithm to compute ILU factorization asynchronously on Intel MIC architecture  and GPUs. They showed that 5 asynchronous sweeps usually make the inexact ILU factorization comparable to the exact one, but with a significant factor of speedups. Anzt et al.  implemented an asynchronous method on GPU for solving triangular systems using several number of Jacobi iterations. Although the inexact preconditioning step results in more number of solver iterations, it shows an advantage over the exact preconditioning in terms of the total compute time of the linear solver. Some methods are based on sparse approximate inverses (SAI) [15–19] where the matrix inverse of the triangular factors is estimated through solving least squares problems on a preset pattern. And the resulting inverses can transform the preconditioning step into sparse matrix-vector multiplications with much more possible of concurrency computing on parallel computers. Most recently, Anzt et al. [10, 11] proposed an incomplete SAI (ISAI) algorithm in which the least square problems are replaced with solving square systems with cheaper computations and faster convergence.
Linear systems arising from block sparse matrices are widely used in scientific computing especially in multiphysics problems. Most of the numerical algorithms for block linear systems are derived from that for scalar systems. Even though they are quite similar to each other in a mathematical formula, the implementation strategies and performance tuning techniques could be very different from each other, especially on GPU accelerators. For example, the block sparse matrix-vector multiplication (BSpMV) on GPU  is performed in the multiplications of blocks and vectors, whereas a general SpMV is realized in scalar multiplications. Due to the principles of global memory access and the use of shared memory on the GPU, direct migration of the code from scalar case to block case may hamper the performance much . Therefore, algorithms for block matrices usually require redesigned work. In multiphysics problems, the coupling feature of the physical fields results in block matrices that usually have blocks with a small size such that the inverse of a block can be explicitly expressed. Motivated by this kind of numerical applications and the ISAI preconditioning proposed by Anzt et al. [10, 11], we focus on the GPU preconditioning in a block format in this study. The main contributions of this study are the following.(1)The GPU preconditioning framework [10, 11] is extended to block matrices with block sizes up to 5.(2)An efficient, warp-based GPU implementation exploiting fine-grained programming model for block-ISAI preconditioning is proposed and elaborately explained.(3)Detailed comparisons are made between the proposed algorithm and block triangular solvers from popular libraries, including PETSc  and cuSPARSE . On block matrices selected from the SuiteSparse collection  and real multiphysics areas, the proposed algorithm shows an advantage over the PETSc’s serial and cuSPARSE’s parallel implementations of block triangular solvers in terms of total computing time for GMRES (30).
The rest of this study is organized as follows. In Section 2, some backgrounds, including sparse approximate inverse (SAI), ISAI, and block matrices, are introduced. In Section 3, the GPU preconditioning framework for block linear systems is proposed, and the strategy for GPU implementation is introduced and discussed. In Section 4, performance results and comparisons for four testing cases are shown. Concluding remarks are given in Section 5.
2.1. Incomplete Sparse Approximate Inverses
For a given sparse matrix with rows and columns, the SAI algorithm [1, 15–19] gives an approximation of the inverse of A by minimizing the Frobenius norm of aswhere is the estimated inverse of for a given sparsity pattern , is the identity matrix, and represent the column of and , respectively. Solving (1) is equivalent to solvewhere is the non-zero pattern for , represents the indices of the rows that contain non-zero values at columns, and is a submatrix extracted from the rows and columns of . Note that if the diagonal values of are non-zeros, and (2) can be viewed as the solve of a batch of least-squares problems. The SAI algorithm fits parallel computers because the problems in (2) can be solved independently of each other.
This simplification makes a square matrix, and the least-squares problems in (2) are changed into linear systems in (3). In this way, the ISAI can be easily applied to the matrices having special sparsity patterns. For example, in the case where incomplete LU (ILU) factorizations-based preconditioning is performed, the inverse of the lower triangular factor (L) and upper triangular factor (U) are approximately calculated with high concurrency using (3) [10, 11]. This avoids solving large sparse triangular systems which are regarded as the bottleneck in today’s state-of-art parallel computers and accelerators.
2.2. Block Sparse Matrix
Block sparse matrices are widely used in scientific computing applications, especially in multiphysics problems. Figure 1 shows a block sparse matrix with a block size of 2. It is neither a general sparse matrix nor a dense matrix, but it falls somewhere in between. It has the sparse property in global but dense features in local blocks.
There are many storage formats for block sparse matrices, such as block compressed sparse row (BCSR) and block compressed sparse column (BCSC) in PETSc  and cuSPARSE . Different from the general CSR format for storing scalar values, BCSR format treats each block as a unit and stores all non-zero positions block by block (Figure 1). The values in each block are stored consecutively either in a row-major or column-major format. In the following parts of this study, block matrices are focused on, and column-major format is employed to store non-zero blocks in BCSR or BCSC formats.
3. Block-ISAI Preconditioning on GPUs
To solve an asymmetric block linear system , the Krylov subspace-based generalized minimal residual (GRMES) algorithm  with right preconditioning is employed aswhere is a preconditioner.
The well-known method to construct a preconditioner is to perform a block incomplete LU factorization (BILU) on [21, 24]. Then, the preconditioning step in which the inverse of applies to a vector , i.e., , can be executed bywhere L and U are the block lower and upper triangular matrices factorized by , respectively, and is the preconditioned vector.
3.1. Block-ISAI on GPU
Traditionally, the blocked-based forward and backward substitutions are applied to the two triangular systems in (5). However, these are totally sequential operations. Instead of computing (5) accurately, Anzt et al. [10, 11] introduced an inexact idea in which the inverse of the lower and upper factors are estimated by highly parallel ISAI, and (5) is transformed into the problems of ISAI relaxation steps or ISAI SpMV. This method seeks a tradeoff between parallelism and exactness and shows an advantage in terms of total computing time of a linear solver.
Motivated by the scalar version of ISAI on a single GPU [10, 11] and the block version  on Intel’s MIC (many integrated core) architecture , we propose a GPU-enabled block-ISAI algorithm for preconditioning in block matrices. For the convenience of our statement, we assume all the matrices in the following discussion are block matrices.
In the context of incomplete factorization preconditioners, we applied the incompleteness idea of (3) to the block lower factor L and upper factor U to have the estimations of the inverses, NL and NU, for the factors aswhere and are the indices of non-zero blocks at the block column of L and U, respectively. As and are square block matrices, the solve of (6) is equivalent to the solve of a series of triangular systems in the block format independently
The sizes of the systems in (7) are determined by the numbers of non-zero blocks in block columns, and they usually are much smaller than the general large sparse systems. Therefore, the solve of the systems provides relatively fine-grained parallelism that fits the GPU architecture. We show the GPU-enabled block-ISAI in Algorithm 1.
To compute the preconditioned vector , Algorithm 1 applies the preconditioner to the input vector on the GPU. Specifically, the GPU accepts the preconditioner expressed in two block triangular factors L and U and a vector as input parameters and outputs the preconditioned vector . This can be implemented in two phases. The first phase aims to obtain the approximate inverses for L and U (denoted NL and NU, respectively) by the block-ISAI consisting of four main steps. The first step gives the guesses of the sparsity patterns for NL and NU. Following the strategy in [10, 11], we use and as the sparsity patterns for NL and NU, respectively, where and represent the sparsity patterns for the multiplications of times of absolute values of L and U, respectively. The second step extracts two block triangular systems, denoted and , according to the non-zero patterns and at each column of NL and NU. The extraction process is illustrated in Figure 2, which shows how the block lower triangular system corresponding to the fourth column of NL is formed. The non-zero pattern for NL is constructed using which is denser than . The non-zero pattern for the fourth column of NL provides a row and column set for indexing L, and then, the 4th, 5th, 7th, and 8th block rows and columns are extracted from L as a small block lower triangular matrix. In the third step, looping over all block columns of NL (NU) and storing all corresponding triangular matrices consecutively, we form two groups of small block triangular systems, LGSBTS and UGSBTS. The fourth triangular system in the LGSBTS, also shown in Figure 2, is formed bywhere is the solution for the fourth block column of NL, and is the right hand side with the first block being identity matrix. The last step is to solve the LGSBTS and UGSBTS block column by block column for the solutions of NL and NU, respectively. Generally, since the preconditioning matrix remains unchanged during the solve of , the first phase needs to be performed only once.
Once the approximate solutions of NL and NU are obtained, the second phase can be accomplished by performing two block sparse matrix-vector multiplications, i.e., and . These two operations must be conducted in every iteration because changes at every iteration.
3.2. GPU Implementation
Although the block version of the ISAI looks similar to the scalar version, the computations in the two versions are entirely different. For example, in (8) consists of four blocks instead of four scalars, matrix-matrix multiplications is conducted instead of scalar-scalar multiplications, and the inverse of a non-zero block is computed instead of the inverse of a scalar. Therefore, to make full use of fine-grained features of a GPU, the implementations and tuning strategies for the two versions are also different. In the following discussion, an implementation of Algorithm 1 is introduced by developing several efficient GPU kernels and exploiting the cuSPARSE library .
For the first step in S1 of Algorithm 1, the essence of estimating the sparsity patterns of NL and NU is to conduct block matrix-matrix multiplication, which usually involves symbolic multiplication and numerical multiplication. However, since only the non-zero patterns of NL and NU are needed, only the symbolic multiplication of and is employed on the GPU. This is realized by calling the presetup function which estimates the non-zero pattern for the multiplication result of two general sparse matrices.
In the scalar version of the ISAI, there are two strategies introduced in [10, 11] for the GPU implementation, including the following three steps. One can separately design three kernels, each of which is responsible for a single step of Algorithm 1. In this way, the LGSBTS and UGSBTS must be formed explicitly for the solve . An alternative method  is to merge all three steps into a single kernel in which the data (or ) extracted from L (or U) are directly used for the partial solution of one system from LGSBTS (UGSBTS). This means each system is formed temporarily, and the data for one system could be overwritten by subsequent systems. The main advantage of the first strategy is that one can explicitly express the LGSBTS (or UGSBTS) in an order that the memory pattern is perfect for coalesced memory accesses on the GPU. However, the drawback is obvious. To solve NL’s block column with blocks, according to (7), it requires to explicitly storing the corresponding lower triangular block matrix that contains blocks, i.e., scalar elements, where s is the block size. By adding the number of blocks up for all block columns in NL targeting parallel computing, it requires approximately double precision memory spaces for storing all matrices in (7). It is estimated that explicitly storing LGSBTS for = 200, 000, = 3, and requires over 750 MB of memory, and the memory requirement exceeds 2 GB when the block size is 5. The memory requirements will double if the LGSBTS and UGSBTS are both stored. By considering the space complexity of the first strategy and the limited memory resources on the GPU, in the block format, explicitly forming the LGSBTS and UGSBTS in the implementation proposed in this study is avoided.
A key point in S1 is the solving of the small block triangular systems (SBTSs) that fit the GPU architecture very well because they are independent of each other. However, the GPU strategy [10, 11] for the scalar version is not an optimized choice for the block version here because mapping a thread to a matrix block for numerical operations leads to significantly uncoalesced memory accesses on the GPU. Moreover, the experiments on block sparse matrix-vector multiplication in  show that using consecutive threads (usually 32 threads in a warp) collaboratively for the computing of consecutive blocks is good for bandwidth utilization. Therefore, 32 threads in a warp are employed for the solve of a SBTS. The idea of the algorithm is shown in Figure 3, which shows the first warp in a thread block of size 128 for the solution of (8) with block size 4 in parallel. The small block triangular matrix (SBTM) is extracted from L in columns, so that a warp can handle a column with possible data concurrency. Specifically, when the block column is accessed by a warp, the block of , denoted as , is ready to be solved, and once is obtained, it can contribute to the solving of by performing in parallel
Taking the SBTS’s 0th block column consisting of four blocks as an example, the 0th block is used to solve , and this is followed by accumulating and simultaneously. As a warp can only cover two blocks, it restarts the computation for the rest (i.e., ) in the next cycle, until it has gone through all blocks in the column.
The GPU kernel developed by the CUDA C/C++ language for solving the LGSBTS in Algorithm 2 is now described. The one-dimensional thread organization is employed. Since a warp (32 threads) is chosen and implicitly dispatched by CUDA (compute unified device architecture), for solving of a SBTS, the total number of launched threads is where is the number of block rows (columns) of the matrix. The total number of CUDA blocks is calculated by ceil (. By default, the one-dimensional block organization is employed, but when the number of blocks exceeds the maximum number of blocks along one-dimension provided by the CUDA architecture, the two-dimensional block arrangement is conducted.
To identify which warp corresponds to the desired column of NL, a global thread identifier is first transformed to the global warp identifier and relative warp identifier in a thread block (lines 5 and 6). Then, the local index within a warp for a thread, denoted , can be obtained by . For a block size , the maximum number of complete blocks a warp can cover at a time is floor , and the maximum number of active threads in a warp is estimated by . As a result, the following task of the kernel (from lines 10 to 42) is limited in range threads. As illustrated in Figure 3, the warp goes through the row indices of the block column of NL. For each row index , the first task is to extract the blocks at the rows in the block column of L. Since L is expressed in BCSR format, the block column cannot be accessed directly and continuously. Moreover, a warp may not be able to cover all blocks in the column simultaneously, and the type of computation on the first (diagonal) block is different from that on the remaining blocks in the column. By considering these issues, the task of going through a column of L is divided into two subtasks.
First, the first threads solve a block of the solution by performing a block-block multiplication (lines 17–24). To exploit the fine-grained feature of the GPU, the computing of only one element for the result of multiplication is assigned to a thread. The thread corresponding to the row and column of the result is responsible for the inner product between the row of the left-hand block and column of the right-hand block. Because values in the two blocks are repeatedly visited, the multiplication is implemented through the shared memory, where the two operating blocks are loaded to avoid repeated access from the global memory. Note that the first block (diagonal block of a SBTM) is inversed by using the symbolic expression of the adjoint matrix of a block. For example, letting be a block, then the row and column of the adjoint matrix of B, denoted as , can be expressed explicitly asand therefore, each thread of the threads is only computing one element of the adjoint matrix. The symbolic expressions for and can be derived in a similar way.
Second, the entire warp executes (9) by going over all remaining blocks in the column (lines 25–40). In the circumstance in which a warp is unable to cover all of the remaining blocks at one time, several cycles are performed until the end block is reached. In each cycle, before loading blocks into the available shared memory, every threads search from right to left to identify whether there is a non-zero block at the column. The block is set to a zero block if the searching fails. Once the searching is completed, up to floor blocks update the right-hand side by block-block multiplications.
The implementation for the preconditioning step comprises two block matrix-vector multiplications that can be realized by calling cusparseDbsrmv in the cuSPARSE library. Compared to performing triangular solves in a traditional way, the block matrix-vector multiplication offers more possible parallelism that may greatly reduce the computing overhead of the preconditioning step in each iteration.
4.1. Experiment Setup
A computing node of the GPU cluster was used for all experiments. The node is configured with two Intel E5-2640 [email protected] GHz CPUs, with 128 GB of memory and four Nvidia Tesla V100 cards. The Tesla V100 card, with a compute capability of 7.5, is configured by 4096-bit HBM2 16 GB memory and has 80 stream multiprocessors (SMs) with 5120 FP32 cores, 2560 FP64 cores, and 640 Tensor cores in total. The peak single-precision and double-precision floating point performance is 15.7 and 7.8 TFLOPS, respectively.
The algorithms presented in this study are implemented based on the PETSc-3.14.2 (portable, extensible toolkit for scientific computation) framework  and CUDA 10.0. The GPU version of the GMRES algorithm with restart 30, developed for block matrices based on PETSc’s data structure and cuSPARSE library, is used as the test solver.
For each test case, comparisons are made between different implementations for the preconditioning step. In the first implementation, the preconditioning step is realized using PETSc’s block triangular solver on the CPU, and the results are copied to the GPU for the remaining part of computing by the GMRES. The second implementation uses parallel block triangular solves on the GPU from the cuSPARSE library. The input for the cuSPARSE’s triangular solver is obtained by calling the ILU factorization routine in BCSR format. The last implementation is the block-ISAI preconditioning algorithm on the GPU as proposed in the present study. The lower and upper block triangular factors are provided by the block ILU factorization in PETSc and used as the input of Algorithm 1. For convenience, the wall clock time of computing the preconditioning step for 30 iterations is denoted as , the total computing time until the GMRES converges to the relative tolerance is denoted as , and the total number of GMRES iterations is denoted as .
4.2. Atmosmodd Case
The first test matrix, atmosmodd, is from the SuiteSparse matrix collection . It consists of 1,270,432 rows and columns and 8,814,880 non-zeros. The matrix is transformed into the BCSR format with block size 4 using cuSPARSE library . As a result, the block matrix consists of 317,608 block rows and columns and 2,190,844 non-zero blocks. The relative tolerance for GMRES convergence is set to .
Table 1 provides the time costs for 30 preconditioning steps and the GMRES (30) method, as well as the total number of GMRES iterations using different preconditioning methods. It is reported in  that the sparsity pattern for atmosmodd provides valid parallelism and make the level-scheduling algorithm on the GPU gain accelerations over serial triangular solvers on the CPU. It is observed from Table 1 that the level-scheduling scheme in block format is also more efficient than the serial block triangular solves implemented in PETSc, and approximately 12.4x speedup can be achieved. The preconditioning costs decrease significantly when block-ISAI and block-ISAI are applied, which is attributed to the high parallelism provided by the block sparse matrix-vector multiplication in the proposed algorithm. The resulting , however, is more than that produced by exact triangular solves using PETSc and cuSPARSE. This is because we only compute an approximate inverse of the lower and upper block triangular matrices based on a guessing sparsity pattern.
The overhead consisting of three parts for setting up block-ISAI(|X|) and block-ISAI is shown in Figure 4. As described in Algorithm 1, NL and NU are solved in columns, and thus, the BCSR to BCSC transformations are required for NL and NU after the estimations of sparsity patterns have been made. Once the block values in NL and NU are filled by the GPU kernel, they are transformed into BCSR formats for the block matrix-vector multiplications in the preconditioning step. Therefore, four format transformations in total have to be conducted. Compared to using block-ISAI , using block-ISAI results in a lower number of GMRES iterations but more computing time for the GMRES algorithm and constructing an effective preconditioner. Although the overhead of generating block-ISAI increases with making the sparsity pattern denser, it only costs 2.5% and 5.8% compared to for block-ISAI and block-ISAI , respectively.
By adding the overhead of setting up block-ISAI, the overall times for the above four methods, as well as the speedups over the baseline GPU GMRES (30) with PETSc’s preconditioning are shown in Figure 5. A speedup of 12.6x is obtained by block-ISAI over the baseline GMRES (30), which is the best among the listed methods. Compared to cuSPARSE, both versions of block-ISAI are faster.
4.3. af_shell3 Case
The second matrix, af shell3, is also from the SuiteSparse matrix collection . The size of the matrix is 504, 855 504, 855 with 17,588,875 non-zeros. By setting the block size to 5, the matrix is transformed into the block format. The number of block rows and columns is 100,971 and that of non-zero blocks is 703,555. The relative tolerance for GMRES (30) is set to .
In this case, , , and are considered for the estimation the inverses of L and U. Table 2 provides the comparisons of , , and obtained by the five methods. Note that the PETSc, cuSPARSE, and block-ISAI methods result in less than 30 GMRES iterations, so the corresponding values are reported using the actual number of iterations; otherwise, for 30 iterations is reported. It is observed that the exact triangular solver from cuSPARSE is efficient in extracting possible parallelism from the sparsity pattern, and it runs approximately 2.55x faster than the PETSc’s serial block triangular solver. However, the preconditioning overhead of cuSPARSE, , still accounts for 88% of the total computing time. In contrast, the block-ISAI method results in much less time for , but at the sacrifice of increasing the total number of GMRES iterations. In addition, decreases when the estimated patterns for inverses become denser.
By counting the overhead of setting up block-ISAI in Figure 6, the overall time and speedups are shown in Figure 7. Although the costs of block-ISAI are comparable to , the overall times obtained by block-ISAI methods are still much less than those in the PETSc’s and cuSPARSE implementations. Figure 7 indicates that all three block-ISAI with different sparsity patterns can achieve more than 5x over PETSc’s implementation. Block-ISAI with is the most efficient in terms of the overall time and is nearly 3.9x faster than the cuSPARSE implementation.
4.4. Lid-Driven Cavity Case
In the third experiment, a linear system with the block matrix from a lid-driven cavity case is solved on a 300300 mesh in the velocity-vorticity formulation using the five-point finite difference method. The block matrix, with a block size of 3, is extracted from the first Newton step of the nonlinear solver. The number of block rows and columns is 90,000 and that of non-zero blocks is 448,800. The relative tolerance is set to for GMRES (30).
It is observed from Table 3 that cuSPARSE on the Tesla V100 fails to accelerate the preconditioning step that is composed of two block triangular systems. This is due to the strong data dependency in the sparsity pattern. Block-ISAI, in contrast, can offer a better solution in this circumstance. Eventhough the number of GMRES iterations using block-ISAI is 14%, 34%, and 75% more than that using exact triangular solves, block-ISAI preconditioning is still more efficient in terms of . In Figure 8, the breakdown costs of block-ISAI with different sparsity patterns indicates that compared to , the overhead is only up to 4%, which keeps the cost from affecting the overall time.
Figure 9 shows the comparisons of overall times obtained by PETSc, cuSPARSE, and block-ISAI. Block-ISAI preconditioning shows an obvious advantage over the serial and parallel preconditioning from PETSc and cuSPARSE. It is further observed that both and obtained by block-ISAI preconditioning decreases upon increasing the density level of the sparsity pattern. GMRES (30) using block-ISAI preconditioning is approximately 6.9x faster than that using PETSc or cuSPARSE preconditioning.
4.5. Venkat50 Case
The last matrix is from the unstructured two-dimensional Euler solver provided by Venkatakrishnan at the SuiteSparse matrix collection . The fluid fields associated with mesh points are ordered continuously so that the resulting matrix has a block structure naturally, with a block size of 4, which can be viewed in Figure 10. The matrix has 15,606 block rows and columns with 107,362 non-zero blocks in total. The relative tolerance is for the GMRES (30) solver.
It is given in Table 4 that the preconditioning (triangular solver) using cuSPARSE is nearly 4 times slower than the serial ones provided by PETSc. This is normal because the level scheduling-based scheme cannot always be efficient on the GPU when slight parallelism is extracted from a given sparsity pattern . Compared to obtained by PETSc, using block-ISAI does not show an obvious advantage because the approximate inverses lack of accuracy and result in 6 times more GMRES iterations in . Upon increasing the density of the sparsity pattern, the decrease of both and is observed, and block-ISAI is shown to be the best preconditioner in term of . The breakdown of costs for computing block-ISAI upon increasing the dense levels is shown in Figure 11. The overall speedups on GMRES are calculated and shown in Figure 12 by counting the overhead for computing block-ISAI, and speed improvements of approximately 1.92x and 6.69x are achieved compared to PETSc’s and cuSPARSE’s preconditioning, respectively.
In this study, block-ISAI preconditioning for block sparse matrices is investigated by proposing an efficient, warp-based algorithm to approximate the inverses of block triangular factors on a Tesla V100 GPU. Instead of solving triangular systems globally for preconditioning, a group of small triangular systems with very high concurrency is solved to approximate the inverses and conduct block SpMV in the preconditioning step. The results on four cases from a matrix collection website and multiphysics areas show that the proposed GPU algorithm outperforms the serial and parallel block triangular solver-based preconditioning from the PETSc and cuSPARSE libraries in terms of the total computing time of the GMRES (30) algorithm. It is noted that all comparisons are performed under the condition that the L and U factors are provided because the target for comparison is the block triangular solves in the preconditioning step. Planned future work includes extending the algorithm to multiple GPUs that are connected by the NVLink technology using block-Jacobi or the additive Schwarz method (ASM) preconditioners, as well as the coupling of sparsity patterns and relaxation steps  for extreme cases.
The main matrix data used to support the findings of this study are available from the SuiteSparse matrix collection (https://sparse.tamu.edu/).
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was supported by a grant from the National Key R&D Program of China (2019YFB1704202), the National Natural Science Foundation of China (61702438), and the Nanhu Scholar Program of XYNU and the Innovation Team Support Plan of University Science and Technology of Henan Province (19IRTSTHN014). The authors thank LetPub for its linguistic assistance during the preparation of this manuscript.
Y. Saad, Iterative Methods for Sparse Linear Systems, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2 edition, 2003.
M. Naumov, Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU, NVIDIA Corporation, Santa Clara, CF, USA, 2011, NVIDIA Technical Report NVR-2011-001.
H. Kabir, J. D. Booth, G. Aupy, A. Benoit, Y. Robert, and P. Raghavan, “STS-K: a multilevel sparse triangular solution scheme for NUMA multicores,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Austin, TX, USA, November 2015.View at: Google Scholar
J. Park, M. Smelyanskiy, N. Sundaram, and P. Dubey, “Sparsifying synchronization for high-performance shared-memory sparse triangular solver,” in Proceedings of International Supercomputing Conference, Springer International Publishing, Leipzig, Germany, June 2014.View at: Google Scholar
NVIDIA cuSPARSE, https://docs.nvidia.com/cuda/archive/10.0/cusparse/index.html. (accessed on 30 December 2020).
W. Liu, A. Li, J. D. Hogg, I. S. Duff, and B. Vinter, “Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides,” Concurrency and Computation: Practice and Experience, vol. e4244, pp. 1–18, 2017.View at: Google Scholar
H. Anzt, E. Chow, and J. Dongarra, “Iterative sparse triangular solves for preconditioning,” in Proceedings of European Conference on Parallel Processing, Springer, Berlin, Heidelberg, August 2015.View at: Google Scholar
H. Anzt, E. Chow, T. Huckle, and J. Dongarra, “Batched generation of incomplete sparse approximate inverses on GPUs,” in Proceedings of Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, IEEE, Salt lake, UT, USA, November 2016.View at: Google Scholar
Intel Many Integrated Core Architecture (Intel MIC Architecture)-Advanced. https://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html. (accessed on 31 December 2020).
A. C. N. V. Duin, “Scalable parallel preconditioning with the sparse approximate inverse of triangular matrices,” SIAM Journal on Matrix Analysis and Applications, vol. 20, no. 4, 1999.View at: Google Scholar
R. Eberhardt and M. Hoemmen, “Optimization of block sparse matrix-vector multiplication on shared-memory parallel architectures,” in Proceedings of 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, Chicago, IL, USA, May 2016.View at: Google Scholar
Balay S.; Abhyankar S.; Adams M. F.; et al. PETSc Web Page. https://www.mcs.anl.gov/petsc.(accessed on 30 December 2020).
B. Yang and H. Liu, “Accelerating the GMRES solver with block ILU(k) preconditioner on GPUs in reservoir simulation,” Journal of Geology and Geophysics, vol. 4, no. 2, pp. 1–7, 2015.View at: Google Scholar
A. Kashi and S. Nadarajah, Fine-grain Parallel Smoothing by Asynchronous Iterations and Incomplete Sparse Approximate Inverses for Computational Fluid Dynamics, AIAA, Orlando, FL, USA, 2020.