Research Article  Open Access
Ichitaro Yamazaki, Stanimire Tomov, Jack Dongarra, "Computing LowRank Approximation of a Dense Matrix on Multicore CPUs with a GPU and Its Application to Solving a Hierarchically Semiseparable Linear System of Equations", Scientific Programming, vol. 2015, Article ID 246019, 17 pages, 2015. https://doi.org/10.1155/2015/246019
Computing LowRank Approximation of a Dense Matrix on Multicore CPUs with a GPU and Its Application to Solving a Hierarchically Semiseparable Linear System of Equations
Abstract
Lowrank matrices arise in many scientific and engineering computations. Both computational and storage costs of manipulating such matrices may be reduced by taking advantages of their lowrank properties. To compute a lowrank approximation of a dense matrix, in this paper, we study the performance of QR factorization with column pivoting or with restricted pivoting on multicore CPUs with a GPU. We first propose several techniques to reduce the postprocessing time, which is required for restricted pivoting, on a modern CPU. We then examine the potential of using a GPU to accelerate the factorization process with both column and restricted pivoting. Our performance results on two eightcore Intel Sandy Bridge CPUs with one NVIDIA Kepler GPU demonstrate that using the GPU, the factorization time can be reduced by a factor of more than two. In addition, to study the performance of our implementations in practice, we integrate them into a recently developed software StruMF which algebraically exploits such lowrank structures for solving a general sparse linear system of equations. Our performance results for solving Poisson's equations demonstrate that the proposed techniques can significantly reduce the preconditioner construction time of StruMF on the CPUs, and the construction time can be further reduced by 10%–50% using the GPU.
1. Introduction
In applied and numerical mathematics or in scientific and engineering simulations, we often encounter lowrank matrices, and more frequently, we encounter matrices whose submatrices are lowrank. We can reduce both computational and storage requirements of manipulating many of these matrices by taking advantages of their lowrank properties. In this paper, we study the performance of the following two algorithms for computing a lowrank approximation of a dense matrix on multicore CPUs and investigate the potential of using a GPU to accelerate the process: (1) the QR factorization with column pivoting (QP3) [1, 2] and (2) the QR factorization with restricted pivoting (QPR) [3, 4]. In addition, to study the performance of QP3 and QPR in practice, we integrate our implementations of QP3 and QPR into StruMF [5, 6] which is a recently developed software for solving a general sparse linear system of equations. StruMF algebraically exploits a lowrank structure, referred to as a hierarchically semiseparable (HSS) structure [7], of a coefficient matrix while computing and applying a preconditioner, and uses QP3 and QPR for computing the lowrank approximation of the dense submatrices. In many cases, the preconditioner construction time of StruMF is dominated by the lowrank approximation of the submatrices.
The rest of the paper is organized as follows. First, in Section 2, we review QP3 and QPR. Then, in Section 3, after analyzing the performance of StruMF on a CPU, we propose several techniques to improve the QPR performance on the CPU. Next, in Section 4, we describe our implementations of QP3 and QPR using a GPU. Finally, in Section 5, we show their performance on multicore CPUs with a GPU and its impact on the performance of StruMF. We provide our final remarks in Section 6. Throughout this paper, the th column of a matrix is denoted by , while is the submatrix consisting of the th through the th rows and the th through the th columns of .
2. QR Algorithms with Column Pivoting
An by matrix has a numerical rank with respect to a threshold when where are the singular values of in the descending order (i.e., ) (we assume that without the loss of generality). Then, an RRQR factorization of has a form where is an by orthonormal matrix, is an by uppertriangular matrix, and is a permutation matrix chosen to reveal the numerical rank: Since the interlacing properties of the singular values [8] states that satisfying (3) or (4) is equivalent to finding the permutation matrix such that one of the following two tasks is satisfied, respectively: More detailed discussion on the RRQR factorizations can be found in [9] and the references therein.
In the following subsections, we review the existing algorithms to compute such RRQR factorizations. Namely, we first review the blocked versions of Householder QR [8] (Section 2.1). We then outline QP3 [1, 2] which is a greedy algorithm for solving task1 and is implemented in LAPACK [10] (Section 2.2). Next, we describe how an algorithm solving task1 can be modified to solve task2 and present three types of socalled hybrid algorithms [9] that are theoretically guaranteed to solve task1 or task2, or both (Section 2.3). Finally, we discuss QPR [3] that uses the hybrid algorithm as a postprocessing scheme to reduce the computational bottleneck of QP3, while ensuring the rankrevealing properties of the computed factorization (Section 2.4).
2.1. Blocked QR Algorithm
The th step of Householder QR [8] generates the Householder transformations to zero out the offdiagonal elements in the th column of (for ). To improve the data locality of the factorization, a blocked version of the algorithm (QR3) accumulates Householder transformations and uses a BLAS3 to apply the accumulated transformations at once to the trailing submatrix; that is, for , and (we assume that both and are multiples of , but the discussion can be easily extended for other cases), where is the product of the Householder matrices; that is, and . (The matrices is not explicitly formed. Instead, we store in the lowertriangular part of and store in an length vector.) This matrix can be represented in a socalled form [11]: where . Since , these Householder matrices are applied to the vector in addition to the matrix ; this blocked algorithm requires about times more floating point operations (flops) than the unblocked algorithm. However, on a modern computer, the blocked algorithm takes advantage of the memory hierarchy and often obtains a significant speedup over an unblocked algorithm.
The additional by workspace to store can be reduced by factorizing into the following form [12]: where is an by uppertriangular matrix such that and is an length vector given by . Algorithm 1 shows the pseudocode of the resulting blocked QR algorithm with BLAS3 (QR3).

2.2. QP3 Algorithm
At each step of an RRQR factorization, selecting an optimal pivot to satisfy task1 (7) or task2 (8), or both, is likely to be a combinatorial optimization problem. To reduce the computational cost, a greedy algorithm is generally used. In this section, we discuss such a greedy algorithm for solving task1. Specifically, at the th step (for ), assuming that the wellconditioned columns of have been already selected and factorized, the next pivot column is selected from the remaining columns such that the smallest singular value of the selected columns are maximized: Since selecting such a column is still computationally expensive, heuristics are used. First, since a Householder transformation can zero out the elements of below the diagonal, we have where . Furthermore, we can approximate the smallest singular value of the matrix by the reciprocal of the largest row norm of its inverse [9, 13]: In addition, since is assumed to be wellconditioned, all the row norms of are expected to be small. Hence, it leads to the following approximation: where Based on these heuristics, QR with column pivoting (QRP) [1] selects the next pivot column that is farthest away in the Euclidean norm from the subspace spanned by the already selected columns (i.e., the column with the largest ). Since an orthogonal transformation does not change the column norms, once these norms are initialized ( for ), it can be cheaply downdated at the end of the th step ( for ).
Algorithm 2 shows a blocked version (QP3) of QRP [2]. One difference between QP3 and QR3 of Algorithm 1 is at the trailing submatrix update. Specifically, QR3 stores the product of the Householder matrices in the form (10) (or using the matrix to reduce the memory requirement) and updates the trailing submatrix by two matrixmatrix multiplies (our implementation takes advantage of the triangular structures of both and ); that is, after the th panel factorization, the trailing submatrix is updated by where . On the other hand, the th step of the th QP3 panel factorization computes the th column of the matrixmatrix product and uses the resulting vector to update the th row of , which, in return, is needed to downdate the column norms of the trailing submatrix (Steps 1.5 and 1.6 of Algorithm 2). Finally, QP3 updates the trailing submatrix by a single matrixmatrix multiply, . Since QP3 saves the result of the matrixmatrix product in the auxiliary matrix , QP3 and QR3 require about the same numbers of flops. However, QP3 performs about a half of the total flops using BLAS2, while QR3 performs most of its flops using BLAS3.

In some cases, due to the roundoff errors, the downdated norms diverge significantly from the true norms [14]. When this occurs, the trailing submatrix is immediately updated with the outstanding Householder transformations, and the column norms are recomputed. If the column norms must be frequently recomputed, then in comparison to QR3, QP3 not only requires significantly more flops for recomputing the norms but also exhibits a poorer data locality since the trailing submatrices are updated using smaller blocks.
2.3. Hybrid Algorithms
In this section, we first show how an algorithm solving task1 can be used to solve task2, and then describe socalled hybrid algorithms for solving task1 or task2, or both. The algorithms presented in this subsection are used as a postprocessing to improve the numerical properties of an RRQR factorization, and the matrix is of uppertriangular form. For instance, Algorithm 3(a) shows the QP3 algorithm applied to an upper triangular matrix . To simplify our notation, we write the RRQR factorization as where is the by leading submatrix. Then, task2 is equivalent to Hence, to solve task2, we apply an algorithm that solves task1 to the transposeinverse of the matrix . Namely, if we have such an RRQR factorization, then we also have Moreover, if is the permutation matrix with ones on the antidiagonal, then where is a permutation matrix, is an orthogonal matrix, and is minimized. Hence, the factorization (21) is an RRQR factorization solving task2. This is called the unification principle of the algorithms solving task1 and task2 [9].

We now introduce the ChanI algorithm for solving task1, whose task2 version is used as the postprocess scheme in Section 2.4. Just like the QRP algorithm, the ChanI algorithm pivots the column with the largest norm, but it uses an additional heuristic. Namely, at the th step, the column norm is approximated using the rightsingular vector corresponding to the largest singular value of the submatrix . Specifically, if is the singular value decomposition of , then the th column norm of is approximated by Hence, at the th step, the algorithm pivots the th column, whose corresponding element of the most dominant right singular vector has the largest module. Algorithm 3(b) shows the ChanII algorithm which is the task2 version of the ChanI algorithm and pushes the column that minimizes into the trailing submatrix at each step.
Finally, Algorithm 4 shows a single iteration of three hybrid algorithms HybridI, HybridII, and HybridIII [9] that solve task1, task2, and both task1 and task2, respectively. The iteration is terminated when the th column is not moved.

2.4. QPR Algorithm
Even when QP3 and QR3 performs about the same number of flops, QP3 is often slower. This is largely because QR3 performs most of its flops using BLAS3, while QP3 performs about half of its flops using BLAS2. Since this BLAS2 is needed to select the pivot among all the remaining columns, QPR [3] tries to reduce the bottleneck by selecting the pivot among a fixed number, , of the columns. Algorithm 5(a) shows the pseudocode of QPR. At each step of the panel factorization, while QP3 computes the matrixvector product with the whole trailing submatrix, QPR updates the columns within this window using a Householder matrix (both by BLAS2). Since the pivots are now selected only within the window, in order to ensure the rankrevealing properties, QPR uses a condition estimator and accepts only the pivots that satisfy (5). After all the columns are either accepted or rejected (Step 1 in Algorithm 5(a)), the QR factorization of the rejected columns is computed to obtain the uppertriangular matrix (Steps 2 and 3). At the end, in comparison to QP3, QPR performs a fewer flops using BLAS2 to select the pivots. However, the columns of the trailing submatrix are now updated using BLAS2. In Sections 3 and 5.1, we compare the performance of QPR and QP3 on a CPU and on multicore CPUs with a GPU, respectively.

Since QPR selects the pivots only within the windows, it requires postprocessing to globally ensure the rankrevealing property. Though there are two postprocessing options [9, 15], here, we focus on the first [9] because it has shown to be more effective in our experiments. Algorithm 5(b) shows the pseudocode of this postprocessing scheme, which modifies the HybridIII algorithm of Algorithm 4 to improve its convergence rate [3].
3. Case Studies with StruMF on a CPU
The performance of both QP3 and QRP depends on the input matrix . To study their performance, in this section, we provide their case studies with StruMF for solving 3D sevenpoint Poisson’s equations on one core Intel Sandy Bridge CPU. In our experiments, we use the same default parameters used for solving Poisson’s equations with StruMF in [5] (e.g., the numerical rank tolerance is set to be , and FGMRES(30) is considered to be converged when the residual norm is reduced at least by the order of ), and for QPR, we used the default window size recommended in [3] (i.e., ).
3.1. Performance of Original StruMF and QRP
Figure 1(a) shows the breakdown of the StruMF time using QP3. It clearly shows that StruMF spends most of its preconditioner construction time in QP3. Moreover, the percentage of the time spent in QP3 often increases as the global matrix dimension increases. To analyze the performance of QP3, Figure 2 shows the dimensions of the offdiagonal blocks for which the lowrank approximations are computed. Most of these offdiagonal blocks are short and wide having a few hundreds rows and tens of thousands of columns. In addition, the dimensions of the offdiagonal blocks, especially, their numbers of columns, often increase with respect to the global dimension. Hence, if the compression time of these shortwide blocks can be reduced using another algorithm or a GPU, then the StruMF solution time may be significantly reduced.
(a) Relative time spent in QP3
(b) Memory relative to direct factorization
(a)
(b)
(c)
(d)
Figure 3 compares the StruMF solution times using QP3 and QPR. The figure clearly indicates that even though the factorization time is reduced using QPR, the postprocessing can be expensive. In the next subsection, we propose modifications to the QPR implementation, which often reduce the postprocessing time and make QPR more competitive. Just for reference, Figure 3 compares the StruMF solution time with that of a direct multifrontal factorization. We see that for a large enough system, StruMF can reduce not only the memory requirement (see Figure 1(b)) but also the solution time of the direct factorization.
3.2. Proposed Modifications to Original QPR Implementation
The performance of QPR depends strongly on the condition estimator used to evaluate Step 1.1.1 of Algorithm 5(a). In the original implementation, the smallest singular value of is estimated using an incremental condition estimator (ICE) [16, 17], while the largest singular value is estimated by the product of the largest column norm and the third root of the matrix dimension (i.e., where is of dimension , and is the th column of ) [3]. This simple estimator is used for the largest singular value because though the estimator may lead to a greater estimation error, it requires a fewer flops (i.e., ICE requires flops to update the estimation of ). During the postprocessing phase, the same estimators are used at Step 1 of Algorithm 3(b), but in order to obtain an accurate final factorization, both the largest and smallest singular values are estimated by ICE at Step 3 of Algorithm 5(b). We found that, in our numerical experiments using random matrices, this simple estimator underestimates the largest singular values. As a result, many components that should be rejected are accepted. In addition, during the preconditioner construction for solving Poisson’s equation by StruMF, the simple estimator overestimates the singular values, rejecting the components which should be accepted. In either case, this estimation error could significantly increase the postprocessing cost. Figure 4(a) shows that when a more accurate condition estimator (i.e., ICE) is used, the postprocessing time can be dramatically reduced. In addition, ICE reduced the factorization time slightly because most of the components are accurately accepted by Step 1 of Algorithm 5(a), and Step 2 has less work.
(a) Effect of condition estimator
(b) Effect of tuning the postprocess
During the postprocessing (Algorithm 5(b)), the GolubI algorithm requires the column norms of the trailing submatrix , and the ChanII algorithm requires the column norms of the leading submatrix . At the beginning of Step 1 to call the ChanII algorithm, the original implementation computes the column norms of the leading submatrix . Then, the norms and are computed as they are permuted into . On the other hand, for each GolubI call, the column norms of the trailing submatrix are recomputed. This is because when the ChanII algorithm applies Given’s rotation to the leading submatrix , the column norms of the trailing submatrix changes. In our experiments, this norm computation could become expensive, especially when Step 1 requires many iterations or a large tolerance is used. To reduce this computational cost, we incrementally update or downdate the norms when Given’s rotation is applied and when the submatrix size changes at the outer iteration of Algorithm 5(b). In addition, we use the same criteria used in QP3 [14] to detect if the updated norms have diverged from the actual norms due to the roundoff errors. When this happens, the column norms are recomputed. Figure 4(b) clearly indicates that the postprocessing time of StruMF can be significantly reduced by avoiding the norm computation.
For Step 2 of the factorization in Algorithm 5(a), the original implementation uses the columnwise QRP algorithm. On a modern computer, a blocked algorithm can obtain significant speedups. Hence, we replaced it with QP3. In addition, we use ICE to detect the numerical rank instead of the simple estimator used in the original implementation (i.e., QP3 terminates when the estimated condition number of the leading submatrix becomes greater than the threshold ). Figure 5(a) shows that the blocked algorithm reduces the factorization time, but the overall improvement is not significant since after ICE is integrated, only a short time is spent at Step 2 for solving this Poisson’s equation.
(a) Columnwise (left) and blocked (right bar) algorithm
(b) Ratio of exact and estimated conditioner number of leading triangular factors after the QPR factorization
(c) StruMF solution time using task1 (left), task2 (middle), and task1 and task2 (right bar) postprocessing
(d) StruMF solution time with QP3 (left) and QPR (right bar)
(e) Hybrid paradigm
(f) Algorithmic flow
Finally, while QP3 aims to solve task1, the QPR postprocessing tries to solve both task1 and task2. Hence, we replaced it with the postprocessing algorithms for solving only either task1 or task2. Even though the postprocessing solved different tasks, StruMF converged in a similar number of iterations. However, the task1 postprocessing converged slightly faster than the task2 postprocessing, which was faster than the original postprocessing. We have also used the test matrices from the original paper [3] to evaluate the quality of the factorization after the postprocessing. Figure 5(c) shows that the quality of the factorization did not change significantly when different postprocessing schemes were used. We use the original postprocessing scheme in the remaining of the paper since it provides the most robust behavior, while the overhead is not significant after all the proposed modifications are integrated. Figure 5(d) compares the StruMF solution time using QP3 and QPR after all the modifications are made. Now, the solution time using QPR is shorter than that using QP3.
4. GPU Implementations
We now describe our QP3 and QPR implementations that utilizes a GPU.
4.1. QP3 Implementation
MAGMA (http://icl.utk.edu/magma/) extends LAPACK (http://www.netlib.org/lapack/) to heterogeneous architectures based on a hybrid programming and static scheduling. For instance, for the blocked QR factorization, the latencylimited BLAS2 based panel factorization is scheduled on the CPUs, while the computeintensive BLAS3 based trailing submatrix update is scheduled on the GPU [18]. Furthermore, as soon as the next panel is updated on the GPU, it is copied to the CPU such that the panel factorization on the CPUs can be hidden behind the remaining submatrix update on the GPU (this is commonly referred to as a lookahead). As a result, for a large enough matrix, the QR factorization by MAGMA can obtain the performance of the BLAS3, which exhibits a highlevel of data parallelism and can be efficiently implemented on the GPU [19].
Our first QP3 implementation is based on the same hybrid paradigm, where the CPUs factorize the panel while the GPU updates the trailing submatrix. However, in contrast to the QR panel factorization that accesses only a panel, each step of the QP3 panel factorization accesses the whole trailing submatrix to look for the pivot. Hence, the entire trailing submatrix must be updated before the panel factorization starts. As a result, though the transfer of the panel to the CPU can be overlapped with the remaining submatrix update, the actual panel factorization cannot be overlapped with the update (the lookahead is not possible). In addition, the QP3 factorization time is often dominated by the BLAS2 based panel factorization that performs about a half of the total flops. Hence, our second implementation accelerates this panel factorization using a GPU. Because the flop count of the panel factorization is dominated by the matrixvector product with the trailing submatrix (Step 1.4 of Algorithm 2), the GPU is used to accelerate this BLAS2 kernel. As shown in Figure 5(e), our implementation computes the matrixvector product with the top block row of the trailing submatrix on the CPUs, while the rest of the product is computed on the GPU. This is because this top block row is needed for the column norm downdating (Step 1.6), which is performed on the CPUs. Hence, this hybrid paradigm avoids the transfer of a row from the GPU to the CPU at each step of the panel factorization. Figure 5(f) illustrates this hybrid paradigm.
Unfortunately, in comparison to the QR factorization, this hybrid implementation of the QP3 factorization lead to much lower speedups. This is mainly because its performance is limited by the memory bandwidth to move a column between the CPU and GPU at each step of the panel factorization. This data transfer is needed for the column pivoting and matrixvector multiply (see Figure 5(f)). As an execution trace in Figure 6 also shows, this CPUGPU data transfer could become as expensive as the actual computation at each step of the panel factorization. To avoid this data transfer, our second implementation of QP3 performs all the computation on the GPU. At each step of the factorization, this GPU implementation still requires two synchronizations on the CPU; one to pick a pivot and the other one to check if the column norms must be recomputed (Steps 1.1 and 1.6, resp.). Furthermore, in many cases, the CPUs obtain higher performance of BLAS1 (e.g., Householder vector generation) than the GPU. However, once the matrix is copied from the CPUs to the GPU, this GPU implementation does not transfer any vector or matrix between the CPUs and GPU and could obtain a higher performance than our hybrid implementation could (see Section 5).
(a) Whole trace ()
(b) Partial zoomedin trace
Initially, our GPU implementations used an individual GPU kernel for each BLAS or LAPACK routine used for the panel factorization. However, many of these routines perform only small amounts of computation or require a scalar to be on the CPU. In order to improve the performance, we merged several computational kernels into one kernel in order to avoid the kernel launch overhead and unnecessary GPUCPU communication. In addition, we have tuned these kernels (e.g., block size and thread grid) for the matrix dimensions typical for the QPR factorization. We will show the effects of these optimization techniques in Section 5. A similar GPUimplementation is proposed in [20].
4.2. QPR Implementation
In contrast to QP3, QPR allows lookaheads. Namely, once the GPU updates all the columns of the next window, the CPU can start the panel factorization while the GPU updates the remaining submatrix. However, in contrast to the QR panel factorization that only requires the panel, the QPR panel factorization updates all the columns within the window at each step of the panel factorization. Hence, when the window size is large in comparison to the trailing submatrix dimension, it becomes difficult to hide the BLAS2 based panel factorization on the CPUs behind the BLAS3 based trailing submatrix update on the GPU (we have investigated a hybrid panel factorization. However, in many cases, it was less efficient due to the CPUGPU synchronization and communication, especially in our experiments with StruMF, where only a few columns were accepted at each panel factorization).
For Step 2 of the QPR factorization in Algorithm 5(a), we use the GPU implementation of QP3 (described in Section 4.1). Since at the end of Step 1, the coefficient matrix is on the GPU, Step 2 does not require any data transfer to the GPU. For Step 3, if the remaining submatrix is relatively small ( in our experiments), then we compute its QR factorization using LAPACK on the CPU. Otherwise, the QR factorization is computed using MAGMA on the GPU. We found that, in many cases, QP3 does not accept any of the rejected columns from Step 1. Hence, in order to hide the cost of copying the matrix from the GPU to the CPU for Step 3, the matrix is asynchronously copied to the CPU, while QP3 is performed on the GPU. Only when QP3 accepted a column, the matrix is resent to the CPU after the whole matrix is factorized. Finally, to generate the final orthogonal matrix , the Householder vectors from both QP3 and QR are accumulated on the GPU.
4.3. Integration into StruMF
To call our GPU kernels from StruMF, we copy the matrix to the GPU, compute the factorization, and then copy the result back to the CPU. Initially, we allocate a fixed amount of GPU memory for the workspace, and the workspace is reallocated when the current workspace is not large enough. Since the interfaces of most of the MAGMA routines are identical to those of LAPACK, replacing the LAPACK routine with the MAGMA routine is relatively easy.
5. Performance Studies with a GPU
We now study the performance of our QP3 and QPR implementations with a GPU (Section 5.1) and its impacts on the performance of StruMF (Section 5.2). We compiled our codes using the C compiler gcc 4.4.6 and the CUDA compiler nvcc 5.0.35 and linked them to the threaded version of MLK 2013.4.183. We emphasize that our CPU codes have been optimized. Namely, our QPR code integrates all the modifications described in Section 3.2, and our QP3 code computes a partial factorization, where the factorization is terminated at the numerical rank specified by the tolerance . Computing a partial factorization obtains a significant speedup compared to the QP3 routine of MKL that computes the full factorization. This was especially true in our experiments, where a relatively large is used.
5.1. Kernel Performance
Figure 7(a) shows the performance of our QP3 implementations to factorize square random matrices with an NDIVIA Tesla K20c GPU () and compares it with that of MKL on two eightcore Intel Sandy Bridge CPUs. Our GPU implementation improves the performance of our hybrid implementation because it avoids the data transfer between the CPU and GPU. Figure 7(b) demonstrates the advantage of the GPU implementation for the shortwide matrices. In addition, our optimized GPU implementation (Section 4.1) obtains significant speedups for both small and shortwide matrices, and this is used for the remaining of the experiments.
(a) Square Matrices
(b) Shortwide Matrices
Figures 7(a) and 7(b) also show that all of our implementations obtain significant speedups over MKL. We observed that, for large matrices, our GPU implementation and MKL obtain similar performance relative to the practical peak performance on the corresponding hardware, where the practical peak performance is measured by computing the required matrixvector products with the trailing submatrices and the matrixmatrix products for the submatrix update without any synchronization. In other words, our GPU implementation obtains the speedups over MKL because it effectively utilizes the higher memory and computational bandwidths of the GPU.
Figure 8(a) shows the breakdown of the QP3 factorization time. Up to 70% of the factorization time is spent in the BLAS3 matrixmatrix and BLAS2 matrixvector multiplies. Even though these BLAS3 and BLAS2 perform about the same numbers of flops, the bandwidthlimited BLAS2 dominates the factorization time. Figure 8(b) compares the performance of our QPR implementations with that of the QP3 implementations on random shortwide matrices. It shows that after the proposed modifications were integrated, the QPR implementation obtained the higher performance and greater scalability on the CPUs. Furthermore, our hybrid QPR implementation outperformed our QP3 GPU implementation due to the fewer BLAS2 flops required to select the pivots. Since the performance of QP3 and QPR (especially that of QPR) depends greatly on the input matrix (e.g., the number of columns accepted at each panel factorization), in the next subsection, we study the performance of these two algorithms within StruMF.
(a) Breakdown of factorization time
(b) Comparison with QP3 on random matrices
5.2. Performance of StruMF
Figure 9(a) shows the performances of StruMF using our QP3 implementations for solving a 3D Poisson’s equation. First, by comparing the performance on a single CPU in the figure with that of Figure 5(a), we clearly see the advantage of computing the partial factorization by QP3 (the factorization is terminated at the numerical rank specified by the tolerance ). Next, the preconditioner construction time of StruMF using QP3 is often dominated by BLAS2 or BLAS3 on small submatrices, and the construction time did not scale well on the CPUs. As a result, our GPU kernel was able to accelerate the construction time by 30%–50% over StruMF running on up to 16 CPUs. Figure 9(b) then shows the performance of StruMF using QPR. Using QPR, the preconditioner construction slowed down on a single CPU, but it scaled better and was faster than using QP3 on multiple CPUs. Furthermore, the GPU reduced the construction time by 10%–20%. In comparison to QP3, the GPU acceleration was smaller in QPR mainly because the trailing submatrices were updated using smaller blocks. Most of the columns in each window are rejected, and significant time is spent swapping the rejected columns to the end of the matrix.
(a) QP3
(b) QPR
Figure 10 shows the results of solving a 2D Poisson equation, where the same tolerance from [5] was used (i.e., ). Though the sizes of the dense submatrices were significantly smaller in the 2D problem (see Figure 11), our GPU kernel, especially the QP3 implementation, still obtained significant speedups. Unfortunately, the compression time was less dominant in the 2D problem, and the reduction in the preconditioner construction time was less significant using the GPU.
(a) QP3
(b) QPR
(a) 2D problem ()
(b) 2D problem ()
6. Conclusion
We studied the performance of QP3 and QPR for computing the lowrank approximation of dense submatrices. We first proposed several modifications to the original QPR implementation to improve its performance on the CPUs. We then investigated the potential of using a GPU to accelerate the factorization time. Our performance results demonstrated that the proposed modifications could significantly improve the performance of the QPR factorization on the CPU, and the factorization time can be further reduced using a GPU. In addition, we provided the case studies with an hierarchically semiseparable linear solver StruMF, which showed that the preconditioner construction time of StruMF can be reduced by 30%–50% and 10%–20% using the GPU for the QP3 and QPR factorizations, respectively. Though we only show the results of solving Poisson’s equations in this paper, the performance is representative of many cases and good indications for other cases. We emphasize that our focus is to study the performance of computing lowrank approximations, and our aim is not on improving the numerical performance of StruMF. The techniques discussed in this paper are applicable to other software which computes lowrank approximations or numerical ranks of dense matrices, including those on distributedmemory systems.
We did not study the impact of the input parameters on the performance of our implementations. For instance, we used all the default parameters of StruMF and QPR (e.g., compression rate , iteration stopping and restarting criteria for StruMF, and window size for QPR). In particular for StruMF, a smaller compression rate would lead to larger dense blocks increasing the effectiveness of our GPU kernels, while it would also reduce the iteration count, potentially reducing the total solution time. For QPR, the smaller window size would improve the effectiveness of the lookahead and may improve the performance of our hybrid implementation. On the other hand, the larger window size could improve the performance of the trailing submatrix updates and may also lead to more accurate factorization, potentially reducing the postprocessing time. We are currently studying the effects of these parameters on the performance of StruMF. In addition, we mentioned that, in comparison to QP3, QPR obtained smaller speedups on the GPU, mainly because it uses a smaller block to update the trailing submatrix. We are currently studying prepossessing techniques that would increase the block size and improve the performance of the QPR implementation (e.g., before the factorization, reorder the matrix columns in the descending order of their norms). We also discussed that the performance of the QPR implementation depends on the performance of the condition number estimator. We are investigating if more accurate estimators are needed for other test matrices. Finally, we are studying the performance of the randomization algorithm [21] to compute the lowrank approximation on a GPU [22] and would like to investigate the performance of our QP3 implementation in a communicationavoiding version of the algorithm [23] on multiple GPUs (e.g., distributedmemory system).
Conflict of Interests
The authors declare that there is no conflict of interests regar