Research Article

Computing Low-Rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and Its Application to Solving a Hierarchically Semiseparable Linear System of Equations

Figure 6

Execution trace of the hybrid QP3 implementation. The top trace is on the CPU, while the remaining two traces are on the GPU with two GPU streams (matrix-vector multiply, matrix-matrix multiply, column swap, pivot selection, reflector generation, norm computation, and communication are in green, purple, orange, magenta, red, cyan, and black, respectively. Since the BLAS matrix-vector multiply routine does not support a vector-matrix multiply, a matrix-matrix multiply is used to compute at Step 1.4 of Algorithm 2). The second GPU stream is used to transfer the next panel and top block row to the CPU.
(a) Whole trace ()
(b) Partial zoomed-in trace