Abstract

Sparse matrix-vector multiplication (SpMV) is an important operation in computational science and needs be accelerated because it often represents the dominant cost in many widely used iterative methods and eigenvalue problems. We achieve this objective by proposing a novel SpMV algorithm based on the compressed sparse row (CSR) on the GPU. Our method dynamically assigns different numbers of rows to each thread block and executes different optimization implementations on the basis of the number of rows it involves for each block. The process of accesses to the CSR arrays is fully coalesced, and the GPU’s DRAM bandwidth is efficiently utilized by loading data into the shared memory, which alleviates the bottleneck of many existing CSR-based algorithms (i.e., CSR-scalar and CSR-vector). Test results on C2050 and K20c GPUs show that our method outperforms a perfect-CSR algorithm that inspires our work, the vendor tuned CUSPARSE V6.5 and CUSP V0.5.1, and three popular algorithms clSpMV, CSR5, and CSR-Adaptive.

1. Introduction

Given their many-core structures, graphics processing units (GPUs) have sufficient computation power for scientific computations. Processing big data by using GPUs has drawn considerable attention over the recent years. Following the introduction of the compute unified device architecture (CUDA), a programming model that supports the joint CPU/GPU execution of applications, by NVIDIA in 2007 [1], GPUs have become strong competitors as general-purpose parallel programming systems.

Sparse matrix-vector multiplication (SpMV) has proven to be of great importance in computational science. It represents the dominant cost in iterative methods for solving linear systems and eigenvalue problems which arise in a wide variety of scientific and engineering applications [25]. Recently, with the increasing size of matrices, GPU-accelerated SpMVs have attracted considerable attention.

SpMV performance is heavily dependent on the format that is used to store the sparse matrix in the memory. Compressed sparse row (CSR) has historically been a frequently used format because it efficiently compresses both structured and unstructured matrices. The naive CSR-based parallel SpMV, known as CSR-scalar, assigns each row of the sparse matrix to a separate thread [6]. However, the memory-access pattern of CSR-scalar is rarely coalesced, which results in disappointing performance. CSR-vector [7, 8] improves the performance of CSR-scalar by using warps to access the CSR structure in a contiguous but not generally aligned fashion, which implies partial coalescing. Since then, researchers have developed many highly efficient CSR-based SpMV implementation techniques on the GPU by optimizing the memory-access pattern of the CSR structure [919].

Among the existing CSR-based algorithms on the GPU, a representative, called a perfect-CSR algorithm (PCSR) [13], achieves high performance by fully coalesced accesses to the CSR arrays. PCSR involves two kernels. The first kernel is used to perform many parallel coalesced loads from the CSR arrays and place values into global memory . The second kernel is composed of two stages. The first stage is to load the global memory into the shared memory in a fully coalesced manner. Next, the second stage performs a scalar-style reduction (with each thread working on a single row), and each row in is reduced to an output value. We can observe that the process of loading and reading the global memory is coalesced in PCSR but it still needs some cost. If a matrix has a great number of nonzero values, the operation that is similar to the vector sum in the first kernel will bottleneck the PCSR performance [20]. Moreover, when each row has large and significantly different number of nonzero values for a matrix, PCSR may leave many threads inactive during the scalar-style reduction process for each block and the thread with a long row often has a slower reduction speed than that with a short row. This greatly decreases the reduction efficiency.

These observations motivate us to enhance the performance of PCSR. We propose an improved PCSR called IPCSR to calculate SpMV on the GPU. IPCSR reduces two kernels in PCSR to one kernel while keeping the coalesced loads of the CSR arrays intact and saves the cost of loading and reading global memory . IPCSR only includes two stages. In the first stage, IPCSR calculates dynamically the suitable number of rows for each block and quickly loads values from the CSR arrays into the shared memory . This stage merges operations in the first kernel of PCSR and the first stage of the second kernel of PCSR. This process of loads is coalesced and effectively utilizes the GPU’s DRAM bandwidth, alleviating the bottleneck of PCSR. IPCSR supports two reduction approaches: a scalar-style reduction (with each thread working on a single row) and a multiple scalar-style reduction (with multiple threads working on a single row). How many threads are assigned to a single row for each block is dynamically determined by a decision tree. Thus, IPCSR improves the reduction efficiency by keeping threads active as much as possible.

IPCSR loses efficiency when a block operates a row, and it becomes inoperative if a row has more values than can be allocated in the shared memory. To overcome this limitation, we specially design a highly efficient algorithm for this case and can dynamically determine which block executes IPCSR with a set of rows or a long row. Our experiments on a set of 20 sparse matrices show that IPCSR outperforms PCSR for all test cases. IPCSR achieves average performance improvement of (single precision) and (double precision) over PCSR on the C2050 GPU and (single precision) and (double precision) over PCSR on the K20c GPU. In addition, compared to CUSPARSE V6.5 [21], our proposed IPCSR achieves performance improvement by up to on average in single precision and on average in double precision on the C2050 GPU and on average in single precision and on average in double precision on the K20c GPU. Compared to CUSP V0.5.1 [22], which chooses the best of six algorithms that include COO, CSR-scalar, CSR-vector, DIA, ELL, and HYB, IPCSR achieves a performance gain of up to on average in single precision and on average in double precision on the C2050 GPU and on average in single precision and on average in double precision on the K20c GPU. Compared to clSpMV [23], which combines advantages of many existing algorithms, IPCSR achieves performance improvement by up to on average in single precision and on average in double precision on the C2050 GPU and on average in single precision and on average in double precision on the K20c GPU. Compared to CSR5 [18], IPCSR achieves average performance improvement of (single precision) and (double precision) over PCSR on the C2050 GPU and (single precision) and (double precision) over PCSR on the K20c GPU. IPCSR has a slightly better behavior than CSR-Adaptive [17] and achieves performance improvement by up to on average in single precision and on average in double precision on the C2050 GPU and on average in single precision and on average in double precision on the K20c GPU.

The main contributions in this paper are summarized as follows:(i)We propose a novel CSR-based SpMV on the GPU. Our proposed algorithm is inspired by the work of Gao et al. [13] but achieves performance improvement by up to on average on the C2050 GPU and on average on the K20c GPU compared to the work of Gao et al.(ii)In our proposed algorithm, each block may have different number of rows, and the optimal number of rows each block involves is dynamically calculated.(iii)Each block can dynamically determine whether to execute our proposed algorithm with a set of rows or a long row on the basis of the number of rows it involves.

The rest of this paper is organized as follows. Section 2 addresses the related work. Section 3 introduces the CSR format and the PCSR algorithm. Section 4 details our IPCSR algorithm. Experimental results are presented in Section 5. Section 6 summarizes our conclusions and suggestions for future research.

Sparse matrix-vector multiplication (SpMV) is so important that there has existed a great deal of work on accelerating it. We only discuss the most relevant ones here. Initial work about accelerating the SpMV on the CUDA-enabled GPU is presented by Bell and Garland [7]. They implement several well-known formats including DIA, ELL, CSR, and COO and a new hybrid format HYB on the GPU. Each storage format has its ideal matrix type. They conclude that the best performance of SpMV on the GPU depends on the storage format and advocate the column-major ELL and HYB. These conclusions are carried forward to the CUSP [22] library, which relies heavily on HYB for SpMV.

A lot of work has been proposed for GPUs using the variants of the CSR, ELL, and COO formats such as the compressed sparse eXtended [24], bit-representation-optimized compression [25], block CSR [26], ELLPACK-R [27], sliced ELL [28, 29], SELL-C- [30], sliced COO [31], and blocked compressed COO [32]. Specialized storage formats provide definitive advantages. However, as a great number of software programs use CSR, the conversion from CSR to these other formats will present a large engineering hurdle and can incur large runtime overheads and require extra storage space.

In [7, 8], Bell and Garland declare that CSR-scalar loses efficiency due to its rarely coalesced memory accesses. CSR-vector improves the performance of CSR-scalar, but it accesses the CSR structure in a contiguous but not generally aligned fashion, which implies partial coalescing. Compared to CSR-scalar and CSR-vector, ELL, DIA, and HYB implementation techniques on the GPU benefit from full coalescing. Following the idea of optimizing the memory-access pattern of the CSR structure, researchers have developed many algorithms. Lu et al. [33] optimize CSR-scalar by padding the CSR arrays and achieve 30% improvement of the memory-access performance. In [10], Dehnavi et al. propose a prefetch-CSR method that partitions the matrix nonzeros into blocks of the same size and distributes them among GPU resources. The method obtains a slightly better behavior than CSR-vector by padding rows with zeros to increase data regularity, using parallel reduction techniques, and prefetching data to hide global memory accesses. Furthermore, authors increase the performance of this method by replacing it with three subkernels in [11]. Gao et al. [13] propose a perfect-CSR (PCSR) algorithm on the GPU, which consists of two kernels. This algorithm implements the fully coalesced accesses to the CSR arrays by introducing a middle array. Greathouse and Daga suggest a novel algorithm, CSR-Adaptive, which keeps the CSR format intact and maps well to GPUs [14]. Thus, authors enhance CSR-Adaptive and make it work well for both regular and irregular matrices while keeping the CSR format unchanged [17]. Liu and Schmidt present LightSpMV, a novel fast CUDA-compatible SpMV algorithm using the standard CSR format, which achieves high parallelism by benefiting from the fine-grained dynamic distribution of matrix rows over warps/vectors [16]. Other related work can be found in [9, 12, 18].

Our proposed algorithm is inspired by Gao et al.’s work [13]. It alleviates the bottleneck of PCSR that is suggested by Gao et al. while keeping fully coalesced accesses to the CSR arrays and thus outperforms PCSR.

3. The CSR Format and PCSR

3.1. The CSR Format

The compressed sparse row (CSR) format is a popular, general-purpose sparse matrix representation. CSR stores a sparse matrix via three arrays: (1) the array contains all the nonzero entries of , (2) the array contains column indices of the nonzero entries stored in , and (3) entries of the array point to the first entry of subsequence rows of in the arrays and .

For example, the matrix is stored in the CSR format by

3.2. PCSR

PCSR is CSR-based SpMV implementation on the GPU and involves two kernels. Algorithm 1 shows the main procedure of the first kernel. This kernel is to load values from the CSR arrays into the global memory in a coalesced manner. Each thread only calculates a nonzero entry. Algorithm 2 shows the main procedure of the second kernel. We observe that the second kernel is composed of two stages. The first stage performs many parallel coalesced loads from and quickly places values into the shared memory . The second stage completes a scalar-style reduction (with each thread working on a single row). Each row in is reduced to an output value.

Input: , , , ;
Output: ;
() ;
() ;
() while
()  ;
()    +=  ;
() end while
Input: , ;
Output: res;
()    ;
()    ;
()    ;
()    ;
()    ;
()    if()
       ;
()    ();
()    ; ; ;
()    for to with   +=   do
()    ;
()    ();
       //Load temp into the shared memory temp_
()    for to do
()      if then
()       ;
()         +=  ;
()      end
()    done
()    ();
       //Perform a scalar-style reduction
()    if ) is false then
()      ;
()      ;
()      for to do
()         +=  ;
()      done
()    end
()    ;
() done

4. Improved PCSR

On the basis of the above discussions, we propose an improved PCSR on the GPU, called IPCSR, as shown in Algorithm 3. Compared to PCSR, IPCSR reduces two kernels to one kernel and only involves two stages. The first stage is to have each thread block perform many parallel coalesced loads from the and arrays. This quickly places values into the shared memory (see lines (7)–(9) in Algorithm 3). This stage combines operations in the first kernel of PCSR (see lines (3)–(6) in Algorithm 1) and the first stage of the second kernel of PCSR (see lines (12)–(17) in Algorithm 2). The second stage, similar to the second stage in the second kernel of PCSR, completes a scalar-style reduction (with each thread working on a single row).

Input: , , , , ;
Output: ;
()    ;
()    ; ;
()    ;
()    ;
()    ;
()    ;
    //Assemble into shared memory
()    for to with   +=   do
()       ;
()    done
() ();
    //Perform a scalar-style reduction from temp_
() ;
() if then
()    ;
()    ;
()    ;
()    for to do
()      +=  ;
()    done
()    ;
() end

For IPCSR, the size of shared memory is set as follows for a given (number of threads per block): where , , and are the maximum number of threads per multiprocessor, maximum amount of shared memory per multiprocessor, and maximum number of blocks per multiprocessor, respectively. , , and are constant for a specific GPU.

The number of rows each block calculates for PCSR is an invariable value that is equal to the number of threads per block. However, in IPCSR, how many rows are assigned to a block is dynamically calculated. IPCSR splits the problem into row blocks, where the nonzero values within a block fit into a statically sized amount of shared memory (i.e., SHARED_SIZE). For example, assume that there are 128 rows for a matrix, each with 16 nonzero values, and , two blocks, each with 64 rows, are required. If the first row instead has 32 nonzero values and the remaining 127 rows still have 16 nonzero values, three blocks will be needed. The first, second, and third blocks work on 63, 64, and 1 rows, respectively. With the assumption that the number of nonzero values per row is not more than SHARED_SIZE, the main procedure of generating row blocks is shown in Algorithm 4.

Input: , , , ;
Output: , ;
()    ; ; ; ;
()    for to do
      //Compute non-zeros and the total rows
()        +=  ;
()      ++;
()      if
       (  &&  ) then
       //This row fills up SHARED_SIZE or threads per block
()       ; ++; ; ;
()      else if then
       //This row is an extra one that is excluded
()       ; ++; ; ; −−;
()      end
() done
      //Extra case
() if   !=   then
()    ;
() else
()    −−;
() end

Here, the procedure in Algorithm 4 is executed on the CPU. Of course, this procedure can also be performed on the GPU. However, the test results show that the use of the GPU for this procedure does not increase the performance. For a specific matrix, the blocks need to be calculated only once. The complexity of generating the blocks is related to the number of rows, and the cost of generating the blocks is generally less than 1% of the cost that it takes to generate the CSR data structure.

In summary, compared to PCSR, IPCSR has the following improvement:(i)Reducing two kernels to one kernel and avoiding loading/reading the global memory by merging operations in the first kernel of PCSR and the first stage of the second kernel of PCSR, which saves the cost of loading/reading the global memory.(ii)Fitting the number of rows dynamically to the amount of space, which simplifies the work done on the GPU.

4.1. Optimizing the Scalar-Style Reduction

For the scalar-style reduction in Algorithm 3, a single row is only calculated by one thread in a block. If a sparse matrix has nearly equal and small number of nonzero values per row, the scalar-style reduction will be efficient because most threads for each block are active and can almost synchronously complete the scalar-style reduction. However, when a sparse matrix has a large or significantly different number of nonzero values per row, many threads for each block are inactive and the thread with a long row has a slower reduction speed than that with a short row, which greatly decreases the performance. In order to improve the efficiency of the scalar-style reduction, we take full advantage of threads per block and propose a decision tree to fit the number of threads dynamically to a single row, as shown in Figure 1. For each block, this decision tree enables selecting the appropriate number of threads to calculate a single row on the basis of (number of rows that are assigned to it). When using multiple threads to calculate a single row in a block, the scalar-style reduction in Algorithm 3 is modified to the multiple scalar-style reduction (Algorithm 5).

    
()    ;
()    ;
    //Perform a multiple scalar-style reduction from temp_
()    ;
()    ;   & ();
()    if then
     //Perform a partial reduction from temp_
()     ;
()     ;
()     ;
()     for to with   +=   do
()     +=  ;
()  done
()  ;
()  ();
     //Perform a warp reduction from bVAL_s
()  if   &&  
()    += ;
()   ();
     
()  if   &&    >=  16
()    += ; ();
()  if   &&    >= 8
()    += ; ();
()  if   &&  
()    += ; ();
()  if   &&  
()   ;
() end

The multiple scalar-style reduction consists of two steps: a partial-reduction step and a warp-reduction step. In the partial-reduction step, each one of the threads in each thread group (warpNums threads are grouped into a thread group) performs a partial reduction and saves the reduction value to the shared memory. Then, in the warp-reduction step, the partial-style reduction values in the shared memory for each thread group are reduced to an output value in parallel. The main procedure of IPCSR with the adaptive number of threads per block is shown in Algorithm 7.

4.2. Optimizing Very Long Rows

If a single row has nonzero values that are more than SHARED_SIZE, IPCSR in Algorithm 7 will not be suitable for calculating the extremely long individual row because it cannot move these values into the shared memory .

This is solved by giving each of the very long rows to a single block. To assign each of these very long rows to a single block, the code in the 8th line of Algorithm 4 is modified as Algorithm 6.

if then
 //This row is an extra one that is excluded
; −−;
else
 //This row is more than SHARED_SIZE non-zeros
;
end
++; ; ;
Input: , , , , , ;
Output: ;
()    _;
()    _;
()    ; ;
()    ; ;
()    ; ;
    //Assemble into shared memory
()    for to with   +=   do
()      ;
()    done
()    ();
() ;
() if then
()   //Omitted: Perform a scalar-style reduction
() else
     //Omitted: Perform a multiple scalar-style reduction
() end

The main procedure where a single long row is calculated by one block is listed in Algorithm 8. First, the thread block performs many parallel coalesced loads from the and arrays. This quickly places values into the shared memory (see lines (2)–(4) in Algorithm 8). Second, the values in are reduced in parallel to an output value (see lines (7)–(15) in Algorithm 8).

    
    //a long row is calculated
()    ;
()    for to with   +=   do
()       +=  ;
()    done
()    ;
()    ();
()    if
()       +=  ; ();
()    
() if     +=  ; ();
() if     +=  ; ();
() if     +=  ; ();
() if     +=  ; ();
() if     +=  ; ();
() if   ;

The main procedure of IPCSR with very long rows is shown in Algorithm 9. By experiments, if one block only involves a single row whose nonzeros are not more than SHARED_SIZE, the performance that is obtained by the multiple scalar-style reduction in Algorithm 5 is always less than that obtained by a single long row reduction in Algorithm 8. Therefore, we also use the single long row reduction in Algorithm 8 for each block involving a single row whose nonzeros are not more than SHARED_SIZE.

Input: , , , , , ;
Output: ;
()    ;
()    ;
()    ; ;
()    ; ;
()    ; ;
()    ;
()    if   &&    !=  1 then
()     //Omitted: With the adaptive number of threads per block;
()    else
         //Omitted: With very long rows
() end
4.3. Optimizing the Accesses to

In IPCSR, the vector in the global memory is randomly accessed, which results in decreasing its performance. Therefore, the texture memory is utilized to alleviate the high latency of randomly accessing the vector in the global memory [34]. For single precision, can be rewritten as

Because the texture does not support the double value, the following function () shown in Algorithm 10 is adopted to transfer the int2 value to the double value.

(1) fetch_double(texture, )
(2) ;
(3) return2double();
(4)

Furthermore, for double precision, based on the function (), we rewrite as

5. Experimental Results

Here, in addition to comparing IPCSR with PCSR, we also use CUSPARSE V6.5 [21], CUSP V0.5.1 [22], clSpMV [23], CSR-Adaptive [17], and CSR5 [18] for the performance comparison. CSRMV, a CSR-based algorithm, is only considered for CUSPARSE. CUSP supports six algorithms that include COO, CSR-scalar, CSR-vector, DIA, ELL, and HYB. For each test matrix, we choose the best of the six CUSP algorithms. For clSpMV, we test all the single formats and choose the best performing single format for each matrix.

5.1. Experimental Setup

We use a total of 20 sparse matrices: 14 of them are from [35] and 6 of them are from the University of Florida Sparse Matrix Collection [36]. Table 1 summarizes the information of the sparse matrices, including the number of rows, number of columns, number of nonzeros, and number of nonzeros per row. These matrices have been widely used in some previous work [8, 14, 23, 28, 29, 32, 37].

Experimental environments include one machine that is equipped with an Intel Xeon Quad-Core CPU and an NVIDIA Tesla C2050 GPU and another machine with an Intel Xeon Quad-Core CPU and an NVIDIA Tesla K20c GPU. Table 2 shows an overview of NVIDIA GPUs.

The performance of all algorithms is measured in terms of GFlop/s (second) or GByte/s (GB/s). Measured performance does not include the data transfer (from GPU to CPU or from CPU to GPU). For each algorithm, we randomly execute it 50 times and then take the average performance as its performance.

5.2. Single Precision

Figures 2 and 3 illustrate the performance of various implementation techniques in single precision in terms of GFlop/s for all test matrices on C2050 and K20c, respectively. GFlop/s values are calculated on the basis of the assumption of two flops per nonzero entry for a matrix [8]. Table 3 presents speedups that are achieved by IPCSR over other implementation techniques on both GPUs for all test matrices. We can observe that the performance of our proposed IPCSR for each test matrix is much better than that of PCSR on both GPUs. IPCSR achieves speedups, ranging from 1.03 to 3.30 on C2050 and 1.04 to 3.25 on K20c, over PCSR (Table 3). The average speedups on the C2050 and K20c are 1.76 and 1.79, respectively (Figure 4).

Furthermore, from Figure 2, we can see that IPCSR outperforms CUSPARSE, CUSP, and clSpMV on C2050 for all the matrices except for Wind Tunnel, QCD, and Epidemiology. IPCSR has only slightly lower performance than clSpMV and CUSP for Wind Tunnel and QCD, respectively. For Epidemiology, the number of rows in many blocks is nearly half the number of threads, and other rows have small and nearly equal number of nonzeros except that a few rows have large number of nonzeros. This greatly decreases the performance of performing the scalar-style reduction in IPCSR. For Wind Tunnel, QCD, and Epidemiology, IPCSR has some performance degradation, but it achieves average performance improvement of over CUSPARSE, over CUSP, and over clSpMV (Figure 4). In addition, IPCSR has also better behavior than CSR5 for all test cases and achieves average performance improvement of over CSR5 (Figure 4). The performance of IPCSR is slightly higher than that of CSR-Adaptive for all test matrices except for Protein and mip1, and IPCSR only achieves average performance improvement of over CSR-Adaptive (Figure 4). For Protein and mip1, there are many rows that have a large number of nonzeros. CSR-Adaptive can use several blocks to calculate a row whereas IPCSR at most uses a block to calculate a row, which can be the reason resulting in the improvement of the CSR-Adaptive performance compared to our proposed IPCSR.

Similar to C2050, we also observe that IPCSR has higher performance than CUSPARSE, CUSP, and clSpMV for all test matrices except for Wind Tunnel and Epidemiology on K20c from Figure 3. IPCSR achieves an average speedup of over CUSPARSE, over CUSP, and over clSpMV, respectively (Figure 4). IPCSR outperforms CSR5 and the performance of IPCSR is slightly better than that of CSR-Adaptive for all test cases. IPCSR achieves average performance improvement of over CSR5 and over CSR-Adaptive, respectively (Figure 4).

5.3. Double Precision

The performance of CUSPARSE, CUSP, clSpMV, CSR5, CSR-Adaptive, PCSR, and IPCSR in double precision for all test matrices on C2050 and K20c is listed in Figures 5 and 6, respectively. Table 4 presents speedups that are achieved by IPCSR over other implementation techniques on both GPUs for all test matrices. Compared to PCSR, IPCSR achieves higher performance for all the matrices on both GPUs. The speedups of IPCSR over PCSR range from 1.07 to 2.72 on C2050 and 1.08 to 2.68 on K20c. The average speedups on C2050 and K20c are 1.53 and 1.56, respectively.

For CUSPARSE, CUSP, clSpMV, CSR5, CSR-Adaptive, and IPCSR, there is not a single approach that is best for all the matrices on both GPUs. However, IPCSR usually equals or outperforms other approaches and is the best approach on average. Similar to single precision, the major exceptional matrices are still Wind Tunnel, QCD, and Epidemiology in double precision. Although the performance of IPCSR is not the best for these exceptional matrices, IPCSR achieves average performance improvement of over CUSPARSE, over CUSP, over clSpMV, over CSR5, and over CSR-Adaptive on C2050 (Figure 4). Similarly, IPCSR achieves average performance improvement of over CUSPARSE, over CUSP, over clSpMV, over CSR5, and over CSR-Adaptive on K20c (Figure 4).

Therefore, we can conclude that IPCSR greatly improves the performance of PCSR and shows better behavior than CUSPARSE, CUSP, clSpMV, CSR5, and CSR-Adaptive for all test matrices on both GPUs.

5.4. Effective Bandwidth and Evaluations

Here, we only take single precision on C2050 to test the effective bandwidth of all algorithms. The effective bandwidth is calculated by the following:Here, the effective bandwidth is in units of GB/s; is the number of bytes read per kernel and is the number of bytes written per kernel. Figure 7 lists the effective bandwidth of CUSPARSE, CUSP, clSpMV, CSR5, CSR-Adaptive, PCSR, and IPCSR for all test matrices on C2050. The unit is GB/s. The red line is the peak theoretical bandwidth (144 GB/s) of C2050. We can observe that IPCSR achieves effective bandwidth ranging from 72.63 to 139.48 GB/s for all test matrices on C2050 and its average effective bandwidth of IPCSR is 119.22 GB/s. On the basis of the definition of the bandwidth [20], we can conclude that IPCSR has high parallelism. For CUSPARSE, CUSP, clSpMV, CSR5, CSR-Adaptive, and PCSR, their average effective bandwidth for all test cases on C2050 is 61.64, 89.92, 85.35, 100.29, 117.43, and 73.19 GB/s, respectively. This further verifies that IPCSR is much better than CUSPARSE, CUSP, clSpMV, CSR5, and PCSR and slightly outperforms CSR-Adaptive.

Compared to CSR5, the implementation of IPCSR is simpler. IPCSR keeps the CSR arrays intact and only incorporates an extra array (i.e., ) whereas CSR5 only leaves one of the three CSR arrays unchanged, stores the other two arrays in an in-place tile-transposed order, and adds two groups of extra auxiliary information. Therefore, the limitations of CSR5 are its complexity and the matrix transpose, which can cause large transformation overheads [17]. CSR-Adaptive involves three algorithms: CSR-Stream, CSR-vector, and CSR-VectorL, which correspond to the matrices with short, long, and very long rows, respectively. Their switches are complicated because some thresholds are required (i.e., and ). IPCSR uses one algorithm, and the only difference is that the reduction method can be different for different types of rows. And the reduction method can be easily chosen utilizing a decision tree. For very long rows, compared to a row per block that is used in IPCSR, CSR-Adaptive uses multiple blocks to calculate a row. However, the efficiency is often offset by the overhead of the atomic operation and block synchronization.

Like other GPU-accelerated SpMV algorithms, the performance of IPCSR is also sensitive to the GPU architecture. For different GPU architectures, IPCSR usually has a little difference. There are the following two main reasons: (1) some parameters (i.e., ) that have a great influence on the IPCSR performance depend on the GPU feature and (2) the GPU features and performance parameters usually are different for different GPU architectures. For IPCSR, based on (3), the value in single precision is twice that in double precision. If so, each block in single precision will usually involve more rows and have higher reduction efficiency than that in double precision. Other algorithms do not have this feature. This can result in the performance improvement of IPCSR being faster than other algorithms from double precision to single precision.

6. Conclusion

In this study, we propose a novel CSR-based algorithm on the GPU, which improves the performance of PCSR. Compared to PCSR, our proposed algorithm IPCSR has the following distinct characteristics: (1) each thread block is assigned to different number of rows, (2) a multiple scalar-style reduction is involved besides the single scalar-style reduction, and (3) optimization implementation is specially designed for very long rows. Experimental results on C2050 and K20c show that IPCSR outperforms PCSR on both GPUs. Furthermore, IPCSR also shows higher performance than the vendor tuned CUSPARSE V6.5, CUSP V0.5.1, and three popular algorithms clSpMV, CSR5, and CSR-Adaptive.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

The research has been supported by the Chinese Natural Science Foundation under Grant no. 61379017 and the Natural Science Foundation of Zhejiang Province, China, under Grant no. LY17F020021.