Research Article  Open Access
Guixia He, Jiaquan Gao, "A Novel CSRBased Sparse MatrixVector Multiplication on GPUs", Mathematical Problems in Engineering, vol. 2016, Article ID 8471283, 12 pages, 2016. https://doi.org/10.1155/2016/8471283
A Novel CSRBased Sparse MatrixVector Multiplication on GPUs
Abstract
Sparse matrixvector multiplication (SpMV) is an important operation in scientific computations. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSRbased SpMVs on graphic processing units (GPUs), for example, CSRscalar and CSRvector, usually have poor performance due to irregular memory access patterns. This motivates us to propose a perfect CSRbased SpMV on the GPU that is called PCSR. PCSR involves two kernels and accesses CSR arrays in a fully coalesced manner by introducing a middle array, which greatly alleviates the deficiencies of CSRscalar (rare coalescing) and CSRvector (partial coalescing). Test results on a single C2050 GPU show that PCSR fully outperforms CSRscalar, CSRvector, and CSRMV and HYBMV in the vendortuned CUSPARSE library and is comparable with a most recently proposed CSRbased algorithm, CSRAdaptive. Furthermore, we extend PCSR on a single GPU to multiple GPUs. Experimental results on four C2050 GPUs show that no matter whether the communication between GPUs is considered or not PCSR on multiple GPUs achieves good performance and has high parallel efficiency.
1. Introduction
Sparse matrixvector multiplication (SpMV) has proven to be an important operation in scientific computing. It need be accelerated because SpMV represents the dominant cost in many iterative methods for solving largesized linear systems and eigenvalue problems that arise in a wide variety of scientific and engineering applications [1]. Initial work about accelerating the SpMV on CUDAenabled GPUs is presented by Bell and Garland [2, 3]. The corresponding implementations in the CUSPARSE [4] and CUSP libraries [5] include optimized codes of the wellknown compressed sparse row (CSR), coordinate list (COO), ELLPACK (ELL), hybrid (HYB), and diagonal (DIA) formats. Experimental results show speedups between 1.56 and 12.30 compared to an optimized CPU implementation for a range of sparse matrices.
SpMV is a largely memory bandwidthbound operation. Reported results indicate that different access patterns to the matrix and vectors on the GPU influence the SpMV performance [2, 3]. The COO, ELL, DIA, and HYB kernels benefit from full coalescing. However, the scalar CSR kernel (CSRscalar) shows poor performance because of its rarely coalesced memory accesses [3]. The vector CSR kernel (CSRvector) improves the performance of CSRscalar by using warps to access the CSR structure in a contiguous but not generally aligned fashion [3], which implies partial coalescing. Since then, researchers have developed many highly efficient CSRbased SpMV implementations on the GPU by optimizing the memory access pattern of the CSR structure. Lu et al. [6] optimize CSRscalar by padding CSR arrays and achieve 30% improvement of the memory access performance. Dehnavi et al. [7] propose a prefetchCSR method that partitions the matrix nonzeros to blocks of the same size and distributes them amongst GPU resources. This method obtains a slightly better behavior than CSRvector by padding rows with zeros to increase data regularity, using parallel reduction techniques, and prefetching data to hide global memory accesses. Furthermore, Dehnavi et al. enhance the performance of the prefetchCSR method by replacing it with three subkernels [8]. Greathouse and Daga suggest a CSRAdaptive algorithm that keeps the CSR format intact and maps well to GPUs [9]. Their implementation efficiently accesses DRAM by streaming data into the local scratchpad memory and dynamically assigns different numbers of rows to each parallel GPU compute unit. In addition, numerous works have proposed for GPUs using the variants of the CSR storage format such as the compressed sparse eXtended [10], bitrepresentationoptimized compression [11], block CSR [12, 13], and rowgrouped CSR [14].
Besides using the variants of CSR, many highly efficient SpMVs on GPUs have been proposed by utilizing the variants of the ELL and COO storage formats such as the ELLPACKR [15], ELLRT [16], sliced ELL [13, 17], SELLC [18], sliced COO [19], and blocked compressed COO [20]. Specialized storage formats provide definitive advantages. However, as many programs use CSR, the conversion from CSR to other storage formats will present a large engineering hurdle and can incur large runtime overheads and require extra storage space. Moreover, CSRbased algorithms generally have a lower memory usage than those that are based on other storage formats such as ELL, DIA, and HYB.
All the above observations motivate us to further investigate how to construct efficient SpMVs on GPUs while keeping CSR intact. In this study, we propose a perfect CSR algorithm, called PCSR, on GPUs. PCSR is composed of two kernels and accesses CSR arrays in a fully coalesced manner. Experimental results on C2050 GPUs show that PCSR outperforms CSRscalar and CSRvector and has a better behavior compared to CSRMV and HYBMV in the vendortuned CUSPARSE library [4] and a most recently proposed CSRbased algorithm, CSRAdaptive.
The main contributions in this paper are summarized as follows:(i)A novel SpMV implementation on a GPU, which keeps CSR intact, is proposed. The proposed algorithm consists of two kernels and alleviates the deficiencies of many existing CSR algorithms that access CSR arrays in a rare or partial coalesced manner.(ii)Our proposed SpMV algorithm on a GPU is extended to multiple GPUs. Moreover, we suggest two methods to balance the workload among multiple GPUs.
The rest of this paper is organized as follows. Following this introduction, the matrix storage, CUDA architecture, and SpMV are described in Section 2. In Section 3, a new SpMV implementation on a GPU is proposed. Section 4 discusses how to extend the proposed SpMV algorithm on a GPU to multiple GPUs. Experimental results are presented in Section 5. Section 6 contains our conclusions and points to our future research directions.
2. Related Techniques
2.1. Matrix Storage
To take advantage of the large number of zeros in sparse matrices, special storage formats are required. In this study, the compressed sparse row (CSR) format is only considered although there are many varieties of sparse matrix storage formats, such as the ELLPACK (or ITPACK) [21], COO [22], DIA [1], and HYB [3]. Using CSR, an sparse matrix with nonzero elements is stored via three arrays: (1) the array contains all the nonzero entries of , (2) the array contains column indices of nonzero entries that are stored in , and (3) entries of the array point to the first entry of subsequence rows of in the arrays and .
For example, the following matrixis stored in the CSR format by
2.2. CUDA Architecture
The compute unified device architecture (CUDA) is a heterogenous computing model that involves both the CPU and the GPU [23]. Executing a parallel program on the GPU using CUDA involves the following: (1) transferring required data to the GPU global memory; (2) launching the GPU kernel; and (3) transferring results back to the host memory. The threads of a kernel are grouped into a grid of thread blocks. The GPU schedules blocks over the multiprocessors according to their available execution capacity. When a block is given to a multiprocessor, it is split in warps that are composed of 32 threads. In the best case, all 32 threads have the same execution path and the instruction is executed concurrently. If not, the execution paths are executed sequentially, which greatly reduces the efficiency. The threads in a block communicate via the fast shared memory, but the threads in different blocks communicate through highlatency global memory. Major challenges in optimizing an application on GPUs are global memory access latency, different execution paths in each warp, communication and synchronization between threads in different blocks, and resource utilization.
2.3. Sparse MatrixVector Multiplication
Assume that is an sparse matrix and is a vector of size , and a sequential version of CSRbased SpMV is described in Algorithm 1. Obviously, the order in which elements of , and are accessed has an important impact on the SpMV performance on GPUs where memory access patterns are crucial.

3. SpMV on a GPU
In this section, we present a perfect implementation of CSRbased SpMV on the GPU. Different with other related work, the proposed algorithm involves the following two kernels:(i)Kernel 1: calculate the array , where , , and then save it to global memory.(ii)Kernel 2: accumulate element values of according to the following formula: , , and store them to an array in global memory.
We call the proposed SpMV algorithm PCSR. For simplicity, the symbols used in this study are listed in Table 1.

3.1. Kernel 1
For Kernel 1, its detailed procedure is shown in Algorithm 2. We observe that the accesses to two arrays and in global memory are fully coalesced. However, the vector in global memory is randomly accessed, which results in decreasing the performance of Kernel 1. On the basis of evaluations in [24], the best memory space to place data is the texture memory when randomly accessing the array. Therefore, here texture memory is utilized to place the vector instead of global memory. For the singleprecision floating point texture, the fourth step in Algorithm 2 is rewritten asBecause the texture does not support double values, the following function is suggested to transfer the int2 value to the double value.

Furthermore, for the doubleprecision floating point texture, based on the function , we rewrite the fourth step in Algorithm 2 as
3.2. Kernel 2
Kernel 2 accumulates element values of that is obtained by Kernel 1 and its detailed procedure is shown in Algorithm 3. This kernel is mainly composed of the following three stages:(i)In the first stage, the array in global memory is piecewise assembled into shared memory of each thread block in parallel. Each thread for a thread block is only responsible for loading an element value of into except for thread 0 (see lines (05)(06) in Algorithm 3). The detailed procedure is illustrated in Figure 1. We can see that the accesses to are aligned.(ii)The second stage loads element values of in global memory from the position to the position into shared memory for each thread block. The assembling procedure is illustrated in Figure 2. In this case, the access to is fully coalesced.(iii)The third stage accumulates element values of , as shown in Figure 3. The accumulation is highly efficient due to the utilization of two shared memory arrays and .

Obviously, Kernel 2 benefits from shared memory. Using the shared memory, not only are the data accessed fast, but also the accesses to data are coalesced.
From the above procedures for PCSR, we observe that PCSR needs additional global memory spaces to store a middle array besides storing CSR arrays , , and . Saving data into in Kernel 1 and loading data from in Kernel 2 to a degree decrease the performance of PCSR. However, PCSR benefits from the middle array because introducing makes it access CSR arrays , , and in a fully coalesced manner. This greatly improves the speed of accessing CSR arrays and alleviates the principal deficiencies of CSRscalar (rare coalescing) and CSRvector (partial coalescing).
4. SpMV on Multiple GPUs
In this section, we will present how to extend PCSR on a single GPU to multiple GPUs. Note that the case of multiple GPUs in a single node (single PC) is only discussed because of its good expansibility (e.g., also used in the multiCPU and multiGPU heterogeneous platform). To balance the workload among multiple GPUs, the following two methods can be applied:(1)For the first method, the matrix is equally partitioned into (number of GPUs) submatrices according to the matrix rows. Each submatrix is assigned to one GPU, and each GPU is only responsible for computing the assigned submatrix multiplication with the complete input vector.(2)For the second method, the matrix is equally partitioned into submatrices according to the number of nonzero elements. Each GPU only calculates a submatrix multiplication with the complete input vector.
In most cases, two partitioned methods mentioned above are similar. However, for some exceptional cases, for example, most nonzero elements are involved in a few rows for a matrix, the partitioned submatrices that are obtained by the first method have distinct difference of nonzero elements, and those that are obtained by the second method have different rows. Which method is the preferred one for PCSR?
If each GPU has the complete input vector, PCSR on multiple GPUs will not need to communicate between GPUs. In fact, SpMV is often applied to a large number of iterative methods where the sparse matrix is iteratively multiplied by the input and output vectors. Therefore, if each GPU only includes a part of the input vector before SpMV, the communication between GPUs will be required in order to execute PCSR. Here PCSR implements the communication between GPUs using NVIDIA GPUDirect.
5. Experimental Results
5.1. Experimental Setup
In this section, we test the performance of PCSR. All test matrices come from the University of Florida Sparse Matrix Collection [25]. Their properties are summarized in Table 2.

All algorithms are executed on one machine which is equipped with an Intel Xeon QuadCore CPU and four NVIDIA Tesla C2050 GPUs. Our source codes are compiled and executed using the CUDA toolkit 6.5 under GNU/Linux Ubuntu v10.04.1. The performance is measured in terms of GFlop/s (second) or GByte/s (second).
5.2. Single GPU
We compare PCSR with CSRscalar, CSRvector, CSRMV, HYBMV, and CSRAdaptive. CSRscalar and CSRvector in the CUSP library [5] are chosen in order to show the effects of accessing CSR arrays in a fully coalesced manner in PCSR. CSRMV in the CUSPARSE library [4] is a representative of CSRbased SpMV algorithms on the GPU. HYBMV in the CUSPARSE library [4] is a finely tuned HYBbased SpMV algorithm on the GPU and usually has a better behavior than many existing SpMV algorithms. CSRAdaptive is a most recently proposed CSRbased algorithm [9].
We select 15 sparse matrices with distinct sizes ranging from 25,228 to 2,063,494 as our test matrices. Figure 4 shows the singleprecision and doubleprecision performance results in terms of GFlop/s of CSRscalar, CSRvector, CSRMV, HYBMV, CSRAdaptive, and PCSR on a Tesla C2050. GFlop/s values in Figure 4 are calculated on the basis of the assumption of two Flops per nonzero entry for a matrix [3, 13]. In Figure 5, the measured memory bandwidth results for single precision and double precision are reported.
(a) Single precision
(b) Double precision
(a) Single precision
(b) Double precision
5.2.1. Single Precision
From Figure 4(a), we observe that PCSR achieves high performance for all the matrices in the singleprecision mode. In most cases, the performance of over 9 GFlops/s can be obtained. Moreover, PCSR outperforms CSRscalar, CSRvector, and CSRMV for all test cases, and average speedups of 4.24x, 2.18x, and 1.62x compared to CSRscalar, CSRvector, and CSRMV can be obtained, respectively. Furthermore, PCSR has a slightly better behavior than HYBMV for all the matrices except for af_shell9 and cont300. The average speedup is 1.22x compared to HYBMV. Figure 6 shows the visualization of af_shell9 and cont300. We can find that af_shell9 and cont300 have a similar structure, and each row for two matrices has a very similar number of nonzero elements, which is suitable to be stored in the ELL section of the HYB format. Particularly, PCSR and CSRAdaptive have close performance. The average performance of PCSR is nearly 1.05 times faster than CSRAdaptive.
(a) cont300
(b) af_shell9
Furthermore, PCSR almost has the best memory bandwidth utilization among all algorithms for all the matrices except for af_shell9 and cont300 (Figure 5(a)). The maximum memory bandwidth of PCSR exceeds 128 GBytes/s, which is about 90 percent of peak theoretical memory bandwidth for the Tesla C2050. Based on the performance metrics [26], we can conclude that PCSR achieves good performance and has high parallelism.
5.2.2. Double Precision
From Figures 4(b) and 5(b), we see that, for all algorithms, both the doubleprecision performance and memory bandwidth utilization are smaller than the corresponding singleprecision values due to the slow softwarebased operation. PCSR is still better than CSRscalar, CSRvector, and CSRMV and slightly outperforms HYBMV and CSRAdaptive for all the matrices. The average speedup of PCSR is 3.33x compared to CSRscalar, 1.98x compared to CSRvector, 1.57x compared to CSRMV, 1.15x compared to HYBMV, and 1.03x compared to CSRAdaptive. The maximum memory bandwidth of PCSR exceeds 108 GBytes/s, which is about 75 percent of peak theoretical memory bandwidth for the Tesla C2050.
5.3. Multiple GPUs
5.3.1. PCSR Performance without Communication
Here we take the doubleprecision mode, for example, to test the PCSR performance on multiple GPUs without considering communication. We call PCSR with the first method and PCSR with the second method PCSRI and PCSRII, respectively. Some largesized test matrices in Table 2 are used. The execution time comparison of PCSRI and PCSRII on two and four Tesla C2050 GPUs is listed in Tables 3 and 4, respectively. In Tables 3 and 4, ET, SD, and PE stand for the execution time, standard deviation, and parallel efficiency, respectively. The time unit is millisecond (ms). Figures 7 and 8 show the parallel efficiency of PCSRI and PCSRII on two and four GPUs, respectively.


On two GPUs, we observe that PCSRII has better parallel efficiency than PCSRI for all the matrices except for G3_circuit from Table 3 and Figure 7. The maximum, average, and minimum parallel efficiency of PCSRII are 98.06%, 91.64%, and 77.41%, which wholly outperform the corresponding maximum, average, and minimum parallel efficiency of PCSRI 98.06%, 88.16%, and 72.20%. Moreover, PCSRII has a smaller standard deviation than PCSRI for all the matrices except for ecology2, Transport, and G3_circuit. This implies that the workload balance on two GPUs for the second method is advantageous over that for the first method.
On four GPUs, for the parallel efficiency and standard deviation, PCSRII outperforms PCSRI for all the matrices except for (Table 4 and Figure 8). The maximum, average, and minimum parallel efficiency of PCSRII for all the matrices are 96.35%, 85.14%, and 64.17% and are advantageous over the corresponding maximum, average, and minimum parallel efficiency of PCSRI 96.21%, 78.89%, and 59.94%. Particularly, for , , , , and , the parallel efficiency of PCSRII is almost times that obtained by PCSRI.
On the basis of the above observations, we conclude that PCSRII has high performance and is on the whole better than PCSRI. For PCSR on multiple GPUs, the second method is our preferred one.
5.3.2. PCSR Performance with Communication
We still take the doubleprecision mode, for example, to test the PCSR performance on multiple GPUs with considering communication. PCSR with the first method and PCSR with the second method are still called PCSRI and PCSRII, respectively. The same test matrices as in the above experiment are utilized. The execution time comparison of PCSRI and PCSRII on two and four Tesla C2050 GPUs is listed in Tables 5 and 6, respectively. The time unit is ms. ET, SD, and PE in Tables 5 and 6 are as the same as those in Tables 3 and 4. Figures 9 and 10 show PCSRI and PCSRII parallel efficiency on two and four GPUs, respectively.


On two GPUs, PCSRI and PCSRII have almost close parallel efficiency for most matrices (Figure 9 and Table 5). As a comparison, PCSRII slightly outperforms PCSRI. The maximum, average, and minimum parallel efficiency of PCSRII for all the matrices are 96.34%, 88.51%, and 80.44% and are advantageous over the corresponding maximum, average, and minimum parallel efficiency of PCSRI 96.05%, 86.03%, and 73.57%.
On four GPUs, for the parallel efficiency and standard deviation, PCSRII is better than PCSRI for all the matrices except that PCSRI has slightly good parallel efficiency for , , and and slightly small standard deviation for , , , and (Figure 10 and Table 6). The maximum, average, and minimum parallel efficiency of PCSRII for all the matrices are 90.12%, 74.50%, and 58.27%, which are better than the corresponding maximum, average, and minimum parallel efficiency of PCSRI 88.77%, 65.69%, and 56.39%.
Therefore, compared to PCSRI and PCSRII without communication, although the performance of PCSRI and PCSRII with communication decreases due to the influence of communication, they still achieve significant performance. Because PCSRII overall outperforms PCSRI for all test matrices, the second method in this case is still our preferred one for PCSR on multiple GPUs.
6. Conclusion
In this study, we propose a novel CSRbased SpMV on GPUs (PCSR). Experimental results show that our proposed PCSR on a GPU is better than CSRscalar and CSRvector in the CUSP library and CSRMV and HYBMV in the CUSPARSE library and a most recently proposed CSRbased algorithm, CSRAdaptive. To achieve high performance on multiple GPUs for PCSR, we present two matrixpartitioned methods to balance the workload among multiple GPUs. We observe that PCSR can show good performance with and without considering communication using the two matrixpartitioned methods. As a comparison, the second method is our preferred one.
Next, we will further do research in this area and develop other novel SpMVs on GPUs. In particular, the future work will apply PCSR to some wellknown iterative methods and thus solve the scientific and engineering problems.
Competing Interests
The authors declare that they have no competing interests.
Acknowledgments
The research has been supported by the Chinese Natural Science Foundation under Grant no. 61379017.
References
 Y. Saad, Iterative Methods for Sparse Linear Systems, SIAM, Philadelphia, Pa, USA, 2nd edition, 2003. View at: Publisher Site  MathSciNet
 N. Bell and M. Garland, “Efficient Sparse Matrixvector Multiplication on CUDA,” Tech. Rep., NVIDIA, 2008. View at: Google Scholar
 N. Bell and M. Garland, “Implementing sparse matrixvector multiplication on throughputoriented processors,” in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC '09), pp. 14–19, Portland, Ore, USA, November 2009. View at: Publisher Site  Google Scholar
 NVIDIA, CUSPARSE Library 6.5, 2015, https://developer.nvidia.com/cusparse.
 N. Bell and M. Garland, “Cusp: Generic parallel algorithms for sparse matrix and graph computations, version 0.5.1,” 2015, http://cusplibrary.googlecode.com. View at: Google Scholar
 F. Lu, J. Song, F. Yin, and X. Zhu, “Performance evaluation of hybrid programming patterns for large CPU/GPU heterogeneous clusters,” Computer Physics Communications, vol. 183, no. 6, pp. 1172–1181, 2012. View at: Publisher Site  Google Scholar
 M. M. Dehnavi, D. M. Fernández, and D. Giannacopoulos, “Finiteelement sparse matrix vector multiplication on graphic processing units,” IEEE Transactions on Magnetics, vol. 46, no. 8, pp. 2982–2985, 2010. View at: Publisher Site  Google Scholar
 M. M. Dehnavi, D. M. Fernández, and D. Giannacopoulos, “Enhancing the performance of conjugate gradient solvers on graphic processing units,” IEEE Transactions on Magnetics, vol. 47, no. 5, pp. 1162–1165, 2011. View at: Publisher Site  Google Scholar
 J. L. Greathouse and M. Daga, “Efficient sparse matrixvector multiplication on GPUs using the CSR storage format,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14), pp. 769–780, New Orleans, La, USA, November 2014. View at: Publisher Site  Google Scholar
 V. Karakasis, T. Gkountouvas, K. Kourtis, G. Goumas, and N. Koziris, “An extended compression format for the optimization of sparse matrixvector multiplication,” IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 10, pp. 1930–1940, 2013. View at: Publisher Site  Google Scholar
 W. T. Tang, W. J. Tan, R. Ray et al., “Accelerating sparse matrixvector multiplication on GPUs using bitrepresentationoptimized schemes,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '13), pp. 1–12, Denver, Colo, USA, November 2013. View at: Publisher Site  Google Scholar
 M. Verschoor and A. C. Jalba, “Analysis and performance estimation of the conjugate gradient method on multiple GPUs,” Parallel Computing, vol. 38, no. 1011, pp. 552–575, 2012. View at: Publisher Site  Google Scholar  MathSciNet
 J. W. Choi, A. Singh, and R. W. Vuduc, “Modeldriven autotuning of sparse matrixvector multiply on GPUs,” in Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '10), pp. 115–126, ACM, Bangalore, India, January 2010. View at: Publisher Site  Google Scholar
 G. Oyarzun, R. Borrell, A. Gorobets, and A. Oliva, “MPICUDA sparse matrixvector multiplication for the conjugate gradient method with an approximate inverse preconditioner,” Computers & Fluids, vol. 92, pp. 244–252, 2014. View at: Publisher Site  Google Scholar  MathSciNet
 F. Vázquez, J. J. Fernández, and E. M. Garzón, “A new approach for sparse matrix vector product on NVIDIA GPUs,” Concurrency Computation Practice and Experience, vol. 23, no. 8, pp. 815–826, 2011. View at: Publisher Site  Google Scholar
 F. Vázquez, J. J. Fernández, and E. M. Garzón, “Automatic tuning of the sparse matrix vector product on GPUs based on the ELLRT approach,” Parallel Computing, vol. 38, no. 8, pp. 408–420, 2012. View at: Publisher Site  Google Scholar
 A. Monakov, A. Lokhmotov, and A. Avetisyan, “Automatically tuning sparse matrixvector multiplication for GPU architectures,” in High Performance Embedded Architectures and Compilers: 5th International Conference, HiPEAC 2010, Pisa, Italy, January 25–27, 2010. Proceedings, pp. 111–125, Springer, Berlin, Germany, 2010. View at: Publisher Site  Google Scholar
 M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop, “A unified sparse matrix data format for efficient general sparse matrixvector multiplication on modern processors with wide simd units,” SIAM Journal on Scientific Computing, vol. 36, no. 5, pp. C401–C423, 2014. View at: Publisher Site  Google Scholar  MathSciNet
 H.V. Dang and B. Schmidt, “CUDAenabled sparse matrixvector multiplication on GPUs using atomic operations,” Parallel Computing. Systems & Applications, vol. 39, no. 11, pp. 737–750, 2013. View at: Publisher Site  Google Scholar  MathSciNet
 S. Yan, C. Li, Y. Zhang, and H. Zhou, “YaSpMV: yet another SpMV framework on GPUs,” in Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14), pp. 107–118, February 2014. View at: Publisher Site  Google Scholar
 D. R. Kincaid and D. M. Young, “A brief review of the ITPACK project,” Journal of Computational and Applied Mathematics, vol. 24, no. 12, pp. 121–127, 1988. View at: Publisher Site  Google Scholar  Zentralblatt MATH
 G. Blelloch, M. Heroux, and M. Zagha, “Segmented operations for sparse matrix computation on vector multiprocessor,” Tech. Rep., School of Computer Science, Carnegie Mellon University, Pittsburgh, Pa, USA, 1993. View at: Google Scholar
 J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming with CUDA,” ACM Queue, vol. 6, no. 2, pp. 40–53, 2008. View at: Publisher Site  Google Scholar
 B. Jang, D. Schaa, P. Mistry, and D. Kaeli, “Exploiting memory access patterns to improve memory performance in dataparallel architectures,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 1, pp. 105–118, 2011. View at: Publisher Site  Google Scholar
 T. A. Davis and Y. Hu, “The University of Florida sparse matrix collection,” ACM Transactions on Mathematical Software, vol. 38, no. 1, pp. 1–25, 2011. View at: Publisher Site  Google Scholar  MathSciNet
 NVIDIA, “CUDA C Programming Guide 6.5,” 2015, http://docs.nvidia.com/cuda/cudacprogrammingguide. View at: Google Scholar
Copyright
Copyright © 2016 Guixia He and Jiaquan Gao. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.