Research Article

A Novel CSR-Based Sparse Matrix-Vector Multiplication on GPUs

Algorithm 3

Kernel 2.
Input:
   CUDA-specific variables:
    (i) threadId.x: a thread
   (ii) blockId.x: a block
   (iii) blockDim.x: number of threads per block
   (iv) gridDim.x: number of blocks per grid
Output:
(01) define shared memory with size
(02) define shared memory with size
(03) threadIdx.x + blockIdx.x blockDim.x;
(04) threadIdx.x;
 /Load ptr into the shared memory ptr_s /
(05) [] [];
(06) if == 0 then _s[] [ + ];
(07) __syncthreads();
(08) ([] − )/ + 1;
(09) ( , );
(10) 0.0; [];
(11) for to with += do
(12)  index ;
(13)  __syncthreads();
  /Load into the shared memory /
(14)  for to do
(15)   if then
(16)   [ ] ;
(17)    += ;
(18)   end
(19)  done
(20)  __syncthreads();
   /Perform a scalar-style reduction/
(21)  if ( or ) is false  then
(22)    ([] − );
(23)    ;
(24)   for to do
(25)     += ;
(26)   done
(27)  end
(28) done
(29) ;