input:
           output:
(1)    shared memory
(2)   for  all threads in every thread block do in parallel
(3)                local variables
(4)                
(5)                
(6)                
(7)                if     then
(8)                          if     then
(9)                                    
(10)                        else
(11)                                   
(12)              if     then
(13)                        if     then
(14)                                  
(15)                        else
(16)                                  
(17)              
(18)              
(19)              
(20)             
(21)              
(22)             
Algorithm 3: CUDA kernel for tridiagonal matrix vector multiplication.