input:
        output:
(1) for  all threads in every thread block do in parallel
(2)           local variables
(3)           
(4)           
(5)           for     to     by Step  1 do
(6)                     
(7)           
Algorithm 4: Optimized CUDA kernel for constant vector multiplication and vector-vector addition.