Research Article

Performance Optimization of 3D Lattice Boltzmann Flow Solver on a GPU

Figure 11

Sharing 2048 registers among (a) a larger number of threads with smaller register uses versus (b) a smaller number of threads with larger register uses.
(a) Eight blocks, 64 threads per block, and 4 registers per thread
(b) Eight blocks, 32 threads per block, and 8 registers per thread