Research Article
Performance Optimization of 3D Lattice Boltzmann Flow Solver on a GPU
Figure 11
Sharing 2048 registers among (a) a larger number of threads with smaller register uses versus (b) a smaller number of threads with larger register uses.
(a) Eight blocks, 64 threads per block, and 4 registers per thread |
(b) Eight blocks, 32 threads per block, and 8 registers per thread |