IET Computers & Digital Techniques

Research Article

A Fast Fully Parallel Ant Colony Optimization Algorithm Based on CUDA for Solving TSP

Figure 4

Execution process of the kernel function. When executing a CUDA program, the kernel function is loaded on the GPU and perceived as a grid (Figure 4(a)). Before that, the programmer can set the dimension of a grid that consists of several blocks. Then, the controller of GPU will identify and allocate the blocks to SMs. In Figure 4(b), one block runs on one SM until the instructions within this block are finished. The controller of the GPU will determine which SM block runs on. Figure 4(c) shows the scheduling of threads. In SM, the wrap scheduler will schedule threads in one block wraps by wraps (a wrap consists of 32 threads), an architecture called single instruction multiple threads (SIMT) [24], and then send instructions to the dispatch unit. The dispatch unit will dispatch threads to the CUDA cores and execute the instructions. More details about the execution of threads can be referred to in [25–27]. (a) Kernel functions are loaded from host to GPU. (b) A grid is divided into several blocks and allocated to SMs. (c) Controller of SM dispatches threads to each CUDA core for executing.

(a)

(b)

(c)