Abstract

The lattice Boltzmann method (LBM) has become an attractive and promising approach in computational fluid dynamics (CFD). In this paper, parallel algorithm of D3Q19 multi-relaxation-time LBM with large eddy simulation (LES) is presented to simulate 3D flow past a sphere using multi-GPUs (graphic processing units). In order to deal with complex boundary, the judgement method of boundary lattice for complex boundary is devised. The 3D domain decomposition method is applied to improve the scalability for cluster, and the overlapping mode is introduced to hide the communication time by dividing the subdomain into two parts: inner part and outer part. Numerical results show good agreement with literature and the 12 Kepler K20M GPUs perform about 5100 million lattice updates per second, which indicates considerable scalability.

1. Introduction

Driven by the market demand for real-time, high-definition 3D graphics at processing large graphics data sets, graphics processing unit (GPU) has been developed for rending tasks. Due to its high computational power and the emergence of compute unified device architecture (CUDA) [1], GPU has drawn much attention to accelerate nongraphics computing [25].

In the recent two decades, the lattice Boltzmann method (LBM) has been an increasingly popular numerical method of CFD derived from kinetic theory for flow computations. There are several variations of LBM including lattice Bhatnagar-Gross-Krook (LBGK) [6] or single-relaxation-time LBM (SRT-LBM) [7], entropic LBM (ELBM) [8], two-relaxation-time LBM (TRT-LBM) [9], and multiple-relaxation-time LBM (MRT-LBM) [10, 11]. In these methods, MRT-LBM can improve stability and give accurate results in solving higher Reynolds number flow simulations [12]. Moreover, LBM is especially suitable for parallel computing because of its inherent parallelism.

Large eddy simulation (LES) is a mathematical model for turbulence applied in CFD [13]. It has become an effective way to simulate physical flow. In order to capture the turbulence flow physics, Hou et al. combine SRT-LBM with Smagorinsky model [14], and Krafczyk et al. incorporated the Smagorinsky model into MRT-LBM [15]. The improved MRT-LBM with LES still remains the inherent parallelism. In this paper, the parallel algorithm of MRT-LBM-LES on multi-GPUs is studied.

Due to algorithmic simplicity and suitability for parallel computing, numerous attempts of LBM on homogeneous and heterogeneous clusters have been reported. Before the advent of GPU, many researches have been carried out on LBM to obtain high performance and scalability. Kandhai et al. [16] presented different kind of domain decomposition method and parallel strategies about D2Q9 SRT-LBM. Velivelli and Bryden [17] examined cache optimization for two-dimensional LBM in both serial and parallel implements. Pan et al. [18] investigated five domain decomposition methods and the parallel efficiency of simulating single and multiphase flow in porous medium systems using D3Q15 LBM on various parallel computing platforms. Wu and Shao [19] simulated the lid-driven cavity flow using MRT-LBM compared with SRT-LBM by parallel implementation. Schepke et al. [20] presented blocked parallel implementation of D2Q9 and D3Q19 SRT-LBM; they divided computational domain into subdomains to reduce the MPI communication time. Since GPU has many computational units and the localized communication mode of each lattice well matches the data-parallel characteristic of GPU, Tölke [21] used shared memory to propagate parts of particle distribution functions (PDFs), avoiding costly misaligned access to the global memory on a single GPU, and Habich et al. [22] extended this strategy from D2Q9 to D3Q19 LBM. Due to the improvement of NVIDIA hardware, it is not necessary to use shared memory. Obrecht et al. [23] found out that the cost of a misaligned access was nearly equal to that caused by global memory access; they presented two schemes, a split one and a revise one, without using shared memory. Zhou et al. [24] simulated flows with curved boundaries; they gave detailed implementation to deal with curved boundaries on D2Q9 SRT-LBM. In order to simulate large-scale simulations, single GPU cannot afford the computational requirement, More schemes on multi-GPUs were presented. Xian and Takayuki [25] introduced the overlapping technique between computation and communication to hide the communication time in two streams simultaneously based on 3D domain decomposition. Based on presented work, Hong et al. [3] presented schemes to simulate 3D cavity flow based on D3Q19 MRT-LBM and evaluated the performance of various schemes. Mawson and Revell [26] examined and analysed previous optimization strategies; the implementation including misaligned accesses to the global memory is found to be more efficient. Ye et al. [27] proposed the hybrid CPU-GPU computing environment of D3Q19 ELBM on single-node multi-GPU platform; their method not only takes advantage of GPU but also exploits the computing power according to model-based optimal load balancing. Tran et al. developed high performance parallelization of LBM on a GPU by reducing the overheads associated with the uncoalesced memory accesses and improving the cache locality using the tiling optimization with the data layout change [28].

The rest of the paper is organized as follows. First, a short overview of MRT-LBM with LES is given. In Section 3, the multi-GPUs architecture is introduced. Next, MRT-LBM with LES parallel strategy is presented in Section 4. Then the numerical experiment is shown in Section 5. Finally, the major conclusions are summarized in Section 6.

2. MRT-LBM with LES

The LBM based on lattice Boltzmann equation (LBE) can be described as [12]where represents the PDFs at site and time , denotes the collision operator, the left side is the time step, and expresses the particle velocity in the th direction.

In the present study, the D3Q19 (see Figure 1) lattice model and MRT collision operator are adopted. The discrete velocities according to D3Q19 can be defined byHere, the lattice speed , where is the lattice size.

In the MRT model, can be expressed aswhere the transformation matrix linearly transforms the PDFs and equilibrium distribution functions (EDFs) to the velocity moments and , respectively. is the relaxation diagonal matrix andwhere and satisfyand the remaining relaxation frequencies can be referenced by [29]. The EDF can be obtained bywhere is the speed of sound: ; and the coefficients of are weighting factor listed as follows:

The macroscopic density and velocity based on PDFs can be written as

Based on (1), LBM can be expressed by two steps: collision (see (9)) and propagation (see (10)).where expresses the postcollision state of the PDF.

With the purpose of simulating turbulent flow at high Reynolds numbers, Krafczyk et al. incorporated MRT-LBM with LES [15]. The total viscosity can be expressed as the sum of and turbulent eddy viscosity .where is the Smagorinsky constant and it is set to be 0.16 [15];    in Cartesian coordinates is the strain rate tensor; that is, can be computed directly from nonequilibrium moments :

Then, the corresponding relaxation frequencies can be modified in terms of (5).

In this present work, a nonequilibrium extrapolation method is used for the nonslip static straight wall [30] and a unified boundary treatment for curved boundary is applied to model nonslip boundary condition with curved boundaries [31]. In Figure 2, the solid circle is the solid node and the hollow circle is the fluid node, and a curved wall separates the solid nodes from the fluid nodes. The lattice node is denoted as on the fluid side of the boundary and on the solid side. is the adjacent fluid node of . The filled small rhombus on the boundary wall denotes the intersections of the wall with various lattice links. The boundary velocity at is denoted as . Next, this curved boundary scheme is given bywhere is the opposite direction of and is written by

When dealing with curved boundary, it is necessary to judge the lattice type (solid, fluid, or boundary lattice) and the value of . Algorithm 1 shows the judgement of lattice type and how to obtain . The ray method is employed to generate the lattice type. When the lattice is in the geometry, the number of intersections is even.

Generate the solid lattice based on grid file using ray method.
Obtain the curved boundary lattice according to the solid lattice and (2).
Compute the distances and ; get the value of .

3. Multi-GPUs Architecture

In the earliest days, computers only consisted of central processing units (CPUs). However, the many-core accelerators (GPU and Intel Xeon Phi) are becoming prevalent in the past decade. Over time, GPUs have become more and more powerful and generalized for general-purpose parallel computing tasks with excellent performance and high efficiency [1, 32]. The GPU architecture consists of a scalable array of Streaming Multiprocessors (SMX). Each SMX supports the concurrent execution of hundreds of threads so that thousands of threads can be executed concurrently on a single GPU. CUDA employs Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads in a group of 32 called warps. The threads in a warp perform the same instruction at the same time; each thread in a warp has its own instruction address counter and register state and executes the current instruction on its own data. In this paper, the K20 based on GK110 architecture is tested. It contains 13 SMXs, 1.25 MB L2 cache, 5 GB memory size, and six 64-bit memory controllers. Each SMX includes 192 single-precision CUDA cores, 64 double-precision units, 32 special function units (SFU), and 32 load/store units (LD/ST). Each SMX consists of 4 warp schedulers and 8 instruction dispatchers which can issue 4 warps and execute concurrently. The theoretical peak of single-precision performance is 3.5 TFLOPS, and the memory bandwidth achieves 208 GB/s.

As a result of the dimension of the problems treated with the LBM, a single piece of GPU cannot deal with the problems and high computing power and large memory space are required. It has to be solved on multi-GPUs. As shown in Figure 3, the CPU processor transfers data with GPU device by cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost. The CPU processors communicate with MPI_ISend and MPI_IRecv.

4. MRT-LBM with LES on Multi-GPUs

4.1. Domain Decomposition Method and Data Exchange

Owing to the local nature of LBM, it is a better choice to adopt domain decomposition method that partitions the flow domain into subdomain. As for three-dimensional flow simulation, there are 1D, 2D, and 3D (see Figure 4) domain decomposition methods [20], in which the fluid domain can be divided along one, two, or three directions. It can be seen from [20, 25] that choosing the 3D one is the best option when running on cluster. In the 3D one, each MPI process deals with a part of computational domain, and the process can be denoted as a triple according to the MPI process rank . The triple can be obtained bywhere and denote the amount of domain decomposition along and , respectively. Then, the computational part along direction of process can be got bywhere is the lattice node count along the direction. is the start lattice node along direction, and is the end one. The range of and directions can be calculated similarly.

Based on (9) and (10), the only communication occurs in the propagation step. After collision, each MPI process needs to exchange the interface between neighboring MPI processes. In 3D, each MPI process has to send data of 6 surfaces and 12 edges and receive data of 6 surfaces (see Figure 5) and 12 edges (see Figure 6). When transferring data , there is no need to transfer all lattice data of ; only the expected data in propagation operation are required along the direction. For example, only the data of , , , , and are expected when transferring data from MPI process to .

4.2. Data Structure

Group of threads into warps is relevant not only to computation but also to global memory access [32]. GPU coalesces global memory loads and stores accessed by threads of a warp to minimize DRAM bandwidth. In the GPU implement of LBM, it is a simple and efficient way to assign a lattice node to a thread. Each thread needs to store , , , , and lattice style. Figures 7 and 8 illustrate the memory layout of Array of Structures (AoS) and Structure of Arrays (SoA), respectively. Storing the data of PDFs in AoS format on the GPU and executing an operation that only requires would result in a loss of bandwidth as other PDFs are implicitly loaded in each 32-byte segment or 128-byte cache line. In the meantime, the AoS format would also waste the L2 cache space on other unneeded PDFs values. Storing the data in SoA format can make full use of GPU memory bandwidth which provides coalesced memory access. In order to obtain coalescence of global memory access and efficient global memory utilization, it is better to choose SoA type rather than AoS type.

4.3. Framework of GPU Implement of MRT-LBM-LES

There are two propagation schemes: out-of-place propagation (see Figure 9), in which the collision step is carried out before the propagation step, and in-place propagation (see Figure 10), in which the collision step is executed after the propagation step [23]. Hong et al. [3] compare the two schemes. It can be seen that out-of-place propagation which we take is more efficient than in-place propagation on multi-GPUs. Based on [3], the reverse scheme can improve the performance greatly. The out-of-place scheme can be illustrated as Algorithm 2.

Execute propagation operation while reading data from global memory.
Deal with boundary conditions including curved boundary and nonslip static straight
wall.
Calculate macroscopic quantities (density and velocity ) according to (8).
Do the collision operation.

Based on out-of-place scheme, the framework of GPU implement of MRT-LBM with LES can be described as Algorithm 3 and Figure 11. Compared with [3], it is not necessary to execute the collision step in the lattices stored in the buffers when the MPI communication occurs after collision step. In order to improve parallel efficiency, overlapping mode is applied to hide the communication time. The subdomain is divided into two parts: inner part and outer part [25]. The inner part does not need to be exchanged with other MPI processes. But PDFs of outer part of subdomain are needed by neighbor MPI processes after propagation. In our implement, two streams are designed: one is for inner part and the other is for boundary part (see Figure 12). When executing the computation of inner part, the MPI communication part of boundary part is processed simultaneously.

Read grid file.
Domain decomposition.
Memory allocation on host and device.
Initialization.
Copy data from host to device.
Judgement of the lattice style.
Iterative computation until satisfying convergence condition on GPUs.
Read data from global memory and propagation.
Deal with boundary condition.
Calculate macroscopic quantities.
Collision and write data back.
Data exchange of outer subdomain using MPI.
Copy data from device to host.
Gather and write data back to host memory.

5. Numerical Results and Discussion

The numerical experiments are tested on a three-node GPU cluster; each node is equipped with four NVIDIA Kepler K20M GPU devices and two Intel Xeon E5-2680 CPUs. Each node is connected by 56 Gb FDR InfiniBand network.

5.1. Validation of Flow Past a Sphere

In this section, the simulation of the 3D incompressible flow past a sphere with a constant velocity profile, , is a validation test. The sphere radius is taken. Figure 13 shows the flow geometry, coordinate system, and computational domain. The length of the computational domain is , the width is , and the height is . The lattice scale is 512 × 128 × 128.

Figure 14 shows the sphere surface grid, which is used for the test. Figures 15 and 16 show the lattices around sphere on the slice of and plane, respectively.

The drag force on the sphere is computed with the momentum-exchange method [33] and the drag force coefficient [34] can be expressed by

Figure 17 shows compared with the theoretical value. Figures 18 and 19 describe the vortical structure when Re = 1000 and 3700, respectively.

5.2. Parallel Performance

The parallel performance is examined in terms of MLUPS (million lattice updates per second), and MLUPS can be obtained by

Figures 20 and 21 show the strong scaling and weak scaling, respectively. According to Figure 20, the ideal MLUPS is proportional to the number of GPUs and the present MLUPS of our algorithm is nearly equal to the ideal MLUPS. The weak scaling test from Figure 21 shows that the MLUPS on each GPU remains almost the same when the grid size is assigned to each GPU equally regardless of the number of GPUs. The computation of multi-GPUs of MRT-LBM-LES has considerable strong and weak scalability when fixed grid size is × × . × × indicates that the direction of the fluid domain is divided into parts, direction is splitted into part, and direction is decomposed as parts. In order to compare overlapping and nonoverlapping modes, Figure 22 shows that the nonoverlapping mode can obtain 66% improvement and improve the scalability when our present algorithm is executed on 12 GPUs.

6. Conclusion

In this paper, a scalable parallel algorithm of D3Q19 MRT-LBM with LES on 12 NVIDIA Kepler K20M GPUs platform is presented to simulate three-dimensional flow past a sphere. The numerical results of flow past a sphere are compared with the literature and show good agreements. In order to simulate complex geometry, a method to judge lattice style is also described. The 3D domain decomposition method and overlapping mode are employed to improve the scalability for heterogeneous cluster, which can obtain 66% improvement. Numerical results show that the present algorithm can perform 5100 MLUPS on 12 Kepler K20M GPUs, which indicates that the present algorithm is efficient and scalable.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors thank all members of the high performance computing group at Shanghai University for their good advice. This research is supported by the National Natural Science Foundation of China (no. 91630206 and no. 11601315) and Shanghai Sailing Program (no. 16YF1404000).