#### Abstract

Direct numerical simulation (DNS) and large eddy simulation (LES) were performed on the wall-bounded flow at using lattice Boltzmann method (LBM) and multiple GPUs (Graphic Processing Units). In the DNS, 8 K20M GPUs were adopted. The maximum number of meshes is , which results in the nondimensional mesh size of for the whole solution domain. It took 24 hours for GPU-LBM solver to simulate LBM steps. The aspect ratio of resolution domain was tested to obtain accurate results for DNS. As a result, both the mean velocity and turbulent variables, such as Reynolds stress and velocity fluctuations, perfectly agree with the results of Kim et al. (1987) when the aspect ratios in streamwise and spanwise directions are 8 and 2, respectively. As for the LES, the local grid refinement technique was tested and then used. Using grids and Smagorinsky constant , good results were obtained. The ability and validity of LBM on simulating turbulent flow were verified.

#### 1. Introduction

Lattice Boltzmann method has been regarded as a promising alternative for simulation of fluid flows with complex physics in the past two decades. Unlike conventional numerical schemes based on discretizations of macroscopic continuum equations, the LBM is based on microscopic models and mesoscopic kinetic equations. It has many advantages, including easy implementation of boundary conditions, easy programming, and fully parallel algorithms [1]. As a result, the LBM has been applied in many fields, such as biofluid and porous medium [2]. But its feasibility and validity in both direct numerical simulation and large eddy simulation of turbulent flow remain to need further validation [3, 4].

Fully developed turbulent wall-bounded flow is a classic problem, which is seemingly simple but complicated. The DNS and LES on it, including the boundary layer flow, have become important tools for investigating the mechanism of turbulence. For a long time, the simple geometry attracts many experimental and theoretical investigations of complex turbulence interactions in the vicinity of walls. Among them the direct numerical simulation done by Kim et al. [5, 6] is most cited for comparison. Their DNS results of ( and are the friction velocity and channel half-width, resp.,) have been considered as the benchmark of numerical simulations. Because of the large computational cost of DNS and LES, the published related results on wall-bounded flow are rare in spite of the increasingly high performance of high performance computing (HPC) in the recent decade years.

Nowadays, the presence of GPU brings possibility to the large-scaled DNS and LES for turbulence prediction. GPU equipped on the graphic board in a computer is originally used to process large graphics data sets fast. Recently, with the emergence and developing of computing platforms, for example, CUDA [7] and OpenCL, the use of GPU to accelerate nongraphic computations has drawn more and more attention [8, 9]. Due to its high performance of floating-point arithmetic operation, wide memory bandwidth, and better programmability, GPU has been further used for the general purposes, that is, GPGPU. Computational fluid dynamics (CFD) based on grids is one of the important applications. According to our experience, solving incompressible Navier-Stokes equations to simulate fluid flow with the marker and cell (MAC) solver on single GPU is times faster than the heavily optimized CPU-based implementations. And the LBE GPU-based calculations can even obtain more than 100 speedups [10–13]. Aoki group [14] has devoted to the GPU high-performance computing since 2006. The team members performed various numerical simulations using hundreds or even thousands of GPUs, such as weather prediction, crystal growth process, and air flow in the city, and a good performance of PFLOPS was achieved [15–18]. At present, GPU applications for engineering purpose in fields such as energy, environment, and aerospace become hotspots.

In the present work, DNS and LES on the wall-bounded flow at were performed using D3Q19 model of lattice Boltzmann method and multiple GPUs. There are two objectives. One is to present a high performance computing with LBM and multi-GPUs, which may provide a possible way for the prediction on turbulent flow. The other is to verify the validity and ability of LBM on simulating turbulent flow.

This paper is organized as follows. In Section 2, lattice Boltzmann equations (LBE) for DNS and LES with forcing term are described. Section 3 introduces the model system of the simulation and data transfer in CUDA-MPI. The results of DNS and LES on fully developed turbulent wall-bounded flow are discussed in Sections 4 and 5, respectively. The summary and conclusions are presented in Section 6.

#### 2. Lattice Boltzmann Equation (LBE) Formulation for DNS and LES of Turbulence

##### 2.1. LBE for DNS

In LBM, firstly, the following Boltzmann equation for the discrete velocity distribution is solved on a discrete lattice [19]:
where is the particle velocity distribution function, is the particle velocity in the th direction, and is the number of velocities. For LBM, the way of classifying the different methods by lattice is the DnQm scheme. Here “” stands for “ dimensions,” while “” stands for “ speeds.” For 2D model, 9-velocity LBM is the most common. For 3D model, there are several cubic lattice models, such as D3Q13, D3q15, D3Q19, and D3Q27 (13, 15, 19, or 27). is the collision operator. Using Boltzmann-BGK approximation [20],
we get
where is the local equilibrium distribution and is the relaxation time. To obtain the Navier-Stokes equations, the equilibrium distribution functional form must be carefully chosen. In the 9-speed square lattice and the 13-, 15-, 19-, and 27-speed cubic lattices, a suitable equilibrium distribution function has been proposed as [2]
In the present study, D3Q19 model was adopted with (0, 0, 0), (1, 0, 0), (−1, 0, 0), (0, 1, 0), (0, −1, 0), (0, 0, 1), (0, 0, −1), (1, 1, 0), (−1, 1, 0), (1, −1, 0), (−1, −1, 0), (1, 0, 1), (−1, 0, 1), (1, 0, −1), (−1, 0, −1), (0, 1, 1), (0, −1, 1), (0, 1, −1), and (0, −1, −1). Figure 1 shows the D3Q19 model we used here. The corresponding weighting factors are , , and . and** u** are the macroscopic quantities and can be evaluated as [2]
The speed of sound is , and the equation of state is that of an ideal gas as follows:
Equation (3) can be further discretized in physical space and time . The completely discretized form of (3) is
Equation (7) is the well-known LBGK model, where is the nondimensional relaxation time. The viscosity in the macroscopic Navier-Stokes equation can be derived from (7) as
Equation (7) is usually solved with its standard form by assuming that according to the following two steps. Collision step:
Streaming step:
where and denote the pre- and postcollision states of the distribution function, respectively.

##### 2.2. LBE for LES

The filtered form of the LBE for LES is defined as [21–23] where and represent the distribution function and the equilibrium distribution function of the resolved scales, respectively. Moreover, and are the relaxation time corresponding to the molecular viscosity and eddy viscosity , respectively. Accordingly, is given by [21–23] The eddy viscosity arising from the standard Smagorinsky model can be calculated with the filter length scale and the filtered strain rater tensor as [23] Here, is Smagorinsky constant. In addition, finite-difference approximation was chosen to compute the strain rater tensor in the present work.

##### 2.3. External Force

For the external forcing term, there are many implementation methods. In the presented wall-bounded flow simulation, the modified velocity method [24] was applied that is using a modified velocity to compute the equilibrium distribution function. It can be described as follows: where is the external force. As the wall-bounded flow simulated here is driven by a constant body force, the uniform pressure drop, we define as Here, and represent the friction velocity and channel half-width, respectively.

Then calculate the equilibrium distribution function with instead of . So (4) can be transferred as

##### 2.4. Boundary Conditions

As for the boundary conditions of , periodic boundary condition was used in streamwise and spanwise direction of the present work. For the walls, the following boundary condition was applied: where was computed by the macroscopic values and on the boundaries according to (4). cannot be computed directly and was assumed to be equal to the value of its neighboring inner node. and were computed by (9) and (4), respectively, with the corresponding and .

#### 3. Model System and Parameters

Figure 2 shows the model system. The calculation domain length is , , and corresponding to the streamwise , normal , and spanwise direction, respectively. is half-width in the normal direction to walls, which is used to define the Reynolds number. The standard Reynolds number can be given by with the mean centerline velocity. In addition, it is common to define the Reynolds number in such flow model using the “wall unit” or viscous length . The skin friction is calculated as and the corresponding “wall unit” time is . Whereby, the friction Reynolds number is obtained as which implies . As usual, a superscript (+) indicates quantities measured in the “wall units” and . In a fully developed turbulent wall-bounded flow beyond the transient range of Reynolds numbers, .

To carry out numerical simulation of turbulence, the nondimensional physical parameters describing the problem must be defined first. For wall-bounded flow, the spatial extent of the computational domain plays an important role to obtain accurate results. In particular, the aspect ratios are and . Referring to the study in [26], Spasov et al. [27] performed the direct numerical simulations of at and . The mean velocities were overpredicted compared with the results obtained by Kim et al. [5]. Also, the Reynolds stress did not agree well. In Section 4, we verified various groups of aspect ratio to perform DNS.

For DNS, the grid size cannot be larger than the local Kolmogorov length . In fully developed turbulent flow, in the region near the wall [28]. increases with the increasing distance from the wall. Owing to the uniform mesh used in LBM, the grid scale in the whole field should satisfy . As for the LBM, , , which results in . Accordingly, with the friction Reynolds number in the present work , the half-width in the computational domain should be large than 120.

As for initial conditions, and was given by logarithmic law of friction: , in which von Karman constant , , and was set as 0.1. The velocity field was initialized as follows: , , and , where the fluctuating terms were got from , , , and was based on a frequently used engineering approximation for the universal mean velocity profile near the wall The dimensionless velocity is measured in the “wall units.” The particular values chosen, for the velocity profile inflection point, for the Van Karman constant, and , are not expected to have an impact on the eventual flow profile taken for the computation of developed turbulence statistics.

Current parallel computation was made by 1D domain partitioning using 8 GPUs. Figure 3 shows the 1D partitioning in -direction executed in the present study. The data are exchanged with top and bottom ranks. As for detailed information on data transfer, refer to our previous work [12]. Data communications among GPUs are done by Message Passing Interface (MPI) and cudaMemcpy. First, the data are copied from GPU to CPU by CUDA API cudaMemcpy(…,cudaMemcpyDeviceToHost). Then the data are exchanged between corresponding CPUs using parallel tool like MPI, and so forth. Finally, the exchanged data are copied from CPU to GPU by cudaMemcpy(…,cudaMemcpyHostToDevice).

#### 4. DNS for Wall-Bounded Flow

The DNS was performed on turbulent wall-bounded flow with and various aspect ratios. The corresponding nondimensional mesh size is for the whole solution domain. Figure 4 shows (a) the velocity pattern at a constant streamwise location and (b) the isosurface of second invariant of velocity gradient tensor when the aspect ratio is , .

**(a)**

**(b)**

##### 4.1. Effect of Aspect Ratio

Figure 5 shows the profile of averaged velocity in the normal -direction at the aspect ratios (a) , , (b) , , (c) , , and (d) , , in which our DNS results shown by solid lines and the dotted lines represent the DNS results got by Kim et al. [5]. It is obvious that the averaged velocity at , fits perfectly with the result of Kim et al. Although the periodical boundary condition is used in spanwise direction, for the same dimension in streamwise direction, larger spanwise ratio makes the results match better by comparing Figure 5(a) with Figure 5(b) and comparing Figure 5(c) with Figure 5(d). However, the dimension in streamwise direction, that is, , has little effect on the mean velocity (comparing Figure 5(a) with Figure 5(c) and Figure 5(b) with Figure 5(d)).

**(a)**

**(b)**

**(c)**

**(d)**

Figure 6 presents the turbulent statistics calculated by LBM-DNS and comparison with the DNS data by Kim et al. [5] for (i) Reynolds stress in wall-normal directions, (ii) the components of the root-mean-square (rms) in streamwise, spanwise, and wall-normal velocity fluctuations, at (a) , , (b) , , respectively. It can be seen that our LBM-DNS results and Kim’s data [5] are generally in good agreement for the two cases of aspect ratio. There are slight discrepancies of rms at , , especially the rms in streamwise velocity fluctuations indicated by _rms. Comparing with mean velocity and Reynolds stresses, DNS results of velocity fluctuations are more sensitive to the streamwise length chosen for simulation.

**(a)**

**(b)**

##### 4.2. Performance of Computation by GPUs

For this DNS work, it took about 24 hours to simulate LBM-steps for LBM-grids with 8 K20M GPUs. The performance achieves 2333 MLUPS; that is, meshes are processed per second. Compared with the similar DNS in [27] done by 36 CPUs, in which the performance is 5.6 MLUPS, the present performance by GPUs is about 416 times higher. With LBM-GPU, high performance can be achieved easily without any difficulty in programming.

#### 5. LES for Wall-Bounded Flow

Since the DNS approach resolves all relevant spatial and temporal scales, it can predict all possible fluid motions with high fidelity but large cost. In this part, we simulated wall-bounded flow by applying the standard Smagorinsky model to LES. Moreover, the local grid refinement technique of LBM was tested and then used.

##### 5.1. Local Grid Refinement

Local grid refinement is usually applied to regions where large changes of solution are expected. For LBM, the locally refined patches are superposed to the global coarse grid, saving computational time to some extent and enabling a high resolution where needed. Grid refinement is performed by dividing the space step through a refinement factor . The kinematic viscosity, defined in the frame of the LBGK model, depends on the step size with . To keep a constant viscosity and the same Reynolds number on coarse grids () and fine grids (), the relaxation time in (8) has to be redefined as [29] Here and represent the relaxation time on the fine and coarse grids, respectively. Taking into account that when is very close to 2, the LBGK scheme becomes unstable and , we estimate from (20) that the upper limit of is about 50. In the present study the value of refinement factor is 2. Accordingly, the time step on the fine grid is decreased by , in which is the time-step on the coarse grid. Because of the continuity of the hydrodynamic variables and their derivatives on the interface between two grids, the relationships between postcollision distribution functions on the coarse and fine grid nodes obtained from (4), (9), and (20) can be described as follows, respectively: Strategy of the numerical realization with refinement factor can be seen in Figure 7. Firstly, carry out a simulation on coarse grids to . Then, map the solution to fine grid using second-order interpolation in space and time from the results of coarse grids at , according to (22). Finally, advance the dependent variables on the coarse and fine grids in time proceeding with the obtained fine grid boundary conditions.

The validity of the adopted local grid refinement was verified by simulating the classical problem of lid-driven cavity flow at with and without local grid refinement. Local grid refinement was used with a coarse grid (64 × 32 nodes) in middle part and a fine grid (16 × 127 nodes) in 1/4 top and bottom computational region, respectively. The comparison between the NSE results of Ghia et al. [25] and our LBE results with and without using grid refinement on this problem is presented by velocity profile on the center line in the normal direction in Figure 8. It is clearly shown that the LBE results adopted grid refinement coincide better with the results gained by Ghia et al. [25] near boundaries where the grids are refined than the ones by the uniform coarse grid system.

##### 5.2. Results and Discussion

For LES, first, the number of grids is decreased by 4 times in each direction based on the DNS and the aspect ratio is keep to , . That is, the total grids for LES is 1/64 of DNS. Figure 9 displays the profile of averaged velocity in the normal -direction using LBM-LES at various for a grid system , whose grid scale is . The results at different are same, which means that the used subgrid scale model does not work at such a large grid scale. Considering grid scale and computational cost simultaneously, grids in of top and bottom walls were doubled. Figure 10 shows the mean velocity distribution in wall normal direction at various computed in a local grid refinement system, where top and bottom parts adjacent to walls were refined to () and the middle parts are kept at (). As is observed, the subgrid model takes effect and the results get the best match when the value of is 0.13, which is less than the typical value of used in the NS-LES [28].

Figure 11 shows the profile of Reynolds stress in the normal -direction using LBM-LES at for the same local grid refinement system mentioned above. It can be observed that the values deviate LBM-DNS results near the walls, that means the grid is not fine enough to capture the very fine eddies in boundary layer. Also, the SGS model for constant is less efficient near walls.

#### 6. Conclusion

In the present work, DNS and LES were performed on the wall-bounded flow at using lattice Boltzmann method and multiple GPUs. In the DNS, 8 K20M GPUs were adopted. The effect of aspect ratio was verified and the number of grids is for the largest aspect ratio. The results show that although the periodical boundary is used in spanwise and streamwise directions, the dimensions in both directions have effect on the DNS results. For the case of , , both the mean velocity and turbulent statistic variables, such as Reynolds stress and velocity fluctuations, agree perfectly with the results of Kim et al. [5]. At the same time, high performance of 2333 MLUPS was obtained. As for the LES, the local grid refinement technique was tested and then used. Using grids and , good results on mean values were obtained. However, the turbulent statistic values deviate from LBM-DNS data near the wall. It may suggest that more grids should be used near walls and the dynamic Smagorinsky constant should be utilized to obtain more accurate results. Moreover, multi-GPU, which is of super computing power and matches perfectly with the good parallelism of LBM, presents a surprisingly high performance in this work.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

The authors would like to acknowledge the financial support for this work provided by the National Natural Science Foundation of China (no. 11302165 and 11242010). Also, this research was supported in part by the Japan Society for the Promotion of Science (JSPS) KAKENHI, a Grant-in-Aid for Scientific Research (B) no. 23360046 and a Grant-in-Aid for Young Scientists (B) no. 25870226, and Japan Science and Technology Agency (JST) Core Research of Evolutional Science and Technology (CREST) research programs on “Highly Productive and High Performance Application Frameworks for Post-Petascale Computing.”