Abstract

Computational fluid dynamics (CFD) plays an important role in the optimal design of aircraft and the analysis of complex flow mechanisms in the aerospace domain. The graphics processing unit (GPU) has a strong floating-point operation capability and a high memory bandwidth in data parallelism, which brings great opportunities for CFD. A cell-centred finite volume method is applied to solve three-dimensional compressible Navier–Stokes equations on structured meshes with an upwind AUSM+UP numerical scheme for space discretization, and four-stage Runge–Kutta method is used for time discretization. Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for GPUs, which reduces the complexity of programming. The main purpose of this paper is to design an extremely efficient multi-GPU parallel algorithm based on MPI+CUDA to study the hypersonic flow characteristics. Solutions of hypersonic flow over an aerospace plane model are provided at different Mach numbers. The agreement between numerical computations and experimental measurements is favourable. Acceleration performance of the parallel platform is studied with single GPU, two GPUs, and four GPUs. For single GPU implementation, the speedup reaches 63 for the coarser mesh and 78 for the finest mesh. GPUs are better suited for compute-intensive tasks than traditional CPUs. For multi-GPU parallelization, the speedup of four GPUs reaches 77 for the coarser mesh and 147 for the finest mesh; this is far greater than the acceleration achieved by single GPU and two GPUs. It is prospective to apply the multi-GPU parallel algorithm to hypersonic flow computations.

1. Introduction

Generally, real flows have different levels of compressibility. In comparison with incompressible flow, the density of compressible flow varies, which increases the complexity of the solution of governing equations. According to the Mach number, compressible flow can be divided into subsonic flow (Ma<0.8), transonic flow (Ma=0.8–1.2), supersonic flow (Ma=1.2–5.0), and hypersonic flow (Ma>5.0) [1]. Hypersonic flight has been with us since 1963; North American X-15 achieved a successful flight at Mach number 6.7. Over the last six decades, with the development of scramjet engines, high-speed aircrafts, hypersonic missiles, and hypersonic reentry vehicles, research on hypersonic flow theory has made considerable progress and attracts much attention of scientists [2, 3]. In hypersonic flow, some of the preceding physical phenomena become important. For flow over a hypersonic body, a strong shock wave with high temperature is formed. The flow field between the shock wave and the body is defined as shock layer, and the distance of the shock layer can be small. At a high temperature, the vibrational energy may be excited, and chemical reactions can occur. Accurate simulations of hypersonic flow need a more sophisticated meshing, which enlarges the computational work remarkably [4]. What is more, it is attractive for engineers and researchers to obtain accurate numerical results as soon as possible. The graphics processing unit (GPU) has exhibited significant achievements in the accelerated solution of hypersonic flow problems [5, 6].

With the rapid development of computational fluid dynamics (CFD) and computer technology, CFD plays a vital role in the optimal design of aircraft and the analysis of complex flow mechanisms [7]. At present, CFD is used to simulate problems ranging from molecular level to macroscopic magnitude, from simple two-dimensional configurations to three-dimensional complex real aircrafts, and from the simulation of subsonic flow problems to hypersonic flow problems [8, 9]. A large number of applications of CFD can reduce the development costs and provide technical support for the research on aircrafts. High Reynolds number flows, which are ubiquitous in real flows, such as turbulence, vortices, separation, aeroacoustics, and other multiscale complex flow problems, require intensive computational meshes [10]. When modeling real gas flow simulations, chemical reaction models that contain multiple components significantly increase the computational demands [11]. High-fidelity simulations are inaccessible to engineering activities and scientific research because of large-scale computing requirements. To reduce the computational time and obtain accurate numerical results quickly, the extensive calculations can only be completed by parallel computing. In recent years, the progress of the central processing unit (CPU) has shifted into a bottleneck because of the limitations in power consumption and heat dissipation prevention, and adding CPU cores cannot improve the overall performance of CPU-based parallel clusters.

In comparison with the Intel CPU, the GPU has strong floating-point operations and a high memory bandwidth in data parallelism, as illustrated in Figures 1 and 2 [12]. Traditionally, a GPU is limited to graphics rendering. In 2007, NVIDIA introduced the compute unified device architecture (CUDA), which is a parallel computing platform and programming model for GPU that allows developers to use C/C++ as a high-level language and reduces the complexity of programming. The emergence of CUDA as a programming model marked the beginning of the widespread use of GPUs in general-purpose computing. Currently, GPUs are widely used in the area of scientific computing, such as molecular dynamics (MD), direct simulation Monte Carlo (DSMC), CFD, weather forecast (WF), and artificial intelligence (AI) [1317]. The development of GPU-based parallelization brings increased opportunities and challenges for high-performance computing. In the field of CFD, GPU parallelization has achieved numerous achievements. Brandvik et al. [18] studied the solution of three-dimensional Euler equations on an NVIDIA 8800GTX GPU with explicit time-stepping schemes, and the speedup reached 16 times. Khajeh–Saeed et al. [19] accomplished direct numerical simulation (DNS) of turbulence using GPU accelerated supercomputers, which demonstrated that scientific problems could benefit significantly from advanced hardware. Wang et al. [20] discussed DNS and Large Eddy Simulation (LES) on a turbulent wall-bounded flow using Lattice Boltzmann method and multiple GPUs, and the acceleration performance has been discussed. Emelyanov et al. [21] discussed the popular CFD benchmark solution of the flow over a smooth flat plate on a GPU with various meshes, and the speedup reached more than 46 times. Zhang et al. [22] performed an implicit meshless method for compressible flow on an NVIDIA GTX TITAN GPU, and the solution agrees well with experimental results.

The main purpose of this paper is to design an extremely efficient multi-GPU parallel algorithm to study hypersonic flow characteristics of an aerospace model, and the acceleration performance of the algorithm has been analysed to apply it to large-scale scientific calculation.

The remainder of this paper is organized as follows. Section 2 presents the governing equations and numerical schemes. Section 3 introduces CUDA and the GPU parallel algorithm for CFD, and then multi-GPU parallel algorithm based on MPI+CUDA is established. Section 4 describes the physical model and mesh generation of an aerospace plane in sufficient detail. Section 5 shows the computed results compared with experimental measurements and analyses the performance of multi-GPU parallel platform. Section 6 provides the conclusions of this work.

2. Governing Equations and Numerical Schemes

2.1. Governing Equations

In the field of CFD, without regard to volume sources due to body forces and volumetric, three-dimensional compressible Navier–Stokes (NS) equations can be written in integral form as follows [23]:where is the vector of conservative variables, which contains three dimensions of five components; is the vector of convective fluxes; presents the vector of viscous fluxes; is the control volume; and is the cell surface.In the above equations, is the density, are the local Cartesian velocity components, is the static pressure, is the static temperature, are the unit normal vector of the cell surface, is the scalar product of the velocity vector and the unit normal vector, is the total enthalpy, and is the total energy, which can be related to the static pressure, , through the equation of state:where is the ratio of specific heat capacities which is taken as 1.4 for calorically perfect gas and is the specific gas constant.

For the vector of viscous fluxes, the components of viscous stress and heat flux vector are expressed as follows:where and are the viscosity coefficient and thermal conductivity coefficient, respectively, which can be obtained by Sutherland’s law and the definition of Prandtl number, respectively, shown as follows: where is a constant equal to K, kg/(m s), K for air, is the specific heat at constant pressure, and is the Prandtl number, which equivalents to 0.72.

2.2. Nondimensionlisation of Governing Equations

In order to solve the NS equation more conveniently due to its complexity, nondimensionlisation of these equations is based on the freestream values: density (), speed of sound (), static temperature (), and a reference length (). Nondimensionlisation of governing equations is as follows [24]:

2.3. Numerical Schemes

Structured or unstructured meshes are used to discretize the governing equations. An outstanding advantage of structured meshes is that the indices represent the computational space, since it directly corresponds to how the flow variables are stored in the computer memory. This property allows it to implement computational algorithms more efficiently in numerical simulation. Meanwhile, CFD codes are easily performed on GPUs due to their regular data structures which are convenient for parallel computing. However, the generation of structured meshes is more complicated than that of unstructured meshes for complex geometries.

The cell-centred finite volume method based on structured meshes is used in the spatial discretization of governing equations. For the control volume , the discrete form of formula (1) can be written in the following form:The term in square brackets on the right-hand side of formula (19) is called the residual, which is denoted by The convective fluxes are computed by an upwind AUSM+UP numerical scheme [25], which has a high resolution and computational efficiency for all speeds. The underlying idea of the approach is that the convective fluxes are decomposed into two parts (the convective and pressure parts) based on the mass flow functionwhere subscript refers to the cell face; subscripts and refer to the control volume on the left and right edges of the cell face; denotes the vector of the convective part; denotes the vector of pressure; denotes the mass flux; and is a vector quantity that is convected by , in whichIn this paper, second-order accuracy is achieved through the monotone upstream-centred schemes for conservation laws (MUSCL) [26] with the Van Leer’s limiter, which can prevent the generation of oscillations and spurious solutions in regions with a large gradient.

The central scheme is selected to solve the viscous fluxes due to its elliptic mathematical properties. The values at the cell face result fromFor the temporal discretization of formula (20), the method of lines is used, such that space and time integration can be handled separately. Space integration, which consists of the solution of convective fluxes and viscous fluxes, is introduced hereinbefore. The four-stage Runge–Kutta time-stepping method [23, 27] is implemented for the temporal discretization, in whichThis scheme is second-order accurate in time, where is the time step of control volume and (, , , ) represents the stage coefficients. The multistage scheme is good at data parallelism, which is of considerable significance to parallel computing. Furthermore, this method reduces the storage expense because only the initial conservative variables and the latest calculated residual value are stored in memory. The implicit residual smoothing method is used to accelerate the solution of the governing equations.

3. CUDA and GPU Parallel Algorithm for CFD

3.1. CUDA Overview and GPU Architecture

CUDA is a general-purpose parallel computing platform and programming model for researchers to perform codes on GPU flexibly to solve many complex scientific problems [9, 12, 28]. In CUDA, a C program is extended to the execution of kernels, which are executed by a large number of threads. Individual threads are grouped into thread blocks, and the block size is no more than 1,024 for current devices. Thread blocks are organized into a grid, and the number of thread blocks in a grid is usually decided by the size of the data being processed, which ensures that a thread loads the computation of a mesh. Figure 3 shows the relationship among thread, block, and grid.

The GPU architecture contains different types of memory to access data, such as global, constant, texture, shared, and local memories and registers [12, 21]. Figure 4 shows the memory hierarchy of a GPU. At a low level, local memory and registers are private to the threads. The input or intermediate output variables in the threads will be stored in registers or local memory. At a high level, each thread block has shared memory, and all threads in the same thread block can cooperate with one another. At an even higher level, the texture and constant memories are used to store read-only parameters that need to be frequently accessed. The global memory is available to CPU and GPU and enables data transfer between them. The memory hierarchy of a GPU guarantees that high-performance computing can be achieved.

The GPU architecture is built based on an array of streaming multiprocessors (SMs). A multiprocessor is designed to execute hundreds of threads concurrently to help a GPU hide the latency of the memory access, which is called Single-Instruction, Multiple-Thread (SIMT). When the host invokes a kernel function, the thread blocks of the grid are distributed to SMs. For a SM, thread blocks are executed in turn: once old thread blocks terminate, new thread blocks are launched immediately. Threads within the same block are guaranteed to run on the same SM concurrently, and interthread communication is accomplished through shared memory.

3.2. GPU Parallel Algorithm for CFD

Figure 5 shows the GPU parallel algorithm corresponding to the solution of CFD. The solid boxes represent the general cooperation of CPU and GPU, and the dashed box represents the solution of governing equations implemented on GPU. The GPU parallel algorithm follows seven steps: device setting, memory allocation, data transfer from host to device, kernel execution, data transfer from device to host, freeing memory, and device resetting. For the solution of governing equations, each time step of the computation contains a series of kernel functions, including the processing of boundary conditions, the calculation of fluxes of cell face, and the update of primitive variables. The exchanges of primitive variables and their gradients between GPUs are essential to calculate the fluxes of cell face accurately when running the GPU-based CFD codes on multi-GPU parallel cluster.

3.3. Multi-GPU Parallelization Based on MPI+CUDA

The Message Passing Interface (MPI) is widely used on shared and distributed memory machines to implement large-scale calculations on multi-GPU parallel cluster [5, 29, 30]. Figure 6 shows the multi-GPU programming model based on MPI+CUDA. At the beginning of the calculation, the MPI_Init function is called to enter the MPI environment. The computing tasks are then evenly distributed to each GPU to achieve load balance by one-dimensional domain decomposition method [31]. At the end of the calculation, the MPI_Finalize function is called to exit the MPI environment. Here, each node consists of one Intel Xeon E5-2670 CPU with eight cores and four NVIDIA GTX 1070 GPUs.

In a multi-GPU parallel cluster, the ghost and singularity data will be exchanged between GPUs. For the current devices, data transfer cannot be completed directly between GPUs, and CPU plays a role as medium. Figure 7 shows data transfer process between GPUs. The function cudaMcemcpyHostToDevice is called to transfer data from the GPU to the relevant CPU through the PCI-e bus. After fulfilling data transfer between CPUs, the function cudaMcemcpyDeviceToHost is called to transfer the data from CPU to the target GPU. The latest device introduced by NVIDIA Corporation is Tesla V100, which provides an NVLink bus technique [32] to achieve communication between GPUs directly.

4. Physical Model and Mesh Generation

4.1. Physical Model

Figure 8 shows the physical model of an aerospace plane. Reference [33] provides a wide variety of experimental data sets for this model, and these experiments are carried out in a hypersonic wind tunnel. The aerospace plane is significantly important for future space warfare and transportation, which are drawing increasing interest. In this paper, the hypersonic flow over the aerospace plane model has been studied on the multi-GPU parallel platform. Orthographic views and the main dimensions (given in mm) of the model are presented in Figure 9, which exactly corresponds to the experimental model.

Two Mach numbers, namely, 8.02 and 10.02, are chosen to analyse the hypersonic flow characteristics of aerospace plane. The slip angle for all cases is , and the angles of attack are and for Mach number 8.02, and and for Mach number 10.02. The temperature of isothermal wall is set to 300 K. Table 1 shows the flow conditions for this study. The freestream values can be obtained as follows:

4.2. Mesh Generation

The commercial software Pointwise is used to generate the multiblock structured mesh. Only half of the flow field is simulated because of its symmetry with no sideslip. The mesh is refined near the walls and at the corners, and the magnitude of the grid height of the first layer is 1e-6 m to capture the characteristics of the boundary layer accurately. The mesh in the boundary layer (see the orange part in Figures 10 and 11) is generated separately from the main flow to ensure the quality of the grid in this region. Meanwhile, this method of mesh generation has strong flexibility. The mesh has nodes and hexahedra. Figures 10 and 11 show the computational mesh of symmetry plane and base plane, respectively. Figure 12 shows the boundary conditions, which are marked in different colours. Here, four types of boundary conditions are employed: wall, symmetry, inflow, and outflow.

5. Results and Discussion

5.1. Multi-GPU Parallel Platform

In this paper, we use CUDA version 8.0, Visual Studio 2013 for C code and MPICH2 1.4.1 for MPI communication. All simulations are performed on one node, which contains one Intel Xeon E5-2670 CPU at 2.60 GHz and four NVIDIA GTX 1070 GPUs. The main parameters of the multi-GPU parallel platform are shown in Table 2. The theoretical peak single-precision floating-point operations are more than those of double-precision. Hence, the single-precision data for GTX 1070 GPU parallelization is used.

5.2. Flow over the Physical Model at Mach Number 8.02

The Mach number and pressure contours on the symmetry plane at Mach number 8.02 are shown in Figures 13 and 14. The GPU parallel algorithm can clearly capture the wave structures in the flow field. The approximate boundary layer thickness is present in Figure 13. The bow shock at the head and the expansion wave at the upper surface (shown in Figure 14) have an important influence on the distributions of the flow field parameters.

The surface pressure distributions on the symmetry plane at Mach number 8.02, shown in Figures 15 and 16, depict pressure distributions on the symmetry plane, and the experimental results are presented here to facilitate a direct comparison. The experimental results are all from [33]. The pressure reaches a maximum at the head due to the effect of bow shock and then rapidly decreases. At the upper surface, the pressure decreases at the corner of cone and cylinder because of the expansion and then maintains about the same. As the angle of attack increases, the upper surface pressure decreases, while the opposite holds for the lower surface. It can be seen that the numerical results agree well with experimental data, and the agreement between numerical and experimental results is favourable.

5.3. Flow over the Physical Model at Mach Number 10.02

In this section, the computed results of cases 3 and 4 are discussed. Figure 17 shows the heat flow contours of the aerospace plane. As clearly shown in Figure 17, extreme values occur at the head and wing twist due to the drastic changes of the temperature gradients. Figure 18 shows the temperature contours on the symmetry plane. Obviously, a high temperature zone is formed at the head of the physical model due to the influence of bow shock.

Figures 19 and 20 show the heat flow distributions on the symmetry plane of cases 3 and 4, which are close to the experimental data reported in [33]. For calculation of , see (8), and the reference value equals . As can be seen from the figures, the distributions of heat flow are consistent with the trend of pressure. Nevertheless, discrepancy exists between simulation and measurements, especially at the corner of cone and cylinder, due to the grid resolution. Overall, the agreement between numerical simulation and experimental measurements is also favourable.

5.4. GPU Parallel Performance Analysis

For case 1, numerical computations are performed on six different grid-refinement levels: coarser (1.78 million), coarse (3.56 million), medium (5.02 million), fine (7.08 million), finer (10.04 million), and finest (20.08 million), to analyse the performance of multi-GPU parallel algorithm.

A good parallel system requires better acceleration, and speedup is an important parameter to measure the performance of parallel algorithms. The speedup is defined as the runtime of one CPU with eight cores divided by that of the GPU. In this paper, we simulate ten thousand time steps, and the runtime of one iteration step is achieved by averaging the execution time.The runtime of one iteration step for one CPU and single GPU is presented in Table 3 (time is given in ms). Figure 21 shows that the speedup of the single GPU increases with the grid size. In comparison with CPU-based parallel computing, GPU parallelization can considerably improve the computational efficiency. The speedup reaches 63.22 for the coarser mesh, and 77.59 for the finest mesh. The speedup increases rapidly and then gradually becomes flat with the increase of grid size; this is because the proportion of the single instruction task increases (iteration, etc.) as the grid size increases compared with the branch instruction task (data transfer, etc.). GPUs are better suited for compute-intensive tasks than traditional CPUs. After the grid size attains the “limit value,” the speedup remains basically unchanged and the “limit value” is closely related to the properties of GPU architecture.

Table 4 shows the runtime of one iteration step for multi-GPUs (time is given in ms). Figure 22 shows that the speedup of different numbers of GPUs increases with the grid size, starting with a single GPU and going up to four GPUs. Multi-GPU parallelization prominently improves the computational efficiency with the increased grid size. The speedup of four GPUs reaches 76.83 for the coarser mesh and 146.55 for the finest mesh; this is far greater than the acceleration achieved by single GPU and two GPUs, which indicates that the multi-GPU parallel algorithm established in this paper can be applied to large-scale scientific computations.

6. Conclusions

In the present work, a multi-GPU parallel algorithm based on MPI+CUDA is established to accelerate the computations of hypersonic flow problems. Flow over an aerospace plane model at Mach numbers 8.02 and 10.02 is studied on NVIDIA GTX 1070 GPUs to verify the algorithm. The numerical results agree well with experimental data by comparing pressure and heat flow distributions. The speedup of single GPU reaches 63.22 for the coarser mesh and 77.59 for the finest mesh. In comparison with CPU-based parallel computing, GPU parallelization can considerably improve the computational efficiency. The speedup of four GPUs reaches 76.83 for the coarser mesh and 146.55 for the finest mesh; this is far greater than the acceleration achieved by single GPU and two GPUs, which indicates that the multi-GPU parallel algorithm established in this paper can be applied to large-scale scientific computations.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The first author would like to thank Dr. Feng Liu. All authors would like to express their thanks for the support from the National Natural Science Foundation of China (No. 11472004).