Scientific Programming

Volume 2019, Article ID 4254676, 12 pages

https://doi.org/10.1155/2019/4254676

## Implementation and Optimization of a CFD Solver Using Overlapped Meshes on Multiple MIC Coprocessors

^{1}College of Computer and Information Technology, Xinyang Normal University, Henan 464200, China^{2}Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China

Correspondence should be addressed to Wenpeng Ma; nc.ude.unyx@pwam

Received 2 February 2019; Revised 27 March 2019; Accepted 23 April 2019; Published 27 May 2019

Academic Editor: Basilio B. Fraguela

Copyright © 2019 Wenpeng Ma et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

In this paper, we develop and parallelize a CFD solver that supports overlapped meshes on multiple MIC architectures by using multithreaded technique. We optimize the solver through several considerations including vectorization, memory arrangement, and an asynchronous strategy for data exchange on multiple devices. Comparisons of different vectorization strategies are made, and the performances of core functions of the solver are reported. Experiments show that about 3.16x speedup can be achieved for the six core functions on a single Intel Xeon Phi 5110P MIC card, and 5.9x speedup can be achieved using two cards compared to an Intel E5-2680 processor for two ONERA M6 wings case.

#### 1. Introduction

Computing with accelerators such as graphics processing unit (GPU) [1] and Intel many integrated core (MIC) architecture [2] has been attractive in computational fluid dynamics (CFD) areas recent years because it provides researchers with the possibility of accelerating or scaling their numerical codes by various parallel techniques. Meanwhile, the fast development of computer hardware and the emerging techniques require researches to explore suitable parallel methods for applications in engineering. Intel MIC architecture consists of processors that inherit many key features of Intel CPU cores, which makes the code migrating less expensive and become popular in the development of parallel algorithms.

Many CFD-based codes or solvers have been studied on Intel MIC architecture. Gorobets et al. [3] used various accelerators including AMD GPUs, NVIDIA GPUs, and Intel Xeon Phi coprocessors to conduct direct numerical simulation for turbulent flows and compared the results from these accelerators. Farhan et al. [4] utilized native and offload mode of MIC programming model to parallelize the flux kernel of PETSc-FUN3D, and they obtained about 3.8x speedup with offload mode and 5x speedup with native mode by exploring a series of shared memory optimization techniques. Graf et al. [5] ran their PDE codes on single Intel Xeon Phi Knights Landing (KNL) node and multiple KNL nodes respectively by using MPI + OpenMP programming model with different thread affinity types. Cai et al. [6] calculated the nonlinear dynamic problems on Intel MIC by employing offload mode and overlapped data transfer strategy, and they obtained 17x on MIC over sequential version on the host for the simulation of a bus model. Saini et al. [7] investigated various numerical codes on MIC to seek performance improvement with different host-coprocessor computing modes comparisons, and they also presented load-balancing approach for symmetric use of MIC coprocessors. Wang et al. [8] reported the large-scale computation of a high-order CFD code on Tianhe-2 supercomputer that consists of both CPU and MIC coprocessors. And other CFD-related works on Intel MIC architecture can be found in references [9–12]. Working as coprocessors, GPUs also have been popular in CFD. Many researchers [13–17] have studied GPU computing on structured meshes, which involved coalesced computation technique [13], heterogeneous algorithm [15, 17], numerical methods [16], etc. Corrigan et al. [18] investigated an Euler solver on GPU by employing unstructured grid and gained important factor of speedup over CPUs. Then, a lot of results included data structure optimization [19, 20], numerical techniques [21], and applications [22] based on unstructured meshes on the GPU platform. For GPU simulations on overlapped (overset) meshes, Soni et al. [23] developed a steady CFD solver on unstructured overset meshes by using GPU programming model, which accelerated both the procedure of grid (mesh) assembly and the procedure of numerical calculations. Then, they extended their solver to unsteady ones [24, 25] to make their GPU implementation capable of handling dynamic overset meshes. More CFD-related computing on GPUs can be found in an overview reference [26].

However, most of the existing works either used consistent structured or unstructured meshes without mesh overlapping over the computational domain on MIC architecture or studied overlapped meshes on GPUs. A majority of CFD simulations involving overlapped meshes were implemented or developed on distributed system through message passing interface (MPI) [27] without using coprocessors in past several decades. Specifically, Djomehri and Jin [28] reported the parallel performance of an overset solver using a hybrid programming model with both MPI [27] and OpenMP [29]. Prewitt et al. [30] conducted a review of parallel implementations using overlapped mesh methods. Roget and Sitaraman [31] developed an effective overlapped mesh assembler and investigated unsteady simulations on nearly 10000 CPU cores. Zagaris et al. [32] discussed a range of problems regarding parallel computing for moving body problems using overlapped meshes and presented the preliminary performance on parallel overlapped grid assembly. Other overlapped mesh-related works can be found in [33–36]. Although the work in [7] conducted tests by using a solver that is compatible with overlapped meshes on MIC coprocessors, the way it accessed multiple MIC coprocessors is through the native mode or symmetric mode of MIC. In this paper, we focus on the use of offload mode of MIC and investigate the parallelization of a solver with overset meshes on a single host node with multiple MIC coprocessors. The contributions of this work are as follows:(i)We parallelize an Euler solver using overlapped meshes and propose an asynchronous strategy for calculations on multiple MIC coprocessors within a single host node(ii)We investigate the performances of core functions of the solver by employing offload mode with different thread affinity types on the MIC architecture, and we make detailed comparisons between the results obtained by Intel MIC vectorization and that obtained by Intel SSE vectorization(iii)A speedup of 5.9x can be obtained on two MIC coprocessors over a single Intel E5-2680 processor for two M6 wings case

The remainder of the paper is as follows. We first introduce the MIC architecture and programming model. And this is followed by equations and numerical algorithms that have been implemented in the solver. In Section 4, we discuss implementation and optimization aspects including data transfer, vectorization, and asynchronous data exchange strategy on multiple MIC coprocessors. The performances of core functions by using different thread affinity types are reported, and comparisons are made in Section 5. The last section summarizes our work.

#### 2. MIC Architecture and Programming Model

Many integrated core (MIC) architecture [2] is a processor that is capable of integrating many ×86 cores, providing the computing power of high parallelism. The architecture used in the first Intel Xeon Phi product is called Knights Corner, KNC. The KNC coprocessors can have many (up to 61) double dispatched, in-order executing ×86 computing cores. Each core has a 512 bit vector processing unit (VPU) which supports 16 single or 8 double floating point operations per cycle, and each core is able to launch 4 hardware threads. 32 KB L1 code cache, 32 KB L1 data cache, and 512 KB L2 cache are available to each core. The coprocessor used in this work is Intel Xeon Phi 5110P [37], which consists of 60 cores each of which runs at 1.05 GHz. It can launch a total of 240 threads simultaneously. And 8 GB of GDDR5 memory is available on it.

Anyone who is familiar with C, C++, or Fortran programming language can develop codes on MIC coprocessors without major revision of their source codes. MIC provides very flexible programming models, including native host mode, native MIC mode, offload mode, and symmetric mode [38]. Coprocessors are not used in native host mode, and programmers can run their codes on CPUs just like they do before the MIC architecture was introduced. By contrast, codes can be conducted only on coprocessors in native MIC mode when they are compiled with “-mmic” option. Symmetric mode allows programmers to run codes on both CPU cores and coprocessors. And offload mode is most commonly used on a single coprocessor or multiple coprocessors within a single host node. The basic use of offload mode for programmers is to write offload directives to make the code segment run on MIC coprocessors. To take full advantage of computational resources on MIC, the code segment can be conducted in parallel by employing multithreading techniques, such as OpenMP [29].

#### 3. Equations and Numerical Algorithms

##### 3.1. Compressible Euler Equations

The three-dimensional time-dependent compressible Euler equations over a control volume can be expressed in integral form aswhere represents the vector of conservative variables and denotes the vector of convective fluxes with .

##### 3.2. Numerical Algorithms

The flux-difference splitting (FDS) [39] technique is employed to calculate the spatial derivative of convective fluxes. In this method, the flux at the interface (for example, *i* direction), expressed by , can be computed by solving an approximate Riemann problem aswhere the left and right state of , , and are constructed by Monotonic Upstream-Centered Scheme for Conservation Laws (MUSCL) [40] and min-mod limiter.

Equation (1) is solved in time in this work by employing an implicit approximate-factorization method [41], which achieves first-order accuracy in steady-state simulations.

Euler wall boundary conditions (also called inviscid surface conditions), inflow/outflow boundary conditions, and symmetry plane boundary conditions are applied to equation (1) by using the ghost cell method [42].

##### 3.3. Mesh Technique

In this work, we aim to solve 3D Euler equations on multiple MIC devices by using overlapped mesh technique. In this technique [33, 42], each entity or component is configured by a multiblock mesh system, and different multiblock mesh systems are allowed to overlap with each other. During the numerical calculations, overlapped regions need to receive interpolated information from each other by using interpolation methods [43, 44]. The process of identifying interpolation information among overlapped regions, termed as mesh (grid) assembly [31–34, 45], has to be employed before starting numerical calculations. This technique has reduced the difficulty of generating meshes for complex geometries in engineering areas because an independent mesh system with high mesh quality can be designed for a component without considering other components. However, it adds the complexity of conducting parallel calculations (via MPI [27], for example) on this kind of mesh system. As mesh blocks are distributed on separated processors where data can not be shared directly, more detailed work should be done on data exchange among mesh blocks. There are two types of data communication for overlapped mesh system when calculations are conducted on distributed computers. One is the data exchange on the shared interfaces where one block connects to another, and the other is the interpolation data exchange where one block overlaps with other blocks. And the discussion of both types of communication is going to be covered in this work.

#### 4. Implementation

##### 4.1. Calculation Procedure

The procedure of solving equation (1) depends mainly on the mesh technique employed. In the overlapped mesh system, the steps of calculations are organized in Listing 1. As described in Section 3.3, performing mesh assembly is a necessary step (Listing 1, line 1) before conducting numerical computing on an overlapped mesh system. Mesh assembly identifies cells which need to receive interpolation data (CRI) as well as cells which provide interpolation data (CPI) for CRIs in overlapped regions and creates a map to record where the data associating with each block should be sent to or received from. This data map stays unchanged during the steady-state procedure and is used for data exchange (Listing 1, line 8) within each iteration. Then, the mesh blocks that connect with other blocks need to share the solutions at their common interfaces. When the data from blocks that provide interpolation and from neighbouring blocks are ready, a loop (Listing 1, lines 10–13) is launched to compute fluxes and update solutions on each block one by one.