Journal of Computational Engineering

Volume 2015 (2015), Article ID 295393, 10 pages

http://dx.doi.org/10.1155/2015/295393

## Asynchronous Parallelization of a CFD Solver

^{1}Department of Applied Mathematics, Naval Postgraduate School, 1 University Circle, Monterey, CA 93943, USA^{2}Department of Civil and Environment Engineering, University of Western Ontario, 1151 Richmond Street, London, ON, Canada N6A 3K7

Received 24 July 2015; Accepted 13 October 2015

Academic Editor: Fu-Yun Zhao

Copyright © 2015 Daniel S. Abdi and Girma T. Bitsuamlak. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

A Navier-Stokes equations solver is parallelized to run on a cluster of computers using the domain decomposition method. Two approaches of communication and computation are investigated, namely, synchronous and asynchronous methods. Asynchronous communication between subdomains is not commonly used in CFD codes; however, it has a potential to alleviate scaling bottlenecks incurred due to processors having to wait for each other at designated synchronization points. A common way to avoid this idle time is to overlap asynchronous communication with computation. For this to work, however, there must be something useful and independent a processor can do while waiting for messages to arrive. We investigate an alternative approach of computation, namely, conducting asynchronous iterations to improve local subdomain solution while communication is in progress. An in-house CFD code is parallelized using message passing interface (MPI), and scalability tests are conducted that suggest asynchronous iterations are a viable way of parallelizing CFD code.

#### 1. Introduction

Atmospheric boundary layer simulations on complex terrains require tremendous amount of computational resources (CPU hours and memory) [1–3]. Even in the case of CFD simulation around a single building, tens of millions of grid cells may be required to fully resolve all scales of motion. The use of complex turbulence models, for instance, large eddy simulation instead of standard -epsilon, further adds to the computational demand [4]. Therefore, one often resorts to methods that compromise accuracy for speed, such as use of wall models instead of resolving near wall flow [5], to get results within reasonable time. Parallel computation on a cluster of machines can help to get results faster without sacrificing quality of results. In this work, an in-house CFD program is developed and parallelized to run on a cluster of computers using two communication and computation approaches, namely, synchronous and asynchronous methods.

*(1) Domain Decomposition*. Wind flow simulations on complex terrain produce large amount of data at every time step of simulation, making it virtually impossible to simulate the whole domain using only a single commodity computer. To counter this problem, many CFD software programs use domain decomposition (DD) methods in which a processor takes care of only a part of the domain. Then, information is exchanged between the subdomains during the solution stage to enforce at least weak coupling. DD can be thought of as a “divide and conquer” strategy that can be used either when the problem is too big to fit in the memory space of one computer, or when the subdomains are more easily solvable than the original undecomposed problem. The method has been extensively used in aerospace engineering since early days to conduct finite element calculations on parts of an airplane [6]. In those days, computer memory was very limited; therefore, the only way to solve a big problem was to decompose it and solve each subdomain one by one by while imposing special boundary conditions to couple them. Some of the nonoverlapping DD methods are the Dirichlet-Neumann, Neumann-Neumann, and other adaptive variations suitable for hyperbolic convection problems. While the motivation for these methods was to solve large size problems that did not fit in the memory space of a single computer, the current study is concerned with exploiting concurrency on a cluster of computers capable of holding the whole computational domain in distributed memory.

In this study, we implement an implicit parallel CFD solver and compare two approaches of communication and computation, namely, synchronous and asynchronous methods. The motivation for investigating the asynchronous method is that the synchronous method can incur significant performance loss on cluster of computers due to its use of collective communication calls, such as and . For example, calculating global error norms requires an expensive call, that is often the subject of optimization on massively parallel computers. Gropp et al. [7] describe a way of parallelizing the Poisson equation using the synchronous approach and Jacobi iterations. The popularity of the synchronous method derives from the fact that computation done in parallel gives identical results, aside from floating point truncation errors, to one obtained with a serial computation. This gives great confidence to the developer and user that the parallel implementation is correct. However, synchronizing all processors at each step of iteration can be costly to scalability on thousands of processors. For instance, a recent study [8] on the open source CFD code OpenFOAM found barrier calls to be responsible for poor scaling performance. Synchronous communication is the standard way of implementing parallel CFD code and it has been used by [9–11] among many others.

The second alternative uses asynchronous communication and computation between subdomains to avoid all synchronization between processors. Use of asynchronous method in CFD is not common, but there are discussions and implementations by some researchers [12–14]. Some of them concluded that asynchronous computation has a better future because of heterogeneous clusters that have compute units with variable communication latencies, for example, CPU, GPU, and FGPA.

*(2) Iterative Algorithms*. Finite volume discretizations of the governing equations of fluid dynamics yield matrices that are highly sparse. The solution of such system of linear equations is carried out more efficiently using iterative algorithms, which successively update a solution vector starting from an initial guess, rather than using direct methods. Direct methods construct either the inverse of the matrix or its LU decomposition, which is both time consuming and also results in a full matrix even when the original matrix is sparse. Moreover, iterative methods allow for use of asynchronous iterations, which is the focus of the current study, to reduce idle time incurred due to the need for synchronization of processors. In the following, we will briefly discuss those iterative algorithms that are relevant to the current study.

The Krylov subspace methods, such as the preconditioned conjugate and biconjugate gradient methods (Algorithm 1), are popular among iterative algorithms due to their fast convergence properties. Even though simple relaxation algorithms (Algorithm 2) are not commonly used for solving linear system of equations all by themselves, they scale very well on large number of processors as mentioned in [15]. In addition, they can be used as preconditioners for the fast-converging Krylov subspace methods [16], and also as smoothers in a multigrid solver.