Abstract

A parallel adaptive pseudo transient Newton-Krylov-Schwarz (NKS) method for the solution of compressible flows is presented. Multidimensional upwind residual distribution schemes are used for space discretisation, while an implicit time-marching scheme is employed for the discretisation of the (pseudo)time derivative. The linear system arising from the Newton method applied to the resulting nonlinear system is solved by the means of Krylov iterations with Schwarz-type preconditioners. A scalable and efficient data structure for the NKS procedure is presented. The main computational kernels are considered, and an extensive analysis is reported to compare the Krylov accelerators, the preconditioning techniques. Results, obtained on a distributed memory computer, are presented for 2D and 3D problems of aeronautical interest on unstructured grids.

1. Introduction

The aim of this paper is to provide an overview of the methods required for an efficient parallel solution of the compressible Euler equations (CEE) on unstructured 2D and 3D grids. The ingredients include space and time discretisation schemes for the underlying partial differential equations (PDEs), a nonlinear solver based on the Newton’s method, and a parallel Krylov accelerator with domain decomposition preconditioners.

Nowadays, unstructured grids are of particular interest for industrial applications. With respect to structured grids, it is often easier to produce unstructured grids of good quality in domains of complex shape, especially if the unstructured grid generator can be coupled to a CAD system. This can provide—at least in principle—a fast process to solve the problem at hand, with minimal intervention of the user, once the geometry of the domain and the boundary conditions have been specified. However, such techniques imply an almost perfect geometric representation (CAD level) and automatic grid generation remains a challenge in computational engineering, especially for moving boundaries.

The framework presented here can be successfully applied to the solution of sets of PDEs problems discretised on unstructured grids. The focus in this paper is restricted to the solution of the CEE. In particular, the space discretisation technique considered is based on the so-called multidimensional upwind residual distribution (MURD) schemes; see for instance [15]. These schemes are supposed to render higher accuracy and less spurious numerical anomalies. This is due to the fact that they take into account the multidirectionality of the wave structure, rather than being based upon a summation of one-dimensional Riemann problems, [6, 7]. Firstly developed for a scalar advection equation, MURD schemes have been extended to homogeneous flux hyperbolic systems such as the Euler equations. They can be interpreted as Petrov-Galerkin finite element methods, with compact-stencil basis functions. This turns out to be of particular advantage for their parallel implementation, since both first-order and second-order in space schemes share the same communication pattern.

For compressible flow simulations, we will consider here the Euler equations. The steady state solutions are found as the steady-state of their time-dependent version. The time derivative is discretised using the backward Euler method. This leads to a nonlinear system to be solved at each time step. Hence, for these cases, time accuracy is not an issue. In fact, the pseudo-time derivative is added as a mean to obtain a “good” initial iterate for the Newton’s method (which can even diverge for initial iterates too far-away from the solution). Moreover, only few steps of the Newton’s method are considered, and the linear system with the Jacobian matrix is solved only approximatively.

The Newton’s method for the solution of this nonlinear system results in a large, ill-conditioned linear system with the system’s Jacobian matrix. Direct methods appear to be too computationally expensive, especially in 3D. For our kind of applications, preconditioned iterative solvers of Krylov type are a prime choice and can also be associated with multigrid techniques, another challenging method, especially for compressible flow simulations. A possible way to derive the preconditioner is to adopt a domain decomposition approach [8, 9], and more in particular to use a Schwarz preconditioner. This means that the computational domain is decomposed into overlapping subdomains , , such that . Then, a Dirichlet problem is solved in each subdomain. Since typically each subdomain is assigned to a different processor, this leads to a parallel preconditioner. Further, a coarse level is added, resulting in an overall—although cheap—communication among the subdomains. Here, to construct the coarse level, we consider algebraic procedures based on the concept of aggregation.

The resulting framework is usually addressed to as the Newton-Krylov-Schwarz framework, a quite generic procedure that can be applied to nonlinear systems of PDEs [1016]. The idea is to obtain (asymptotically) a second-order convergence in time through the Newton iterations and a reasonable parallelism thanks to the use of a Krylov solver complemented with a Schwarz-type preconditioner. The resulting algorithm will be addressed to as NKS.

For an efficient and scalable implementation of NKS methods, parallel codes have to be developed, extending their sequential counterparts. Data, like grid structures, the Jacobian matrix, state variable, and work vectors, must be partitioned among the processors in order to minimise the amount of intraprocessor communications. The definition of a distributed data structure, wellsuited for all the computational phases, is the first problem to be addressed. To that aim, all the main computational kernels of the NKS algorithm are analysed, and suitable algorithms are presented and implemented in order to make use of as much local data as possible, thus minimising communication. Note that robust and accurate methods are necessary to treat the large variety of regimes which may typically arise in aeronautics which rely on solving the Euler equations. The interest of these techniques is that many software packages, such as PetSc and ISL++, offer these functionalities in a user-friendly way.

The paper is organised as follows. Section 2 describes the numerical solution of the compressible Euler equations and the NKS scheme. In Section 3 the data structure used for parallelisation is introduced. Section 4 outlines the main computational kernels of the NKS scheme. Section 5 gives some numerical results obtained on a SGI Origin 3800 for academic test cases as well as typical full aircraft problems. Although this machine is nowadays replaced by Linux clusters, the scaling of the results for multiple processor clusters is significant. Indeed, all techniques and results described here scale in exactly the same way on Linux clusters under MPI. Finally, Section 6 outlines the conclusions.

2. Numerical Solution of the Compressible Euler Equations

The compressible Euler equations describing nonviscous and nonheat conducting compressible fluid flows in a domain can be written as where is the vector of conservative variables. For the applications of interest, (1) are posed on a bounded domain and completed by appropriate boundary conditions and initial condition (see for instance [17]).

In (1), is the density, ,  , and are the -,  -, and -components of the velocity vector , and is the specific total energy. The quantities ,   are homogeneous functions of , called fluxes. Here is the Kronecker symbol, is the (static) pressure, and is the specific enthalpy. The gas is considered to be a calorically perfect gas [18, Chapter 2], therefore with and being the specific heat coefficient at constant volume and constant pressure, respectively.

The space discretisation of (1) is conducted using MURD schemes, which are based on the quasilinear form as with

The basis idea of MURD schemes is to compute the residual at element level, defined as Then, the main idea is to redistribute this quantity to the nodes of the element, according to the choice of the scheme.

We will not describe MURD schemes, referring the reader to, for example, [1924] and the references therein. Here, we just give a very broad overview of the schemes.

At their beginning, MURD schemes have been developed for scalar equations. The extension to system schemes is straightforward if the system is diagonalisable. In this case, scalar schemes can be used for each scalar equation of the diagonalised system. For nondiagonalisable systems, like the CEE, three main orientations are identified. One consists in the formal extension of the scalar scheme. This leads, for instance, to the system N-scheme [4]. A second orientation was introduced by Roe in 1986 [25] and consists in decomposing the initial residual as a sum of simple wave solutions. The last one, initially proposed by Deconinck and coworkers, see [26], is an elliptic/hyperbolic splitting, which decompose the CEE into an acoustic subsystem of three equations (for 3D problems), plus two advection equations, one of them being the entropy advection. The last approach is followed here. It was shown in [27] that this decomposition can lead to numerical methods with minimal spurious entropy production.

MURD schemes are targeted to the solution of advection-dominated systems on unstructured grids because they are not constructed by concentrating on any particular direction of the grid. An advantage is that, at least for scalar equations, one can construct a fully second-order accurate scheme on triangular grids with a very compact stencil: only the nodes in the triangle are used in the evaluation of the fluctuation. Note that these schemes can be interpreted as Petrov-Galerkin finite-element schemes, see [28, Section 4.1].

The spatial discretisation of (1) results in a system of nonlinear equations of type with a convenient initial condition . In (6), is a nonsingular (lumped) mass matrix. In particular, we seek for the the steady-state solution of system (6), which will be one of the roots of Note that the Newton method applied directly to system (7) does not suffice, since an initial iterate sufficiently near to the root is generally not available. This is particularly true for complex flows containing shock waves and contact discontinuities, like the ones considered in this paper. Framing the steady-state equation into a time-dependent setting constitutes a sort of continuation process, which can be considered as an attempt to widen the domain of convergence of Newton method, or as a procedure to obtain sufficiently close starting points.

System (6) is discretised using a backward Euler scheme. Starting from a given , the solution at the pseudotime step is found by solving the nonlinear system of equations or, equivalently, where is the time increment at step .

A strategy for the definition of is needed. As a general rule, one should keep the time step small until all the main flow features are well resolved, then large time steps may be taken near to obtain superlinear or quadratic convergence of Newton’s method. In this paper we have adopted the so-called exponential rule [29], that is, where is the initial time step and is a prescribed growing factor. Other strategies have been proposed in the literature, like the so-called switched evolution relation rule (see [29]) or the expert rule (see [30]).

A Newton iteration at time level for the solution of (9) would then read with being the index associated to the Newton procedure, and . The Newton iteration should stop when either a certain tolerance level is reached, or after a fixed number of iterations. Although it is possible in principle to take more Newton correction iterates, convergence results from [31] show that, for steady-state computations, quadratic convergence can be eventually achieved using only one step of process (11), provided that a suitable time-step evolution strategy for is found. Therefore, in the following we are supposed to take just one Newton iteration. This leads to the following linear problem: that can be written, dropping the index , as where and . More precisely,

A preliminary analysis of the memory requirements and the computational time required by for direct techniques suggests the use of an iterative solver of Krylov type for the solution of (13).

The problem with iterative methods is that their convergence rate depends on the spectral properties of the coefficient matrix, in particular the condition number , where is a matrix norm. To obtain a more favourable condition number, one can solve the modified system , where is a nonsingular matrix. should approximate as closely as possible, so that , while still being reasonably cheap to compute both in memory and CPU time requirements. Also, it should be optimal, that is, does not depend on the problem dimension, and scalable, that is, does not depend on the number of processor used. Indeed, one of the main aims of a preconditioner is to transform the system into a system such that the matrix is close to an identity matrix.

Many parallel preconditioners have been proposed in the literature. Here, we have resorted to preconditioners based on the domain decomposition (DD) approach. The basic idea of DD methods is to decompose the computational domain into smaller parts , , called subdomains, such that . Next, the original problem can be reformulated within each subdomain , of smaller size. This family of subproblems is coupled one to the other through the values of the unknown solution at subdomain interface. This coupling is then removed at the expense of introducing an iterative process which involves, at each step, solutions on the with additional interface conditions on .

We will focus on overlapping domain decomposition methods, also known as Schwarz methods. In these methods, the computational domain is subdivided into overlapping subdomains, and local Dirichlet-type problems are then solved on each subdomain. The communication between the solutions on the different subdomains is here guaranteed by the overlapping region.

These DD methods are usually rather inefficient when used as solvers of the linear problem; however, they can be reformulated as efficient parallel preconditioners. To express the preconditioner, we need to fix some notation. Let be a rectangular matrix which returns a vector defined on the nodes internal to from a global vector defined on , the local matrix, and finally the correction on subdomain . Using this notation, the one-level (additive) Schwarz method can be written as

This preconditioner is particularly well suited for the class of problems considered here. It is quite easy to implement and has reduced memory requirements. However, since acts only locally, its scalability is hindered by the weak coupling between far away subdomains. Its performances degrade rapidly as the number of subdomains increases.

This behaviour is typical of all one-level preconditioners, which, by construction, are composed only by local corrections on the subdomains. In fact, all the state-of-art DD preconditioners consist of local correction and a global components. The local part, acting at the subdomain level, captures the strong couplings that appear between neighbouring subdomains, while the global part provide an overall—although inexpensive—communication among the subdomains. The global component is usually referred to as a “coarse space correction,” since it is usually defined on a space that is coarse with respect to the fine space containing the solution. The complexity of this auxiliary problem is much lower than that of the original problem, and its role is to diffuse information among the subdomains. In an analogous manner to multigrid methods, this coarse space is used to correct the “smooth” part of the error, whereas the local preconditioner is used to dump the “high-frequency” part of the error.

In all generality, the global correction term has the form , where corresponds to the discretisation of the original PDE problem on a coarse space , and is the restriction operator from the fine space to the coarse space. The resulting preconditioner reads as This preconditioner is fully additive since all the corrections on the subdomains and on the coarse space are added together. Alternatively, an hybrid preconditioner can be used as Roughly speaking, preconditioner (17) is called hybrid because the corrections on the subdomains are treated in an additive way (the term ), as well as in a multiplicative way (the term ). For more details, the reader is referred to [9].

The key element to obtain a scalable and efficient preconditioner is the proper definition of the coarse space. The general approach is to discretise the original problem on a coarse grid. However, the construction of the coarse grid and of the corresponding restriction operator can be difficult or computationally expensive for problems defined on unstructured grids in domain of complex shape, as typical in aeronautical applications. For this reason, here we consider agglomeration procedures to construct the coarse matrix for a two-level Schwarz preconditioner. The procedure is completely algebraic and very wellsuited for unstructured grids.

We now detail the procedure used to define the coarse space , referring the reader for more details to [28, 32]. Our approach is based on the concept of  aggregation [3335]. On each subdomain we build   aggregates , , . An aggregate is a set of contiguous vertices, see a 2D examples in Figure 1. The value represents the dimension of the coarse space. This means that we split the degrees of freedom among the subdomains, for example, simply setting . Finally, inside , each aggregate is associated to a vector , , whose elements are built following the rule

The algorithm uses a simple procedure to build the operators and and to form the coarse matrix as For a comparison between the definition of the coarse space using a coarse grid and using aggregation, the reader is referred to [36], or [28, Chapter 3], where the authors compare the two approaches for a model problem.

3. Definition of the Distributed Data Structure

The development of a scalable parallel NKS code requires the definition of an efficient distributed data structure. This data structure must efficiently handle the preconditioning phase, which is probably the most important aspect of the NKS solver, as well as all the other computational kernels of the scheme.

Throughout our description, we make the following assumptions. (i)The parallel computer is of multiple instruction multiple data (MIMD) architecture. (ii)The parallel computer is assumed to use distributed memory architecture. In fact, this is not a restriction at all since it can be emulated on almost every type of parallel computers including shared memory computers. (iii)Communications are achieved using a message passing interface, like MPI [37]. This is not a limitation since message passing libraries are currently available on almost every type of architecture.

We also suppose that the starting grid, of moderate size, say, up to a million of elements for an inviscid solution, has been generated on a sequential computer. More refined grids will be obtained by the means of parallel adaptation techniques, as described in paper [38].

The first step is the partition of the grid among the processors. To that aim, we have used the -way graph partitioning algorithms [39]. Given a graph with , we partition into subsets such that , on average, and , and the number of edges of whose incident vertexes belong to different subsets is minimised. A k-way partition is commonly represented by a partition vector of length , such that for every vertex , is an integer between and , indicating the partition to which vertex belongs to. Given a partition , the number of edges whose incident vertexes belong to different subsets is called the edge-cut of the partition.

The -way algorithm can be used to decompose the direct graph or the dual graph of the grid, leading to vertex-oriented (VO) or element-oriented (EO) decompositions. In the former, each vertex of the grid is assigned to a different processor, while in the latter the decomposition is at element wise. A numerical comparison between the two approaches for the compressible Euler equations using explicit time-marching scheme can be found in [40], while results concerning implicit time-marching schemes can be found in [41].

We have followed a VO decomposition. As a result of the VO decomposition, each vertex of the computational grid is assigned to a unique processor, whereas there exist elements shared by more processors. These elements are called cut elements—since their edges have been intersected—and they form a layer of elements that actually belong to more than one processor. The cut elements are stored on all processors which have at least one node of the element in their update set. For example, in Figure 2 the shaded elements are stored on both processor and processor . The nodes assigned to processor form its update set, whose size is . Clearly, , being the global dimension of the problem. The nodes belonging to the subdomain, but not to the update set, are called external nodes. The size of this set is .

From the point of view of the parallel application, grid generation and graph partitioning result in input files [40]; each input file contains a list of elements (grid connectivity and spatial coordinates of each vertex) and a list of the (physical) boundary nodes, grouped in the so-called patches. Each patch corresponds to a part of the boundary, like, for instance, the nacelle, the wing, or the symmetry plane, on which the same boundary conditions are imposed. The grid is represented in terms of element-based data structure. Also, a local-to-global mapping must be provided, to enable communication among the processors.

After convergence of the NKS solver, each processor writes the solution corresponding to its update set as a part of the global data. Often, these partitioned data are merged into a single file for global solution analysis and visualisation. However, certain packages also work directly on the partitioned solution and can also monitor the solution. We will not consider automatic visualisation issues in the following, since we have used stand-alone software tools.

4. NKS Solver Main Operations Analysis

The main computational kernels of our solver are as follows: (1)construction of the distributed matrix; (2)vector updates and AXPY operations; (3)vector-vector products; (4)matrix-vector products; (5)application of the preconditioner.

In this section we analyse these operations.

Let us focus on the construction of the distributed matrix. As typical, the finite-element matrix is constructed by looping over all the elements of the local grid, to compute the elemental matrix , corresponding to each element . On processor , the contribution of will be considered only if the corresponding row is contained in the update set of . This means that is constructed in a distributed way, that is, each row of is stored on a unique processor. In an analogous way, every vector, say , is distributed: each processor will store only , the vector composed by the components of which correspond to its update set, and of size .

An AXPY operations (i.e., for and two given vectors and a given real number) are done as follows. First, one has to verify that all processors have the same value of ; then, for all the elements of the local vector , one do . Apart from the synchronisation of the value of (step which is often avoided), no communication occurs in vector updates and AXPY operations.

The vector-vector product on processor reads as follows: (1) compute the local value ; (2) sum up the local contributions of all the processors: . Note that communication is required only in the second step. Before computing the final value of , all processors must have finished the computations in the first step. For this reason, vector-vector product always act, in a parallel environment as synchronisation points and require, global communications.

In the matrix-vector product  , a processor computes only the elements of in its update set, that is, in . Hence, the local parallel matrix-vector product reads as where and are the values of corresponding to its update and external set, respectively, is the submatrix representing the influence of update nodes on themselves, while , the action of external nodes on update ones.

From (20), the steps required to perform the parallel matrix product are (1) the exchange of boundary data, that is, the update the value of the external nodes ; (2) the computation of the the local matrix-vector product ; (3) the computation of the external matrix-vector product .

Steps (1) and (2) can be overlapped. Note that an efficient implementation of step (1) is particularly important to obtain a scalable matrix-vector product. As a general rule, data to be sent to a processor should be packed and sent using only one MPI call to avoid latency problems occurring within the network. However, since the grid discretisation is unstructured, it is not trivial to determine the owner of an external set and, moreover, all the nodes belonging to a certain processor are not necessarily stored contiguously in the vector . This operation of packing data into a buffer creates contiguous messages from irregularly spaced data structure and requires a preprocessing of the matrix, plus the definition of a certain number of data structures.

We want to underline the fact that for both first-order and second-order MURD schemes, the computational stencil of a generic vertex is compact, that is, it only depends on information from vertexes that are one edge-distance away in the grid structure. Therefore, the Jacobian matrix will have nonzero elements only for the contributions corresponding to the first-order neighbouring nodes even for second-order schemes. This simplifies considerably the communication and allows the use of the same data structure for first-order and second space discretisation schemes.

Finally, let us consider the application of the Schwarz preconditioner. Using the VO decomposition outlined in Section 3, it is particularly easy to implement a one-level Schwarz preconditioner with minimal overlap, that is, with an overlap among the subdomains of one element only. Referring again to Figure 2, the overlapping region among the subdomains is represented by the elements in shaded colour.

On processor , the computation of the preconditioned residual simply reads as Matrix is already stored and formed in the local memory of processor . Note that Schwarz preconditioners with wider overlaps will require communication. In general, the optimal amount of overlap is clearly a compromise between the effectiveness of the preconditioner and its computational complexity. For this class of problems, we have found it convenient to use the minimal overlap version. This is equivalent to a block-Jacobi procedure, where the blocks have size . An ILU(0) or BILU(0) factorisation (using a reverse Cuthill-McKee reorder) of the block matrices gives satisfactory results (in terms of CPU times) with respect to other “richer” incomplete factorisations [41]. In fact, compared with other variants, ILU(0) is computationally fast and memory efficient.

Now, let us focus on the two-level Schwarz preconditioner described in Section 2, with a coarse space based on aggregation. We show that this preconditioner makes use of information at subdomain level. To obtain a representation of matrix , we order the set of nodes putting before the nodes in , and then the nodes of the domain subdomains in the other subdomains. Analogously, we order the degree of freedom of . As a result, has the following block structure:

Note that each vector can be constructed at the subdomain level, using only the local information about the grid of (hence no communications). Very often, the graph of the local matrix can furnish the information required to produce the aggregates, and henceforth grid data are not required by the algorithm. This is notably the case of finite-element codes with basis function, as used in this paper. Note moreover that the vectors are not explicitly required by the algorithm. We use one integer vector , so that represents the (local) degree of freedom associated to the aggregate . The vector will be of size . First, communications occur to exchange the actual values of among the processors. Then, in the aggregation procedure to compute the coarse matrix as in (19), no communication will be required to compute, on each subdomain, the rows of corresponding to the degree of freedom of contained in that subdomain.

5. Numerical Results

In this section the NKS solver presented previously is validated. The underlying Euler code we have used is called THOR, developed within the BRITE-EURAM project IDeMAS. The basic code was provided by the project co-ordinator, the von Karman Institute VKI (Belgium), and includes contributions from the partners, DASSAULT AVIATION (France), ALENIA (Italy), DASA/EADS (Germany), INRIA (France), and EPFL (Switzerland). THOR involves several high-level numerical schemes aiming to produce highly accurate multidimensional upwinding techniques, see [5]. The graph partitioning is based on the public domain software METIS [42] and its parallel counterpart ParMETIS [43]. The linear systems involved are solved using the library AZTEC [44], developed at the Sandia National Laboratories (USA). The Schwarz algorithms and the Krylov solvers have been implemented by the authors.

The numerical results have been obtained on a SGI-Origin 3800, equipped with 128 MIPS R14000 500 Mhz processors. Each processor has 512 Mbytes of RAM, with 32 Kbytes and 8 Kbytes of first level and second level cache memory, respectively. This type of machine can be replaced by Linux cluster-type machines, the results being significant for both types of architectures. The parallel communication paradigm used is MPI [37].

Three test cases taken of aeronautical interest have been selected to illustrate the variation of the performance with the various parameters within the algorithm and validate the full NKS cycle.

In all the numerical tests, indicates the one-level Schwarz preconditioner, with ILU(0) factorisations and minimal overlap among the subdomains. denotes the two-level Schwarz preconditioner, as detailed in Section 2. The local dimension of the coarse space (i.e., the number of aggregates assigned to each processor) is . The coarse problem is solved redundantly on each processor, using a direct LU decomposition.

5.1. Falcon Aircraft at and

We have studied the influence of the Krylov accelerator on a small 3D test case, namely, a Falcon aircraft with a computational grid composed of 45387 nodes and 255944 elements, with flight conditions of and . This test case is representative of a complete aeronautical configuration of an industrial airplane, with body, wing, and nacelles. This is the only subsonic test case. The mesh is quite coarse, and we have used this test case to validate the Krylov accelerators and the basic properties of the preconditioners (see also [23, 45] for a more extensive analysis on this subject).

Slip boundary conditions are imposed on the airplane. Since the problem is symmetric, as well as the boundary conditions, the mesh is represented by half a sphere, where we have imposed far-field and symmetry-plane boundary conditions. Referring to equation (10), we have used , , and . For this test case, only the first-order N-scheme for systems was used, since the aim of this test case was to validate the choice of the Krylov accelerator.

Figure 3 reports the norm of the residual (scaled by the norm of the initial residual) versus the linear system solver iterations, for the 10th time step. The accelerators considered are, Bi-CGSTAB, CGS, TFQMR, GMRES(25), and GMRES(60). The preconditioner is .

Figure 4 reports the iterations to converge to a tolerance on the relative residual of at each time iteration, using , , and . From the figure, it is evident that the CFL number has a strong influence on the iterations to converge. In fact, as the CFL number increases, the linear system becomes more skew-symmetric and less diagonal-dominant, and hence more iterations are required to converge to the desired accuracy. Note that, for low values of the CFL number, the number of iterations to converge is almost the same with one-level and two-level methods. This may suggest to adapt the preconditioner for the problem to be solved, introducing the coarse problem only for certain values of the CFL number.

For any combination of solver and preconditioner, convergence of the linear system solver is reached at any time step. GMRES is slightly faster than the other methods; it takes more iterations to converge, but the single iteration is faster, since it requires only one matrix-vector product, instead of two as in the other methods. However, it is quite sensitive to the dimension of the Krylov space, especially as the number of processor used in the computation increases. The convergence history at the 10th time step (when the CFL number is 10240) reveals that TFQMR has a quite erratic convergence, while the one provided by CGS is (unexpectedly) quite regular. Bi-CGSTAB performs quite well but it is still erratic. The differences among the Krylov accelerators are particularly evident as the number of processors grows. This may reflect the fact that the Schwarz preconditioners with no coarse correction worsen their performances with many subdomains.

As a result of this analysis, GMRES appear to be the most interesting among the iterative schemes for our class of problems. This confirms what was already pointed out in [23]. The only drawback of GMRES is represented by the memory requirements. Therefore, if it is the case to use Krylov subspaces of “low” dimension, one may consider other iterative schemes, like TFQMR or Bi-CGSTAB.

The results also show a reasonable scalability as illustrated in Figure 5. The whole convergence process is autoregulating; however, in general, 20–200 linear system iterations are performed 4 to 16 time levels as illustrated in Figure 6.

5.2. X29 Aircraft at and

The next test case is a X29 experimental aircraft, flying at almost transonic speed. It represents the experimental Grumman Model 712, designated X-29A by the USAF, a single-seat jet aircraft fitted with a wing mounted at the rear of the fuselage, swept forward at , and with canards just behind the cockpit. The flight conditions are , . Equation (10) is used, with , , .

The initial grid is composed by 136767 nodes and 726713 elements. The first-order N-scheme for systems was used on the nonadapted grid, while a second-order blended scheme was used on the adapted grids. In all the computations, the Jacobian matrix is formed starting from first-order space approximation to save computational time, because the evaluation of second order distribution residual is more time consuming with respect to first-order method.

In Table 1, the cost of the following adaptation sequence is reported. The complete solution is required at least 8 hours on 16 processors, and less than one hour on 64 processors. This result is mainly due to the specific architecture of the SGI-Origin, which allows a processor to use (when required) the memory of another processor. This is notably the case of the X29 test case when run with only 16 processors, due to the size of the problem; while by using 64 processors, then only the local processor memory was used.

The NKS cycle is as follows. Starting from an initial grid , 2 steps of refinement and derefinement and structural optimisation gave the grid . Then smoothing optimisation is performed on to obtain . Then on large scale, refinement and optimisation gave a final grid of 1.47 million nodes and 8.7 million elements, , which is a very interesting performance test case.

The effect of the smoothing optimisation can be seen in Figure 7. Figure 8 details the solutions on these first refinements. The mesh refinement criteria taken here are segment length and Mach gradients and are detailed in [38].

A comparison of iterations to converge and CPU times is reported in Table 2 for the first 14 iterations of the NKS procedure. In the table, iters indicates the sum of Krylov iterations at each backward Euler time step, and CPU time the sum of the (elapsed) CPU times (in seconds) to solve the Jacobian linear system. More detailed results about one-level and two-level Schwarz preconditioners are reported in Table 3. As previously noted, the CFL number has a strong influence on the convergence. One may note that performs better than . For small problems, large dimension of have to be preferred, while bigger problems required a fine tuning of to obtain the best numerical results.

It is worth to note that the efficiency of the preconditioner clearly outperforms in terms of iterations, while as concerns CPU-times there is an improvement of about 10%. This is due to the following facts: (i) requires a matrix-vector product to be performed, and this is computationally expensive; (ii)the solution of the coarse problem requires intensive communications among the processors.

5.3. ONERA M6 Wing at and

The last test case is an ONERA M6 wing at and . It represents a ONERA M6 wing, a widely used test case for 3D flows from subsonic to transonic regimes. The selected transonic case is and , which results in several shock waves over the wing. In the flow-field, a shock originates at approximatively of the wing span. A second leading-edge shock is caused by the sweep angle of the M6 wing. Near the tip of the wing, a lambda-shock structure is observed.

The first-order N-scheme for systems is used to obtain the solution on the first (nonadapted) grid, while the second-order blended scheme is used for the adapted grids. As for the X29 test case, the Jacobian matrix is formed using the first-order scheme.

The far field is represented by a half sphere with a radius of 12.5 root-chord length. The NKS procedure has been performed as follows. The starting nonoptimised grid is composed by 94493 nodes and 666569 elements, and an initial constant solution is used. Then, two steps of refining are performed. Each adaptation phase corresponds to at least two passes of adaptation (refinement coarsening and optimisation). The final grid (after two adaptation cycles) is composed by 585725 nodes and 3477090 elements.

The pressure coefficient at 20% of the wing span is reported in Figure 9. The picture on the left shows the improvement from first-order to second-order schemes. Second-order scheme captures the first shock wave, near the leading edge, with much greater precision than first-order schemes. The over- and under-shoots are sensibly reduced by using adapted grids, as one can notice from the picture on the right.

An analysis of the flow field on the upper part of the wing is reported in Figure 10, while Figure 11 reports the nonadapted grid and the adapted grid.

As regards the convergence of the NKS scheme, as one can seen from Figure 12 that in the case of 16 processors, the first-order N-scheme rapidly converges to steady-state solution very rapidly. In this phase, , , , and analytical Jacobian are used. These initial iterations are needed to place the shocks in a more or less correct position, so that they do not have to move consistently when second-order schemes are employed. For second-order schemes, we have used , , and .

As already reported in [23, 46], the iterative convergence of second-order schemes is poor; basically, for nonadapted meshes, the density residuals stagnate around to orders of magnitude, where the residual is compared to the first iteration. The situation is slightly improved on adapted meshes, even if it was not possible to reduce the residual more than a certain factor, here about with respect to the first iteration on the nonadapted grid.

6. Conclusions

Multidimensional upwind schemes have reached a certain degree of maturity for the solution of the steady-state compressible Euler equations on unstructured grids made up of triangles (in 2D) and tetrahedrons (in 3D). They can be used for complex aerodynamic flows, in the subsonic, transonic, and supersonic regime. Nevertheless, it is clear that the the full potential of the approach may be revealed only if these methods could effectively lead to scalable parallel codes. In particular, this requires, among others, the definition and the implementation of a scalable preconditioner and of a parallel grid adaption technique to improve the solution accuracy whilst optimising the computational resources. To that aim, we have here presented the adaptive pseudotransient Newton-Krylov-Schwarz framework.

Most parallel implementation of MURD schemes are based on one-level Schwarz preconditioner and do not include automatically grid adaptation procedure. This paper aims to present a complete framework which couples these two aspects with the characteristic aspects of MURD schemes.

The main kernels of this algorithm have been analysed. A data structure, based on a vertex-oriented decomposition of the grid, has been described. This data structure allows an efficient implementation of all the main computational kernels. Of particular importance is the definition of the preconditioner. One-level and two-level Schwarz preconditioners have been presented. The coarse matrix is constructed automatically and for any computational grid with no input from the user (except for the matrix and the dimension of the coarse space). This simplicity has its roots mainly in the way the restriction and interpolation operators are defined. The results show an improvement of performance with respect to the one-level Schwarz method, which is the preconditioner usually adopted in the literature for multidimensional upwind residual distribution schemes.

Numerical results, obtained on distributed memory computers, are presented for large scale computations in the sense of complexity and mesh adaptation on parallel machines. Results show that the conjunction of the Krylov-Schwarz approach with a grid adaptation procedure leads to an effective solution of the compressible Euler equations on unstructured grids. The same techniques apply to the Navier-Stokes system of equations solved by similar equivalent finite-volume techniques on unstructured grids [18, 47].

Acknowledgments

This paper summarises the activity of the authors, partially financed within the BRITE/EURAM European Union founded project IDeMAS (Contract no. BRPR-CT07-0591) and the Swiss National Science Foundation project no. 200020-112221. The Swiss Federal Office for Education and Science (OFES) is also acknowledged for financial support. The authors would like to thank Professor A. Quarteroni and Professor L. Formaggia for the useful discussions and help. VKI is acknowledged for the basic THOR code. The DIT/EPFL is acknowledged for the technical assistance.