Abstract

Three different speed-up methods (viz., additive multigrid method, adaptive mesh refinement (AMR), and parallelization) have been combined in order to provide a highly efficient parallel solver for the Poisson equation. Rather than using an ordinary tree data structure to organize the information on the adaptive Cartesian mesh, a modified form of the fully threaded tree (FTT) data structure is used. The Hilbert space-filling curve (SFC) approach has been adopted for dynamic grid partitioning (resulting in a partitioning that is near optimal with respect to load balancing on a parallel computational platform). Finally, an additive multigrid method (BPX preconditioner), which itself is parallelizable to a certain extent, has been used to solve the linear equation system arising from the discretization. Our numerical experiments show that the proposed parallel AMR algorithm based on the FTT data structure, Hilbert SFC for grid partitioning, and additive multigrid method is highly efficient.

1. Introduction

Although an unstructured mesh is highly flexible for dealing with very complex geometries and boundaries [1], the discretization schemes used for these types of meshes (usually in conjunction with control-volume-based finite element methods) are generally much more cumbersome than those used for structured meshes. In particular, the data structure used by an unstructured mesh requires information on “forming points’’ that are needed to describe the (arbitrary) shape and location of the faces of control volumes where the conservation laws need to be satisfied and to define the adjacent neighbors for each of these control volumes through an appropriate connectivity matrix. These types of operations require generally indirect memory referencing, and, as a consequence, efficient matrix solvers widely used for structured meshes cannot be easily adapted for unstructured meshes. In view of this, computational methods based on unstructured meshes are generally computationally less efficient than those based on structured meshes, particularly those that use a Cartesian grid. In order to utilize the computational efficiency afforded by a Cartesian structured grid while retaining the flexibility and accuracy in the discretization process required to address problems involving complex geometries and large ranges in spatial scales, adaptive mesh refinement (AMR) can be applied to focus the computational effort and memory usage where it is required for the accurate representation of the solution at minimal cost.

There are two main approaches that have been developed for structured grid AMR. The first approach utilizes a block-structured style of mesh refinement [25], in which a sequence of nested structured grids at different hierarchies or levels (cell sizes are the same for all grids at the same hierarchy or level) are overlapped with or patched onto each other. Although a tree-like (or directed acyclic graph) data structure is usually used to facilitate the communications between the regular Cartesian grids at the various resolutions, it should be noted that each node in this tree-like data structure represents an entire grid rather than a cell, with the result that efficient solvers for structured grids can be used to solve the system of algebraic equations resulting from the discretization on each node. However, this type of AMR framework is limited in its flexibility in the sense that the clustering methods used to define the subgrids in the hierarchy of Cartesian grids can result in portions of the computational domain covered by a highly refined mesh when it is not needed (e.g., solution is smooth in this region) resulting in a wasted computational effort. The hierarchical structured grid approach for AMR developed by Berger and Oliger [2] was originally implemented as a serial code. More recently, a number of researchers [68] have implemented the classical Berger and Oliger AMR approach on parallel machines, using an ordinary tree data structure to organize the blocks of grid cells corresponding to grids on different levels (resolutions). The disadvantage with this type of data structure is that an expensive search is required to find all the neighbors of a given computational cell in the ordinary tree. In particular, two levels of the tree need to be traversed on average before a neighbor can be found. More importantly, in the worst-case scenario, the entire tree may need to be traversed in order to find all the neighbors of a cell making it difficult to parallelize the operations since the tree traversal may need to be extended across a number of processors.

In the second approach for AMR, the hierarchy of grids at different levels is represented by a quadtree/octtree data structure in two/three dimensions (2D/3D), with each node representing an individual grid cell (rather than a block of grid cells as in the first approach). Because the mesh is only locally refined in the region where it is required, the second approach results in increased computational efficiency, increased storage savings, and better control of the grid resolution in comparison with the first approach. In consequence, it is easier to parallelize the second approach for AMR, and, to this purpose, the fully threaded tree (FTT) data structure [9] can be employed to enhance the parallel efficiency of the algorithm. The application of FTT is advantageous in that it guarantees that no more than one level of the tree is required to access all the neighbors of a particular cell. In consequence, a one-cell thick layer of ghost cells is all that is required for intragrid communication in the parallel computation because this is sufficient to guarantee that all the nearest neighbors of a given cell can be found.

Although a number of researchers [912] use the FTT data structure to manage the AMR, the implementation required to parallelize FTT is only addressed briefly by Popinet [12]. However, Popinet only considers a static approach for partitioning and mapping the mesh onto the processors. In this approach, the grid is partitioned and mapped to the processors only once after the initial grid has been generated. The disadvantage of this approach is that the parallel efficiency will deteriorate as the parallel computation proceeds and as the resolution in the computational grid evolves when the mesh is refined and derefined. Furthermore, to simplify the implementation of the parallelization procedure, Popinet kept the size and level of the cells adjacent to the buffer or ghost region exactly the same as the cells comprising the ghost boundary. Unfortunately, this approach is wasteful in terms of the number of cells required to represent the solution at the boundary between two processors (and also wasteful of the computational resources).

In the present paper, the efficient parallelization of an AMR algorithm that uses the FTT data structure and dynamic partitioning of the mesh is investigated. In particular, the challenging problem associated with dynamic grid partitioning and optimal load balancing for conducting AMR in a massively parallel computational environment is addressed. The paper is organized as follows. In Section 2, the parallelization of AMR is described in detail, which includes (1) domain decomposition and grid partitioning within the context of the FTT data structure using the Hilbert space-filling curve approach; (2) the coarsening and refinement of grids in parallel; (3) construction of the ghost boundary cells for information transfer between processors; (4) application of the BPX (parallel) multigrid approach for solving the algebraic equations arising from the AMR discretization in the parallel environment. In Section 3, numerical experiments using the Poisson equation are conducted in order to demonstrate the parallel efficiency that can be achieved using our proposed approach. Finally, Section 4 contains our conclusions.

2. Numerical Methods

2.1. Modification of the FTT Data Structure

When a computational domain is spatially discretized using AMR, the individual cells in this discretization are usually organized hierarchically in the form of an octtree in 3D (quadtree in 2D). In a conventional oct-tree discretization and representation, such as that used by MacNeice et al. [6], the connectivity information between an individual cell and its neighbors needs to be stored explicitly with each cell requiring 17 words of computer memory to store the following information:(i)level of the cell in the tree;(ii)cell location (position);(iii)identification (ID) of the parent cell;(iv)IDs of the eight child cells;(v)IDs of the six neighbor cells.

In addition to the large memory overhead required to maintain this oct-tree data structure, it is also very difficult to parallelize owing to the fact that the neighbors of each cell are connected (linked) to each other. Consequently, removing or adding one cell in the process of adaptive mesh refinement will affect all the neighboring cells of this one cell. If the IDs of the neighboring cells are not stored explicitly, the oct-tree must be traversed to find these neighboring cells. On average, it is expected that two levels of the oct-tree must be traversed before a neighbor can be found, but in the worst-case scenario it may be necessary to traverse the entire tree to access the nearest neighbors. Tree traversals of this type can extend from one processor to another making it difficult to vectorize and parallelize these operations. The fully threaded tree [9] has been developed to reduce the memory overhead required to maintain the information embodied in the tree and to facilitate rapid and easy access to the information stored in the tree.

In the FTT structure employed by a serial code (its parallel implementation will be discussed later in Section 2.2.2), all cells are organized in groups referred to as octs. As shown in Figure 1 and as the name implies, each oct maintains pointers to eight (child) cells and a parent cell. Each oct incorporates information about its level which is equal to the level of the parent cell and maintains the information about its position in the computational domain. Furthermore, each oct maintains pointers to the parent cells of the six neighboring octs and each of the eight cells in the oct maintains a pointer to the “parent oct’’ (viz., the oct containing the eight child cells, which are cells 1 to 8 in Figure 1) and a pointer to the “child oct’’ if the cell is not a leaf cell (or to a null pointer if the cell has no children). In the FTT data structure proposed originally by Khokhlov [9], each cell of the oct only contained a pointer to its child oct (or to a null pointer if the cell has no children). As a result, finding an individual cell in the original FTT data structure was complicated by the fact that the parent oct of the cell must be accessed first.

In a parallelization of AMR that utilizes the original FTT structure, each oct becomes the basic unit in the grid partitioning, which greatly complicates the load balancing on each processor (because the latter operation is based on the coarse-grained oct, rather than the fine-grained cell as the basic unit). In light of this, an extra word of computer memory is added to maintain a pointer to the parent oct for each cell of the oct. With this modification to the FTT structure, it is now possible to traverse each cell instead of each oct as implied in the original FTT data structure, with the result that the neighboring cells of this cell (either parent cell or child cells) can be accessed much more efficiently using the present modified oct structure (see also [12]). This is because even for a leaf cell (whose pointer to its child oct is null) the neighboring information can still be accessed directly through its parent oct. However, this modification incurs a computational cost in that the memory requirement is now 3.125 words per cell, rather than the 2.125 words per cell required for the implementation of the original FTT structure. Nevertheless, the memory requirement for our modified FTT oct data structure is still significantly less than that required for the conventional (ordinary) tree data structure (the latter of which requires 17 words of storage for each cell).

Using our modified FTT structure, it is now possible to access the neighbors of a cell without the need for a search operation. Figure 2 shows an example of how this can be done: suppose we need to find the four neighbors of cell 6 in the two-dimensional FTT quad structure exhibited here. To access the west and south neighbors of cell 6, cells 5 and 8 can be found directly using a simple add/subtract operation when provided with the parent oct ID for cell 6, which is stored explicitly in our modified quad structure. The north neighbor of cell 6 can be accessed directly from the north neighbor parent cell by using its parent quad structure. To access the east neighbor of cell 6, the east neighbor parent cell 4 is found first through its parent quad structure. Since cell 4 possesses a child quad, the desired east neighbor (or cell 9) can be found through a simple add/subtract operation using the child quad for cell 4.

2.2. Domain Decomposition for Adaptive Mesh Refinement

Implementing an AMR algorithm in a parallel computational environment requires a grid partitioning strategy. More specifically, the grid has to be decomposed into a number of partitions and each partition needs to be mapped to one processor in the parallel computing system. The grid partitioning has to be done at program execution time since there is usually no prior knowledge about where the grid needs to be refined in order to provide an acceptably accurate solution that reduces the local error below a tolerable threshold. The key issue in the implementation of a computationally efficient AMR algorithm in a parallel computing environment is the development of an appropriate strategy for grid partitioning. This will require the consideration of the following issues. Firstly, the computational load has to be equally distributed amongst the processors of the parallel computing system. Secondly, the volume of data that is required to be transferred between the processors during the partitioning and mapping operations and during the simulation itself needs to be minimized. Finally, the computational cost associated with the load balancing and mapping strategies needs to be (significantly) less than that used for the actual computation (or simulation).

2.2.1. Parallel Load-Balancing Using Space-Filling Curves

Our proposed solution for the grid partitioning and subsequent mapping of partitions-to-processors problem is to utilize the space-filling curve (SFC) approach. There are three important properties associated with an SFC that makes it an attractive approach for grid partitioning in the context of AMR: namely, mapping, locality, and compactness [13]. Figure 3 shows two space-filling curves that have been used previously for grid partitioning in parallel computation: namely, the Morton (or N-ordering) and the Hilbert (or U-ordering) space-filling curves [1416]. These space-filling curves involve the mapping of the points in a higher-dimensional space onto the points along a one-dimensional curve, thus reducing the topology to that of a one-dimensional coordinate along the curve. These curves can be constructed recursively from their underlying generators (e.g., U-shaped and N-shaped generators for the Hilbert and Morton SFCs, resp.) as shown in Figure 3. Here, three different levels of SFCs are shown, which connect 4, 16, and 64 points in the two-dimensional (square) domain at the first, second, and third levels, respectively.

Any point in the computational domain will map to a unique point on a space-filling curve. In view of this, to partition the computational domain in 𝑁-dimensions (usually, 𝑁=2 or 3) and to map each partition to a processor of the parallel computational system, all we need to do is fill the computational domain with an SFC, encode the grid points of this domain as one-dimensional coordinates along this curve, and divide the ordered points thus encoded into equal (or approximately equal) parts (or workload) in order to achieve the best load balancing. If the workload for each Cartesian cell is the same, the partition problem is straightforward; namely, an equal number of points along the SFC should be assigned to each processor. However, it is possible for the workload of each cell to be different. For example, if a cut-cell approach [17] is used to handle the complex geometry on a Cartesian mesh, the workload associated with a cut-cell will generally be larger than that associated with a normal cell owing to the fact that more computational effort is required to determine the centroid, volume, and surface area of a cut-cell. In this case, different weights need to be assigned to the various cells to reflect the workload associated with them and these weights should be used in the partitioning and mapping with the SFC. In particular, the points along the SFC need to be partitioned so that the workload assigned to each processor is equal (approximately or better) as this choice will provide the best load balancing.

An important property to consider when using a space-filling curve for grid partitioning is the property of compactness which relates to whether the encoding of the coordinates of the curve requires only local information in the 𝑁-dimensional computational domain. This property is important because compactness implies that the SFC can be constructed in parallel on each local processor without the need to exchange information with other neighboring processors. This is because the compactness property implies that it is possible to compute the coordinates of a cell location in the SFC (encoding operation) using only the information embodied in the cell coordinates in the 𝑁-dimensional computational domain. In this regard, both the Morton and Hilbert SFCs satisfy the compactness property. However, we note that construction of an SFC with Morton ordering is generally easier and faster than that with Hilbert ordering owing to the fact that the N-shaped generator used for the Morton ordering does not need to be reflected and/or rotated in order to keep the ends of successive generators adjacent as is required for the U-shaped generator used for the Hilbert ordering (cf. the recursive construction of Hilbert and Morton curves in Figures 3(a) and 3(b), resp.).

Using the integer indexing for Cartesian grids described by Aftosmis et al. [13], the encoding of a Morton integer from a cell integer coordinates in an 𝑁-dimensional space is easily accomplished by interleaving the bitwise representation of these integer coordinates. For example, Figure 4 illustrates how this encoding is accomplished for a two-dimensional Cartesian mesh. To map the point 𝑝 with integer coordinates (6,4) in the two-dimensional Cartesian grid shown here, we first construct the bitwise representation of these coordinates using 𝑚-bit (𝑚=3) integers to give (110,100). One then immediately obtains the location of 𝑝 along the Morton curve by simply interleaving the bits of this representation to give a 2𝑚-bit (here, 6-bit) integer 111000 corresponding to the coordinates of point 𝑝 along the Morton curve (cf. Figure 4(b)).

The encoding of a cell integer coordinates using the Morton ordering is simpler than that using the Hilbert ordering. However, the Hilbert curve preserves locality better than the Morton curve. Indeed, the Hilbert curve has the greatest degree of locality possible for any space-filling curve. Locality means that contiguity along the curve implies contiguity in the 𝑁-dimensional space of the Cartesian mesh. In consequence, in the U-ordering used for the Hilbert curve, face-neighboring cells of the Cartesian mesh will be mapped to face neighbors along the Hilbert curve using the U-ordering. This is advantageous when using the Hilbert curve as a grid partitioner as it results in reduced computational overhead arising from inter-processor communications owing to the fact that nearby points in the Cartesian grid are likely to be assigned to the same processor. For this reason, we have chosen to use the Hilbert SFC as the grid partitioning algorithm in conjunction with the modified FTT data structure (described earlier). The penalty to be paid here is that encoding a Hilbert integer is more complex than encoding a Morton integer owing to the fact that rotations and inversions of the U-shaped generator (used for the recursive construction of the Hilbert curve) are required in order to assure that the ends of successive generators are kept adjacent (see Figure 3(a)). This complicates the encoding of a Hilbert integer, but there exist orientation and ordering tables [14, 15] that facilitate this encoding.

The algorithm that is used for encoding a Hilbert integer to represent the one-to-one mapping between points in an 𝑁-dimensional computational domain and points on the Hilbert curve can be described as follows. We focus the description on a two-dimensional computational domain, but the methodology can be applied to higher-dimensional spaces [14, 15]. For this case, the U-ordering of the Hilbert curve involving four orientations and orderings in two dimensions (2) is summarized in Table 1. Here, the rows of the ordering and orientation tables determine the ordering and orientation of the offspring when a parent with orientation given by that row is refined. Figure 5 illustrates how to generate the first two levels of a two-dimensional Hilbert ordering using the ordering and orientation tables shown in Table 1. The root quadrant has orientation 00, so the offspring at the first level are ordered in accordance to the Morton index sequence provided by row 00 of the ordering Table 1: {00,01,11,10}. Next these offspring are assigned orientations according to row 00 of the orientation Table 1: {01,00,00,10}. The next (second) level uses this information to determine the order and orientation of their offspring. For the example shown in Figure 5(a), the orientation for quadrant 00 is 01, so we use row 01 of the ordering and orientation Table 1 to ascertain the offspring order (at the second level) and their offspring orientation (at the third level) as {00,10,11,01} and {00,01,01,11}, respectively. In a similar manner, the offspring order and orientation at the second level for the other three quadrants can be determined and yield the results summarized in Figure 5(b). Proceeding in this fashion, all higher levels of the two-dimensional Hilbert ordering can be generated recursively using this simple and efficient procedure.

Using the ordering and orientation tables, the encoding of a Hilbert integer corresponding to a point 𝑝 in an 𝑁-dimensional computational domain is now straightforward. The encoding is accomplished by simply taking the highest-order bit from each spatial coordinate and packing them together as 2 bits (for a two-dimensional computational domain) in the order 𝑥𝑦 where 𝑥 refers to the highest-order bit of the word. The initial orientation is 0. Row 0 of the ordering table provides the value for 𝑥𝑦 (corresponding as such to the initial orientation) and the column index data. The latter data constitute the first two bits for the encoding of the location along the Hilbert curve. The next orientation value is found by looking up the corresponding element of the orientation table (column data and row 0). This recursive procedure continues until the last two bits of the spatial coordinate are processed. At the recursion level 𝑖, the 𝑖th bits from each spatial dimension of point 𝑝 are packed together as the data to be looked up in the ordering and orientation tables. The ordering table lookup finds the packed two bits data at the column index value and the row determined by the previous orientation table value. The column index value is appended to form the next portion of the Hilbert integer (or coordinate along the Hilbert curve corresponding to the point 𝑝). Next, the orientation table is used to find the successive orientation value at the element corresponding to the column index and row of the orientation value determined in the previous step of the recursion. Figure 6 summarizes this procedure for a simple example involving the encoding of a point 𝑝 in a two-dimensional Cartesian grid with integer coordinates (6,4) as the Hilbert (binary) integer 101110 (or 46 in decimal notation).

The spatial coordinates of points in an 𝑁-dimensional computational domain are real numbers (rather than integers). Even so, it should be noted that, within the computer, these spatial coordinates will come from a finite set of numbers determined by the finite precision of the computer hardware. This observation suggests a digital representation for the spatial coordinates which is convenient since to compute the Hilbert integer corresponding to a given point of the 𝑁-dimensional computational domain using the ordering and orientation tables, we need to provide a representation of the various spatial coordinates of the point in terms of an unsigned integer. To this end, taking the computer word length to be 𝑊 bits (usually 𝑊=32) implies that an unsigned integer can be used to encode up to 𝑊 levels of refinement of a Hilbert SFC. In view of this, it is convenient to map the 𝑁-dimensional computational domain into the unit hypercube [0,1]𝑁 and let the available spatial coordinates in this hypercube be odd multiples of 2(𝑊+1), labelled by unsigned integers 𝑘 from 0 to 2𝑊1: 𝑥=2𝑊(𝑘+(1/2)).

Figure 7 illustrates the application of the Hilbert SFC for grid partitioning. Here, we show an example of AMR applied to a two-dimensional Cartesian mesh, which is partitioned on four processors using a Hilbert SFC. In this example, all the cells are assumed to have equal weight in terms of their workload. Only the leaf cells have been shown in Figure 7.

2.2.2. Data Distribution

A shared-memory computer uses a global address space, whereas, in a distributed-memory computer, each processor has its own local address space. In view of this, parallelization of AMR on a distributed-memory computer is more difficult than on a shared-memory computer. This problem arises because if a pointer is used to refer to a cell in the adaptive Cartesian mesh, this pointer will need to be changed when the cell is migrated from one processor to another (during the dynamic adaptive mesh refinement) for otherwise the pointer of this cell will refer to an incorrect memory address on the processor to which it is migrated. To circumvent this problem, we propose that a unique global ID be assigned to each cell in the adaptive Cartesian mesh and used in the parallel code. Because each cell now has a unique identifier, this global ID does not need to be changed during the course of the simulation.

In a tree-based serial code for AMR, it is possible (and easy) to access any cell in the Cartesian mesh as the tree is traversed. However, it is very difficult to access an arbitrary cell in the tree using this technique in a parallel code owing to the fact that the parent of a particular cell may not even reside in the same processor. To address this problem, we use a hash table to store the information for all cells that reside in a particular processor. This technique is better than that based on a linked list owing to the fact that any destination (target) cell can be located immediately once the hash key (unique entity) for the hash table is given. Here, the global identifier for each cell is used as the hash key because it is unique to the cell. Furthermore, the FTT data structure is maintained to provide the full connectivity information for each cell, such as its parent cell, its child cells, and its neighbor cells. This connectivity information is critical for the discretization of the governing equations. To access a particular cell in the adaptive Cartesian mesh, the processor ID to which the cell belongs needs to be known. To this purpose, it does not make too much sense to explicitly store the processor ID for each cell because this ID will change dynamically at run time as the adaptive mesh is refined and derefined. Rather, it is more efficient (and very easy) to calculate the processor ID associated with a particular cell from its spatial coordinates using the Hilbert integer encoding. This only requires that the range of the coordinates of the Hilbert curve handled by each processor be stored.

If a cell is marked to be migrated to another processor as a result of grid partitioning using the Hilbert space-filling curve, all connectivity information (e.g., parent oct data, child oct data, etc.) associated with this cell will also need to be migrated and the FTT data structure for each processor will need to be regenerated. To facilitate this process, the oct data structure is stored in another hash table. We do not need to assign a global ID for each oct to use as the hash key. Instead, we use the global ID of the parent cell of the oct as the hash key.

2.2.3. Adaptive Mesh Refinement and Coarsening

The refinement mesh is generated level by level in parallel starting from a base-level mesh using some form of a user-supplied criterion (which is usually based on an estimate of the local truncation error in the solution). However, we impose a constraint on the refinement/derefinement process in order to maintain the FTT data structure. This constraint is that no neighboring cells with a difference in level greater than one are allowed in the Cartesian mesh. This rule guarantees that no sharp gradients in the solution will exist at the boundaries for each level of the refinement. Figure 8(a) shows an example of the application of this rule. Here, it is seen that once cell A has been refined, cell B cannot remain a leaf cell (cell with no children) and must necessarily be refined if the constraint stated above is not violated. Also, the refinement of a cell in conjunction with the satisfaction of the user-supplied criterion for refinement/derefinement can provoke refinement of cells not only in the immediate neighborhood of the refined cell but also in the neighborhood of these immediate neighbors leading to a possible propagation of refinement throughout the entire computational domain. As shown in Figures 8(b) and 8(c), after cell B has been marked for refinement, cell C, which is not immediately adjacent to cell A, may need to be refined as well and this refinement can potentially propagate to the entire domain. Fortunately, the enforcement of the refinement constraint stated above on a level-by-level basis, in conjunction with the user-supplied criterion for mesh refinement/derefinement, alleviates the potential problem associated with a “runaway” refinement process.

We use the following strategy to enforce the refinement constraint which can be illustrated using Figure 8. Firstly, for each processor, all the cells flagged to be refined are added to a local linked list. Suppose cell A in Figure 8 has been tagged for refinement and added to the local linked list. At this stage, we now need to check the state of the two neighbors of the marked cell parent, which from Figure 8(a) happen to be cells B and D. This process is repeated or iterated until the lengths of all the local linked lists (one for each processor) are the same. At this point, interprocessor message passing is required. Let us assume that cells B and C belong to different processors. It is necessary to send the global ID of the parent cell of cell B (and its associated oct) to the processor where cell C resides. Since C is a leaf cell and cell B (which is one level higher than the level of cell C) is flagged to be refined, cell C is also flagged to be refined in order to maintain the FTT data structure constraint mentioned previously and appended to the linked list for the processor where cell C resides. Generally, this iterative process converges after about one or two iterations. Figure 9 shows an example of an adaptive Cartesian mesh partitioned on four processors before and after one enforcement of the refinement constraint.

Even if the user-supplied criterion is satisfied for cell coarsening (derefinement), the cell cannot be coarsened immediately. Other conditions need to be satisfied before the coarsening of the cell can take place. These auxiliary conditions for coarsening can be understood with reference to Figure 10. Firstly, all the children of a proposed cell for coarsening (cell A in Figure 10) should be leaf cells. Secondly, for each neighbor of cell A that is not a leaf, it is necessary to examine the two child cells of that neighbor which border on cell A. In Figure 10, these cells are marked by a circle. If any of these cells is not a leaf, and their child cells are not marked to be coarsened, cell A cannot be coarsened any further. In this coarsening procedure, only the nearest neighbor cells are involved. Owing to the use of the FTT data structure, when a cell is marked to be coarsened it is only necessary to remove the child cells of the coarsened cell and the corresponding oct consisting of these child cells from the hash table. In consequence, it is not necessary to regenerate global IDs for all the neighbors of the coarsened cell as would be required in a parallel AMR code that uses an ordinary tree data structure [6]. As a consequence, the coarsening procedure described here is greatly simplified compared to the procedures described by MacNeice et al. [6].

2.2.4. Construction of the Ghost Boundary Cells for Each Processor

The discretization of the governing equations for various conservation laws for parallel AMR will require the consideration of ghost regions needed to facilitate interprocessor communications. To this end, we will focus on the Poisson equation and use it to illustrate the process that is involved in the construction of ghost boundary cells for parallel AMR. If a second-order accurate scheme is used to discretize the Poisson equation, the ghost boundary needs to be only one cell thick. In the FTT data structure, an oct is the basic structural unit used to organize cells in such a manner that they are able to efficiently locate their neighbors and ascertain their level and position in the adaptive Cartesian mesh. In view of this, before the ghost boundary cells can be constructed, it is necessary to generate the corresponding oct data structure first.

To see how this is done, refer to Figure 11. Here, there are seven cells (marked by the 4 circles, 2 squares, and 1 triangle) in each coordinate direction that are directly related to oct A in the FTT data structure, depending on the level of the neighbor. If the neighboring cells (marked by triangles) and the central cell (marked by the solid circle) are located on the same processor, we can determine directly the level of all the neighbor cells and ascertain which cells depend on the oct A structure. If this is not the case, message passing between two or more processors will be required and this can involve multiple exchanges of data between these processors in order to determine the levels associated with all the neighbor cells. An example of this is shown in Figure 11. For this example, let us assume that octs A and B reside on different processors. To find the southern neighbor cell to A (cell B), interprocessor message passing will be needed to ascertain whether cell B is a leaf cell. If cell B has children as in the example shown here, then additional message passing will be required to determine if cells C and D (two child cells of cell B) are leaf cells. This type of communications overhead involving multiple exchanges of data between processors can be potentially expensive, so to reduce this overhead and hence improve the parallel efficiency, we compute the Hilbert coordinates of all the neighbor cells of A in order to ascertain their processor IDs. After this computation, we simply broadcast (send) the oct data for cell A to these processors.

Once the necessary oct structures have been constructed, the ghost boundary cells required for each processor can be determined from these oct structures. As an example, let us assume that one of the faces of cell A in Figure 12 coincides with an interprocessor boundary. Figure 12 displays all possible cells (marked by the 8 circles, 4 squares, and 2 triangles) potentially associated with cell A that will be required to construct the ghost boundary cells for this cell, depending on the level of the neighbor. The latter information is contained in the oct structure for the ghost boundary cells. An example of ghost boundary cells is shown in Figure 13, which displays the local leaf cells along with their corresponding ghost leaf boundary cells on two processors.

2.3. Discretization of the Poisson Equation

We will use the Poisson equation to illustrate aspects of the parallel AMR algorithm proposed in this paper. The Poisson equation has the general form2𝜙=𝑓.(2.1) Using Stokes’ theorem, the integral form of the Poisson equation can be expressed as𝑆𝜙𝑛𝑑𝑠=𝑉𝑓𝑑𝑉,(2.2) where 𝑉 is the control volume, 𝑆 is the control surface, and𝑛 is the unit normal direction to the control surface 𝑆.

Given a control volume (cell), the discrete form of (2.2) can be written as𝑑𝑆𝑑𝑑𝜙=𝑎𝑓,(2.3) where 𝑑 refers to a coordinate direction, 𝑆𝑑 is the surface area of a face of the control volume normal to the direction 𝑑, and 𝑎 is the volume of the cell. The gradient 𝑑 is evaluated at the center of a control volume face in the direction 𝑑.

To achieve second-order accuracy in the discretization, the gradient flux of 𝜙 at the interface between cells at different levels will be approximated using the value of 𝜙 evaluated at an auxiliary node as illustrated in Figure 14. Here, using the auxiliary node 𝑃, the gradient flux through the east face 𝐹𝑒 can be approximated as follows:𝐹𝑒=𝑆𝑒𝑒𝜙𝜙=𝐸𝜙𝑃𝑥𝐸𝑥𝑃𝑆𝑒.(2.4) Because 𝜙𝑃 can be approximated as 𝜙𝑃𝜙𝑃+(𝜙)𝑃(𝑟𝑃𝑟𝑃) (where𝑟𝑃𝑟𝑃 is the position vector from 𝑃 to 𝑃), the gradient flux on the east face of cell 𝑃 can be written as𝐹𝑒=𝜙𝐸𝜙𝑃𝑥𝐸𝑥𝑃𝑆𝑒+𝑆𝑒𝑥𝐸𝑥𝑃𝜕𝜙𝜕𝑦𝐸𝑦𝑒𝑦𝐸𝜕𝜙𝜕𝑦𝑃𝑦𝑒𝑦𝑃.(2.5) In (2.5), the gradient flux has been decomposed into two parts: the first involving the cell-centered unknowns (𝜙𝑃and𝜙𝐸) and the second involving the cell-centered gradients of 𝜙[(𝜕𝜙/𝜕𝑦)𝑃and(𝜕𝜙/𝜕𝑦)𝐸]. We will refer to the first part as the active part and the second part as the lagged part, so𝐹𝑒(𝜙)=𝐹active𝑒(𝜙)+𝐹lagged𝑒(𝜙).(2.6) Note that, in our case, at least one of the gradient terms in the lagged part will be zero (because 𝑦𝑒 and either 𝑦𝑃 or 𝑦𝐸 will have the same value).

To approximate the contribution of the lagged part in (2.5), we have to evaluate the cell-centered gradient of𝜙 at each cell. To this end, we use the least-squares approach [18, 19] for the reason that it is exact for a linear profile and can be easily parallelized since only values of 𝜙 at the neighbor cells are needed to construct the cell-centered gradient of 𝜙. For an arbitrary cell 𝑃 and its set of immediate neighbors 𝜂𝑃, the change in value of 𝜙 between a central cell 𝑃 and an immediate neighbor cell 𝑗 (𝑗𝜂𝑃) is given by the difference 𝜙𝑗𝜙𝑃. If𝜙 is linear, then this difference is given exactly by𝜙𝑗𝜙𝑃=(𝜙)𝑃𝑟𝑗𝑟𝑃.(2.7) However, unless the solution 𝜙 is linear, the gradient here is not exact owing to the fact that cell 𝑃 has more neighbors than the gradient has components. The least-squares approximation to the gradient is that which minimizes the cost function given by𝐸=𝑗𝜂𝑃𝑤𝑗(𝜙)𝑃𝑟𝑗𝑟𝑃𝜙𝑗𝜙𝑃2,(2.8) where 𝑤𝑗 are weights that we choose as 𝑤𝑗=1/|𝑟𝑗𝑟𝑃|2.

For the two-dimensional case, the solution to this least-squares problem reduces to the solution of the following linear system of algebraic equations: 𝑤𝑗Δ𝑥𝑗Δ𝑥𝑗𝑤𝑗Δ𝑥𝑗Δ𝑦𝑗𝑤𝑗Δ𝑥𝑗Δ𝑦𝑗𝑤𝑗Δ𝑦𝑗Δ𝑦𝑗𝜙𝑥𝜙𝑦=𝑤𝑗Δ𝑥𝑗Δ𝜙𝑗𝑤𝑗Δ𝑦𝑗Δ𝜙𝑗,(2.9) whereΔ𝑥𝑗𝑥𝑗𝑥𝑃,Δ𝑦𝑗𝑦𝑗𝑦𝑃,Δ𝜙𝑗𝜙𝑗𝜙𝑃,(2.10) and 𝜙𝑥 and 𝜙𝑦 are the two components of the cell-centered gradient (𝜙)𝑃 of 𝜙 at 𝑃.

2.4. Parallel Multigrid Preconditioner with Conjugate Gradient Method

The discretized Poisson equation can be written in the following form:(𝜙)=𝑓,(2.11) where is a linear operator. To solve this linear system efficiently, we will use a multigrid method that is advantageous because the convergence rate of these methods do not depend on the dimension of the system. There are two kinds of multigrid methods that can be considered, namely, multiplicative and additive multigrid methods [20]. The difference between these two types of multigrid methods can be seen as follows. Consider a “V-cycle” whose correction form for the linear system of (2.11) is(𝜙+𝛿𝜙)=𝑓L(𝛿𝜙)=𝑅with𝑅=𝑓(𝜙).(2.12) Figure 15 summarizes the sequence of operations involved in the V-cycle multiplicative multigrid method. The values of 𝜙 in (2.11) at the finest grid level are first approximated by performing a few iterations with a smoother (resulting in a smooth error or residual at this grid level). The residual is then calculated on the finest grid and transferred to the next coarser level using a restriction operation. The correction 𝛿𝜙 to the solution of (2.12) at the coarser level is then approximated by performing a few iterations with the smoother. This procedure is then applied recursively down to the coarsest level. The correction 𝛿𝜙 of the solution on the coarsest grid level is then transferred back to the next finer grid level (prolongation operation). After performing a few iterations at this level, the correction of the solution on this level is transferred to the next finer level until the finest level is finally reached. The whole V-cycle is repeated until the residuals on the finest level have been reduced below some desirable (small) level. An examination of the sequence of operations involved here would reveal that the multiplicative multigrid method cannot be used to process the coarse- and fine-grid level information simultaneously (in parallel), because the processing on the coarser grids requires information from the finer-grid residuals and the finer grids require information from the coarser grid corrections. Hence, the multiplicative multigrid method is not amenable to parallelization.

In contrast, the additive multigrid method is amenable to some form of parallelization. More specifically, in this multigrid method the smoothing operations on the different grid levels can be performed simultaneously (or in parallel). Restriction and prolongation, however, still need to be conducted sequentially in the additive multigrid method. Figure 16 summarizes the operations involved in the V-cycle additive multigrid method. These operations involve three major stages: restriction (in sequence), smoothing (in parallel) and prolongation (in sequence). After the residuals (referred to as the defect in Figure 16) of a linear system such as (2.11) have been computed at the finest grid level, it is restricted down level by level from the finest to the coarsest levels. On all the grid levels, the correction form for the linear system of (2.12) can be solved in parallel. After this step, the correction 𝛿𝜙 can be prolongated level by level from the coarsest to the finest level. Owing to the increased level of parallelization allowed by the additive multigrid (or BPX [21]) (The BPX designation here derives from the first letter of the last names of the three authors (Bramble, Pasciak, and Xu) who originally developed the method.) method, we choose to use this method for the solution of the discretized Poisson equation. It should be noted that the BPX method generally cannot converge if simply used as a stand-alone smoother. For this reason, we use the BPX method in conjunction with a conjugate gradient method as the smoother.

A volume-weighted average is applied to restrict the residuals at a given grid level to a coarser grid level: 𝑅𝑙=144𝑖=1𝑅𝑙+1𝑖,(2.13) where the summation is over the four children of the cell considered. The correction form of (2.7) is used to prolongate the correction 𝛿𝜙 from a coarser grid level to a finer grid level as follows:(𝛿𝜙)𝑓𝑖=(𝛿𝜙)𝑐+((𝛿𝜙))𝑐𝑟𝑓𝑖𝑟𝑐.(2.14)

When a cell and all its child cells are located on the same processor, the restriction and prolongation operations can be done locally without having to incur any communications overhead. Otherwise, the information required between processors in the execution of the BPX method will need to be passed on a level-by-level basis. However, given the oct data structure and the Hilbert coordinates for the cells in the adaptive Cartesian mesh, the global and processor IDs of these cells can be easily determined and used to construct the appropriate communication map.

3. Numerical Results

A two-dimensional Poisson equation is considered in this study as our test problem:2𝜙=𝜋2𝑘2+𝑙2sin(𝜋𝑘𝑥)sin(𝜋𝑙𝑦),(3.1) with 𝑘=𝑙=3. The computational domain for the test problem is [0.5,0.5]2, and the Neumann boundary conditions are used on the four boundaries of the domain. All wall-clock times shown in Tables 24 are measured on a Beowulf cluster on SHARCNET [22]. On this cluster, a high-speed wide-area network with a throughput of 1 Gigabit per second was used for the message passing between processors.

The Poisson equation was chosen for our test problem because the ratio of the computational work to the communications overhead is low compared to the computationally more demanding problem of solving the Navier-Stokes equation. However, the simple Poisson equation corresponds to a more difficult test for our algorithm with respect to the achievement of a high parallel efficiency.

In the first test case, we consider only uniform grids with no adaptive mesh refinement and derefinement. Table 2 shows the wall clock times for the solution of the Poisson equation (3.1) on a sequence of six different regular grids using up to 64 processors. In these tests, the grid at level 𝑛 has a size of 2𝑛×2𝑛. From Table 2, it is observed that using more processors does not always reduce the wall-clock time monotonically. For example, for the cases corresponding to grids with levels less than 8, the wall-clock time required for the solution of the Poisson equation increases as the number of processors increases from 16 to 32. This is because the communications overhead becomes dominant as the size of the problem handled by each processor becomes smaller. As the overall size of the problem increases (e.g., for problems associated with levels 7 and above), the computational effort outweighs the communications overhead, so the parallel efficiency increases. For the last case in the Table 2, a parallel efficiency of 𝑇8/(64/8)𝑇64×100=98% is achieved with 64 processors relative to the wall-clock time on 8 processors (here, 𝑇𝑛 denotes the wall-clock time required to solve the problem using 𝑛 processors).

In the next test case, we consider various grids generated using AMR. The leaf cells in the grids are refined if |𝜙| is larger than the mean value of variable on all the leaf cells. Table 3 provides the wall-clock times for the solution of the Poisson equation (3.1) for 6 different levels of adaptive grids on different numbers of processors. It can be seen from Table 3 that, in conformance to what we observed in Table 2, the total time decreases monotonically as the number of processors increases for those problems with large grid sizes. For the last case in Table 3, a parallel efficiency of 𝑇8/(64/8)𝑇64×100=106% is achieved. This is likely due to the efficient use of cache memory when the grid size handled by each processor is small enough so that the information in the grid can be stored in the cache memory.

Table 4 shows the wall-clock time for solving the Poisson equation (3.1) using the Hilbert SFC for grid partitioning and subsequent mapping of the data to each processor. The percentage of the total computational time required for the grid partitioning is shown in the same table for up to 64 processors. It is observed from Table 4 that this percentage increases slightly when a large number of processors is used. This is because a large volume of data needs to be migrated over a large number of processors. However, the method proposed here for parallel AMR is still very efficient. For example, in the case of 64 processors, the ratio of time spent in the grid partitioning to the total computational time is only 2.2×103. Taken together, the results of Tables 3 and 4 demonstrate that our proposed parallel AMR algorithm when used in conjunction with the additive multigrid method and the Hilbert SFC for grid partitioning is highly efficient. Furthermore, comparing the results for run times summarized in Tables 2 and 4, it can be seen that combining the three different speed-up methods resulted in speed-up factors of 381.52/0.01134700 and 48.7/0.00855730 using 8 and 64 processors, respectively, when comparing the parallel AMR solution against the solution obtained using a uniform grid with 10 levels (viz., a uniform 2D grid with 210×210 cells).

4. Conclusions

From numerical point of view, it is important to develop an efficient Poisson solver since many numerical methods for more complex systems of equations (e.g., the Navier-Stokes equation) require the solution of the Poisson equation. In this paper, the three different speed-up methods (viz., additive multigrid, adaptive mesh refinement, and parallelization) are combined together to solve the Poisson equation efficiently. Instead of using an ordinary tree data structure to organize the information on the adaptive Cartesian mesh, a FTT data structure is used because this structure requires minimum computer memory storage and all operations on the FTT can be performed efficiently and are amenable to parallelization. The Hilbert space-filling curve approach has been used in conjunction with the FTT for dynamic grid partitioning (resulting in partitioning that is near optimal with respect to load balancing on a parallel computer system). Furthermore, the refinement and derefinement of the adaptive Cartesian mesh can be parallelized as well. Finally, an additive multigrid method (BPX preconditioner) which itself is parallelizable to a certain extent has been used to solve the linear equation system arising from the discretization. Our numerical experiments show that the proposed parallel AMR algorithm based on the FTT data structure, Hilbert SFC for grid partitioning, and additive multigrid method is highly efficient.

Acknowledgments

This work has been supported by Chemical Biological Radiological-Nuclear and Explosives Research and Technology Initiative (CRTI) program under project no. CRTI-07-0196TD. This work was made possible by the facilities of the Shared Hierarchical Academic Research Computing Network (SHARCNET).