Abstract
We construct quasioptimal elimination trees for 2D finite element meshes with singularities. These trees minimize the complexity of the solution of the discrete system. The computational cost estimates of the elimination process model the execution of the multifrontal algorithms in serial and in parallel sharedmemory executions. Since the meshes considered are a subspace of all possible mesh partitions, we call these minimizers quasioptimal. We minimize the cost functionals using dynamic programming. Finding these minimizers is more computationally expensive than solving the original algebraic system. Nevertheless, from the insights provided by the analysis of the dynamic programming minima, we propose a heuristic construction of the elimination trees that has cost , where is the number of elements in the mesh. We show that this heuristic ordering has similar computational cost to the quasioptimal elimination trees found with dynamic programming and outperforms stateoftheart alternatives in our numerical experiments.
1. Introduction
We present a dynamic programming algorithm to find quasioptimal elimination tree for twodimensional grids with point and edge singularities. We consider two cost functions: one models the sequential solver execution, while the other models the execution cost of a parallel sharedmemory solver. The dynamic programming algorithm finds elimination trees that minimize the appropriate cost function for a multifrontal solver. These minimizers belong to a class of elimination trees obtained by recursive partition of the computational mesh along straight lines. These quasioptimal trees are expressed as graphgrammar productions which define our solver. To optimize execution time, we use the GALOIS scheduler [1]. From the analysis of the quasioptimal trees we propose a heuristic algorithm that constructs in time (where denotes the number of elements of the mesh) the elimination trees with similar performance to the one constructed with the dynamic programming algorithm. To determine the efficiency of our algorithms, we compare our elimination trees to popular alternatives. The comparison is performed using MUMPS as an efficient interface to commonly used elimination algorithms, such as approximate minimum fill, approximate minimum degree, SCOTCH, PORD, METIS, and AMD with quasirow detection. In particular, we estimate the number of FLOPs (floatingpoint operations) in sequential execution of our graphgrammar solver using elimination trees generated by our dynamic programming and heuristic algorithms. We compare these to the FLOPs resulting from execution of MUMPS using execution time as a proxy for FLOPs. We also use numerical experiments to compare the execution times of our sequential and parallel solvers against those of sequential and parallel MUMPS. Seeking to improve the parallel performance, we consider the tree rotation algorithm to well balance the elimination trees for parallel computations. We show that both sequential and parallel graphgrammarbased solver with GALOIS scheduler, using the elimination trees constructed by the dynamic programming optimization as well as the heuristic algorithm, outperform MUMPS with any ordering.
In this paper we present new contributions in the following research areas.
Multifrontal Solvers. The computational cost of the multifrontal solver algorithm depends on the quality of the elimination tree (which in sequential mode can be called an ordering). The problem of finding of an optimal elimination tree for a given mesh resulting in minimal computational cost of either sequential or parallel multifrontal solver algorithm is NPcomplete [2]. However, for a fixed grid it is possible to define a large class of elimination trees, estimate the computational costs for the solver algorithm for each tree, and select the best one in each class. We introduce a dynamic programming algorithm to find elimination trees for a given mesh that minimizes the computational cost of sequential multifrontal direct solver algorithm. The elimination trees obtained by our dynamic programming algorithm are obtained by considering recursive partitions of the computational mesh along straight lines. Thus we call the resulting trees quasioptimal, since we do not consider all possible elimination trees. We also restrict our research to the case of initially structured twodimensional grids with rectangular finite elements, where the partitions along straight lines are possible to implement. From our experience deriving quasioptimal elimination trees for ordering of meshes, we developed insights and abstractions that allowed us to propose a heuristic algorithm that constructs elimination trees with similar properties to the trees constructed by dynamic programming algorithm. The heuristic algorithm can construct the trees in computational cost, where is the number of elements in the mesh. We have executed the multifrontal solver algorithm for some representative grids, namely, for grids with a point singularity and grids with an edge singularity. We estimated the number of FLOPs of the multifrontal solver algorithm for the elimination trees constructed by both the dynamic programming and heuristic algorithms. We compare them to the FLOPs resulting from execution of the stateoftheart ordering algorithms. These are approximate minimum fill, approximate minimum degree, SCOTCH, PORD, METIS (nesteddissection), and AMD with quasirow detection. We show that our elimination trees resulting from both dynamic programming and the heuristic algorithms outperform the alternative ordering algorithms in terms of FLOPs for sequential solver execution.
GraphGrammarBased Solvers. We express our elimination trees and the resulting multifrontal solver as a sequence of graphgrammar productions. Namely, the graphgrammar productions construct frontal matrices, merge Schur complement matrices of children nodes of each node of the binary elimination tree, eliminate fully assembled degrees of freedom, and execute backward substitutions. The graphgrammar productions are implemented in the GALOIS environment [1]. The dependency relation between graphgrammar productions follows the structure of the elimination tree and can be expressed as a directed acyclic graph (DAG). The graphgrammar productions are implemented as GALOIS tasks working on the DAG obtained directly from the elimination tree. We compare the execution time of our graphgrammarbased GALOIS solver with the execution time of sequential MUMPS and show that our solver outperforms this implementation.
Parallelism. We execute the dynamic programming algorithm with a modified cost function that reflects the computational cost of the parallel sharedmemory multifrontal solver algorithm. The dynamic programming algorithm finds elimination trees that minimize the computational cost for the parallel sharedmemory multifrontal solver, within the class of elimination trees obtained by recursive partitions of the computational mesh along the straight lines. We use tree rotation to improve the balancing of the obtained elimination trees. As before, we express the resulting solver as a sequence of graphgrammar productions which can be optimally scheduled using the DAG analysis of GALOIS. These optimally scheduled graphgrammar productions are run on multithreaded execution on a sharedmemory machine. We use four different elimination trees: the quasioptimal dynamic programming using a multithreaded cost function and its rotated tree as well as a heuristic elimination tree and its rotated counterpart. We compare these four solvers against parallel MUMPS on the same machine. All four solvers outperform MUMPS.
1.1. Finite Element Method (FEM), Multifrontal Solver, and Elimination Trees
In this paper we focus on a class of two dimensional structured meshes with rectangular finite elements, subject to refinement as it is described by Demkowicz in [3]. Let us focus on a simple 2D finite element mesh. The domain is described by two elements and fifteen supernodes, that is, two interiors, seven edges, and six vertices (see Figure 1). In the 2D adaptive FEM, described in [3], we utilize basis functions related to abstract element supernodes. In this example (see Figure 2), we have linear basis functions associated with element vertices, namely, with supernodes 1, 3, 5, 11, 13, and 15, quadratic basis functions associated with element edges, namely, with supernodes 2, 4, 6, 8, 10, 12, and 14, and quadratic basis functions associated with element interiors, namely, with supernodes 7 and 9. This case corresponds to the polynomial order of approximation . In the general case of order , we have basis functions related to each element edge and basis functions related to each element interior. In 2D FEM [3], we construct the algebraic system by computing inner products of these basis functions or their derivatives over the analyzed domain. Thus, each entry of the resulting matrix system corresponds to the interaction between particular basis functions. The numerical values of these interactions are determined by the choice of weak form and the support of the basis functions. The connectivity, which is the controlling characteristic of the computational complexity, is only determined by the supports and the weak form. Interior basis functions have support over an element only, edge basis functions have support over one or two elements, and vertex basis functions have their support spread over one or many elements, depending on the grid connectivity.
(a)
(b)
(c)
The multifrontal solver introduced by Duff and Reid [4, 5] is a popular solver for systems of linear equations, which is a generalization of the frontal solver algorithm described in [6]. In a multifrontal solver, connectivity analysis is performed using an elimination tree. The elimination tree in classical solvers is obtained from a planar graph analysis. The graph is constructed based on the sparsity of the global matrix. In our solver, however, we construct the elimination tree by analyzing the mesh. A computational domain is decomposed into a hierarchy of subdomains, which defines an elimination tree (Figure 3). The construction of the elimination tree for an arbitrary mesh is a complex task. The elimination tree is constructed using the graph representing the connectivities of the mesh. This graph is partitioned using algorithms such as nesteddissection from the METIS library [7, 8] or minimum degree algorithms [9]. Usually, solvers like MUMPS [10–12] are not aware of the structure of the mesh, and they need to reconstruct the connectivity pattern by analyzing the sparsity pattern of the matrix given to the solver. The sparse representation of the mesh connectivity for linear order FE method, finite differences, and particle methods directly implies the mesh (or the topological structure of the mesh). For the highorder FE methods, the sparse representation does not precisely reflect the mesh structure (it represents the discretization explicitly). In the multifrontal approach, the solver generates a frontal matrix for each element of the mesh. This is illustrated in Figures 4 and 5. Fully assembled supernodes are eliminated within each frontal matrix, and the resulting Schur complement matrices are merged at the parent level of the tree. This is illustrated in Figure 6. Finally, the solver computes the solution at the elimination tree root node followed by backward substitutions at child nodes. This process is presented in Figure 7.
1.2. GraphGrammarBased Solver
The topological structure of the mesh [13–16] as well as the multifrontal solver algorithm [17, 18] can be expressed by graphgrammar productions [19–22]. In this section we express the multifrontal solver algorithm by graphgrammar productions that directly follow the structure of the elimination tree. We present the implementation of the multifrontal solver in the GALOIS system for sequential and concurrent execution of graphgrammar productions. The input for our solver is the elimination tree, coded in the following way: 2 < polynomial order of approximation 2 < number of elements 1 1 0 0 1 1 < first element id (1, 1) level 1, element 1, and its coordinates (0,0), (1,1) 1 2 1 0 2 1 < first element id (1, 2) level 1, element 2, and its coordinates (1, 0), (2, 1)
3 < number of nodes in elimination tree 1 2 1 1 1 2 2 3 < tree node id = 1, 2 elements, (1, 1) and (1, 2), pointers to son nodes 1, 2 2 1 1 1 < tree node id = 2, 1 element (1, 1) no son nodes 3 1 1 2 < tree node id = 3, 1 element (1, 2) no son nodes.
This example tree corresponds to the case presented in Figure 3.
Given the elimination tree, the operations performed by the solver can be coded as graphgrammar productions, working over the elimination tree. In particular, each merging and elimination operation can be represented as a single graphgrammar production. The above example contains the following graphgrammar productions: The dependency relation between these graphgrammar productions strictly follows the elimination tree and it is represented as a directed acyclic graph (DAG). This representation is equivalent to the one obtained by the trace theory [17, 23]. These graphgrammar productions are implemented as GALOIS tasks working on the DAG representing the elimination tree. The DAG for our simple example is presented in Figure 8. We start from graphgrammar productions located at the leaves of the elimination tree, then we go up to the root, and finally we go back down to the leaves. The tasks are then managed and scheduled by GALOIS [1]. That is, there is a direct relation between the structure of the elimination tree, the graphgrammar productions, the dependency relation between them, and the scheduling based on DAG in GALOIS. Thus, we present the elimination trees generated by our algorithms and we refrain from listing the graphgrammar productions.
2. Computational Cost Estimates for Sequential and Parallel Multifrontal Solver Algorithm
In order to estimate the number of floatingpoint operations (FLOPs) executed by the multifrontal algorithm we start with estimation of the FLOPs number during elimination of rows from square matrix of size (see Figure 9). The FLOPs number as derived in [24] is equal to
To estimate the sequential execution cost of the multifrontal solver we sum up the costs of all the nodes of the elimination tree. To estimate the parallel sharedmemory execution cost of the multifrontal solver we sum the maximum cost of each level of the elimination tree. Additionally we assume that we have enough cores to process all frontal matrices from all levels of the elimination tree in parallel, which is not always the case in practical computations. This is illustrated in Table 1.
Our solver uses partial pivoting. The pivoting is performed over the local frontal matrices at all the levels of the elimination tree. Partial pivoting compares the values of all the entries located below the diagonal within the rows that are fully assembled, which can be eliminated at this level. We do not pivot with nonfully assembled rows. From the point of view of the FLOPs, pivoting does not require FLOPs; it requires a few comparisons followed by a swift of the integers in the vector representing rows order. Thus we do not include the cost of pivoting in the cost function. The implementation of the pivoting requires just one loop through diagonal column, followed by swap of the two indexes in the row indexes vector. The cost of pivoting is negligibly small in comparison to the factorization itself. In our previous papers [18, 25] we have developed one twodimensional solver and one threedimensional solver using such partial pivoting algorithm and we have solved a number of computational problems with , , and adaptivity, including linear elasticity [26], Poisson equation [27], Maxwell equations [28], propagation of acoustics waves over the human head [29], and the Stokes flow problem [30]. We have not encountered convergence or roundoff error problems.
3. QuasiOptimal Elimination Trees by Dynamic Programming
The search for the optimal elimination tree can be represented by the directed acyclic graph (DAG) presented in Figure 10. The DAG root node represents the entire computational mesh, while child DAG nodes represent possible partitions of the mesh. We consider partitions of the mesh along straight lines which can be either horizontal or vertical. Thus, child DAG nodes of a root DAG node represent all possible partitions of the root along straight lines. We repeat this partition process recursively, until we reach leaves representing single finite elements. Some subbranches of the DAG are identical. For example, we identify identical branches in Figure 10 by red or green. These subbranches do not need to be regenerated, since we can use the pointer to an already generated identical subbranch. The elimination trees are represented as binary subtrees of this DAG. The optimization procedure is executed twice, once for each cost function defined recursively below. One cost function corresponds to sequential solver cost; the other one corresponds to parallel solver cost. The cost of processing the internal DAG node is defined as for the optimization performed for the sequential solver execution and for the optimization performed for the parallel sharedmemory solver execution. Again, this is only true under the assumption that we have enough available threads to process all frontal matrices from a given level of the elimination tree at the same time, which is not always the case in practical application. Each node of the elimination tree contains a frontal matrix with size having some number of fully assembled degrees of freedom. Leaf nodes contain element frontal matrices with fully assembled internal supernodes which can be eliminated. The cost of elimination of fully assembled supernodes from frontal matrix of size has been defined in 2. This is just the number of operations for the partial forward elimination algorithm. Given a geometric description of the finite element mesh, the dynamic programming algorithm works in two steps. In the first step, the DAG representing the subproblems and dependency relations between them is constructed. Then, the DAG is optimized in a bottomup approach. The DAG is constructed as follows. The algorithm adds a first DAG node to the DAG corresponding to the initial mesh. At any subsequent step , any unprocessed DAG node is processed and this DAG node is marked as processed. The algorithm terminates once all DAG nodes are processed. To process a DAG node we list all possible bisections of the (sub)mesh the DAG node represents, which we denote as a nodal mesh. The nodal submesh bisections use straight vertical or horizontal lines to split the mesh into two. These straight lines are called separators. For each separator (bisection of the submesh) two children DAG nodes are assigned to the parent DAG node under analysis which are formed by an edge of the graph. Once all possible separators are applied the DAG node analysis is complete and is marked as processed.
After completing the construction of the DAG, we start the optimization stage based on a cost analysis that is built as follows. First, we assign to each DAG node with zero out degrees the cost of evaluating its Schur complement. These DAG nodes are called sinks and correspond to individual finite elements. All DAG nodes with descendants are called parents. The cost assigned to each parent corresponds to the child DAG nodes of partitions with minimal cost, that is, for a DAG node with only sink children, the cost corresponding to the sum of the children in serial execution or the cost of the most expensive children in parallel as listed in Table 1. For parent DAG nodes with children which have outnode connections, the cost corresponds to the path to sinks with minimal cost. The optimization began by assigning the cost to each sink. Then, for each parent whose children’s cost has been fully processed, each partition is assigned a cost based on the separator used and the full cost of the submeshes. The partitions with minimum cost are kept and all other children DAG nodes are removed. The optimization procedure continues this way until reaching the root of the tree.
The dynamic programming algorithm itself for the case of sequential solver optimization has been described in conference proceedings paper [31]. We also refer to [32] for examples on the usage of the GALOIS solver over these trees. In this paper we focus on the trees constructed by the heuristic algorithm presented in the next section, delivering a computational cost similar to the one obtained by the quasioptimal trees obtained with dynamic programming. Below, we provide a short summary of a quasioptimal elimination tree found by our dynamic programming algorithm for meshes with point and edge singularities with three refinement levels.
3.1. Description of QuasiOptimal Orderings Based on Dynamic Programming
3.1.1. Point Singularity in Sequential Execution
The dynamic programming optimization algorithm sequential execution for a mesh with a point singularity results in the elimination tree presented in Figure 11. The pattern of the elimination tree is invariant with the number of refinement levels. The tree follows levelbylevel elimination pattern.
3.1.2. Point Singularity for a Parallel Solver
This time we have executed the dynamic programming optimization algorithm for the point singularity with the cost function designed for a parallel sharedmemory solver. The obtained elimination tree is presented in Figure 12. The tree is no longer cutting layers, one by one, but rather trying to balance the load over each partition. However, the pattern is not invariant with refinement level.
3.1.3. Edge Singularity in Sequential Execution
We switch now to an edge singularity. For the dynamic optimization algorithm optimizing for sequential solution on a mesh with an edge singularity the optimization procedure resulted in the elimination tree presented in Figure 13. The general elimination tree pattern is invariant with the level of refinements. The optimizer cuts the largest two elements, and then it partitions the remaining mesh into two parts. The optimizer does this sequence recursively until the leaves.
3.1.4. Edge Singularity for a Parallel Solver
As before, switching the cost estimates for parallel execution leads to a tree which is not invariant with the refinement level. The quasioptimal tree is presented in Figure 14. This result of the optimization on the bisection sequence scales to optimize load balancing for the selected separators.
4. Heuristic Algorithm for Construction of the Elimination Trees
The dynamic programming strategy algorithm described above to check the computational cost of the multifrontal solver resulting from elimination trees is obtained by recursive partitioning of the computational mesh along straight lines. There are many such elimination trees for a single mesh, and the dynamic programming algorithm can only be used as a learning tool, since the cost of finding the elimination tree resulting in minimal cost for a large mesh is actually orders of magnitude larger than the cost of the solution itself. The dynamic programming algorithm allowed us to construct a heuristic algorithm that provides similar elimination trees in computational cost, where is the number of elements of the mesh. Thus, we propose an area and neighbors algorithm for construction elimination trees for multifrontal solvers for refined grids.
The heuristic algorithm can be utilized under the following assumptions.(i)The computational mesh is twodimensional, and it is obtained by performing a sequence of isotropic refinements from initial structured regular mesh with rectangular elements.(ii)When constructing the refined mesh, only isotropic refinement is allowed. In other words, selected rectangular elements are always broken into four smaller son elements.(iii)The elements in the initial mesh are numbered, and their numbers are topologically sorted, left to right, row by row.(iv)Each element of the initial mesh has assigned refinement level equal to 1.(v)When the adaptive algorithm breaks an element into four son elements, the refinement level of all son elements is equal to the refinement level of the parent element plus one.(vi)The area of each element is defined as .(vii)When the adaptive algorithm breaks an element into four son elements, they get the new numbers in the global numbering of elements, and their numbers increase clockwise.(viii)The mesh fulfills the 1irregularity rule, telling that an element can be broken only once without breaking an adjacent large element.(ix)When there are one element on one side of an edge and two elements on the other side of the mesh, we call this common edge a constrained edge.(x)When we compute the maximum number of common edges between two adjacent patches of elements in the mesh, we count each constrained edge as one. The assumptions listed above correspond to the computational grids generated by twodimensional adaptive finite element method called described in [3]. However in this paper we consider only uniform polynomial order of approximation (only refinement with uniform ). The exemplary process of generation of the numbering for point and edge singularity is presented in Figures 15 and 16. The corresponding numbering of refinement levels for particular elements is reported in Tables 2 and 3.
(a)
(b)
(c)
(d)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
The area and neighbors algorithm executed on the computational mesh with elements can be summarized as shown in Algorithm 1.

Notice the following remarks.(i)Thanks to our definition of numbering of elements, the initial forest sorted according to the numbering of elements is also sorted according to elements area.(ii)When we rotate the point or edge singularity by 90, 180, or 270 degrees our numbering algorithm followed by the area and neighbors algorithm will generate similar quasioptimal elimination trees resulting in a similar number of FLOPs.(iii)In the general case, the algorithm may not deliver quasioptimal elimination trees, since it is a heuristic algorithm, and the problem of construction of an optimal elimination tree is NPcomplete.
Let us illustrate the heuristic algorithm on the mesh examples with point and edge singularities. The meshes are presented in Figures 15 and 16. Let us focus first on the mesh with point singularity, as presented in Figure 17.(1)The elements are sorted according to their numbering, in the reversed order, which is equivalent to sorting according to their area, from smallest to the largest.(2)We select the subforest with smallest elements , , , with minimum area equal to (cf. Figure 17(a)).(3)We find out that the maximum number of common edges between elements , , , is equal to 1.(4)The two pairs of elements and with minimum area and maximum number of common edges are formed. The elements , , , are removed from the list; the newly created trees with pairs of elements are put at the beginning of the list (cf. Figure 17(b)).(5)The subforest with trees built from smallest patches is now , and the minimum area is equal to (cf. Figure 17(b)).(6)We find out that maximum number of common edges between patches , is equal to 2.(7)We form a new tree from a pair ; the pairs , are removed from the list, and the new tree is added at the beginning of the list (cf. Figure 17(c)).(8)The subforest with trees built from smallest patches is now , , , and the minimum area is equal to (cf. Figure 17(c)).(9)We find out that maximum number of common edges between patches , , , is equal to 1.(10)We form a new tree from a patch and element as well as a new tree out of elements . They are removed from the list, and the new trees are added at the beginning of the list (cf. Figure 17(d)).(11)The subforest with smallest patches is now , and the minimum area is (cf. Figure 17(d)).(12)We find out that the maximum number of common edges between patches , is equal to 2.(13)We form a new tree from patches . They are removed from the list, and the new tree is added at the beginning of the list (cf. Figure 17(e)).(14)The subforest with smallest patches is now , , , and the minimum area is (cf. Figure 17(e)).(15)We find out that the maximum number of common edges between patches is equal to 1.(16)We form new trees and . The original patches are removed from the list, and the new trees are added at the beginning of the list (cf. Figure 17(f)).(17)The subforest with smallest patches is now and the minimum area is (cf. Figure 17(f)).(18)We find out that the maximum number of common edges between the patches is equal to 2.(19)We form a new tree from patches . They are removed from the list, and the new tree is added at the beginning of the list (cf. Figure 17(g)).(20)The scenario is repeated recursively until one tree is formed. Let us focus now on the case of the mesh with edge singularity, as presented in Figure 18. For the sake of simplicity, we present only a short description for this case. (1)The elements are sorted according to their area.(2)The eight pairs of elements with minimum area and maximum number of common edges are selected.(3)The created eight patches of elements still have minimal areas. They are selected to form four new patches.(4)At this point, patches and have the same minimal area as the four created multielement patches. All these six patches are merged now into two new patches.(5)At this point patches and the two multielement patches have minimal area and they are merged.(6)Now we have to merge two patches that have minimal area.(7)The situation is repeated again.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(a)
(b)
(c)
(d)
(e)
(f)
5. Tree Rotation Algorithm
We apply a tree rotation algorithm to improve the balance of the elimination tree [33, 34]. The algorithm browses the tree in breadthfirst search order and performs a sequence of local rotations every time a branch is much deeper than the other one. For a detailed description of the tree rotation algorithm please refer to [33, 34].
6. Numerical Results
6.1. Comparison of Computational Cost for Different Elimination Trees
In this section we compare the number of floatingpoint operations (FLOPs) of our heuristic elimination trees constructed for the meshes with point and edge singularities, with alternative ordering algorithms available through MUMPS. In particular, we compare the number of FLOPs of our GALOIS solver based on our heuristic elimination trees with number of FLOPs required by MUMPS using the ordering provided by approximate minimum fill (AMF), approximate minimum degree (AMD), SCOTCH, PORD, METIS, and AMD with quasirow detection. We utilize sequential version of our GALOIS solver, with heuristic elimination tree, without tree rotations. The results of the comparison are presented in Tables 4 and 5 as well as in Figures 19 and 20 using loglog scale. In all the cases, our heuristic algorithms deliver a lower number of FLOPs. We also estimated the exponent factors for both algorithms and obtain for our algorithm and for the best alternative ordering for the point singularity, as well as for our algorithm and for the best alternative ordering for the edge singularity.
6.2. Comparison of Dynamic Programming and Heuristic Algorithms
In this section we analyze the computational performance of the quasioptimal and heuristic elimination trees in actual implementations. To compare the methods we use the execution time. This allows us to account for FLOPs and memory transfers in this comparison. We schedule the resulting sequences of graphgrammar productions using GALOIS to obtain an optimized performance both in serial and in parallel execution. We compare the solver execution time over three elimination trees:(i)the dynamic programming tree using the sequential cost function, without tree rotations,(ii)the dynamic programming tree using the parallel cost function, without tree rotations,(iii)the heuristic elimination tree, without tree rotations.
The tests are performed on a GILBERT machine with 64 cores. We focus on the mesh with point and edge singularity. The solvers use an increasing number of threads, from 1, 2, 4, 8, 16 to 32. The results are presented in Figure 21. The results indicate that the heuristic algorithms result in similar execution times like the quasioptimal trees generated by the dynamic programming algorithm.
We can draw the following conclusions from the performed numerical experiments.(i)Both dynamic programming orderings provide similar execution times; the parallel ordering becomes slightly faster for a large number of threads.(ii)Our heuristic ordering provides a similar execution time to that of the dynamic programming orderings.(iii)We conclude that we can safely use the elimination trees generated by the heuristic ordering instead of expensive dynamic programming orderings; however, we will continue comparison of these two approaches for other kinds of meshes and in 3D in our future work.
6.3. Comparison with MUMPS
In this example we compare our GALOIS solver with heuristic trees without the rotation algorithm against MUMPS with AMD ordering provided by METIS, since AMD results in a minimum number of FLOPs. We compile our GALOIS based solver with the gcc4.8.0 compiler. Our solver does not use any optimized numerical libraries and is a pure C code. We compare against MUMPS solver version 4.10.0 compiled with gfortran4.8.0 and linked with metis4.0.3, atlas3.10.1, LAPACK3.4.2, and ScaLAPACK2.0.2. In MUMPS we utilize Cholesky factorization (the problem is symmetric, positive definite). We use a simple model problem, the heat transfer equation. In our solver we utilize LU factorization. We compare execution times as well as parallel efficiency and speedup. The tests are performed on a single node of ATARI Linux cluster with 16core Intel Xeon CPU, with 2.4 GHz, total 16 GB RAM. The point singularity results in very small computational meshes, and the comparison of parallel solvers makes no sense there. In the following experiments we focus on the mesh with an edge singularity. The comparison of the execution times for an edge singularity is presented in Figure 22. The comparison of the parallel efficiency for an edge singularity is presented in Figure 23. The comparison of the parallel speedup for an edge singularity is presented in Figure 24.
We can draw the following conclusions from the performed numerical experiments.(i)For small problem sizes (less than 10000 degrees of freedom) MUMPS and GALOIS solvers behave like for point singularity case.(ii)For larger problem sizes both MUMPS and GALOIS speed up when we increase the number of cores.(iii)For larger problem sizes GALOIS scales well up to 8 cores; however MUMPS scales well up to 4 cores only.(iv)For larger problems the GALOIS solver with any thread number outperforms multithreaded MUMPS.
6.4. Tests Using Tree Rotation Algorithm
In this section we compare execution times for the meshes with point and edge singularities, using our heuristic elimination trees, before and after execution of the rotation algorithm. The tests are performed on a GILBERT machine with 64 cores. From these experiments we do not observe a significant improvement on the performance of the proposed heuristic elimination trees. That is, for some instances rotation improves performance while in others it does not. In all cases the performance is comparable.
7. Conclusions
In this paper we discussed a dynamic programming algorithm for finding quasioptimal elimination trees for twodimensional grids with singularities. We performed the optimization for the cost function reflecting sequential and parallel execution. We introduce a heuristic algorithm to construct the elimination trees. These heuristic elimination trees have similar performance to that of the quasioptimal trees obtained with dynamic programming. The quasioptimal and heuristic elimination trees are compared against stateoftheart solvers implemented in popular libraries such as MUMPS. We compare number of FLOPs for each solver (using execution times as proxies for MUMPS) and show that our elimination trees deliver better computational costs in terms of the exponent factors. We also executed the algorithm for rotation of our elimination trees to check if they are well balanced and well suited to parallel computations. We verified experimentally that the tree rotation may improve the execution time of the multifrontal solver working with our elimination trees, but this is not always the case. Finally our elimination trees were expressed as graphgrammar productions and implemented in our graphgrammarbased solver using GALOIS scheduler. The solver is compared with MUMPS. We show that the graphgrammarbased solver outperforms MUMPS for large problems and provides comparable execution times for small ones.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The work of Anna Paszyńska, Maciej Paszyński, Konrad Jopek, Maciej Woźniak, Damian Goik, and Piotr Gurgul was supported by the Polish National Science Center Grants nos. DEC2012/07/B/ST6/01229, DEC2011/03/B/ST6/01393, and DEC2012/06/M/ST1/00363. The work of Keshav Pingali and Andrew Lenerth was supported by NSF Grants CCF 1337281, CCF 1218568, ACI 1216701, and CNS 1064956. Donald Nguyen was supported by a DOE Sandia Fellowship. The work of Hassan AbouEisha, Mikhail Moskkov, and Victor Manuel Calo and visits of Anna Paszyńska, Maciej Paszyński, and Maciej Woźniak at KAUST were supported by the Center for Numerical Porous Media at KAUST. The visits of Maciej Paszyński at ICES were supported by J. T. Oden Research Faculty Fellowship.