Preconditioning for Sparse Linear Systems at the Dawn of the 21st Century: History, Current Developments, and Future Perspectives
Iterative methods are currently the solvers of choice for large sparse linear systems of equations. However, it is well known that the key factor for accelerating, or even allowing for, convergence is the preconditioner. The research on preconditioning techniques has characterized the last two decades. Nowadays, there are a number of different options to be considered when choosing the most appropriate preconditioner for the specific problem at hand. The present work provides an overview of the most popular algorithms available today, emphasizing the respective merits and limitations. The overview is restricted to algebraic preconditioners, that is, general-purpose algorithms requiring the knowledge of the system matrix only, independently of the specific problem it arises from. Along with the traditional distinction between incomplete factorizations and approximate inverses, the most recent developments are considered, including the scalable multigrid and parallel approaches which represent the current frontier of research. A separate section devoted to saddle-point problems, which arise in many different applications, closes the paper.
1. Historical Background
The term “preconditioning” generally refers to “the art of transforming a problem that appears intractable into another whose solution can be approximated rapidly” , while the “preconditioner” is the mathematical operator responsible for such a transformation. In the context of linear systems of equations, where and , the preconditioner is a matrix which transforms (1.1) into an equivalent problem for which the convergence of an iterative method is much faster. Preconditioners are generally used when the matrix is large and sparse, as it typically, but not exclusively, arises from the numerical discretization of Partial Differential Equations (PDEs). It is not rare that the preconditioner itself is the operator which makes it possible to solve numerically the system (1.1), that otherwise would be intractable from a computational viewpoint.
The development of preconditioning techniques for large sparse linear systems is strictly connected to the history of iterative methods. As mentioned by Benzi , the term “preconditioning” used to identify the acceleration of the iterative solution of a linear system can be found for the first time in a 1968 paper by Evans , but the concept is pretty older and was already introduced by Cesari back in 1937 . However, it is well known that the earliest tricks to solve numerically a linear system were already developed in the 19th century. As reported in the survey by Saad and Van der Vorst  and according to the work by Varga , Gauss set the pace for iterative methods in an 1823 letter, where the famous German mathematician describes a number of clever tricks to solve a singular linear system arising from a geodetic problem. Most of the strategies suggested by Gauss compose what we currently know as the stationary Gauss-Seidel method, which was later formalized by Seidel in 1874. In his letter, Gauss jokes saying that the method he suggests is actually a pleasant entertainment, as one can think about many other things while carrying out the repetitive operations required to solve the systems. Actually, it is rather a surprise that such a method attracted a great interest in the era of automatic computing! In 1846 another famous algorithm was introduced by Jacobi to solve a sequence of linear systems enforcing diagonal dominance by plane rotations. Other similar methods were later developed, such as the Richardson  and Cimmino  iteration, but no significant improvements in the field of numerical solution of linear systems can be reported until the 40s. This period is referred to as the prehistory of this subject by Benzi .
As it often occurred in the past, a dramatic and tragic historical event like World War II increased the research fundings for military reasons and helped accelerate the development and progress also in the field of numerical linear algebra. World War II brought a great development of the first automatic computing machines, which later led to the modern digital electronic era. With the availability of such new computing tools, the interest in the numerical solution of relatively large linear systems of equations suddenly grew. The more significant advances were obtained in the field of direct methods. Between the 40s and 70s the algorithms of choice for solving automatically a linear system are based on the Gaussian elimination, with the development of several important variants which helped greatly improve the native method. First, it was recognized that pivoting techniques are of paramount importance to stabilize the numerical computations of the triangular factors of the system matrix and reduce the effects of rounding errors, for example, . Second, several reordering techniques were developed in order to limit the fill-in of the triangular factors and the number of operations required for their computation. The most famous bandwidth reducing algorithms, such as Cuthill-McKee and its reverse variant , nested dissection [11–13], and minimum degree [14–16], are still very popular today. Third, appropriate scaling strategies were introduced so as to balance the system matrix entries and help stabilize the triangular factor computation, for example, [17, 18].
In the meantime, iterative methods lived in a kind of niche, despite the publication in 1952 of the famous papers by Hestenes and Stiefel  and Lanczos  who discovered independently and almost simultaneously the Conjugate Gradient (CG) method for solving symmetric positive definite (SPD) linear systems. These papers did not go unnoticed, but at that time CG was interpreted as a direct method converging in steps, if is the number of distinct eigenvalues of the system matrix . As it was soon realized , this property is lost in practice when working in finite arithmetics, so that convergence is actually achieved in a number of iterations possibly much larger than . This is why CG was considered an attractive alternative to Gaussian elimination in a few lucky cases only and was not rated much more than just an elegant theory. The iterative method of choice of the period was the Successive Overrelaxation (SOR) algorithm, which was introduced independently by Young  in his Ph.D. thesis and Frankel  as an acceleration of the old Gauss-Seidel stationary iteration. The possibility of estimating theoretically the optimal overrelaxation parameter for a large class of matrices having the so-called property A  gained a remarkable success for SOR especially in problems arising from the discretization of PDEs of elliptic type, for example, in groundwater hydrology or nuclear reactor diffusion .
In the 60s the Finite Element (FE) method was introduced in structural mechanics. The new method had a great success and gave rise to a novel set of large matrices which were very sparse but neither diagonally dominant nor characterized by property A. Unfortunately, SOR techniques were not reliable in many of these problems, either converging very slowly or even diverging. Therefore, it is not a surprise that direct methods soon became the reference techniques in this field. The irregular sparsity pattern of the FE-discretized structural problems gave a renewed impulse to the development of more appropriate ordering strategies based on the graph theory and more efficient direct algorithms, leading in the 70s and early 80s to the formulation of the modern multifrontal solvers [13, 24, 25]. In the 80s the first pioneering commercial codes for FE structural computations became available and the solvers of choice were obviously direct methods because of their reliability and robustness. At this time the solution of sparse linear systems by direct techniques appears to be quite a mature field of research, well summarized in famous reference textbooks such as the ones by Duff et al.  and Davis .
In contrast, during the 60s and 70s iterative solvers were living their infancy. CG was rehabilitated as an iterative method by Reid in 1971  who showed that for reasonably well-conditioned sparse SPD matrices the convergence could have been reached after far fewer than steps, being the size of . This work set the pace for the extension of CG to nonsymmetric and indefinite problems, with the introduction of the earliest Krylov subspace methods such as the Minimal Residual (MINRES)  and the Biconjugate Gradient (Bi-CG)  algorithms. However, the crucial event for the future success of Krylov subspace methods was the publication in 1977 by Meijerink and van der Vorst of the Incomplete Cholesky CG (ICCG) algorithm . Incomplete factorizations had been already introduced in the Soviet Union  and independently by Varga  in the late 50s, and the seminal ideas for improving the conditioning of the system matrix in the CG iteration can be found also in the original work by Hestenes and Stiefel  and in Engeli et al. , but Meijerink and van der Vorst were the first who put the things together and showed the great potential of the preconditioned CG. The idea was later extended by Kershaw  to an SPD matrix not necessarily of -type and gained soon a great popularity. The key point was that CG as well as any other Krylov subspace method with a proper preconditioning could become competitive with the latest direct methods, though requiring much less memory and so being attractive for large three-dimensional (3D) simulations.
The 80s and 90s are the years of the great development of Krylov subspace methods. In 1986 Saad and Schultz introduced the Generalized Minimal Residual (GMRES) method  which soon became the algorithm of choice among the iterative methods for nonsymmetric linear systems. Some of the drawbacks of Bi-CG were addressed in 1989 by Sonnenveld who developed the Conjugate Gradient Squared (CGS) method , on the basis of which in 1992 van der Vorst presented the Biconjugate Gradient Stabilized (Bi-CGSTAB) algorithm . Along with GMRES, Bi-CGSTAB is the most successful technique for nonsymmetric linear systems, with the advantage of relying on a short-term recurrence. As demonstrated in the famous Faber-Manteuffel theorem , Bi-CGSTAB is not optimal in the sense that it does not theoretically lead to an approximate solution with some minimum property in the current Krylov subspace; however, its convergence can be faster than GMRES. Another family of Krylov subspace methods includes the Quasi-Minimal Residual (QMR) algorithm , with its transpose-free (TFQMR) and symmetric (SQMR) variants [40, 41]. Several other variants of the methods above, including truncated, restarted, and flexible versions, along with nested optimal techniques, for example, [42, 43], have been introduced by several researchers until the late 90s. A recent review of Krylov subspace methods for linear systems is available in , while several textbooks have been written on this subject at the beginning of the 21st century, among which the most famous is perhaps the one by Saad .
Initially, Krylov subspace methods were seen with some suspicion by the practitioners, especially those coming from structural mechanics, because of their apparent lack of reliability. Numerical experiments in different fields, such as FE elastostatics, geomechanics, consolidation of porous media, fluid flow, and transport, for example, [46–51], showed a growing interest for these methods, mainly because they potentially could allow for the solution of much bigger and more detailed problems. Whereas direct methods typically scale poorly with the matrix size, especially in 3D models, Krylov subspace methods appeared to be virtually mandatory with large problems. On the other hand, it became soon clear that the key factor to improve the robustness and the computational efficiency of any iterative solver is preconditioning. Nowadays, it is quite well accepted that it is preconditioning, rather than the selected Krylov algorithm, the most important issue to address.
This is why research on the construction of effective preconditioners has significantly grown over the last two decades, while advances on Krylov subspace methods have progressively faded. Currently, preconditioning appears to be a much more active and promising research field than either direct and iterative solution methods, and particularly so within the context of the fast evolution of the hardware technology. On one hand, this is due to the understanding that there are virtually no limits to the available options for obtaining a good preconditioner. On the other hand, it is also generally recognized that an optimal general-purpose preconditioner is unlikely to exist, so that there is the possibility to improve the solver efficiency in different ways for any specific problem at hand within any specific computing environment. Generally, the knowledge of the governing physical processes, the structure of the resulting system matrix, and the available computer technology are factors that cannot be ignored in the design of an appropriate preconditioner. It is also recognized that theoretical results are few, and frequently “empirical” algorithms may work surprisingly well despite the lack of a rigorous foundation. This is why finding a good preconditioner for solving a sparse linear system can be viewed rather as “a combination of art and science”  than a rigorous mathematical exercise.
2. Basic Concepts
Roughly speaking, preconditioning means transforming system (1.1) into an equivalent mathematical problem which is expected to converge faster using an iterative solver. For instance, given a nonsingular matrix , premultiplying (1.1) by yields If the matrix is better conditioned than for a Krylov subspace method, then is the preconditioner and (2.1) denotes the left preconditioned system. Similarly, can be applied on the right: thus producing a right preconditioned system. The use of left or right preconditioning depends on a number of factors and can produce quite different behaviors; see, for example, . For instance, right preconditioning has the advantage that the residual of the preconditioned system is the same as the native system, while with left preconditioning it is not, and this can be important in the convergence check or using a residual minimization algorithm such as GMRES. By the way, with the right preconditioning a back transformation of the auxiliary variable must be carried out to resume the original unknown . If the preconditioner can be written in a factorized form: a split preconditioning can be also used: thus potentially exploiting the advantages of both left and right formulations.
Writing the preconditioned Krylov subspace algorithms is relatively straightforward. Simply, the basic algorithms can be implemented replacing with either , or , or , and then back substituting the original variable where the auxiliary vector is used. It can be easily observed that it is not necessary to build the preconditioned matrix, which could be much less sparse than , but just to compute the product of , or and , if the factorized form (2.3) is used, by a vector. This operation is called application of the preconditioner.
Generally speaking, there are three basic requirements for obtaining a good preconditioner: (i)the preconditioned matrix should have a clustered eigenspectrum away from 0, (ii)the preconditioner should be as cheap to compute as possible, (iii)its application to a vector should be cost-effective. The origin of condition (i) relies on the convergence properties of CG. It is quite intuitive that if resembles in some sense the preconditioned matrix should be close to the identity, thus making the preconditioned system somewhat “easier” to solve and accelerating the convergence. If and are SPD, it has been proved that the iteration count of the CG algorithm to go below the tolerance depends on the spectral conditioning number of the preconditioned matrix , for example : so that clustering the eigenvalues of away from 0 is of paramount importance to accelerate convergence. If is not SPD, things can be much more complicated. For example, the performance of GMRES depends also on the conditioning of the matrix of the eigenvectors of , and the fact that the preconditioned matrix has a clustered eigenspectrum does not guarantee a fast convergence . Anyway, it is quite a common experience that a clustered eigenspectrum away from 0 often yields a rapid convergence also in nonsymmetric problems.
The importance of conditions (ii) and (iii) depends on the specific problem at hand and may be highly influenced by the computer architecture. From a practical point of view, they can also be more important than condition (i). For example, if a sequence of linear systems has to be solved with the same matrix or with slight changes only, as it often may occur when using a Newton method for a set of nonlinear equations, the same preconditioner can be used several times, with the cost for its computation easily amortized. In this case, it could be convenient spending more time to build an effective preconditioner. On the other hand, the computational cost for the preconditioner application basically depends on its density. As has to be applied once or twice per iteration, according to the selected Krylov subspace method, it is of paramount importance that the increased cost of each iteration is counterbalanced by an adequate reduction of their number. Typically, pursuing condition (i) reduces the iteration count but is in conflict with the need for a cheap computation and application of the preconditioner, so that in the end preconditioning efficiently the linear system (1.1) is always the result of an appropriate tradeoff.
An easy way to build a preconditioner is based on the decomposition of . Recalling the definition of stationary iteration, where is the current residual, the matrix can be used as a preconditioner. Splitting as , where , , and are the diagonal, the lower, and the upper parts of , respectively, then are known as the Jacobi, Gauss-Seidel, SOR, and Symmetric SOR preconditioners, respectively, where is the overrelaxation factor. Replacing with in (2.8) and (2.9) gives rise to the backward Gauss-Seidel and SOR preconditioners, respectively. The preconditioners obtained from the splitting of do not require additional memory and have a null computation cost, while their application can be simply done by a forward or backward substitution. With a simple diagonal scaling is actually required.
Because of their simplicity, Jacobi, Gauss-Seidel, SOR, and Symmetric SOR preconditioners are still used in some applications. Assume, for instance, that matrix has two clusters of eigenvalues due to the heterogeneity of the underlying physical problem. This is a quite common experience in geotechnical soil/structure or contact simulations, for example, [49, 54–58]. In this case the matrix can be written as where the square diagonal blocks are responsible of the two clusters of eigenvalues and usually arise in a natural way from the original problem. Factorizing in (2.11) according to Sylvester's theorem of inertia: where is the Schur complement, a Generalized Jacobi (GJ) preconditioner , based on the seminal idea in , can be obtained as a diagonal approximation of the central block diagonal factor in (2.12): where is a user-specified parameter and is an approximation of in (2.13) with used instead of . Quite recently Bergamaschi and Martínez  have indirectly proved that an optimal value for is the ratio between the largest eigenvalues of and . Another very simple algorithm which has proved effective in similar applications is based on SOR and Symmetric SOR preconditioners, for example, [62–64].
Though the algorithms above cannot compete with more sophisticated tools in terms of global performance [58, 65], they are an example of how even somewhat naive ideas can work and be useful for practitioners. Generally speaking, it is always a good idea to develop an effective preconditioner starting from the specific problem at hand. Interesting strategies have been developed approximating directly the differential operator the matrix arises from, by using a “nearby” PDE or a less accurate problem discretization, for example, [66–69]. For instance, popular preconditioners typically used within the transport community are often physically based, for example, [70–72].
For the developers of computer codes, however, using physically based preconditioners is not always desirable. In fact, it is usually simpler introducing the linear solver as a black box which is expected to work independently of the specific problem at hand, and especially so with less specialized codes such as the commercial ones. This is why a great interest has arisen also in the development of purely algebraic preconditioners that claim to be as much “general purpose” as possible. Algebraic preconditioners are usually robust algorithms which require the knowledge of the matrix only, independently of the underlying physical problem. This paper will concentrate on such a class of algorithms. Subdividing algebraic preconditioners into categories is not straightforward, as especially in recent years there have been several contaminations from adjacent fields with the development of “hybrid” approaches. Traditionally, and as already done by Benzi , algebraic preconditioners can be roughly divided into two classes, that is, incomplete factorizations and approximate inverses. The main distinction is that with the former approach is actually computed and is applied but never formed explicitly, whereas with the latter is directly built and applied. Within each class, moreover, different strategies can be pursued according to a number of factors, such as whether is SPD or not, its sparsity structure and size, the magnitude of its coefficients, and ultimately the computer hardware available to solve the problem, so that a large number of variants have been proposed by the researchers. As anticipated before, the unavoidable contamination from adjacent fields, for example, the direct solution methods, has made the boundaries between different groups of algorithms often vague and certainly questionable.
In this work, the traditional distinction between incomplete factorizations and approximate inverses is followed, describing the most successful algorithms belonging to each class and the most significant variants developed to address particular occurrences and increase efficiency. Then, two additional sections will be devoted to the most recent results obtained in the fields that currently appear to be the more active frontier of research in the area of algebraic preconditioning, that is, multigrid techniques and parallel algorithms. Finally, a few words will be spent on a special class of problems characterized by indefinite saddle-point matrices, which arise quite frequently in the applications and have attracted much interest in recent years.
3. Incomplete Factorizations
Given a nonsingular matrix such that the basic idea of incomplete factorizations is to approximate the triangular factors and in a cost-effective way. If and are computed so as to satisfy (3.1), for example, by a Gaussian elimination procedure, typically fill-in takes place, with and much less sparse than . This is what limits the use of direct methods in large 3D problems. The approximations and of and , respectively, can be obtained by simply discarding a number of fill-in entries according to some rules. The resulting preconditioner is generally denoted as Incomplete LU (ILU). Quite obviously, in (3.2) is never built explicitly as it is much denser than both and . Its application to a vector is therefore performed by forward and backward substitutions. Moreover, the ILU factorized form (3.2) is well suited for split preconditioning. In case is SPD, the upper incomplete factor is replaced by and the related preconditioner is known as Incomplete Cholesky (IC).
The native ILU algorithm runs as follows. Define a set of position , with , where either if or if has a nonzero entry. The set is also denoted as the nonzero pattern of the incomplete factors. To avoid breakdowns in the ILU computation, it is generally recommended to include all the diagonal entries in . Then, make a copy of over an auxiliary matrix which will contain and in the lower and upper parts, respectively, at the end of the computation, and set to zero all entries such that . If the main diagonal belongs to , then it is implicitly assumed that the diagonal of is unitary and vice versa. For every position , the following update is computed: with . At the end of the loop, the ILU factors are stored in the copy of .
The procedure described above is clearly an incomplete Gauss elimination. Hence, no surprise that several implementation tricks developed to improve the efficiency of direct solution methods can be borrowed in order to reduce the computational cost of the ILU computation. For example, the algorithm can follow either the KIJ, or the IKJ, or the IJK elimination variants. To give an idea, Algorithm 1 provides a sketch of the sequence of operations required by the IKJ implementation, which is generally preferred whenever the sparse matrix along with both and is stored in a compact form by rows. It can be easily observed that Algorithm 1 proceeds by rows, computing first the row of and then that of and accessing the previous rows only to gather the required information (Figure 1). Many other efficient variants have been developed, for example, storing in a sparse skyline format or computing by columns [73, 74].
3.1. Fill-In Strategies
The several existing versions of the ILU preconditioner basically differ for the rules followed to select the retained entries in and . The simplest idea is to define statically at the beginning of the algorithm and equal to the set of the nonzero positions in . This idea was originally implemented by Meijerink and van der Vorst  in their 1977 seminal work, giving rise to what is traditionally referred to as ILU(0), or IC(0) for SPD matrices. This simple choice is easy to implement and very cost-effective, as the preconditioner construction is quite cheap and its application to a vector costs as much as a matrix-by-vector product with . Moreover, ILU(0) can be quite efficient, especially for matrices arising from the discretization of elliptic problems, for example, [75–77].
Unfortunately, there are a number of problems where the ILU(0) preconditioner converges very slowly or even fails. There are several different reasons for this, depending on both the unstable computation and the unstable application of and , as was experimentally found by Chow and Saad on a large number of test matrices . For example, ILU(0) often exhibits a bad behavior with structural matrices and the highly nonsymmetric and indefinite matrices arising from computational fluid dynamics. In these cases the nonzero pattern of proves a too crude approximation of the exact and structure, so that a possible remedy is to allow for a larger number of entries in and in order to make them closer to and . In the limiting case, the exact factors are computed and the solution to (1.1) is obtained in just one iteration. Obviously, enlarging the ILU pattern on one hand reduces the iteration count but on the other increases the cost of computation and application of the preconditioner, so an appropriate tradeoff must be found.
There are several different ways for enlarging the initial pattern , or computing it in a dynamic way. In any specific problem, the winning algorithm is the one that proves able to capture the most significant entries in and at the lowest cost. One of the first attempts for enlarging dynamically is based on the so-called level-of-fill concept [79, 80]. The idea is to assign an integer value , denoted as level-of-fill, to each entry in position computed during the ILU construction process. The lower the level, the more important the entry. The initial level is set to During the ILU construction, with reference for instance to the Algorithm 1, the level-of-fill of each entry is updated according to the level-of-fill of the entries used for its computation using the following rule: In practice, any position corresponding to a nonzero entry in remains with a zero level-of-fill. The terms computed using entries with a zero level-of-fill have a level equal to 1. Those built using one entry of the first level have level 2 and so on. The new level-of-fill of each entry is computed adding 1 to the sum of the levels of the entries used for it. This kind of algorithm creates a hierarchy among the potential entries that allows for selecting the “most important” nonzero terms only. The resulting preconditioner is denoted as ILU(), where is the integer user-specified level-of-fill of the entries retained into each row of and . It follows immediately that ILU(0) coincides exactly with the zero fill-in ILU previously defined.
In most applications, ILU is already a significant improvement to ILU(0). The experience shows that the fill-in increases quite rapidly with , so that for the cost for building and applying the preconditioner can become a price too high to pay. Moreover it can be observed that at each level there are also many terms which are very small in magnitude, thus providing a small contribution to the preconditioning of the original matrix . Hence, a natural way to avoid such an occurrence is to add a dropping rule according to which the terms below a user-specified tolerance are neglected. Several criteria have been defined for selecting and the way the ILU entries are discarded. A usual choice is to compute the th row of both and and drop an entry whenever smaller than , being an appropriate vector norm and the th row of , but other strategies have been also attempted, such as comparing the entry of with  or with an estimate of the norm of the th column of [82, 83].
Similarly to the level-of-fill approach, a drawback of the drop tolerance strategy is that the amount of fill-in of and is strongly dependent on the selected user-specified parameter and the specific problem at hand and is unpredictable a priori. To address such an issue a second user-specified parameter can be introduced, denoting the largest number of entries that can be retained per row. In some versions, is also defined as the number of entries added at most per row in excess to the number of entries of the same row of . The use of this dual threshold strategy, which may be quite simple and effective, has been proposed by Saad , and the related preconditioner is referred to as ILUT(). A sketch of the sequence of steps required for its computation is provided in Algorithm 2.
As the drop tolerance may be quite sensitive to the specific problem and so difficult to implement in a black box solver, ILU variants that use only the fill-in parameter have been proposed [85, 86]. These versions are generally outperformed by ILUT() but are much easier to set and generally less sensitive to the user-specified parameter, allowing the user for a thorough control of the memory occupation. More recent ILU variants developed with the aim of improving the preconditioner performance and better exploiting the hardware potential are based on multilevel techniques, dense block identification, and adaptive selections of the setup parameters [87–89], with significant contaminations from the methods developed for the latest multifrontal solvers.
3.2. Stabilization Techniques
Both Algorithms 1 and 2 require the execution of divisions where the denominator, that is, the pivot, is not generally guaranteed to be nonzero. It is well known that in finite arithmetics also small pivots can create numerical difficulties. If is SPD, the pivot has to be also positive in order to produce a real incomplete factorization. These issues are very important because they can jeopardize the existence of the preconditioner itself and so the robustness of the overall iterative algorithm.
Meijerink and van der Vorst  demonstrated that for an M-matrix a zero fill-in IC is guaranteed to exist. However, if is SPD but not of M-type zero or negative pivots may arise, thus theoretically preventing from the IC computation. Typically, such an inconvenience may be avoided by increasing properly the fill-in. In this way the incomplete factor tends to the exact factor, which is guaranteed to be SPD. However, this may generate a fill-in degree that is unacceptably high, lowering the performance of the preconditioned solver. To avoid this undesirable occurrence a number of tricks have been proposed which can greatly increase the IC robustness. The first, and maybe simplest, idea was advanced by Kershaw  who just suggested to replace zero or negative pivots with an arbitrary positive value, for example, the last valid pivot. The implementation of this trick is straightforward, but it seldom provides good results as the resulting IC often exhibits a very low quality. In fact, recall from Algorithm 1 that modifying arbitrarily a pivot in the th row means influencing the computation of all the following rows. Another simple strategy recently advanced in  relies on avoiding the computation of the square root of the pivot, required in the IC algorithm, thus eliminating a source of breakdown. In  a diagonal compensated reduction of the positive off-diagonal entries is suggested in order to transform the native matrix into an M-matrix before the incomplete factorization process is performed. A different algorithm is proposed in  with the incomplete factors obtained through an -orthogonalization instead of a Cholesky elimination.
A famous stabilization technique to ensure the IC existence is the one advanced by Ajiz and Jennings , that has proved quite effective especially in structural applications [50, 90]. Following this approach, we can write where is a residual matrix collecting with all the entries discarded during the incomplete procedure. From (3.6), is actually the exact factorization of . As is SPD, it follows immediately that a sufficient condition for to be real is that is negative semidefinite. Suppose that the first column of has been computed and the term has to be dropped along with the corresponding term in . Hence, the incomplete factorization is equivalent to the exact factorization of where the entries and are replaced by zero, and the error matrix contains in position and . For example, in a matrix with , we have The residual matrix can be forced to be negative semidefinite by adding two coefficients and in position and , such that the submatrix is negative semidefinite. The matrix must be modified accordingly by subtracting and from the corresponding diagonal terms. Though different choices are possible, for example, [50, 93], the option is particularly attractive because in this way the sum of the absolute values of the arbitrarily introduced entries is minimized. This procedure ensures that is the exact factorization of a positive definite matrix, hence its existence is guaranteed, but obviously such a matrix can be quite different from . The quality of this stabilization strategy basically depends on the quality of as an approximation of .
Unstable factors and negative pivots typically arise if the diagonal terms of are remarkably smaller than the off-diagonal ones, or if there are significant contrasts in the coefficient magnitude. The latter occurrence is particularly frequent in heterogeneous problems or in models coupling different physical processes, such as multiphase flow in deformable media, coupled flow and transport, themomechanical models, fluid-structure interactions, or multibody systems, for example, [94–97]. All these problems give naturally rise to a multilevel matrix: where the system unknowns can be subdivided into contiguous groups. By distinction with multigrid methods, a multilevel approach assumes that the different levels do not necessarily correspond to model subgrids. Several strategies have been proposed to handle properly the matrix levels, for example, [98–102]. Generally speaking, all such algorithms can be viewed as approximate Schur complement methods that differ for the strategy used to perform the level subdivision and the restriction and prolongation operators adopted [103, 104]. For a recent review, see, for instance, . The basic idea of multilevel ILU proceeds as follows. Consider the partial factorization of in (3.9), where is the submatrix obtained from without the first level, and and are the rectangular submatrices collecting the blocks and , : In (3.10) and is the Schur complement: Replacing and with the ILU factors and of gives rise to a partial incomplete factorization of that can be used as a preconditioner: where As and tend to be much denser than and , some degree of dropping can be conveniently enforced, thus replacing these blocks with and , respectively. Recalling the level structure of , the inverse of the approximated Schur complement can be also replaced by a partial incomplete factorization with the same structure as in (3.12). Repeating recursively the procedure above starting from , at the th level, is replaced by thus giving rise to an th level Schur complement . The algorithm stops when the size of equals , that is, when equals the user specified number of levels . Multilevel ILU preconditioners are generally much more robust than standard ILU and can be very effective provided that a suitable level subdivision is defined. If is SPD, however, it is necessary to enforce the positive definiteness of all approximate Schur complements, which is no longer ensured when incomplete decompositions and dropping procedures are used. Recently, Janna et al.  have developed a stabilization strategy which can guarantee the positive definiteness of each . Basically, they extend to levels of the factorization procedure introduced by Tismenetsky  where the Schur complement is updated by a quadratic formula. The theoretical robustness of Tismenetsky's algorithm has been proved in . Recalling (3.6), the matrix can be additively decomposed as where is the residual matrix obtained from the IC decomposition of . Assuming that an appropriate stabilization technique has been used, for example, the one by Ajiz and Jennings , is negative semidefinite, so is positive definite. Let us collect all the entries dropped in the computation in the error matrix , thus writing . The Schur complement of can be now rewritten as Ignoring the last three terms of (3.17) we obtain that is the standard Schur complement usually computed in a multilevel algorithm; see (3.14). As is generally indefinite, in (3.18) may be indefinite too. As a remedy, add and subtract to the right-hand side of (3.17): Recalling that (3.19) becomes Neglecting the last term in (3.21) we obtain another expression for the Schur complement: which is always SPD. In fact, (3.21) yields which is SPD independently of the degree of dropping enforced on . The procedure above is generally more expensive than the standard one but is also more robust.
If is not SPD, enforcing the positivity of pivots or the positive definiteness of the Schur complements is not necessary. However, the experience shows that, in many problems, especially those with a high degree of nonsymmetry and a lack of diagonal dominance, ILU can be much more unstable in both the computation and the application, leading to frequent solver failures . In this case no rigorous methods have been introduced to avoid numerical instabilities. The effect of a preliminary reordering and/or scaling of has been debated for quite a long time, without achieving a shared conclusion. The fact that the matrix nonzero pattern has an effect on the quality of the resulting ILU preconditioner, similarly to what happens with direct methods, is quite well understood, but not all the methods performing well with direct techniques appear to be efficient for ILU, for example, [90, 109–112]. A similar result is obtained with a preliminary scaling. As proved in , scaling the native matrix does not change the spectral properties of an ILU preconditioned matrix; however it can greatly stabilize the ILU computation by reducing significantly the impact of the round-off errors. Useful scaling strategies can be taken from the GJ preconditioning of (2.14) or the Least Square Log algorithm introduced in .
4. Approximate Inverses
Incomplete factorizations can be very powerful preconditioners; however the issues concerning their existence and numerical stability can undermine their robustness in several examples. One reason for such a weakness can be understood using the following simple argument. Consider the incomplete factorization along with the corresponding residual matrix for a general matrix : Left and right multiplying both sides of (4.1) by and , respectively, yields The matrix controlling the convergence of a preconditioned Krylov subspace method is actually which differs from the identity by . This means that a residual matrix close to 0 is not sufficient to guarantee a fast convergence. In fact, if the norms of either or are large, can be anyway far from the identity and yields a slow convergence, that is, even an accurate incomplete factorization can provide a poor outcome. This is why problems with strongly nonsymmetric matrices and a lack of diagonal dominance incomplete factorizations often have a bad reputation.
To cope with these drawbacks the researchers in the past have developed a second big class of preconditioners with the aim of computing an explicit form for , thus avoiding the solution of the linear system needed to apply the preconditioner. Generally speaking, an approximate inverse is an explicit approximation of . Quite obviously, it has to be sparse in order to maintain a workable cost for its computation and application. It is no surprise that several different methods have been proposed to build an approximate inverse. The most successful ones can be roughly classified according to whether is computed monolithically or as the product of two matrices. In the first case, the approximate inverse is generally computed by minimizing the Frobenius norm: thus obtaining either a right or a left approximate inverse, which is generally different. In the second case, the basic idea is to find an explicit approximation of the inverse of the triangular factors of : and then use in the factored form (2.3). Different approaches have been advanced within each class and also in between these two groups.
Early algorithms for computing an approximate inverse via the Frobenius norm minimization appeared as far back as in the 70s [114, 115]. The basic idea is to select a nonzero pattern for including all the positions where nonzero terms may be expected. Denoting by the set of matrices with nonzero pattern , any algorithm belonging to this class looks for a matrix such that is minimum. Recalling the definition of Frobenius norm: it follows that where and are the th column of and , respectively. The sum at the right-hand side of (4.6) is minimum only if each addendum is minimum. Therefore, the computation of is equivalent to the solution of least-squares problems with the constraint that has nonzero components only in the positions such that . This requirement reduces enormously the computational cost required to build . Denote by the set and the subvector of made of the components included in . In the product only the columns of with index in will provide a nonzero contribution. Moreover, as is sparse, only a few rows will have a nonzero term in at least one of these columns (Figure 2). Denoting by the set and the submatrix of made of the coefficients such that and , each least square problem at the right-hand side of (4.6) reads and is typically quite small. The procedure illustrated above allows for the computation of a right approximate inverse. To build a left approximate inverse just observe that Hence, (4.9) can still be used for any replacing with and obtaining the rows of . It turns out that for strongly nonsymmetric matrices the right and left approximate inverse can be quite different.
The main difference in the algorithms developed within this class relies on how the nonzero pattern is selected, and this is indeed the key factor for the overall quality of the approximate inverse. A more detailed discussion on this point will be provided in the next section. Some of the most popular algorithms in this class have been advanced in [116–121], each one with its pros and cons. Probably the most successful approach is the one proposed by Grote and Huckle  which is known as SPAI (SParse Approximate Inverse). Basically, the SPAI algorithm follows the lines described above dynamically enlarging for each column of until the least square problem (4.9) is solved with a prescribed accuracy. SPAI has proved to be quite a robust method allowing for a good control of its quality by the user. Unfortunately, its computation can be quite expensive also on a parallel machine. A sketch of the SPAI construction is reported in Algorithm 3.
A different way to build an approximate inverse of relies on approximating the inverse of its triangular factors. Assuming that all the leading principal minors of are not null, that is, can be factorized as the product where is unit lower triangular, is diagonal, and is upper unit triangular, then implying that the rows of and the columns of are two sets of -biconjugate vectors. In other words, denoting by and the th row of and the th column of , respectively, (4.11) means that If and are the matrices collecting all vectors and , , satisfying the condition (4.12) and , then the inverse of reads and can be computed explicitly. There are infinite sets of vectors satisfying (4.12), and they can be computed by means of a biconjugation process, for example, via a two-sided Gram-Schmidt algorithm, starting from two arbitrary initial sets. Quite obviously, these vectors are dense and their explicit computation is practically unfeasible. However, if the process is carried out incompletely, for instance, dropping the entries of and , , whenever smaller than a prescribed tolerance, the matrices and can be acceptable approximations of and though preserving a workable sparsity. The resulting preconditioner is denoted as AINV (Approximate INVerse) and has been introduced in the late 90s by Benzi and coauthors [122–124]. In the following years the native algorithm has been further developed by other researchers as well, for example, [125–130]. A sketch of the basic procedure to build the AINV preconditioner is provided in Algorithm 4.
Quite obviously, if is SPD, then , and in Algorithm 4 only the update of must be carried out. The resulting AINV preconditioner can be written as , is SPD, and exists if for all . Unfortunately, the same does not hold true for nonfactored sparse approximate inverses based on minimizing the Frobenius norm in (4.3). In general, using an algorithm developed along the lines of SPAI there is no guarantee that is SPD if is so. This is quite a strong limitation, as it prevents from using the powerful CG method to solve an SPD linear system. Such a weak point motivated the development of the Factorized Sparse Approximate Inverse (FSAI) preconditioner by Kolotilina and Yeremin . The FSAI algorithm lies somewhat between the two classes of approximate inverses previously identified, as it produces a factored preconditioner in the form where is a sparse approximation of the inverse of the exact Cholesky factor of an SPD matrix obtained with a Frobenius norm minimization procedure. Select a lower triangular nonzero pattern and find the matrix , where is the set of matrices with pattern such that The minimization (4.15) can be performed recalling that Deriving the right-hand side of (4.16) with respect to the nonzero entries of and setting to 0 yield where the symbol denotes the entry in position of the matrix in the brackets. As is a lower triangular pattern, the entry of is nonzero only if and the off-diagonal terms of are not required. Scaling (4.17) so that the right-hand side is either 1 or 0 yields where is the scaled factor and the Kronecker delta. Solving (4.18) by rows leads to a sequence of dense SPD linear systems with size equal to the number of nonzeroes set for each row of . More precisely, denoting by the set of column indices where the th row of has a nonzero entry, the following SPD systems have to be solved: where has zero components except the th one that is unitary, and is the cardinality of . The FSAI factor is finally restored from as such that the preconditioned matrix has all diagonal entries equal to 1. It has been demonstrated that FSAI is optimal in the sense that it minimizes the Kaporin conditioning number of for any [131, 132]. Another remarkable property of FSAI is that the minimization in (4.15) does not require the explicit knowledge of , which would make the algorithm unfeasible. A sketch of the procedure to build FSAI is reported in Algorithm 5. The FSAI preconditioner was later improved in several ways , with the aim of increasing its efficiency by dropping the smallest entries while preserving its optimal properties. Between the late 90s and early 00s further developments of FSAI were abandoned, as AINV proved generally superior. In recent years, however, a renewed interest for FSAI has started, mainly because of its high intrinsic parallel degree, for example, [134–136].
The native FSAI algorithm has been developed for SPD matrices. A generalization to nonsymmetric matrices is possible and quite straightforward [137, 138]; however the resulting preconditioner is much less robust because the solvability of all local linear systems and the nonsingularity of the approximate inverse are no longer theoretically guaranteed. This is why with general sparse matrices SPAI and AINV are generally preferred.
4.1. Nonzero Pattern Selection and Stability Issues
A key factor for the efficiency of approximate inverse preconditioning based on the Frobenius norm minimization is the choice of the nonzero pattern or . If the nonzero pattern is not good for the problem at hand, the result is generally a failure in the Krylov solver convergence. This is due to the fact that approximate inverse techniques assume that a sparse matrix can be a good approximation, in some sense, of . This is not obvious at all, as the inverse of a sparse matrix is structurally dense. If is diagonally dominant, the entries of decay exponentially away from the main diagonal , hence a banded pattern for typically provides good results. Other favorable situations include block tridiagonal or pentadiagonal matrices , but for a general sparse matrix choosing a good nonzero pattern is not trivial at all.
With respect to the nonzero pattern selection, approximate inverses can be defined as static or dynamic, according to whether and are initially chosen and kept unvaried during the computation, or and are generated by an adaptive algorithm which selects the position of the nonzero entries in an a priori unpredictable way. Typically, dynamic approximate inverses are more powerful than static ones, but their implementation especially on parallel computers can be much more complicated. The most popular and common strategy to define a static nonzero pattern for is to take the nonzero pattern of a power of as is justified in terms of the Neumann series expansion of . In real problems, however, it is uncommon to set a value larger than 2 or 3, as can soon become too dense, so it is often preferable working with sparsified matrices. The sparsification of can be done by either a prefiltration, or a postfiltration strategy, or both. For example, Chow  suggests dropping the entries of smaller than a threshold, thus obtaining the sparsified matrix and using the pattern of . An a posteriori sparsification strategy is advanced for the FSAI algorithm by Kolotilina and Yeremin  with the aim of reducing the cost for the preconditioner application by dropping its smallest entries. Such a dropping, however, is nontrivial and is performed so as to preserve the property that the FSAI preconditioned matrix has unitary diagonal entries, with a possibly nonnegligible computational cost. Anyway, pre- and/or postfiltration of a static approximate inverse is generally to be recommended, as the cost of the preconditioner application can be dramatically reduced with a relatively small increase of the iteration count, for example, [135, 136, 142]. By the way, this suggests that the pattern of is often much larger than what would be actually required to get a good preconditioner, the difficulty relying on the a priori recognition of the most significant entries.
Dynamic approximate inverses are based on adaptive strategies which start from a simple initial guess, for example, a diagonal pattern, and progressively enlarge it until a certain criterion is satisfied. For instance, a typical dynamic algorithm is the one proposed by Grote and Huckle  for the SPAI preconditioner; see Algorithm 3. More specifically, they suggest to collect the indices of the nonzero components of the residual vector in a new set and to enlarge the current set by adding the new column indices that appear in the rows with index in . An improvement can be obtained by retaining only a few new column indices, specifically the indices such that the new residual norm: is larger than a user-specified tolerance. Then, the set is enlarged correspondingly, and a new least square problem is solved. Other dynamic techniques for generating have been advanced by Huckle for both nonsymmetric  and SPD matrices . More recently, Jia and Zhu  have advanced a novel dynamic approximate inverse denoted as PSAI (Power Sparse Approximate Inverse), and Janna and Ferronato  have developed a new adaptive strategy for Block FSAI that has proved quite efficient also for the native FSAI algorithm and will be presented in more detail in the sequel.
In contrast with approximate inverses based on the Frobenius norm minimization, the AINV algorithm does not need the selection of the nonzero pattern, so it can be classified as a purely dynamic preconditioner. However, its computation turns out to be pretty similar to that of an incomplete factorization, see Algorithms 2 and 4, thus raising concerns on its numerical stability. In particular, due to incompleteness a zero pivot may result, causing a breakdown, or small pivots can trigger unstable behaviors. To avoid such difficulties a stabilized AINV preconditioner has been developed independently in [124, 145] for SPD problems. For generally nonsymmetric problems, especially with large off-diagonal entries, the numerical stability in the AINV computation may still be an issue.
Theoretically, for an arbitrary nonzero pattern it is not guaranteed that is nonsingular. This could also happen for a few combinations of the user-specified tolerances set for generating dynamically the approximate inverse pattern. However, such an occurrence is really unlikely in practice, and typically approximate inverses prove to be very robust in problems arising from a variety of applications. According to the specific problem at hand, the choice of the most efficient type of approximate inverse may vary. For a thorough comparative study of the available options as of 1999 see the work by Benzi and Tůma . Generally, SPAI is a robust algorithm able to converge also in difficult problems where incomplete factorizations fail, for example, , allowing the user to have a complete control of the quality of the preconditioner through the parameter , see Algorithm 3. However, its computation is quite expensive and can be rarely competitive with AINV if only one system has to be solved so that the construction time cannot be amortized by recycling a few times the same preconditioner. For SPD problems, SPAI is not an option and FSAI is always well defined and a very stable preconditioner, even if its convergence rate may be lower than AINV. For unsymmetric matrices, however, FSAI is not reliable and may often fail. On the other hand, AINV seldom fails and generally outperforms both SPAI and FSAI.
Numerical experiments in different problems typically show that, whenever a stable ILU-type preconditioner can be computed, that is probably the most efficient preconditioner on a scalar machine [129, 148, 149]. However, approximate inverses play an important role in offering a stable and robust alternative to ILU in ill-conditioned problems. Moreover, they can become the preconditioner of choice for parallel computations, as it will be discussed in more detail in the sequence.
5. Algebraic Multigrid Methods
It is well known that the convergence rate of any Krylov subspace method preconditioned by either an incomplete factorization or an approximate inverse tends to slow down as the problem size increases. The convergence deterioration along with the increased number of operations per iteration may lead to an unacceptably high computational cost, thus limiting de facto the size of the simulated model even though large memory resources are available. This is why in recent years much work has been devoted to the so-called scalable algorithms, that is, solvers where the iteration count is insensitive to the characteristic size of the discretization mesh.
Multigrid methods can provide an answer to such a demand. Pioneering works on multigrid are due to Brandt [150, 151] who promoted these techniques for the solution of regularly discretized elliptic PDEs. The matrices arising from this kind of problem have a regular structure for which all eigenvalues and eigenvectors are known. For example, let us consider the 1D model problem: with homogeneous Dirichlet boundary conditions discretized by centered differences over equally spaced points , . The grid characteristic size is therefore . This discretization results in the tridiagonal linear system , where, as is well known, the eigenvalues of are with the associated eigenvectors In other words, the th component of is actually Let us solve the discretized model problem by a stationary method, for example, the Jacobi iteration. As the main diagonal of has all components equal to 2, the Jacobi iteration matrix is just and the th component of the error along the direction of the th eigenvector of is damped at each iteration by the reduction coefficient : The reduction coefficients with around are very small, and the related error components are damped very rapidly in a few iterations. This part of the error is often referred to as the oscillatory part corresponding to high frequency modes. However, if is small, there are also several reduction coefficients close to 1, related to both the smallest and largest values of , that slow down significantly the stationary iteration. This part of the error is usually referred to as the smooth part corresponding to low frequency modes. Let us now consider a coarser grid where the characteristic size is doubled, that is, we simply keep the nodes of the initial discretization with an even index. Recalling (5.4) it follows that a smooth mode on the fine grid, that is, and , is now automatically injected into an oscillatory mode on the coarse grid. Building and , it is now possible to damp in a few iterations the error on the coarse grid and then project back the result on the fine grid. Similar conclusions can be drawn also using other stationary iterations, such as Gauss-Seidel, though in a much less straightforward way.
Recalling the previous observations, the basic idea of multigrid proceeds as follows. The operator , usually denoted as the smoother in the multigrid terminology, is applied to the fine grid discretization for iterations. The resulting fine grid residual is computed and injected into the coarse grid through a restriction operator . Recalling the relationship between error and residual, the coarse grid problem is solved using the residual as known term, thus providing the coarse grid error. Such an error is then projected back to the original fine grid using a prolongation operator , and the fine grid solution, obtained after the initial smoothing iterations, is corrected. Finally, the fine grid smoother is applied again until convergence. The algorithm described above is known as two-grid cycle and is the basis of multigrid techniques. Quite naturally, the two-grid cycle can be extended in a recursive way. At the coarse level, another two-grid cycle can be applied moving at a coarser grid, defining appropriate smoother, restriction, prolongation operators, and so on, until the coarsest problem is so small that it can be efficiently solved by a direct method. The recursive application of the two-grid cycle gives rise to the so-called V-cycle. Other more complex recursions can provide the so-called W-cycle; see, for instance,  for details.
Multigrid methods have been introduced as a solver for discretized PDEs of elliptic type, and indeed in such problems they have soon proved to be largely superior to existing algorithms. The first idea to extend their use to other applications was to look at the multigrid as a purely algebraic solver, where one has to define the smoother, restriction, and prolongation operators knowing the system matrix only independently of the grid and the discretized PDE it actually comes from. This strategy, known as Algebraic Multigrid (AMG), was first introduced in the mid 80s [153–155] and became soon quite popular. The second idea was to regard AMG not as a competitor with Krylov subspace solvers, rather as a potentially effective preconditioner. In fact, to work as a preconditioner it is simply sufficient to fix the number of smoothing iterations needed to reconstruct the system solution at the finest level, thus obtaining an inexact solution. The scheme of a V-cycle AMG application as a preconditioner is provided in Algorithm 6.
Basically, AMG preconditioners vary according to the choice of both the restriction operator and the smoother, while the prolongation operator is often defined as the transposed of the restrictor. The basic idea for defining a restriction is to coarsen the native pattern of and prescribe an appropriate interpolation over the coarse nodes. Classical coarsening strategies have been introduced in [155, 156] which separate the variables into coarse (C-) and fine (F-) points according to the strength of each matrix connection. A point strongly influences the point if where is the user-defined strength threshold. For each point that strongly influences an F-point , is either a C-point or is strongly influenced by another C-point . Moreover, two C-points should not be connected to each other. An alternative successful coarsening strategy is based on the so-called smoothed aggregation (SA) technique , where a coarse node is defined by an aggregate of a root point and all the neighboring nodes such that Another possibility is to use a variant of the independent set ordering (ISO) technique which subdivides a matrix into a block structure and projects the native system into the Schur complement space. Using ISO or SA can lead to an elegant relationship between AMG and multilevel ILU as exploited in [98, 101, 158, 159]. More recently, a number of both parallel and scalar coarsening strategies have been also advanced, for example, [160–164]. As far as the smoother is concerned, for many years the method of choice has been the standard Gauss-Seidel iteration. With the development of parallel computers other smoothers have been introduced with the aim of increasing the parallel degree of the resulting AMG algorithm. Multicoloring approaches  can be used in connection to standard Gauss-Seidel, or also polynomial and C-F Jacobi smoothers [166, 167]. It is no surprise that also approximate inverses can be efficiently used as a smoother, for example, [168, 169].
The last decade has witnessed an explosion of research on AMG techniques. The key factor leading to such a great interest is basically their potential scalability with the size of the problem to be solved, in the sense that the iteration count to converge for a given problem does not depend on the number of the mesh nodes. Several theoretical results have been obtained with the aim of generalizing as much as possible AMG to nonelliptic problems, for example, [158, 170–174]. AMG techniques have been used in several different applications with promising results, such as fluid-structure interactions, meshless methods, Maxwell and Helmoltz equations, structural mechanics, sedimentary basin reconstruction, and advection-convection and reactive systems, for example, [175–185]. The above list of references is just representative and of course largely incomplete. Nonetheless, much work is still to be done in order to make AMG the method of choice. In particular, difficulties can be experienced where the system matrix arises from highly distorted and irregular meshes, or in presence of strong heterogeneities. In these cases, even the most advanced AMG strategies can fail.
6. Parallel Preconditioners
Parallel computing is widely accepted as the only pathway toward the possibility of managing millions of unknowns . At the same time, hardware companies are improving the available computational resources with an increasing degree of parallelism rather than accelerating each single computing core. This is why the interest in parallel computations is continuously growing in recent years. Krylov subspace methods are in principle almost ideally parallel, as their kernels are matrix-vector and scalar products along with vector updates. Unfortunately, the same is not true for most algebraic preconditioners. There are basically two approaches for developing parallel preconditioners: the first tries to extract as much parallelism as possible from existing algorithms with the aim of transferring them to high-performance platforms, while the second is based on developing novel techniques which would not make sense on a scalar computer. Quite obviously, the former approach is easier to understand from a conceptual point of view, as the native algorithm is not modified and the difficulties are mainly technological. The latter implies an additional effort to develop new explicitly parallel methods.
ILU-based preconditioners are highly sequential in both their computation and application. Anyway, some degree of parallelism can be achieved through graph coloring techniques. These strategies have been known from a long time by numerical analysts, for example, , and used in the context of relaxation methods. The aim is to build a partition of the matrix graph such that all vertices in the same group form an independent set, so that all unknowns in the same subset can be solved in parallel. Other useful orderings are based on domain decomposition strategies minimizing the edges connecting different subdomains, with a local ordering applied also to each unknown subset. Parallel ILU implementations based on these concepts have been developed in [187–190]; however it was soon made clear that generally such algorithms could not have a promising scalability.
In contrast, approximate inverses are intrinsically much more parallel than ILU factorizations, as they can be applied by a matrix-vector product instead of forward and backward substitutions. The construction of an approximate inverse, however, can be difficult to parallelize. For example, the AINV Algorithm 4 proceeds by columns and is conceptually pretty similar to an ILU factorization. A parallel AINV construction can be obtained by means of graph partitioning techniques which decompose the adjacency graph of in a number of subsets of roughly the same size with a minimal number of edge cuts . In such a way the inner variables of each subset can be managed independently by each processor, while communications occur only with the boundary variables. By distinction, the construction of an approximate inverse based on a Frobenius norm minimization is naturally parallel if a static nonzero pattern is defined a priori. For example, the computation of both static SPAI and FSAI consists of a number of independent least-square problems and dense linear systems, respectively, which can be trivially partitioned among the available processors. Hence, static SPAI and FSAI are almost perfectly parallel algorithms. However, when the nonzero pattern is defined dynamically parallelization is no longer straightforward, as the operation of gathering the required submatrix of , see Algorithms 3 and 5, gives rise to unpredictable communications. For example, details of the nontrivial parallel implementation of the dynamic SPAI algorithm are provided in .
The theoretical scalability properties of AMG make it very attractive for parallel computations. This is why in the last few years the research on AMG techniques has concentrated on high-performance massively parallel implementations. The main difficulties may arise in parallelizing the Gauss-Seidel smoother and the coarsening stage. As mentioned before, these problems can be overcome by using naturally parallel smoothers, such as Jacobi, relaxed Jacobi, polynomial or a static approximate inverse smoother, for example, [162, 166], and parallel coarsening strategies, for example, [160, 162, 192, 193]. Currently, efficient parallel implementations of AMG are already available in several software packages, for example, Hypre  and Trilinos .
An alternative strategy for developing parallel preconditioners relies on building new algorithms which should consist of matrix-vector products and local triangular solves only. Perhaps the earliest technique of this kind belongs to the class of the polynomial preconditioners which was first introduced by Cesari as back as in 1937 , even though obviously not in the context of Krylov subspace methods and supercomputing. Modern polynomial preconditioners have been developed in [196–198]. A polynomial preconditioner is defined as where is a polynomial of degree not exceeding . The preconditioned system becomes: where and commute, so that right and left preconditionings coincide. Quite obviously, is never formed explicitly and its application to a vector is simply performed by a sequence of matrix-vector products. Therefore, polynomial preconditioners can be ideally implemented in a parallel environment. There are different polynomial preconditioners according to the choice of . One of the simplest options, advanced in , is using the Neumann expansion , where and is a scaling parameter. The degree cannot be too large, and this can limit the effectiveness of this preconditioner. Another option, advanced in , is to select a somewhat “optimal” polynomial, where optimality is evaluated in some sense. A natural choice is prescribing that the eigenspectrum of the preconditioned matrix is as close as possible to that of the identity, that is, finding such that where is the eigenspectrum of . The min-max problem above can be addressed by using Chebyshev polynomials but cannot be solved exactly, as is generally not known. An approximate minimization can be done replacing with a continuous set which encloses . A third option is to compute as the polynomial that minimizes where the symbol denotes the 2-norm induced by the inner product on the space with respect to some nonnegative weight function defined in the interval : The resulting polynomial is known as least-squares and was actually introduced by Stiefel in the solution of eigenproblems . According to Saad , least-squares polynomials tend to perform better than Chebyshev and Neumann. However, a major drawback for the use of this kind of preconditioners stems from the need at least for estimates of the smallest and largest eigenvalues of and often their performance is quite poor in terms of convergence rate .
Another popular parallel preconditioner is based on the Additive Schwarz (AS) methods . The idea is to use a domain decomposition approach for dividing the variables into a number of possibly overlapping subdomains and to address separately each subdomain at a local level (Figure 3). On the overlaps the contribution of each subdomain is added or treated in some special way by averaging techniques. AS preconditioners can be easily parallelized by assigning each subdomain to a single processor. For instance, a sequential ILU factorization can be computed for each subdomain, thus miming a kind of parallel ILU. If no overlaps are allowed for, AS preconditioners with inner ILU factorizations simply reduce to an incomplete Block Jacobi. This class of preconditioners has experienced some success in the 90s as it is conceptually perfectly parallel. Unfortunately, the quality of the preconditioner deteriorates as the number of processors, that is, subdomains, grows, because the size of each subset progressively decreases. This causes an increase of the iteration count which can also completely offset the advantage of using a larger number of processors.
A novel recent parallel preconditioner which tries to combine the positive features of both approximate inverses based on the Frobenius norm minimization and AS methods is Block FSAI . The basic idea exploits the fact that in the Frobenius norm minimization the preconditioned matrix can be actually forced to resemble any target matrix , by computing a pseudoapproximate inverse such that for all matrices having a prescribed nonzero pattern . This concept, originally introduced in , can be used to choose a target matrix that is particularly attractive for parallel computing. The idea of Janna et al.  is to generalize the FSAI approach using as a target a block diagonal matrix. As FSAI is a robust preconditioner for SPD problems, Block FSAI has been originally developed for this class of matrices, although a generalization to nonsymmetric indefinite linear systems has also been attempted . So let be SPD with and a sparse lower triangular and a dense block diagonal nonzero pattern, respectively. Denote by the number of diagonal blocks and the size of the th block, and let be an arbitrary full-rank matrix with nonzero pattern . The Block FSAI preconditioner of is defined as the product , where is a lower block triangular factor with nonzero pattern such that with the exact lower Cholesky factor of . As is arbitrary, it goes without saying that as defined in (6.8) is also. Similarly to the FSAI algorithm, minimization of (6.8) yields a linear relationship for the th row of : where is the set of column indices in the th row of with nonzero entries and is the null vector except for the last components which are arbitrary, being the block index the row belongs to (Figure 4). For the system (6.9) to have a unique solution components of can be set arbitrarily because of the arbitrariness of . According to  the most convenient choice is to prescribe all the components of falling within the th diagonal block to be null, with the exception of the diagonal entries set to 1. This implies that is actually a unit lower triangular matrix with structure Similarly to the static FSAI algorithm, the factor can be efficiently built in parallel as all systems (6.9) are independent. According to (6.8), the preconditioned matrix should resemble , that is, a block diagonal matrix for any arbitrary . In other words, has the block structure where are residual blocks and tend to the diagonal blocks of . As is arbitrary, there is no reason for to be better than in a CG iteration, so it is necessary to precondition again. Assuming that the residual off-diagonal blocks in (6.11) are close to 0, an effective choice could be using as “outer” preconditioner a nonoverlapped AS method containing a local preconditioner for each block . Writing the “outer” preconditioner in the factored block diagonal form , the resulting preconditioned matrix reads with the final Block FSAI preconditioner:
The Block FSAI preconditioner described above can be improved by defining dynamically. An adaptive algorithm for the pattern identification can be implemented starting from the theoretical optimal properties of Block FSAI. Janna and Ferronato  have demonstrated that under the hypothesis that the contains the exact inverse of each diagonal block the Kaporin conditioning number of the preconditioned matrix : satisfies the following bound: where is a constant scalar independent of and
Block FSAI has the theoretical property of minimizing for any given pattern . The basic idea of the adaptive algorithm is to select the off-block diagonal nonzero entries in any row of so as to reduce as much as possible. This is feasible because each factor turns out to be a quadratic form depending on only. Denoting by the subvector of including the off-block diagonal entries only, the square submatrix of built from the first to the th row/column, with the sum of size of the first diagonal blocks of , and the subrow of with the first elements of the th row (Figure 5), each factor in (6.16) reads Minimizing every , , is equivalent to minimizing . The adaptive pattern search for can be therefore implemented as follows. Start from an initial guess for the nonzero pattern of the off-block diagonal part of , for example, the empty pattern, and compute the gradient of each quadratic form : Then, add to the position into each row corresponding to the largest component of computed in (6.18), so as to obtain an augmented pattern . After computing the new factor over , the procedure above can be iterated in order to build , , and so on. The search into each row can be stopped when either a maximum number of entries are added to the initial pattern or the relative variation of in two consecutive steps is smaller than a prescribed tolerance . The construction of the dynamic Block FSAI is summarized in Algorithm 7. The Block FSAI computed with the dynamic algorithm described above is known as Adaptive Block FSAI (ABF) and has proved very effective even if the outer preconditioner does not contain the exact inverse of the diagonal blocks . As any dynamic algorithm, the parallel construction of ABF is not straightforward. However, an almost ideal efficiency can be obtained in massively parallel computations as well. For more details, see .