#### Abstract

During the last decades, the continuous expansion of supercomputing infrastructures necessitates the design of scalable and robust parallel numerical methods for solving large sparse linear systems. A new approach for the additive projection parallel preconditioned iterative method based on semiaggregation and a subspace compression technique, for general sparse linear systems, is presented. The subspace compression technique utilizes a subdomain adjacency matrix and breadth first search to discover and aggregate subdomains to limit the average size of the local linear systems, resulting in reduced memory requirements. The depth of aggregation is controlled by a user defined parameter. The local coefficient matrices use the aggregates computed during the formation of the subdomain adjacency matrix in order to avoid recomputation and improve performance. Moreover, the rows and columns corresponding to the newly formed aggregates are ordered last to further reduce fill-in during the factorization of the local coefficient matrices. Furthermore, the method is based on nonoverlapping domain decomposition in conjunction with algebraic graph partitioning techniques for separating the subdomains. Finally, the applicability and implementation issues are discussed and numerical results along with comparative results are presented.

#### 1. Introduction

Let us consider a sparse linear system of the following form:where is the coefficient matrix, is the right-hand-side vector, and is the solution vector. In order to solve the linear system, in (1), a direct or an iterative method can be used. However, a direct method is computationally expensive with excessive memory requirements. Krylov subspace iterative methods, compare [1], have gained the attention of the scientific community during the recent decades, due to their efficiency and memory requirements for solving large sparse linear systems. Preconditioned iterative methods improve the convergence rate and have been used extensively for solving large sparse linear systems. The left preconditioned form of the linear system, in (1), is as follows:where is a suitable preconditioning matrix. Moreover, there is the right and the symmetric preconditioned form, compare [1]. The preconditioner has to satisfy the following conditions: (i) should have a “clustered” spectrum, (ii) can be efficiently computed in parallel, and (iii) the product of “ by vector” should be fast to compute in parallel, compare [1, 2]. The domain decomposition method has been extensively used for solving (very) large sparse linear systems in parallel systems, compare [3–8]. Furthermore, domain decomposition-type methods have been combined with Krylov subspace methods or have been used in hybrid schemes in conjunction with direct solvers, compare [6, 9–11]. During the last decade, research efforts were directed towards solvers for general sparse linear systems that have improved convergence behavior and are scalable.

Various overlapping domain decomposition methods based on Lagrange multipliers have been proposed. The Finite Element Tearing and Interconnect (FETI), compare [12], and its variants, compare [13, 14], are scalable and efficient solvers for second-order and fourth-order partial differential equation problems. Related methods are the Balancing Domain Decomposition (BDD), compare [15], and the Balancing Domain Decomposition by constraints (BDDC), compare [16], which are multiplicative and additive domain decomposition methods, respectively. The aforementioned methods are suitable for solving linear systems derived from the finite element method. Additionally, there are domain decomposition-type solvers using graph partitioning algorithms. These algebraic domain decomposition methods do not require any geometric property of the problem as the partitioning is based on the corresponding graph of the coefficient matrix. Parallel hybrid algebraic domain decomposition-type solvers that combine direct and iterative methods have been proposed, compare [17, 18]. Additionally, algebraic domain decomposition method in conjunction with algebraic multigrid method, compare [19–22], has satisfactory convergence behavior and is suitable for parallel systems. An algebraic domain decomposition method, based on semiaggregation and subspace compression techniques, namely, multiprojection method with subspace compression (MPMSC), for solving large sparse linear systems is proposed. Several efficient software libraries for graph partitioning, such as Metis, compare [23], or Scotch, compare [24], can be used for algebraic partitioning. The aggregation techniques have been previously used in the context of algebraic multigrid methods and have been shown to be efficient for a wide variety of model problems, compare [25–27]. In the MPMSC scheme, a direct method is used for solving the local linear systems derived from the semiaggregated domains. The semiaggregation procedure results in multiple grids composed of a fine part, corresponding to the unknowns of a subdomain, and aggregated part, corresponding to unknowns related to other subdomains. The proposed scheme utilizes approximations for the fine part of each subdomain, obtained with respect to its aggregated components, as opposed to aggregation based algebraic multigrid methods, which obtain approximations for the solution of the entire domain, through smoothing, based on a hierarchy of coarser grids. Moreover, the proposed scheme does not require the computation of a hierarchy of coarser grids. The subspace compression technique is an algorithmic procedure, based on breadth first search with limited depth on the adjacency matrix of the subdomains. By this technique the neighboring subdomains are discovered and aggregated together to an aggregated (coarse) component. The aggregation process reduces the number of aggregated (coarse) unknowns of the local linear systems, resulting in a substantial reduction in memory requirements. The computation of the rows and columns corresponding to the aggregates, during the formation of the local coefficient matrices as well as the subdomain adjacency matrix, is reused to avoid recomputation. Avoiding the recomputation enhances the performance of the proposed scheme. Moreover, the rows and columns corresponding to the newly formed aggregates have a large number of nonzero elements and thus are ordered last to further reduce fill-in during the factorization of the local coefficient matrices. Further, the number of aggregates of the multiprojection method with subspace compression is controlled by a parameter that describes the level sets of aggregates to be compressed in a single aggregate.

In Section 2, the multiprojection method with subspace compression is presented, implementation details are given, and the aggregation strategy is discussed. In Section 3, numerical results illustrating the applicability and efficiency of the MPMSC scheme are presented. Moreover, comparative results for the performance and convergence behavior are given.

#### 2. Multiprojection Method with Subspace Compression

The multiprojection method with subspace compression is a domain decomposition-type method for solving large sparse linear systems (see (1)). The graph, corresponding to the coefficient matrix , has vertices, the unknowns of the linear system, in (1), and edges, the nonzero elements of the coefficient matrix . Since is expressed as a graph, the algebraic properties of the problem can be utilized in order to partition the graph into nonoverlapping subgraphs (subdomains). The graph partitioning process is performed using the algorithm provided in the Metis software library, compare [23]. Consider the domain partitioned into nonoverlapping subdomains with , so thatThe number of vertices, in each subdomain , is denoted by and is the index set defined by the following expression:where are the indices corresponding to the vertices of each subdomain.

##### 2.1. Subspace Compression

Let us consider the subspace , of the domain , which contains aggregated (coarse) components, corresponding to the subdomains , and the prolongation matrix from onto , , which is denoted according to the following expression:The projection of to subspace is the matrix of dimension () and is given by the following equation:The graph , corresponding to the matrix , concerns the relations between the aggregated (coarse) components. The subspace compression technique is based on the reaggregation of the (number of subdomains) aggregated (coarse) components into (number of aggregates). The number of aggregates is not known a priori and thus is a result of the subspace compression algorithm. Breadth first search with limited depth (BFS-LD) is performed on to find the distance neighborhood. The components in a distance neighborhood of are reaggregated together. Then the neighborhood’s vertices are removed from the set and the BFS-LD is performed again on the new graph. When every vertex is removed from the set , the procedure is terminated. The order in which the reaggregation process is performed may result in a slight variation of the number of aggregates; however this does not affect the efficiency or the convergence behavior of the proposed scheme significantly. The order of the vertices used in the reaggregation scheme is lexicographical with respect to the enumeration of the subdomains by the graph partitioning scheme. Other approaches can be used for the reaggregation process, compare [26, 27]; however the computational work of these schemes will affect negatively the performance of the preprocessing phase, especially for large numbers of subdomains. The subspace compression algorithm is as follows:(1), .(2)** For each ** or** until ** , where is the -distance neighborhood of component and is the index of the aggregated (coarse) components **End For**The parameter of the neighborhood specifies the maximum distance between the vertex and any other vertex in the neighborhood. The value of is selected with respect to the adjacency matrix , considering that the larger the value of , the less the number of aggregates. Distance one neighborhood is chosen by default; however for parallel systems with limited memory resources, distance two or distance three neighborhood should be chosen to reduce the memory requirements of the proposed preconditioning scheme. An increase in the value of might affect negatively the convergence behavior of the proposed scheme. The computational complexity for finding the aggregates depends on the number of subdomains and it is a searching procedure on the graph with negligible execution time compared to the whole initialization phase of the preconditioning scheme.

##### 2.2. Semiaggregation

Let us consider the semiaggregated subdomains , containing the fine components of subdomain and aggregated components. Furthermore, the semiaggregated subdomains are associated with their respective prolongation matrices , with . The matrices are restriction operators mapping an arbitrary full vector of the domain into a vector of the semiaggregated subdomain with size (). The first columns of the matrix are the columns of the identity matrix with dimension (), corresponding the indices . The remaining () column vectors of correspond to the aggregates. The components of the column vector that are included in the corresponding aggregated subdomain have value and the rest of the components are zero, where , with and .

Hence, the matrix is defined as follows:where is the column vectors corresponding to the aggregates as they are described above. Moreover, additional prolongation matrices are required, namely, . The prolongation matrices of dimension () are used to map a vector defined on the subdomain to a vector defined on the domain by prolonging only the components corresponding to the subdomain and discarding the aggregated (coarse) components. The th column of the matrix is given as follows:The linear systems corresponding to each subdomain are as follows:where is a coefficient matrix of order ().

The proposed method is used as a preconditioner to a Krylov subspace iterative method, such as preconditioned GMRES(), compare [28, 29]. The product of the preconditioning matrix by a vector can be described by the following compact algorithmic scheme: **For ** **End For**Considering that is the exact solution and that , the error at th iteration can be written as follows:or equivalentlywhere is an orthogonal projector with respect to the Euclidean inner product. Considering that if is symmetric positive definite, is an orthogonal projector with respect to the -norm inner product, compare [8, 30]. is a product of the two orthogonal projections, which are orthogonal to different inner products; thus the projection is not orthogonal, compare [31]. Taking into consideration the above, the method is categorized into the class of oblique projection procedures, compare [1].

The preconditioned matrix using the MPMSC scheme is given as follows:which is a sum of projections.

Let us consider the unit square domain , which is discretized with mesh size as depicted in Figure 1. The resulting grid is partitioned into 16 subdomains, containing 4 vertices each. The corresponding matrix is projected onto the subdomains , in (10), and the grid corresponding to the projected subdomain is presented schematically in Figure 2.

In the MPMSC scheme, () linear systems of the form in (10) have to be solved in every iteration. Equivalently, the linear system, in (10), can be written in a reordered block form as follows:where the size of is () and the size of is . The vector is prolonged to the full solution vector of domain , and the aggregated (coarse) components are used as auxiliary. In the MPMSC preprocessing step each matrix is factorized by sparse factorization method, compare [32]:Therefore, the preconditioning operation can be described by the following algorithmic scheme: **For ** **End For**The solution, in (19)-(20), of the local linear systems is performed inherently in parallel; thus each core computes one backward-forward substitution process. The aggregated (coarse) components are ordered last in the coefficient matrix , because they have many connections with the fine components and consequently many nonzeros in their rows/columns. By this ordering, the fill-in of the and matrices is reduced, compare [32]. Direct solvers have memory and computational complexity limitations for large linear systems. In order to keep the size of below these limits, the coefficient matrix should be partitioned into a suitable number of subdomains. Maintaining the size of almost constant, regardless of the size of , results in almost constant factorization computational work.

It should be stated that the MPMSC scheme is inherently parallel and thus suitable for distributed systems. The MPMSC scheme is used as a preconditioner to the Parallel Preconditioned restarted Generalized Minimum RESidual, namely, (PPGMRES()), method which is a parallel variant of the preconditioned GMRES() method proposed by Douglas et al., compare [28], and is based on Gram–Schmidt orthogonalization, compare [33].

##### 2.3. Memory Requirements

The memory requirements of a sparse matrix of size with nnz nonzero elements stored in CSR (Compressed Sparse Row) format are denoted byThe size of the coefficient matrix is denoted by “”, is the expected value operator, and nnz() is an operator returning the number of nonzero elements of a matrix. Also, let us denote by “” the average number of components that correspond to a node, which usually is equal to The restart parameter of is denoted by “.” MPMSC requires the following matrices and vectors in the preprocessing phase.

The coefficient matrix in CSR with memory requirements , the right-hand-side vector , the partitioning vector and (number of subdomains/nodes) vectors that map the local index of the components of each subdomain to the global index , and also the (number of subdomains/nodes) sparse (CSR) matrices with memory requirements are required. The matrices are not stored, since they can be applied explicitly. Moreover, the factorization process requires a large fraction of the total memory requirements. Each node requires (number of subdomains/nodes) and sparse (CSR) matrices with memory requirements and . The MPMSC requires also the sparse (CSR) matrices and for the subspace compression process; however they are temporary as they are deleted before the factorization step. The PPGMRES() algorithm has the following extra memory requirements: six auxiliary vectors , the dense matrix retaining the basis vectors of the Krylov subspace , the upper Hessenberg matrix , and three auxiliary vectors for the Givens rotation .

#### 3. Numerical Results

In this section, the effectiveness and applicability of the MPMSC scheme are examined. The numerical experiments were performed on a BlueGene/P (BG/P) supercomputer with the following specifications: CPU: 1024x Quad-Core PowerPC-450 850 Mhz; RAM: 4 GB/node; interconnect: 3D-Torus network with bandwidth 5.1 GBps. The methods are designed for distributed systems with multicore nodes, using MPI and OpenMP environments, compare [34]. Each multicore node executes an MPI process, which undertakes the computations for as many subdomains as the number of its cores. The parallelization in each multicore node is achieved with OpenMP, so that every subdomain is mapped to one core. Having equal number of cores and subdomains allows the parallel computation of every factorization. Moreover, the software libraries BLAS (Basic Linear Algebra Subprograms), compare [35], and Metis, compare [23], have been used. The restart parameter of PPGMRES() was set to 20 and the maximum number of iterations was set to 500 iterations. The termination criterion was . The execution time is given in “seconds.” The number of nonzero elements in the matrix is denoted by . The convergence behavior of the PPGMRES() is given by outer (inner) iterations. The inner iterations are the number of iterations after the last restart of the PPGMRES() and outer iterations are the number of times PPGMRES() was restarted. The neighborhood distance parameter has been set to one, unless stated otherwise explicitly.

The formation of the prolongation operators and the factorization of the local matrices are included in the initialization phase of the preconditioning scheme, which is part of the preprocessing time (Pre-Time). The time required for the graph partitioning, using Metis, compare [23], is not included in the Pre-Time. It should be noted that parallel graph partitioning schemes, such as ParMETIS, compare [36], can be used to increase parallel performance. The communication time of the PPGMRES() is given as Comm-Time and the GMRES-Time is the execution time of the PPGMRES(). It should be mentioned that the Comm-Time is included in the GMRES-Time. Total-Time is the sum of the Pre-Time and the GMRES-Time. The speedup is computed using the execution on 64 cores as a baseline unless stated otherwise explicitly.

##### 3.1. 3D Poisson Problem

The PPGMRES() in conjunction with MPMSC scheme has been used for solving the Poisson equation in three space variables subject to Dirichlet boundary conditions. The Poisson equation was discretized with the 7-point stencil. The right-hand side of the linear systems was computed as the product of coefficient matrix and the solution vector set to , where denotes the order of the linear system. In Table 1, the convergence behavior and the parallel performance of the PPGMRES() method, in conjunction with the MPMSC scheme, for the 3D Poisson problem () are presented for various numbers of cores-subdomains. The communication time is relative to the required iterations for convergence to the prescribed tolerance and the number of cores. It can be easily seen that as the number of cores and subdomains increases, the number of required iterations for convergence to the prescribed tolerance decreases for the 3D Poisson problem. In Table 2, the convergence behavior and the parallel performance of the PPGMRES() method, in conjunction with the Block Jacobi preconditioner, for the 3D Poisson problem () with various numbers of cores-subdomains are given. It should be stated that the MPMSC scheme presents better convergence behavior than the Block Jacobi preconditioning scheme and therefore requires less overall communication time, especially for many cores where the convergence behavior of the Block Jacobi method is downgraded. Therefore, it should be noted that the MPMSC scheme has better scalability than the Block Jacobi preconditioner. In Figure 3, the speedup of the PPGMRES() method, in conjunction with the MPMSC scheme, for the 3D Poisson problem () is depicted. In Figure 4, the weak scalability diagram of time versus cores for MPMSC scheme is presented. The order of the linear system increases as the number of cores increases since the local linear system should be of constant order. This results in augmenting, by 10,000 more unknowns, the linear system for every added core. The Total-Time should ideally remain constant. It should be noted that the Pre-Time is almost constant, as the initialization phase of the preconditioner is an inherently parallel process. From Figures 3 and 4 it is evident that the proposed method is scalable up to large number of cores, for the efficient solution of large linear systems. It should be mentioned that partitioning the domain into more subdomains improves the convergence behavior of the method, since the grid corresponding to the aggregated (coarse) components becomes finer, which improves the accuracy of the local approximations to the solution. The improved convergence behavior reduces the communications and hence improves the performance of the proposed scheme. Due to the reduced communications, while the local computational work remains almost constant, it results in superlinear speedups, especially in the case of moderate number of subdomains. For small number of cores the Pre-Time is dominant compared to the GMRES-Time and as the preprocessing phase is an inherently parallel process, both methods have satisfactory speedup. When the number of cores is 1024, the Pre-Time is a small percentage of the Total-Time; therefore the speedup reduces, since it relies on the parallelization of the PPGMRES(). It should be noted that the GMRES-Time increases proportionally to the problem size, as global communications, related to the synchronization of distributed vectors, and sequential redundant operations of PPGMRES() depend on the size of global vectors, compare [28], as it can be seen in Figure 4. However, the inherently parallel preprocessing phase scales, with satisfactory speedup, up to large number cores-subdomains. Thus, the Total-Time required by the proposed scheme is improved.

Furthermore, the convergence behavior of MPMSC scheme for 3D Poisson problem with 10,000 unknowns per subdomain is shown in Figure 5. The total number of iterations is computed as . It should be stated that the outer iterations of MPMSC scheme are almost constant for large number of subdomains. Moreover, the convergence behavior of the MPMSC scheme is related to the quality of the graph partitioning, provided by Metis, compare [23]. In Figure 6, the number of aggregates () of the MPMSC scheme against the number of subdomains for the 3D Poisson problem with 10,000 unknowns per subdomain is depicted.

In Table 3, the performance and convergence behavior of MPMSC scheme with distance two neighborhood for the 3D Poisson problem () with various numbers of cores-subdomains are presented. It should be mentioned that using distance two neighborhood in the subspace compression technique results in less aggregates compared to distance one neighborhood, which decreases the memory requirements. However, since the required number of iterations for convergence increases, the performance of the iterative scheme is downgraded.

Furthermore, the parallel performance of the proposed method is compared to sequential preconditioned GMRES() in conjunction with dual threshold incomplete factorization (ILUT()) preconditioner, compare [1]. The speedup of PPGMRES() in conjunction with MPMSC scheme with respect to the preconditioned GMRES() in conjunction with the ILUT() preconditioner, with and , and various numbers of cores, for the 3D Poisson problem (), is presented in Figure 7. The preconditioned GMRES() in conjunction with the ILUT() preconditioner required 2 () iterations for convergence to the prescribed tolerance. The PPGMRES() in conjunction with the MPMSC scheme performed better, in terms of total execution time, compared to the preconditioned GMRES() in conjunction with the ILUT() preconditioner, especially for large numbers of cores-subdomains.

##### 3.2. UFL Matrices

In order to assess the applicability, scalability, and performance of MPMSC scheme for solving general large sparse linear systems, seven matrices from the University of Florida sparse matrix collection, compare [37], have been considered. Information concerning the aforementioned matrices is given in Table 4. The right-hand-side vectors for atmosmodd, atmosmodl, and thermal2 problems were given by the University of Florida sparse matrix collection, compare [37], whereas, for the rest of the problems, they were computed as the product of coefficient matrix and the solution vector set to .

The convergence behavior and the parallel performance of the PPGMRES() method, in conjunction with the MPMSC scheme for the atmosmodd, atmosmodl, tmt_sym, apache2, and cage14 problems, are presented in Table 5. In Tables 6 and 7, the convergence behavior and the parallel performance of the PPGMRES() method, in conjunction with the MPMSC scheme for the ecology2 and the thermal2 problems, are given. In Figures 8 and 9, the speedup of the PPGMRES() method, in conjunction with the MPMSC scheme for the ecology2 and the thermal2 problem, is shown.

For the ecology2 problem, the local coefficient matrices have reduced number of nonzero elements and hence Pre-Time is a small fraction of the Total-Time, not affecting the Total-Speedup curve, since the factorization requires reduced number of floating point operations. In contrast, for the thermal2 problem compared to ecology2 problem, the local coefficient matrices have more nonzero elements resulting in increased Pre-Time, since the factorization requires increased number of floating point operations, especially for small number of cores/subdomains. As the number of cores/subdomains increases, the Pre-Time decreases rapidly, because there are less nonzero elements in the local coefficient matrices resulting in improved performance of the factorization. Therefore, superlinear speedup is achieved for the thermal2 problem and not for the ecology2 problem.

The speedup of PPGMRES() in conjunction with MPMSC scheme with respect to the preconditioned GMRES() in conjunction with the ILUT() preconditioner, with and , and various numbers of cores, for the thermal2 problem, is presented in Figure 10. The preconditioned GMRES() in conjunction with the ILUT() preconditioner required 11 () iterations for convergence to the prescribed tolerance. The PPGMRES() based on MPMSC scheme performed better, in terms of total execution time, compared to the preconditioned GMRES() in conjunction with the ILUT() preconditioner, especially for large numbers of cores-subdomains.

Ιt should be noted that the preconditioned GMRES() in conjunction with the ILUT() preconditioner for the ecology2 problem exceeds the available memory resources and does not converge to the prescribed tolerance for any compatible choice of or not exceeding available memory resources.

#### 4. Conclusion

The proposed scheme, namely, multiprojection method with subspace compression, has been shown to be effective in solving various problems in distributed memory parallel systems. Further, the PPGMRES() in conjunction with the MPMSC scheme scales up to a large number of cores exhibiting superlinear speedups, especially for problems that require large computational work in the initialization phase of the preconditioning scheme. The proposed scheme has improved convergence behavior as the number of subdomains increases compared to the extant domain decomposition based preconditioners. Future work will be concentrated towards using asynchronous and neighboring communications between heterogeneous compute nodes, in order to assess the performance of the method in modern cloud environments. Moreover, the choice of the most suitable parallel graph partitioning algorithm, according to the performance and the partitioning quality, will be studied.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

The authors would like to express their thanks to Professor Dana Petcu, Department of Computer Science, West University of Timisoara (UVT), for the provision of computational facilities through the UVT HPC Center.