Abstract

This work introduces a new parallel wavelet-based algorithm for algebraic multigrid method (PWAMG) using a variation of the standard parallel implementation of discrete wavelet transforms. This new approach eliminates the grid coarsening process in traditional algebraic multigrid setup phase simplifying its implementation on distributed memory machines. The PWAMG method is used as a parallel black-box solver and as a preconditioner in some linear equations systems resulting from circuit simulations and 3D finite elements electromagnetic problems. The numerical results evaluate the efficiency of the new approach as a standalone solver and as preconditioner for the biconjugate gradient stabilized iterative method.

1. Introduction

The algebraic multigrid (AMG) method is one of the most efficient algorithms for solving large sparse linear systems. Especially in the context of large-scale problems and massively parallel computing, the most desirable property of AMG is its potential for algorithmic scalability: in the ideal case, for a matrix problem with ๐‘› unknowns, the number of iterative V-cycles required for convergence is independent of the problem size ๐‘› and the work in the setup phase and in each V-cycle is linearly proportional to the problem size ๐‘› [1, 2]. For all this, the need to solve linear systems arising from problems posed on extremely large, unstructured grids has been generating great interest in parallelizing AMG.

However, there are two major problems: first, the core of the AMG setup phase includes the grid coarsening process, which is inherently sequential in nature [1โ€“3]. This coarsening scheme, for traditional AMG, can lead to computational complexity growth as the problem size increases, resulting in an elevated memory use and execution time and in a reduced scalability [4, 5]. Second, most parallel AMG algorithms are based on domain decomposition ideas, which have been proved to be very efficient but require a hierarchy of meshes that eliminates the algebraic characteristic of AMG and precludes its use as a black-box method.

Due to those difficulties and the importance of the development of efficient parallel preconditioners for large, sparse systems of linear equations, the investigation of new parallel approaches has been the main subject of many researchers [6โ€“12]. In this context, a great amount of work has been aimed to extract some parallelism from serial preconditioners such as factorization-based methods, which provide effective preconditioning on sequential architectures [8โ€“10]. However, scalable parallel implementation of incomplete factorization preconditioners presents many limitations and challenges, and although some interesting approaches have presented good performance for certain classes of problems, quite scalable parallel algorithms for this kind of preconditioners seem to have not been available [8].

Also has received much attention in the last years the development of preconditioning approaches that have inherently parallel characteristics. In particular, approximate inverse preconditioners have proven extremely promising and effective for the solution of general sparse linear systems of equations [6, 7, 10โ€“12]. Unfortunately, this kind of preconditioner also has some drawbacks: in general, as the discretization is refined the amount of work per grid point grows with problem size, and it is inherently difficult to approximate the inverse of very ill-conditioned linear systems with a sparse matrix [13].

In this work we introduce a new parallel algorithm for wavelet-based AMG (PWAMG) using a variation of the parallel implementation of discrete wavelet transforms. This new approach eliminates the grid coarsening process present at standard AMG setup phase, simplifying significantly the implementation on distributed memory machines and allowing the use of PWAMG as a parallel black-box solver and preconditioner. The parallel algorithm uses the message passing interface (MPI) that provides a standard for message passing for parallel computers and workstation clusters.

A sequential version of WAMG was introduced recently [14], and it has revealed to be a very efficient and promising method for several problems related to the computation of electromagnetic fields [15, 16]. Here, the method is used as a parallel black-box solver and as a preconditioner in some linear equations systems resulting from circuit simulations and 3D finite elements electromagnetic problems. The numerical results evaluate the efficiency of the new approach as a standalone solver and as preconditioner for the biconjugate gradient stabilized iterative method.

2. The Discrete Wavelet Transform

The discrete wavelet transform (DWT) corresponds to the application of low-pass and high-pass filters, followed by the elimination of one out of two samples (decimation or subsampling). The discrete signal, which in one dimension is represented by a vector of values, is filtered by a set of digital filters that are associated to the wavelet adopted in the analysis.

Starting from a vector ๐‘ฆ(๐‘) at level 0, two sets of coefficients are generated in each level ๐‘™ of the process: a set ๐‘‘๐‘™ of wavelets coefficients (detail coefficients) and a set ๐‘๐‘™ of approximation coefficients. This procedure can be applied again, now using ๐‘๐‘™ as an input vector to create new coefficients ๐‘๐‘™+1 and ๐‘‘๐‘™+1, successively.

Very important classes of filters are those of finite impulse response (FIR). The main characteristic of these filters is the convenient time-localization properties. These filters are originated from compact support wavelets, and they are calculated analytically. An example of FIR filters is the length-2 scaling filter with Haar or Daubechies-2 coefficients, which are given by (2.1):โ„Ž๐ท2=๎€บโ„Ž0,โ„Ž1๎€ป=๎ƒฌ1โˆš2,1โˆš2๎ƒญ.(2.1) For more details about compact FIR filters see [17].

For a given vector ๐‘ฆ=(๐‘ฆ1,๐‘ฆ2,โ€ฆ,๐‘ฆ๐‘), the Haar wavelet transform creates an approximation vector ๐‘ and a detail vector ๐‘‘ according to (2.2) and (2.3), respectively:๎€ท๐‘๐‘=1,๐‘2,โ€ฆ,๐‘๐‘/2๎€ธ,with๐‘๐‘–=๎€ท๐‘ฆ2๐‘–โˆ’1+๐‘ฆ2๐‘–๎€ธโˆš2,๎€ท๐‘‘(2.2)๐‘‘=1,๐‘‘2,โ€ฆ,๐‘‘๐‘/2๎€ธ,with๐‘‘๐‘–=๎€ท๐‘ฆ2๐‘–โˆ’1โˆ’๐‘ฆ2๐‘–๎€ธโˆš2.(2.3)

It is not difficult to see that this procedure will work only if ๐‘ is even.

In the 2D case, in which the discrete signal is represented by a matrix, the DWT is obtained through the application of successive steps of 1D transform into the rows and columns of the matrix. This process generates a matrix formed by four types of coefficients: the approximation coefficients and the detail coefficients (horizontal, vertical, and diagonal), as illustrated in Figure 1. Blocks H and G represent, respectively, the low-pass and high-pass filters.

In both cases, the approximation coefficients keep the most important information of the discrete signal, whereas the detail coefficients possess very small values, next to zero. These approximation coefficients will contain low-pass information, which is essentially a low-resolution version of the signal and represent a coarse version of the original data.

3. A Wavelet-Based Algebraic Multigrid

The approximation property of wavelet is explored by wavelet-based algebraic multigrid for creating the hierarchy of matrices. The method considers the use of a modified discrete wavelet transform in the construction of the transfer operators and the hierarchy of matrices in the multigrid approach.

A two-dimensional modified DWT is applied to produce an approximation of the matrix in each level of the wavelets multiresolution decomposition process. An operator formed only by low-pass filters is created, which is applied to the rows and columns of the matrix. This same operator is used for the intergrid transfer in the AMG. If a length-2 (first order) scaling filter is used, for example, which means the application of the operation defined in (2.2) in the rows and columns of the matrix, the operation is matricially defined asโ€‰โ€‰(3.1):๐‘ƒ๐‘˜๐‘˜+1=โŽ›โŽœโŽœโŽœโŽœโŽœโŽœโŽโ„Ž1โ„Ž0000โ‹ฏโ‹ฏโ‹ฏ000โ„Ž1โ„Ž0000โ‹ฏ0โ‹ฎโ‹ฎโ‹ฎ000โ‹ฏโ‹ฏโ‹ฏ0โ„Ž1โ„Ž0โŽžโŽŸโŽŸโŽŸโŽŸโŽŸโŽŸโŽ (๐‘/2)ร—๐‘.(3.1) The prolongation operator is defined in the usual form,๐‘ƒ๐‘˜๐‘˜+1=๎€ท๐‘ƒ๐‘˜๐‘˜+1๎€ธ๐‘‡,(3.2) and the matrix in the corresponding level ๐‘˜ with the Galerkin condition:๐ด๐‘˜+1=๐‘ƒ๐‘˜๐‘˜+1๐ด๐‘˜๐‘ƒ๐‘˜๐‘˜+1,(3.3) reminding that ๐ด0=๐ด is the matrix of the original system.

Once the intergrid transfer operators and the hierarchy of matrices are created, the multigrid method (V-cycle) can be defined as usual, using a recursive call of the following two-level algorithm.

Algorithm 3.1 (Two-level multigrid). The following holds.
Input: the right hand side vector ๐‘, the original matrix ๐ด โ€ƒโ€ƒโ€ƒand the coarse grid matrix PAPT Output: approximation ฬƒ๐‘ฅ(1)Choose an initial guess ๐‘ฅ and apply ๐œˆ1 smoothing steps in ๐ด๐‘ฅ=๐‘(2)Compute ๐‘Ÿ=๐‘โˆ’๐ด๐‘ฅ(3)๐‘’=(PAP๐‘‡)โˆ’1๐‘ƒ๐‘Ÿ(4)ฬƒ๐‘ฅ=๐‘ฅ+๐‘ƒT๐‘’(5)Apply ๐œˆ2 smoothing steps in ๐ดฬƒ๐‘ฅ=๐‘(6)Return ฬƒ๐‘ฅ.

In this paper, a V-cycle multigrid with ๐œˆ1=๐œˆ2=1 is applied iteratively as a solver and also as a preconditioner inside the biconjugate gradient stabilized (BiCGStab) iterations.

3.1. The Choice of the Wavelet Filters

A common problem on the choice of the filters is to decide between the fill-in control and the wavelet properties. As the WAMG often deals with sparse matrices, the control of the nonzero number is a very important task. In this case, if the matrix ๐ด๐‘˜ is sparse, then the number of nonzero elements in the next level matrix ๐ด๐‘˜+1=๐‘ƒ๐‘˜๐‘˜+1๐ด๐‘˜๐‘ƒ๐‘˜๐‘˜+1 will depend on the order of the filter used in the restriction and prolongation operators. In fact, the longer the filter used the larger the number of nonzero entries in the next computed matrix. Consequently, most of the wavelet-based multigrid methods use shorter filters such as Haar or Daubechies-2 coefficients in its approaches [18โ€“23]. This is also the case in this paper.

4. A Parallel Wavelet-Based Algebraic Multigrid

One advantage of the proposed approach is the elimination of the coarsening scheme from the setup phase. The application of the operations defined in (2.3) in the rows and columns of a matrix allows creating an approximation without any information about meshes or the adjacency graph of the matrix. Thus the implementation on distributed memory machines becomes simpler allowing the use of the method as a parallel black-box solver and preconditioner. Our approach, based on the parallel implementation of discrete wavelet transform presented in [24], avoids any communication among processors in the setup phase if first-order filters are used.

The parallelization strategy starts dividing equally the number ๐‘ of rows of the matrix between the processors, in such a way that the 1-D transform defined in (2.3) could be applied entirely locally in the rows. As each matrix column is equally divided, the transformation in this direction is accomplished partially in each processing unit, considering the size of the column in its local memory. For ๐‘›๐‘ processors, for example, each processor will apply the wavelet transform in the โŒˆ๐‘/๐‘›๐‘โŒ‰ elements of the column in its memory, where โŒˆ๐‘ฅโŒ‰ means the largest integer even less than ๐‘ฅ. If the largest integer less than ๐‘/๐‘›๐‘ is odd, then the last element of the column in the local memory is unchanged, as illustrated in Figure 2 for ๐‘=10 and ๐‘›๐‘=2. In the figure, ๐‘๐‘™๐‘–,๐‘— means the ๐‘–th local element of the level ๐‘™ vector ๐‘ in the processor ๐‘—.

Thus the resulting coarse level matrix is calculated entirely locally.

In the solver phase the interprocessors communication is necessary only for operations involving matrices. More specifically, it is necessary to update the vector always after the smoothing step and after the matrix-vector product in the residual calculation (lines 1, 2, and 5 of the algorithm). In a matrix-vector product ๐ดโˆ—๐‘=๐‘, for example, the matrix ๐ด is distributed in rows, the vector ๐‘ is shared by all processors, and vector ๐‘ is calculated in parallel as illustrated in Figure 3, for 4 processors. Then the resulting vector ๐‘ is updated by the processors through the message passing interface library (MPI). This task is accomplished by using the MPI collective communication function MPI_Allgather [25]. The MPI_Allgather function effect is shown in Figure 4. It is important to highlight that only the resulting vector should be updated. It means each processor communicates to the other only a few elements that it stores. However, in order for the process to continue, the whole vector must be updated on each processor and some kind of collective communication should take place.

The PWAMG method in this work uses a hybrid Jacobi-Gauss method as smoother and the V-cycle for the resolution scheme. The Gauss-Seidel smoothing is applied inside each processor and the Jacobi method applied interprocessors.

5. The Numerical Test Problems

The parallel algorithm uses the version one of the MPI that provides a standard for message passing for parallel computers and workstation clusters. The method has been implemented using C++ and tested in a homogeneous Beowulf cluster with 6 machine nodes (Core 2 Duo, 1โ€‰GB RAM) connected to a switched fast Ethernet network, as illustrated in Figure 5.

The multigrid V-cycle approach was applied as a resolution scheme to solve some linear equations systems resulting from circuit simulations and finite elements electromagnetic problems. The parallel method was also applied as a preconditioner for the iterative biconjugate gradient stabilized (BiCGStab) method, which has been implemented using the same vector updating approach.

The three first matrices are related to 3D finite element electromagnetic analysis. The matrices 2cubes_sphere and offshore are from a study of edge-based finite-element time domain solvers for 3D electromagnetic diffusion equations. The dielFilterV2real is a real symmetric matrix, which comes from a high-order vector finite element analysis of a 4th-pole dielectric resonator [26]. The last real unsymmetric matrix is from circuit simulation problems from Tim Davis sparse matrix collection [27]. The matrices properties are presented in Table 1.

6. Results

The problems were solved firstly using the sequential version of the proposed method. For comparison, the BiCGStab method preconditioned by the incomplete LU was used.

The results are presented in Table 2. In all cases, the convergence is defined by โ€–๐‘Ÿ๐‘›โ€–/โ€–๐‘โ€–<10โˆ’6, where ๐‘Ÿ๐‘› is the residual vector at the ๐‘›th iteration and the right hand side vector ๐‘ is chosen so that the solution is a unitary vector. The setup and solver times are in seconds, and ๐‘› is the number of iterations.

The parallel results for the circuit simulation and electromagnetics problems are presented in Tables 3, 4, 5, and 6 for the PWAMG solver and preconditioner. The setup and solver times are in seconds and have been presented in the form ๐‘ก1(๐‘ก2), where ๐‘ก1 is the processing time returned by the C clock() function and ๐‘ก2 is the total time spent in the corresponding phase, which includes the MPI communication time and is measured using the MPI function MPI_Wtime().

In some cases the CPU and wall-clock times are nearly equal, which means that the processes that are waiting for synchronization are still consuming full CPU time. Moreover, the CPU time and the wall-clock time are practically the same in setup phase indicating that there are no interprocessor communications during the setup. It is a great advantage of the proposed approach since this phase is the most time-consuming task in multilevel approaches, as can be seen in Table 2. For single processor algorithms the times ๐‘ก1 and ๐‘ก2 are the same.

Three important aspects related to the parallel results should be highlighted.(1)The proposed method uses a hybrid Jacobi Gauss-Seidel method as a smoother and a coarse system solver. This method applies the Gauss-Seidel method inside each processor and the Jacobi method between processors. So, as the number of processors increases, both smoother and coarse solvers become different. Also in the sequential version of the method this approach has not been reproduced. This can help us to explain the difference in the number of iterations in the parallel and sequential versions.(2)The way in which the matrices are split between the processors (the same number of rows) may cause load unbalance in some cases that can affect the overall performance of the method. For the matrix circuit5M_dc, for example, the number of nonzero elements in the processor one is 25% larger than in the processor two, when 2 processors are used. A new way to divide the matrix among the processors should be investigated.(3)Despite the use of a homogeneous Beowulf cluster, the type of network connection based on a switched fast Ethernet network shows to be an important bottleneck. So, the effects of the fast Ethernet connection on the results should be evaluated separately. In order to do so, the ๐‘ก๐‘–๐‘š๐‘’(๐‘) required by the solver phase execution on ๐‘ processors, in which there is interprocessors communication, was evaluated considering a relative time in relation to the MPI ๐œ‹ calculation example [25]. It means the ๐‘ก๐‘–๐‘š๐‘’(๐‘) is defined as(6.1)๐‘ก๐‘–๐‘š๐‘’(๐‘)=๐‘ค๐‘ก๐‘–๐‘š๐‘’(๐‘)๐œ‹-๐‘ก๐‘–๐‘š๐‘’(๐‘),(6.1)

in which ๐‘ค๐‘ก๐‘–๐‘š๐‘’(๐‘) is the total time spent in the setup phase, which includes the MPI communication time and is measured using the MPI function MPI_Wtime(), and ๐œ‹-๐‘ก๐‘–๐‘š๐‘’(๐‘) is the time spent by the MPI ๐œ‹ example, both with ๐‘ processors. As an example, Table 7 presents the values of ๐œ‹-๐‘ก๐‘–๐‘š๐‘’(๐‘),๐‘ค๐‘ก๐‘–๐‘š๐‘’(๐‘),๐‘ก๐‘–๐‘š๐‘’(๐‘) and ๐‘ ๐‘๐‘’๐‘’๐‘‘๐‘ข๐‘(๐‘),๐‘=1,โ€ฆ,6, for the matrix 2cubes_sphere, which were used to create the illustration in Figure 6(a). The values of ๐œ‹-๐‘ก๐‘–๐‘š๐‘’(๐‘) were obtained as the mean of 5 runs. All the other results in Figure 6 were derived in a similar way.

As usual, the absolute speedup ๐‘†๐‘ is used for analyzing the parallel performance, and it is defined asโ€‰โ€‰(6.2)๐‘†๐‘=๐‘ ๐‘๐‘’๐‘’๐‘‘๐‘ข๐‘(๐‘)=๐‘ก๐‘–๐‘š๐‘’(1)๐‘ก๐‘–๐‘š๐‘’(๐‘),(6.2) in which ๐‘ก๐‘–๐‘š๐‘’(1) is the time spent by the best sequential algorithm and ๐‘ก๐‘–๐‘š๐‘’(๐‘) as defined in (6.1).

As the MPI ๐œ‹ example uses only point-to-point communication functions, the relative time approach can help to clarify if the apparent poor scalability is due to a high collective communication cost or mainly due to the type of network connection used.

The solver and preconditioner speedups are illustrated in Figure 6 for the matrices circuit5M_dc, 2cubes_sphere, and offshore.

7. Conclusions

The PWAMG method, proposed in this work, has been applied as a black-box solver and preconditioner in some circuit simulations and finite element electromagnetic problems with good results.

An important characteristic of the proposed approach is its small demand for interprocessors communication. Actually, no communication is required in the setup phase if first-order filters are used. This characteristic is confirmed by the results for setup time presented in Tables 3โ€“6, observing that the times measured by MPI_Wtime() and C clock() functions are practically the same. It is a great advantage of the proposed approach since this phase is the most time-consuming one.

In the solver phase, in which there is interprocessors communication, the numerical results seem to show a poor performance with more than 2 processors, even when the number of iterations does not increase by using more processors. These results can be caused in part due to the use of a collective communication function to update the vector after the matrix operations, which is motivated by the manner the matrix is divided between the processors.

In order to update the vector each processor should send and receive the part of the vector to all of the other processors. Of course, when the number of processors increases, this work also becomes larger.

An alternative to overcome this problem may be to apply some graph partitioning method and develop an approach in which each processor should communicate only with its neighbors. Such approach is under development, and it is out of the scope of this paper.

However, the type of network connection based on a switched fast Ethernet network shows to be an important bottleneck, and its effects should be evaluated separately. As the MPI ๐œ‹ example uses only point-to-point communication functions, the relative time approach can help to clarify if the apparent poor scalability is due to a high collective communication cost or mainly due to the type of network connection used. The speedup results based on the relative times show that the proposed approach is promising and that the kind of network connection that has been used is maybe the most important drawback.

Moreover, in spite of the speedup being less than the linear in some cases, it is important to mention an important aspect of this application: in the context of large sparse linear system of equations, where this paper is inserted, the problems have large memory requirements. In these cases, as presented in [28], the speedup necessary to be cost effective can be much less than linear. The parallel program does not need ๐‘ times memory of the unit processor, since parallelizing a job rarely multiplies its memory requirements by ๐‘.

Finally, the number of iteration of the proposed method seems to be independent of the number of processors. However, it is necessary to carry out more tests with a larger number of processors in order to draw more definitive conclusions. Nowadays, the authors are looking for new resources and/or partnerships to enable the continuation of this work.

Acknowledgments

Authors would like to thank the Sรฃo Paulo Research Foundation (FAPESP) (06/59547-5 and 2011/06725-1) and Nove de Julho University (UNINOVE) for the financial support.