#### Abstract

An efficient parallel iterative method with parameters on distributed-memory multicomputer is investigated for solving the banded linear equations in this work. The parallel algorithm at each iterative step is executed using alternating direction by splitting the coefficient matrix and using parameters properly. Only it twice requires the communications of the algorithm between the adjacent processors, so this method has high parallel efficiency. Some convergence theorems for different coefficient matrices are given, such as a Hermite positive definite matrix or an -matrix. Numerical experiments implemented on HP rx2600 cluster verify that our algorithm has the advantages over the multisplitting one of high efficiency and low memory space, which has a considerable advantage in CPU-times costs over the BSOR one. The efficiency for Example 1 is better than BSOR one significantly. As to Example 2, the acceleration rates and efficiency of our algorithm are better than the PEk inner iterative one.

#### 1. Introduction

In recent years, the high-performance parallel computing technology has been rapidly developed. The large sparse banded linear systems are frequently encountered when finite difference or finite element methods are used to discretize partial differential equations in many practice scientific and engineering computing problems, especially in computational fluid dynamics (CFD). While many problems can be efficiently resolved on sequential computers but are difficult to solve on parallel computers, the communications take a significant part of the total execution time. So we need more efforts to investigate more efficient parallel algorithm to improve the experimental results.

The parallel algorithms on the large sparse linear systems have been widely investigated in [1–8]. Specifically, the multisplitting algorithm in [1] is a popular method at present. In [3], the authors provide a method for solving block-tridiagonal linear systems in which local lower and upper triangular incomplete factors are combined into an effective approximation for global incomplete lower and upper triangular factors of coefficient matrix based on two-dimensional domain decomposition with small overlapping. The algorithm is applicable to any preconditioner of incomplete type. Duan et al. presented a parallel strategy based on the Galerkin principle for solving block-tridiagonal linear systems in [4]. In [5], a parallel direct algorithm based on Divide-and-Conquer principle and the decomposition of the coefficient matrix is investigated for solving the block-tridiagonal linear systems on distributed-memory multicomputers. The communication of the algorithm is only twice between the adjacent processors. In [7], a direct method for solving circular-tridiagonal block linear systems is presented. Some parallel algorithms for solving the linear systems can be found in [9–14]. The algorithm in this paper is discussed on the basis of the advantages of the one in [2].

The goal of this paper is to develop an efficient, stable parallel iterative method on distributed-memory multicomputer and to give some theoretical analysis. We appropriately choose the splitting matrices and to establish the iterative scheme. Two examples have been done on the HP rx2600 cluster; the experimental results indicate that the parallel algorithm has advantages over the multisplitting one of high parallel speedup and efficiency.

The content of this paper is as follows. In Section 2, the parallel iterative algorithm is described. In Section 3, the parallel iterative process is discussed. The analysis of convergence is done in Section 4. The numerical results are shown in Section 5. In Section 6, the conclusion is presented.

#### 2. Parallel Algorithm

Let a banded linear equation be represented as where is a matrix, and are and matrices, respectively, and and are -dimensional real column vectors. In general, assuming that there are processors available and (, ), we denote the th processor by (for ) and split the coefficient matrix into .

Then, we use the alternating direction iterative scheme in [2] and obtain the new iterative scheme here and are nonsingular matrices and . And hence (2) is changed into here, is the so-called iterative matrix and .

Obviously, the matrices and should be nonsingular and the definition of and is the most important key of solving the linear systems by (3) in this paper. If and are suitable, the algorithm would have good parallelism and low CPU-times costs. So we choose and as follows

From (3), let ; we obtain then the detailed calculation procedure is as follows: here, and is a -dimentional row vector.

Let ; then we have , and where and is a -dimentional row vector. Then according to the aforementioned formulas, we get .

#### 3. Process of Parallel Iterative Algorithm

Here, we show the storage method and computational procedure of the parallel algorithm as follows.

##### 3.1. Storage Method

The coefficient matrix is divided into from left to right as banded order. Let vectors .

The corresponding relationship is as follows:

Then, assign () rows to each processor. The processor stores the corresponding vectors , with . Here and are upper-band width and lower-band width, respectively. In such a case, this saves much of the memory space although programming is difficult. Note that if is not divisible by , some processors store rows-block of , sequentially, and others store rows-block; meanwhile, each processor stores the corresponding vectors of and . Thereby, it makes load of each processor approach balance and shorten wait time.

##### 3.2. Cycle Process

performs a parallel communication to obtain , and then computes
and implements LU discretization one-step, where , , , and are the th (for ) block of , , , and **,** respectively.

performs one parallel communication to obtain and then computes and implements LU discretization one-step; here is the th (for ) block of .

On the processor, judge whether the inequality ( is error bound, ) holds. Stop if these inequalities hold on every processor, or return to and continue cycling until all inequalities are satisfied.

#### 4. Analysis of Convergence

To perform the theoretical analysis on convergence of the parallel algorithm, we introduce the definition and several lemmata.

Symbol and Definition(i) represents the space of real matrices.(ii) represents the unit matrix of order .(iii), represent the conjugate transpose matrix of , **,** respectively.(iv) represents the inverse matrix of .

*Definition 1 (see [15]). *Suppose and , where and ; then is called normal splitting of matrix .

*Definition 2 (see [15]). *Suppose and , where ; then is called weak normal splitting of matrix .

*Definition 3 (see [15]). *Suppose and , where is a Hermite positive definite matrix; then is called -normal splitting of matrix .

*Definition 4 (see [15]). *Let , if () and ; then the matrix is an -matrix.

Here, we give some theoretical analysis for convergence of the parallel iterative algorithm.

Lemma 5 (see [9]). *Let , if the splitting is a weak normal splitting or normal splitting of coefficient matrix ; then if and only if .*

Lemma 6 (see [10]). *Let be an -matrix. If any element of increases while outside elements of the main diagonal keep nonpositive, then the transformation matrix is also an -matrix and .*

Lemma 7 (see [15]). *Let be a nonsingular Hermite matrix. If is a -normal splitting of the matrix , then if and only if is a positive definite matrix.*

Theorem 8. *Let be a Hermite positive definite matrix. If , , and , then the iterative scheme (3) is convergent for all vector .*

*Proof. *Since and
we have ; here , ,

Since

here
and let
then we have

here
Obviously, is a semipositive definite matrix or a positive definite matrix. Hence the matrix
is a Hermite positive definite matrix.

Therefore, is a -normal splitting of the matrix , and then by Lemma 7; we know that our algorithm iterative scheme is convergent.

By the theorem, we know that the parallel algorithm is convergent if is a Hermite positive definite matrix.

Theorem 9. *Let be an -matrix. If for , here , and is the diagonal element of ; then the iterative scheme (3) is convergent for all vector .*

*Proof. *Since , , and

we have

Here
Hence, we know that , , and , (), are all -matrices by Lemma 6. Then , , , , and ; we obtain . Similarly, we can obtain , and .

Since for , we have and . That is, is obtained and is a normal splitting. Since is an -matrix, then ; we know that by Lemma 5, and the iterative scheme (3) is convergent.

By the theorem, we know that the parallel algorithm is convergent if is an -matrix and for .

#### 5. Numerical Examples

We performed two numerical experiments on the HP rx2600 cluster. The results are shown as follows.

*Example 1. *Consider a banded linear system ; here
Let initialization value and . We apply this algorithm with the optimal relaxation factor, the multisplitting method, and BSOR method to the systems on the HP rx2600 cluster. Here is the number of processor, is the run times (seconds), the is speedup ( of one processor/ of all processors), is iteration times, is the efficiency (), and the error . See Tables 1, 2, and 3 and Figures 1 and 2.

*Example 2. *Consider an elliptic partial differential equation
equipped with the boundary conditions , ; here , , , , , , and are all constants.

We denote , . Using the finite difference method, we obtain two block-tridiagonal linear systems on condition that the step sizes . Then, we apply this algorithm with the optimal relaxation factor, BSOR method, PEk method, and the multisplitting algorithm to the systems on the HP rx2600 cluster. The numerical results are shown in Tables 4, 5, 6, and 7 and Figures 3 and 4.

#### 6. Results Analysis

From Table 1 to Table 7, we can get the following conclusion.(i)It can be known that the results of the parallel algorithm verify the results of the theoretical analysis. The conditions in the theorems are only sufficient conditions.(ii)By the numerical results, it can be known that the parallel one has good parallelism.(iii)As to Examples 1 and 2, the results of the examples show that the efficiency of the algorithm is better than the multisplitting ones. Our algorithm has good parallel speedup the same as BSOR methods to the examples. As to Example 2, the efficiency of the algorithm is also better than PEk methods.(iv)The parallel algorithm is easily implemented on parallel computer and more flexible and simple than [1] in practice.

#### 7. Conclusions

An efficient parallel iterative method on a distributed-memory multicomputer has been presented for solving the large banded linear systems. We make full use of the decomposition of the coefficient matrix to choose and to save computational cost. The storage strategy can save memory space. Only twice it requires the communications of the algorithm between the adjacent processors. Theoretical analysis and experiment show that the algorithm in this paper has good parallelism and high efficiency. The results also confirm correctness of convergence theorems. When the coefficient matrix is a Hermite positive definite matrix or an -matrix, we know that the parallel algorithm is convergent if the given conditions are established. Our algorithm has an advantage over the multisplitting one of high efficiency.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

This research was supported by the National Natural Science Foundation of China under Grant nos. 11002117 and 11302173 and Xianyang Normal University Research Foundation under Grant nos. 09XSYK209 and 09XSYK204.