#### Abstract

It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is . In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16–4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.

#### 1. Introduction

Building fractional mathematical models for specific phenomenon and developing numerical or analytical solutions for these fractional mathematical models are very hot in recent years. Fractional diffusion equations have been used to represent different kinds of dynamical systems [1]. But the fractional applications are rare. One reason for rare fractional applications is that the computational cost of approximating for fractional equations is too much heavy. The idea of fractional derivatives dates back to the 17th century. A fractional differential equation is a kind of equation which uses fractional derivatives. Fractional equations provide a powerful instrument for the description of memory and hereditary properties of different substances.

There has been a wide variety of numerical methods proposed for fractional equations [2, 3], for example, finite difference method [47], finite element method [8, 9], spectral method [10, 11], and meshless techniques [12]. Zhuang and Liu [4] presented an implicit difference approximation for two-dimensional time fractional diffusion equation (2D-TFDE) on a finite domain and discussed the stability and convergence of the method. The numerical result of an example agrees well with their theoretical analysis. Tadjeran and Meerschaert presented a numerical method, which combines the alternating directions implicit (ADI) approach with a Crank-Nicolson discretization and a Richardson extrapolation to obtain an unconditionally stable second-order accurate finite difference method, to approximate a two-dimensional fractional diffusion equation [13]. Two ADI schemes based on the approximation and backward Euler method are considered for the two-dimensional fractional subdiffusion equation [14].

It is very time consuming to numerically solve fractional differential equations for high spatial dimension or big time integration. Short memory principle [15] and parallel computing [16, 17] can be used to overcome this difficulty. Parallel computing is used to solve computation intensive applications simultaneously [1821]. Large scale applications in science and engineering such as particle transport [2224], different linear and nonlinear systems [25], nonnumerical intelligent algorithm [26], and computational fluid dynamics [27] can rely on parallel computing. Diethelm [17] implemented the fractional version of the second-order Adams-Bashforth-Moulton method on a parallel computer and discussed the precise nature of the parallelization concept. This is the first attempt for parallel computing on fractional equations. Following that, Gong et al. [16] presented a parallel algorithm for one-dimensional Riesz space fractional diffusion equation with explicit finite difference method. The numerical solution of Riesz space fractional equations has global dependence on grid points, which means the approximation of a grid point will depend on the approximation of all grid points in one time step. The numerical solution of time fractional equations has global dependence on time steps, which means that the approximation of a grid point will depend on the approximation of the grid point in all time steps. Global dependence means the nonlocal property of fractional deviates on time or space. Explicit method is easy to be parallelized but is restrict by its stability condition. Implicit method is hard to be solved by Gauss elimination method and often uses the iterative scheme. Until today, the power of parallel computing for high dimensional and time fractional differential equations has not been tried.

This paper focuses on the two-dimensional time fractional diffusion equation studied by Zhuang and Liu [4]: where . The fractional derivative is in the Caputo form.

#### 2. Background: Numerical Solution

The fractional derivative of in the Caputo sense is defined as [15]

If is continuous bounded derivatives in for every , we can get

Define ,,,,, and , for , , and . Let , , , , , and be the numerical approximation to , , , , and . We can get the implicit approximating scheme [4] for (1): where , , and . The and are the step size along X and Y directions defined above.

#### 3. Parallel Algorithm

##### 3.1. Analysis

Let , and let ; (4) can be rewritten as

The explicit schemes are conditionally stable and need very small for high dimensional problems for both classical and fractional equations. The implicit schemes are unconditionally stable but need to get the inverse of the coefficient matrix. Sometimes the sparse coefficient matrix is too large, making a direct method too difficult to use. So, the iterative method can be used to avoid matrix inverse: until is smaller than a predefined threshold . are the iterative variables. are the known variables for the unknown time step.

It is very time consuming to solve the 2D-TFDE by iterative method of (6). For determining and assuming if there are iterations for each time step on average, there are about arithmetical logical operations ignoring the computation of the coefficients. So, the computational complexity is , which is much more heavy than the classical integer order 2D partial differential equations .

Besides the heavy computational cost, the memory space requirement is the other problem. Because each unknown time step needs to use all the values of the previous time steps, all the values of need to be stored into the memory space. When is big enough, the memory complexity is , which is far bigger than the classical integer order 2D partial differential equations .

The computation of (6) can be divided into two parts.(i). The unknown value of grid point at the time step relies on the value of grid point at all previous time steps of .(ii). The unknown value of grid point relies on the value of . The data dependence of 2D-TFDE is shown in Figure 1. relies on the neighboring grid points at the same time step and the same position of all the previous time steps.

##### 3.2. Task Distribution Model and Data Layout

The task distribution of the total computation should be designed on distributed memory systems, with the goal of making the total computations as efficient as possible. There are three main issues in choosing a task distribution model for these computations:(i)load balance: ensure splitting of the computations reasonably evenly among all computing processors/processes throughout the time stepping;(ii)less communication: the task distribution model should keep the communication among different computing processes as less as possible;(iii)convenient programming: the parallel algorithm based on the task distribution model should not change the serial algorithm too much. The goal of keeping attention on these issues is achieving high execution efficiency and high scalability of the parallel algorithm on distributed memory systems for 2D-TFDE.

Refer to (6). computation has no data dependence. computation has data dependence among neighboring grid points. There are mainly two kinds of task distribution models. The first one is one-dimensional distribution (ODD): splitting the domain of all grid points along the X or Y direction on average. The task distribution model of the parallel algorithm [16] for the one-dimensional Riesz space fractional equation is ODD. The parallel algorithm based on ODD will not change the serial algorithm much and the load balance is guaranteed. If task is divided along X direction and is very big, the communication will influence the scalability of the parallel algorithm. The second one is two-dimensional distribution (TDD): splitting the domain of all grid points along the X and Y direction on average. So, the computing processes have a two-dimensional grid layout, with process id and . are the dimension size of the processes grid. The task distribution with TDD is shown in Figure 2.

With the TDD, the data layout is described in Figure 3. Each subdomain with a process may have less than four virtual boundaries to receive the boundary data from its nearest neighbors. The virtual boundary is shown with dotted lines. The process has four virtual boundaries. The process only has three virtual boundaries since there is no process that stays on its right hand. A virtual boundary may have several layer grid points, which depends on the discrete scheme on space. In this paper, there is only one layer grid point for a virtual boundary with (4). In every iteration of (6), the processes exchange the data near the virtual boundaries shown in Figure 3. After the exchange, every process performs its own computation according to (6).

##### 3.3. Implementation

The parallel algorithm for 2D-TFDE uses the mechanisms of process level parallelism. The process level parallelism is a kind of task level parallelism. The parallel algorithm for (1) is described in Algorithm 1.

 (1) init parallel environment (2) for  all MPI processes do in parallel (3)    get the input parameters like , , , , , . (4)    allocate local memory , , , , Part1, and so forth (5)    init variables and arrays (6)    get process id (7)    compute the initial condition with and boundary condition (8)    record time (9)    for     to     do (10)     compute , , et al. (11)      with (12)      with (13)     with (14)     for     to     do (15)      with (16)     while     do (17)      ( ) with (18)      if     then (19)       send right boundary to its right neighbor (20)       receive left boundary of its right neighbor (21)      if      then (22)       send top boundary to its top neighbor (23)       receive bottom boundary of its top neighbor (24)      if     then (25)       send left boundary to its left neighbor (26)       receive right boundary of its left neighbor (27)      if     then (28)       send bottom boundary to its bottom neighbor (29)       receive top boundary of its bottom neighbor (30)        with (31)     get global maximum of ϵ of all processes (32)        with (33)  record time (34)  output (35)  stop parallel environment

Each process only allocates its local memory. Assuming are divisible by , the process with four virtual boundaries will allocate memory space for array . The calculation of process id has three steps:step 1:get the MPI global id ;step 2:;step 3:.

The computations of , , , and so forth depend on the particular functions of coefficient and source terms. Performing these computations, every time step is a good choice. If these computations are performed out of the main loop (lines 9–32), a lot of memory space is required. If these computations are performed in the “While" loop (lines 16–32), it is too time consuming. The stands for the zero time step and stands for . means the iteration . If a process has neighbors, it should exchange the boundary data with its neighbors. The received boundary data are stored into the designed virtual boundaries. The lines 3–7 of Algorithm 1 are the preprocessing for the parallel algorithm. The lines 9–32 are the main time marching loops. are used to record the execution time.

#### 4. Experimental Results and Discussion

The experiment platform is a cluster with distributed memory system (DSM) architecture. One computing node consists of two Intel Xeon E5540 CPUs. The specifications of the cluster are listed in Table 1. The code runs on double precision floating point operations and is compiled by the mpif90 compiler with level three optimization (-O3). For convenience to compare the runtime, the inner loop (lines 16–32) of Algorithm 1 is fixed as 3.

##### 4.1. Numerical Example and Convergence of the Parallel Algorithm

The following time fractional () differential equation [4] was considered: where , and is the boundary of . The exact solution of the above equation is .

The computational results for different at and are shown in Figure 4. Figure 4 shows that the order of the fractional time derivative governs the value of unknown . With the increase of to 1, (1) approaches the classical PDE. Figure 5 shows the numerical solutions with .

The parallel algorithm compares well with the exact analytic solution to the fractional partial differential equation in this test case of (7) with , shown in Figure 6. The and are and . The maximum absolute error is .

##### 4.2. Performance Improvement

For fixed , the performance comparison between single process and four processes (single CPU) is shown in Figure 7. The X step number in (6) is , which is the x-coordinate of Figure 7. ranges from 2048 to 10240. With , the runtime of one process is 23.45 seconds and the runtime of four processes is 6.64 seconds. The speedup is 3.53. With , the runtime of one process is 803.88 seconds and the runtime of four processes is 192.76 seconds. The speedup is 4.17. From Figure 7, the parallel algorithm with fixed is more than 4 times faster than the serial algorithm.

For fixed , the performance comparison between single process and four processes is shown in Figure 8. For single process, the X, Y step number is . For four processes, the X, Y step number is 1280 with . ranges from 16 to 512. With , the runtime of one process is 17.63 seconds and the runtime of four processes is 4.65 seconds. The speedup is 3.79. With , the runtime of one process is 4415.78 seconds and the runtime of four processes is 1394.99 seconds. The speedup is 3.16. The performance of four processes is about 3.2 times higher than the performance of single process with .

##### 4.3. Scalability

The scalability of the parallel algorithm on the large scale cluster system is shown in Figure 9. The technical specifications of the cluster system are listed in Table 1. is fixed with 10 for all conditions. Each process has the same with and . varies from 16650, 33300, and 49950 for 9, 36, and 81 processes. The runtime of 9 processes is 83.02 seconds and the runtime of 81 processes is 94.08 seconds. The parallel efficiency of 81 processes is 88.24% compared with 9 processes. Here, the parallel efficiency is defined as the ratio of the runtime of different number of processes with the same work load on each process.

##### 4.4. Discussion

The parallel Algorithm 1 will have good parallel scalability on distributed memory system. From Figure 3, we can see that each subdomain has only virtual boundary at every direction (top, bottom, left, and right). Assuming that the size of the subdomain is , the inner iteration of line 16 in Algorithm 1 has about arithmetic operations with precomputed. It needs to establish 8 communications for neighbors except the global communication for . The arithmetic operation of each time step besides the inner iteration is constant as . is bigger than . The communication data is grid point. Assuming that finishing one arithmetic operation needs time and there are inner iterations, the computing time of each time step is . Assume that is the time to establish the communication, is the transform time for a grid point, and is the global communication time. So, the total communication time for a time step is . The communication/computation ratio is as follows: The computation time is determined with the multiplication of and the communication time is determined with the addition of and . The extreme of is as follows: That means we can enhance the parallel efficiency by enlarging the size of subdomain.

The time and number of grid points will affect the convergence property. The exact solution of (7) shows that .(1)The bigger becomes, the more inner iterations are needed. With , , the first inner time step needs 5 Jacobi iterations and the last inner time step needs 31 iterations for . For , becomes 7 and becomes 61.(2)The bigger becomes, the more inner iterations are needed. The is fixed as 1.0. For , becomes 6 and becomes 66. For , becomes 3 and becomes 136. The reason for the phenomenon above is that () changes dramatically if the source term is big. The iteration times with are shown in Table 2.

The parallel algorithm is compatible with short memory principle [15]. The computing time will become small with a smaller , which is determined by . The Gauss-Seidel iteration method will have better convergent speed than Jacobi iteration method, but it is hard to parallelize the Gauss-Seidel method.

As analyzed in Section 3.1, the computational complexity is . Define the following function: varies almost linearly, as shown in Figure 10. Figure 10 shows that the heavy computation is a real challenge from the point of view of computer science.

The heavy memory usage is the other challenge besides the heavy computation. Ignoring the memory usage of the coefficients and the source term , needs bytes memory space. It needs 100 GB memory with ,, and . As discussed above, the bigger the are, the smaller the (communication/computation ratio) is. So, the heavy memory usage will limit the parallel efficiency of the parallel algorithm. This kind of contradictions exists in many places. One contradiction is the easy parallelization with bad convergence of the Jacobi iterative method. Another contradiction is the hard parallelization and good convergence of the Gauss-Seidel iterative method.

#### 5. Conclusions and Future Work

In this paper, we present a parallel algorithm for 2D-TFDE with implicit differential method. The parallel solution is analyzed and implemented with MPI programming model. The experimental results show that the parallel algorithm compares well with the exact solution and can scale well on large scale distributed memory cluster system. So, the power of parallel computing for the time consuming fractional differential equations should be recognized.

The numerical solution for fractional equations is very computationally intensive. As a part of the future work, first, the numerical solution of high dimensional space fractional equations has global reliance on almost whole grid points, which is very challenging for real applications. Second, the Krylov subspace method with preconditioner will enhance the convergence for (4) and should be paid attention to. Third, accelerating the parallel algorithm on heterogeneous system [28] should be paid attention to.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

This research work is supported by the National Natural Science Foundation of China under Grant no. 11175253, also by 973 Program of China under Grant no. 61312701001. The authors would like to thank the anonymous reviewers for their helpful comments also.