`The Scientific World JournalVolume 2014 (2014), Article ID 219580, 8 pageshttp://dx.doi.org/10.1155/2014/219580`
Research Article

## A Parallel Algorithm for the Two-Dimensional Time Fractional Diffusion Equation with Implicit Difference Method

1College of Aerospace Science and Engineering, National University of Defense Technology, Changsha 410073, China
2Science and Technology on Space Physics Laboratory, Beijing 100076, China
3School of Computer Science, National University of Defense Technology, Changsha 410073, China
4Department of Engineering Science, University of Oxford, Oxford OX2 0ES, UK

Received 9 January 2014; Accepted 6 February 2014; Published 12 March 2014

Academic Editors: F. Liu, A. Sikorskii, and S. B. Yuste

Copyright © 2014 Chunye Gong et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is . In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16–4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.

#### 1. Introduction

Building fractional mathematical models for specific phenomenon and developing numerical or analytical solutions for these fractional mathematical models are very hot in recent years. Fractional diffusion equations have been used to represent different kinds of dynamical systems [1]. But the fractional applications are rare. One reason for rare fractional applications is that the computational cost of approximating for fractional equations is too much heavy. The idea of fractional derivatives dates back to the 17th century. A fractional differential equation is a kind of equation which uses fractional derivatives. Fractional equations provide a powerful instrument for the description of memory and hereditary properties of different substances.

There has been a wide variety of numerical methods proposed for fractional equations [2, 3], for example, finite difference method [47], finite element method [8, 9], spectral method [10, 11], and meshless techniques [12]. Zhuang and Liu [4] presented an implicit difference approximation for two-dimensional time fractional diffusion equation (2D-TFDE) on a finite domain and discussed the stability and convergence of the method. The numerical result of an example agrees well with their theoretical analysis. Tadjeran and Meerschaert presented a numerical method, which combines the alternating directions implicit (ADI) approach with a Crank-Nicolson discretization and a Richardson extrapolation to obtain an unconditionally stable second-order accurate finite difference method, to approximate a two-dimensional fractional diffusion equation [13]. Two ADI schemes based on the approximation and backward Euler method are considered for the two-dimensional fractional subdiffusion equation [14].

It is very time consuming to numerically solve fractional differential equations for high spatial dimension or big time integration. Short memory principle [15] and parallel computing [16, 17] can be used to overcome this difficulty. Parallel computing is used to solve computation intensive applications simultaneously [1821]. Large scale applications in science and engineering such as particle transport [2224], different linear and nonlinear systems [25], nonnumerical intelligent algorithm [26], and computational fluid dynamics [27] can rely on parallel computing. Diethelm [17] implemented the fractional version of the second-order Adams-Bashforth-Moulton method on a parallel computer and discussed the precise nature of the parallelization concept. This is the first attempt for parallel computing on fractional equations. Following that, Gong et al. [16] presented a parallel algorithm for one-dimensional Riesz space fractional diffusion equation with explicit finite difference method. The numerical solution of Riesz space fractional equations has global dependence on grid points, which means the approximation of a grid point will depend on the approximation of all grid points in one time step. The numerical solution of time fractional equations has global dependence on time steps, which means that the approximation of a grid point will depend on the approximation of the grid point in all time steps. Global dependence means the nonlocal property of fractional deviates on time or space. Explicit method is easy to be parallelized but is restrict by its stability condition. Implicit method is hard to be solved by Gauss elimination method and often uses the iterative scheme. Until today, the power of parallel computing for high dimensional and time fractional differential equations has not been tried.

This paper focuses on the two-dimensional time fractional diffusion equation studied by Zhuang and Liu [4]: where . The fractional derivative is in the Caputo form.

#### 2. Background: Numerical Solution

The fractional derivative of in the Caputo sense is defined as [15]

If is continuous bounded derivatives in for every , we can get

Define ,,,,, and , for , , and . Let , , , , , and be the numerical approximation to , , , , and . We can get the implicit approximating scheme [4] for (1): where , , and . The and are the step size along X and Y directions defined above.

#### 3. Parallel Algorithm

##### 3.1. Analysis

Let , and let ; (4) can be rewritten as

The explicit schemes are conditionally stable and need very small for high dimensional problems for both classical and fractional equations. The implicit schemes are unconditionally stable but need to get the inverse of the coefficient matrix. Sometimes the sparse coefficient matrix is too large, making a direct method too difficult to use. So, the iterative method can be used to avoid matrix inverse: until is smaller than a predefined threshold . are the iterative variables. are the known variables for the unknown time step.

It is very time consuming to solve the 2D-TFDE by iterative method of (6). For determining and assuming if there are iterations for each time step on average, there are about arithmetical logical operations ignoring the computation of the coefficients. So, the computational complexity is , which is much more heavy than the classical integer order 2D partial differential equations .

Besides the heavy computational cost, the memory space requirement is the other problem. Because each unknown time step needs to use all the values of the previous time steps, all the values of need to be stored into the memory space. When is big enough, the memory complexity is , which is far bigger than the classical integer order 2D partial differential equations .

The computation of (6) can be divided into two parts.(i). The unknown value of grid point at the time step relies on the value of grid point at all previous time steps of .(ii). The unknown value of grid point relies on the value of . The data dependence of 2D-TFDE is shown in Figure 1. relies on the neighboring grid points at the same time step and the same position of all the previous time steps.

Figure 1: The data dependence of 2D-TFDE of grid point of time step .
##### 3.2. Task Distribution Model and Data Layout

The task distribution of the total computation should be designed on distributed memory systems, with the goal of making the total computations as efficient as possible. There are three main issues in choosing a task distribution model for these computations:(i)load balance: ensure splitting of the computations reasonably evenly among all computing processors/processes throughout the time stepping;(ii)less communication: the task distribution model should keep the communication among different computing processes as less as possible;(iii)convenient programming: the parallel algorithm based on the task distribution model should not change the serial algorithm too much. The goal of keeping attention on these issues is achieving high execution efficiency and high scalability of the parallel algorithm on distributed memory systems for 2D-TFDE.

Refer to (6). computation has no data dependence. computation has data dependence among neighboring grid points. There are mainly two kinds of task distribution models. The first one is one-dimensional distribution (ODD): splitting the domain of all grid points along the X or Y direction on average. The task distribution model of the parallel algorithm [16] for the one-dimensional Riesz space fractional equation is ODD. The parallel algorithm based on ODD will not change the serial algorithm much and the load balance is guaranteed. If task is divided along X direction and is very big, the communication will influence the scalability of the parallel algorithm. The second one is two-dimensional distribution (TDD): splitting the domain of all grid points along the X and Y direction on average. So, the computing processes have a two-dimensional grid layout, with process id and . are the dimension size of the processes grid. The task distribution with TDD is shown in Figure 2.

Figure 2: The two-dimensional task distribution model for 2D-TFDE.

With the TDD, the data layout is described in Figure 3. Each subdomain with a process may have less than four virtual boundaries to receive the boundary data from its nearest neighbors. The virtual boundary is shown with dotted lines. The process has four virtual boundaries. The process only has three virtual boundaries since there is no process that stays on its right hand. A virtual boundary may have several layer grid points, which depends on the discrete scheme on space. In this paper, there is only one layer grid point for a virtual boundary with (4). In every iteration of (6), the processes exchange the data near the virtual boundaries shown in Figure 3. After the exchange, every process performs its own computation according to (6).

Figure 3: Data layout for 2D-TFDE.
##### 3.3. Implementation

The parallel algorithm for 2D-TFDE uses the mechanisms of process level parallelism. The process level parallelism is a kind of task level parallelism. The parallel algorithm for (1) is described in Algorithm 1.

Algorithm 1: Parallel algorithm for 2D-TFDE.

Each process only allocates its local memory. Assuming are divisible by , the process with four virtual boundaries will allocate memory space for array . The calculation of process id has three steps:step 1:get the MPI global id ;step 2:;step 3:.

The computations of , , , and so forth depend on the particular functions of coefficient and source terms. Performing these computations, every time step is a good choice. If these computations are performed out of the main loop (lines 9–32), a lot of memory space is required. If these computations are performed in the “While" loop (lines 16–32), it is too time consuming. The stands for the zero time step and stands for . means the iteration . If a process has neighbors, it should exchange the boundary data with its neighbors. The received boundary data are stored into the designed virtual boundaries. The lines 3–7 of Algorithm 1 are the preprocessing for the parallel algorithm. The lines 9–32 are the main time marching loops. are used to record the execution time.

#### 4. Experimental Results and Discussion

The experiment platform is a cluster with distributed memory system (DSM) architecture. One computing node consists of two Intel Xeon E5540 CPUs. The specifications of the cluster are listed in Table 1. The code runs on double precision floating point operations and is compiled by the mpif90 compiler with level three optimization (-O3). For convenience to compare the runtime, the inner loop (lines 16–32) of Algorithm 1 is fixed as 3.

Table 1: Technical specifications of the experiment platform.

##### 4.1. Numerical Example and Convergence of the Parallel Algorithm

The following time fractional () differential equation [4] was considered: where , and is the boundary of . The exact solution of the above equation is .

The computational results for different at and are shown in Figure 4. Figure 4 shows that the order of the fractional time derivative governs the value of unknown . With the increase of to 1, (1) approaches the classical PDE. Figure 5 shows the numerical solutions with .

Figure 4: The numerical approximation whose transport is governed by the TFDE (7) for various , , , when .
Figure 5: The approximation solution of (7) when and .

The parallel algorithm compares well with the exact analytic solution to the fractional partial differential equation in this test case of (7) with , shown in Figure 6. The and are and . The maximum absolute error is .

Figure 6: Comparison of exact solution to the solution of the parallel algorithm at time = 1.0.
##### 4.2. Performance Improvement

For fixed , the performance comparison between single process and four processes (single CPU) is shown in Figure 7. The X step number in (6) is , which is the x-coordinate of Figure 7. ranges from 2048 to 10240. With , the runtime of one process is 23.45 seconds and the runtime of four processes is 6.64 seconds. The speedup is 3.53. With , the runtime of one process is 803.88 seconds and the runtime of four processes is 192.76 seconds. The speedup is 4.17. From Figure 7, the parallel algorithm with fixed is more than 4 times faster than the serial algorithm.

Figure 7: Performance comparison between one process and four processes on E5540 with fixed .

For fixed , the performance comparison between single process and four processes is shown in Figure 8. For single process, the X, Y step number is . For four processes, the X, Y step number is 1280 with . ranges from 16 to 512. With , the runtime of one process is 17.63 seconds and the runtime of four processes is 4.65 seconds. The speedup is 3.79. With , the runtime of one process is 4415.78 seconds and the runtime of four processes is 1394.99 seconds. The speedup is 3.16. The performance of four processes is about 3.2 times higher than the performance of single process with .

Figure 8: Performance comparison between one process and four processes on E5540 with fixed .
##### 4.3. Scalability

The scalability of the parallel algorithm on the large scale cluster system is shown in Figure 9. The technical specifications of the cluster system are listed in Table 1. is fixed with 10 for all conditions. Each process has the same with and . varies from 16650, 33300, and 49950 for 9, 36, and 81 processes. The runtime of 9 processes is 83.02 seconds and the runtime of 81 processes is 94.08 seconds. The parallel efficiency of 81 processes is 88.24% compared with 9 processes. Here, the parallel efficiency is defined as the ratio of the runtime of different number of processes with the same work load on each process.

Figure 9: Scalability of the parallel algorithm on the cluster system.
##### 4.4. Discussion

The parallel Algorithm 1 will have good parallel scalability on distributed memory system. From Figure 3, we can see that each subdomain has only virtual boundary at every direction (top, bottom, left, and right). Assuming that the size of the subdomain is , the inner iteration of line 16 in Algorithm 1 has about arithmetic operations with precomputed. It needs to establish 8 communications for neighbors except the global communication for . The arithmetic operation of each time step besides the inner iteration is constant as . is bigger than . The communication data is grid point. Assuming that finishing one arithmetic operation needs time and there are inner iterations, the computing time of each time step is . Assume that is the time to establish the communication, is the transform time for a grid point, and is the global communication time. So, the total communication time for a time step is . The communication/computation ratio is as follows: The computation time is determined with the multiplication of and the communication time is determined with the addition of and . The extreme of is as follows: That means we can enhance the parallel efficiency by enlarging the size of subdomain.

The time and number of grid points will affect the convergence property. The exact solution of (7) shows that .(1)The bigger becomes, the more inner iterations are needed. With , , the first inner time step needs 5 Jacobi iterations and the last inner time step needs 31 iterations for . For , becomes 7 and becomes 61.(2)The bigger becomes, the more inner iterations are needed. The is fixed as 1.0. For , becomes 6 and becomes 66. For , becomes 3 and becomes 136. The reason for the phenomenon above is that () changes dramatically if the source term is big. The iteration times with are shown in Table 2.

Table 2: Impact of the source term on iteration times.

The parallel algorithm is compatible with short memory principle [15]. The computing time will become small with a smaller , which is determined by . The Gauss-Seidel iteration method will have better convergent speed than Jacobi iteration method, but it is hard to parallelize the Gauss-Seidel method.

As analyzed in Section 3.1, the computational complexity is . Define the following function: varies almost linearly, as shown in Figure 10. Figure 10 shows that the heavy computation is a real challenge from the point of view of computer science.

Figure 10: The linear variation of .

The heavy memory usage is the other challenge besides the heavy computation. Ignoring the memory usage of the coefficients and the source term , needs bytes memory space. It needs 100 GB memory with ,, and . As discussed above, the bigger the are, the smaller the (communication/computation ratio) is. So, the heavy memory usage will limit the parallel efficiency of the parallel algorithm. This kind of contradictions exists in many places. One contradiction is the easy parallelization with bad convergence of the Jacobi iterative method. Another contradiction is the hard parallelization and good convergence of the Gauss-Seidel iterative method.

#### 5. Conclusions and Future Work

In this paper, we present a parallel algorithm for 2D-TFDE with implicit differential method. The parallel solution is analyzed and implemented with MPI programming model. The experimental results show that the parallel algorithm compares well with the exact solution and can scale well on large scale distributed memory cluster system. So, the power of parallel computing for the time consuming fractional differential equations should be recognized.

The numerical solution for fractional equations is very computationally intensive. As a part of the future work, first, the numerical solution of high dimensional space fractional equations has global reliance on almost whole grid points, which is very challenging for real applications. Second, the Krylov subspace method with preconditioner will enhance the convergence for (4) and should be paid attention to. Third, accelerating the parallel algorithm on heterogeneous system [28] should be paid attention to.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

This research work is supported by the National Natural Science Foundation of China under Grant no. 11175253, also by 973 Program of China under Grant no. 61312701001. The authors would like to thank the anonymous reviewers for their helpful comments also.

#### References

1. R. L. Magin, Fractional Calculus in Bioengineering, Begell House, Redding, Calif, USA, 2006.
2. F. Liu, I. Turner, V. Anh, Q. Yang, and K. Burrage, “A numerical method for the fractional Fitzhugh-Nagumo monodomain model,” ANZIAM Journal, vol. 54, pp. C608–C629, 2013.
3. H. Ding and C. Li, “Mixed spline function method for reaction-diffusion equations,” Journal of Computational Physics, vol. 242, pp. 103–123, 2013.
4. P. Zhuang and F. Liu, “Finite difference approximation for two-dimensional time fractional diffusion equation,” Journal of Algorithms & Computational Technology, vol. 1, no. 1, pp. 1–15, 2007.
5. S. B. Yuste and L. Acedo, “An explicit finite difference method and a new von Neumann-type stability analysis for fractional diffusion equations,” SIAM Journal on Numerical Analysis, vol. 42, no. 5, pp. 1862–1874, 2005.
6. F. Liu, P. Zhuang, V. Anh, I. Turner, and K. Burrage, “Stability and convergence of the difference methods for the space-time fractional advection-diffusion equation,” Applied Mathematics and Computation, vol. 191, no. 1, pp. 12–20, 2007.
7. S. B. Yuste and J. Quintana-Murillo, “A finite difference method with non-uniform timesteps for fractional diffusion equations,” Computer Physics Communications, vol. 183, no. 12, pp. 2594–2600, 2012.
8. X. Zhang, P. Huang, X. Feng, and L. Wei, “Finite element method for two-dimensional time-fractional tricomi-type equations,” Numerical Methods for Partial Differential Equations, vol. 29, no. 4, pp. 1081–1096, 2013.
9. O. P. Agrawal, “A general finite element formulation for fractional variational problems,” Journal of Mathematical Analysis and Applications, vol. 337, no. 1, pp. 1–12, 2008.
10. C. Li, F. Zeng, and F. Liu, “Spectral approximations to the fractional integral and derivative,” Fractional Calculus and Applied Analysis, vol. 15, no. 3, pp. 383–406, 2012.
11. N. N. Leonenko, M. M. Meerschaert, and A. Sikorskii, “Fractional pearson diffusions,” Journal of Mathematical Analysis and Applications, vol. 403, no. 2, pp. 532–546, 2013.
12. P. Zhuang, Y. T. Gu, F. Liu, I. Turner, and P. K. D. V. Yarlagadda, “Time-dependent fractional advection-diffusion equations by an implicit MLS meshless method,” International Journal for Numerical Methods in Engineering, vol. 88, no. 13, pp. 1346–1362, 2011.
13. C. Tadjeran and M. M. Meerschaert, “A second-order accurate numerical method for the two-dimensional fractional diffusion equation,” Journal of Computational Physics, vol. 220, no. 2, pp. 813–823, 2007.
14. Y.-N. Zhang and Z.-Z. Sun, “Alternating direction implicit schemes for the two-dimensional fractional sub-diffusion equation,” Journal of Computational Physics, vol. 230, no. 24, pp. 8713–8728, 2011.
15. I. Podlubny, Fractional Differential Equations, Academic Press, San Diego, Calif, USA, 1999.
16. C. Gong, W. Bao, and G. Tang, “A parallel algorithm for the Riesz fractional reaction-diffusion equation with explicit finite difference method,” Fractional Calculus and Applied Analysis, vol. 16, no. 3, pp. 654–669, 2013.
17. K. Diethelm, “An efficient parallel algorithm for the numerical solution of fractional differential equations,” Fractional Calculus and Applied Analysis, vol. 14, no. 3, pp. 475–490, 2011.
18. J. Yan, G.-M. Tan, and N.-H. Sun, “Optimizing parallel ${S}_{n}$ sweeps on unstructured grids for multi-core clusters,” Journal of Computer Science and Technology, vol. 28, no. 4, pp. 657–670, 2013.
19. Z. Mo, A. Zhang, X. Cao et al., “JASMIN: a parallel software infrastructure for scientific computing,” Frontiers of Computer Science in China, vol. 4, no. 4, pp. 480–488, 2010.
20. F. Chen and J. Shen, “A GPU parallelized spectral method for elliptic equations in rectangular domains,” Journal of Computational Physics, vol. 250, pp. 555–564, 2013.
21. J. B. Haga, H. Osnes, and H. P. Langtangen, “A parallel block preconditioner for large-scale poroelasticity with highly heterogeneous material parameters,” Computational Geosciences, vol. 16, no. 3, pp. 723–734, 2012.
22. C. Gong, J. Liu, L. Chi, H. Huang, J. Fang, and Z. Gong, “GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method,” Journal of Computational Physics, vol. 230, no. 15, pp. 6010–6022, 2011.
23. A. Talamo, “Numerical solution of the time dependent neutron transport equation by the method of the characteristics,” Journal of Computational Physics, vol. 240, pp. 248–267, 2013.
24. C. Gong, J. Liu, H. Huang, and Z. Gong, “Particle transport with unstructured grid on GPU,” Computer Physics Communications, vol. 183, no. 3, pp. 588–593, 2012.
25. S. Pennycook, S. Hammond, S. Wright, J. Herdman, I. Miller, and S. Jarvis, “An investigation of the performance portability of OpenCL,” Journal of Parallel and Distributed Computing, vol. 73, no. 11, pp. 1439–1450, 2012.
26. J. Gu, X. Gu, and M. Gu, “A novel parallel quantum genetic algorithm for stochastic job shop scheduling,” Journal of Mathematical Analysis and Applications, vol. 355, no. 1, pp. 63–81, 2009.
27. F. Salvadore, M. Bernardini, and M. Botti, “GPU accelerated flow solver for direct numerical simulation of turbulent flows,” Journal of Computational Physics, vol. 235, pp. 129–142, 2013.
28. X.-J. Yang, X.-K. Liao, K. Lu, Q.-F. Hu, J.-Q. Song, and J.-S. Su, “The TianHe-1A supercomputer: its hardware and software,” Journal of Computer Science and Technology, vol. 26, no. 3, pp. 344–351, 2011.