Abstract

We present a parallel GPU solution of the Caputo fractional reaction-diffusion equation in one spatial dimension with explicit finite difference approximation. The parallel solution, which is implemented with CUDA programming model, consists of three procedures: preprocessing, parallel solver, and postprocessing. The parallel solver involves the parallel tridiagonal matrix vector multiplication, vector-vector addition, and constant vector multiplication. The most time consuming loop of vector-vector addition and constant vector multiplication is optimized and impressive performance improvement is got. The experimental results show that the GPU solution compares well with the exact solution. The optimized GPU solution on NVIDIA Quadro FX 5800 is 2.26 times faster than the optimized parallel CPU solution on multicore Intel Xeon E5540 CPU.

1. Introduction

The idea of fractional derivatives can be dated back to the 17th century. A fractional differential equation is a kind of equation which uses fractional derivatives. Fractional equations can be used to describe some physical phenomenons more accurately than the classical integer order differential equation [1]. The reaction-diffusion equations play an important role in dynamical systems of mathematics, physics, chemistry, bioinformatics, finance, and other research areas.

Some analytical methods were proposed for fractional differential equations [2, 3]. The stability of Cauchy fractional differential equations was studied [4, 5] and more attention should be paid to the interesting Ulam’s type stability [6]. There have been a wide variety of numerical approximation methods proposed for fractional equations [7, 8], for example, finite difference method [9], finite element method, and spectral techniques [10]. Interest in fractional reaction-diffusion equations has increased [11]. In 2000, Henry and Wearne [12] derived a fractional reaction-diffusion equation from a continuous-time random walk model with temporal memory and sources. The fractional reaction-diffusion system with activator and inhibitor variables was studied by Gafiychuk et al. [13]. Haubold et al. [14] developed a solution in terms of the H-function for a unified reaction-diffusion equation. The generalized differential transform method [15] was presented for fractional reaction-diffusion equations. Saxena et al [16] gave investigation of a closed form solution of a generalized fractional reaction-diffusion equation.

Parallel computing is used to solve computation intensive applications simultaneously [1719]. In recent years, the computing accelerators such as graphics processing unit (GPU) provided a new parallel method of accelerating computation intensive simulations [2022]. The use of general purpose GPU is possible by the advance of programming models and hardware resources. The GPU programming models such as NVIDIA’s compute unified device architecture (CUDA) [23] become more mature than before and simplify the development of nongraphic applications. GPU presents an energy efficient architecture for computation intensive domains like particle transport [24, 25] and molecular dynamics [26].

It is time consuming to numerically solve fractional differential equations for high spatial dimension or big time integration. Short memory principle [27, 28] and parallel computing [2932] can be used to overcome this difficulty. The parallel algorithms of one- and two- dimensional time fractional equations are studied and good parallel scalability is got [31, 32]. Optimization of the sum of constant vector multiplication is presented and 2-time speedup can be got [31]. The parallel implicit iterative algorithm was studied for two-dimensional time fractional problem at the first time [32].

Gong et al. [29] presented a parallel algorithm for Riesz space fractional equations. The parallel efficiency of the presented parallel algorithm of 64 processes is up to 79.39% compared with 8 processes on a distributed memory cluster system. Diethelm [30] implemented the fractional version of the second-order Adams-Bashforth-Moulton method for fractional ordinary equations on a parallel computer. Domain decomposition method is regarded as the basic mathematical background for many parallel applications. A domain decomposition algorithm for time fractional reaction-diffusion equation with implicit finite difference method was proposed [33]. The domain decomposition algorithm keeps the same parallelism but needs much fewer iterations, compared with Jacobi iteration in each time step, until nothing has been recorded on accelerating the numerical solution of Caputo fractional reaction-diffusion equation on GPU.

This paper focuses on the Caputo fractional reaction-diffusion equation: on a finite domain and . The and are constants. If equals 1, (1) is the classical reaction-diffusion equation. The fractional derivative is in the Caputo form.

2. Background

2.1. Numerical Solution

The fractional derivative of in the Caputo sense is defined as [27]

If is continuous bounded derivatives in for every , we can get

Define , , , and for . Define , , and as the numerical approximation to , , and . We can get [11] where , , and

Using explicit center difference scheme for , we can get

The explicit finite difference approximation for (1) is

Define , , , , , , and as

Equation (7) evolves as where matrix is a tridiagonal matrix, defined by

2.2. GPU Architecture and CUDA Programming Model

The architecture of GPU is optimized for rendering real-time graphics, a computation and memory access intensive problem domain with enormous inherent parallelism. Not like CPU, a much larger portion of GPUs resources is devoted to data processing rather than to caching or control flow. The NVIDIA Quadro FX 5800 GPU has 30 multiprocessor units which are called the streaming multiprocessors (SMs). Each SM contains 8 SIMD CUDA cores. Each CUDA core runs at 1.30 GHz. The multiprocessors create, manage, and execute concurrent threads in hardware with near zero scheduling overhead. The single instruction multiple thread (SIMT) unit, which is akin to SIMD vector organizations, creates, manages, schedules, and executes threads in groups of 32 parallel threads called warp [23].

The programming model is a bridge between hardware and application. CUDA comes with a software environment that allows developers to use C as a high-level programming language. There are three key abstractions in CUDA: a hierarchy of thread execution model (grid or kernel, thread block, and thread), shared memory, and barrier synchronization. These abstractions provide fine-grained data level parallelism and thread parallelism, nested within coarse-grained data level parallelism and task level parallelism. Each CUDA kernel creates a single grid, which consists of many thread blocks. Each thread block schedules groups of threads that can share data through on-chip shared memory and synchronize with one another using barrier synchronization. Threads within a block are organized into warps which run in SIMT fashion. CUDA threads may access data from multiple memory spaces during their execution. The memory spaces include global, texture, and constant memory for grid, on-chip shared memory for thread block, and private register files for thread. The memory access time to different memory spaces varies from several to hundreds of cycles. These memory accesses must be coalesced to improve the performance of global memory access.

3. Details of GPU Solution

The parallel solution consists of three parts. The first part is preprocessing, which prepares the initial matrices, vectors, and so on. The second part is the parallel solver, which focuses on the iteration of time step with (9). The third part is postprocessing, which outputs the final results and so on.

The preprocessing includes initialization of parallel environment, distribution of computing task, allocation of local memory space, and initialization of variables and arrays. Matrices and are prepared before the computation of (9). For example, matrix can be got according to (10). The postprocessing is simple. The results of the exact solution are performed. The max absolute error of the exact and parallel solutions is computed and outputted. Both the results of the exact and parallel solution are saved in files which are necessary for plot. Other operations of postprocessing include free memory space and stop the parallel environment.

In order to get , the right-sided computation of (9) should be performed. There are mainly one tridiagonal matrix vector multiplication, many constant vector multiplications, and many vector-vector additions in the right-sided computation.(1)The tridiagonal matrix vector multiplication is .(2)The constant vector multiplications are , , and so on.(3)The vector-vector additions are and so on.

The parallel solution uses the data level parallelism of GPU architecture. The parallel solution with CUDA for (1) is described in Algorithm 1. The preprocessing involves lines 1 to 3. The parallel solver, which is the most time consuming procedure, involves lines 4 to 12. The postprocessing involves lines 13 to 14 and other additional operations are not shown in Algorithm 1. Because the time spent on the preprocessing and postprocessing is trivial when the number of time steps is big enough, the preprocessing and postprocessing time is omitted for the measured time. and are used to record the measured time of the parallel CPU and GPU solutions.

(1)    Init CUDA environment
(2)   Allocate GPU global memory .
(3)   Init variables and arrays on GPU
(4)   record time
(5)   call kernel
(6)   for     to     by Step  1 do
(7)               call kernel
(8)               call kernel
(9)               for     to     by Step  1 do
(10)                     call kernel
(11)              call kernel
(12) record time
(13) output and
(14) free GPU memory and stop CUDA environment

The parallel solution uses the data level parallelism of GPU architecture. The parallel solution with CUDA for (1) is described in Algorithm 1. The preprocessing involves lines 1 to 3. The parallel solver, which is the most time consuming procedure, involves lines 4 to 12. The postprocessing involves lines 13 to 14 and other additional operations are not shown in Algorithm 1. BS stands for the CUDA thread block size and /BS is the number of CUDA thread blocks. BS is the predefined constant like 16, 32, 64, and so forth. Because the time spent on the preprocessing and postprocessing is trivial when the number of time steps is big enough, the preprocessing and postprocessing time is omitted for the measured time. and are used to record the measured time of the serial and parallel solution.

Except for the initialization of variables and arrays, there are four CUDA kernels. The first kernel is , which computes the initial condition according to in (1). The second kernel is , which performs the tridiagonal matrix vector multiplication. The third kernel is , which stands for constant vector multiplication and vector-vector addition. The fourth kernel is , which performs the constant vector multiplication and vector-vector addition. The last kernel is , which stands for constant vector multiplication. The CUDA kernels and are simple and will not be described in detail.

3.1. Implementation

The CUDA kernel for constant vector multiplication and vector-vector addition is shown in Algorithm 2. Algorithm 2 computes and saves the final vector into .

       input:
       output:
(1) for  all threads in each thread block do in parallel
(2)         local variable
(3)         
(4)         
(5)         

Most elements of tridiagonal matrix are zero. The most common data structure used to store a sparse matrix for sparse matrix vector multiplication computations is compressed sparse row (CSR) format [34] shown in So in the following parts of this paper, matrix stands for the format of (11) not the format of (10). With the format of (11), the global memory is coalesced and can improve the performance of global memory access.

The CUDA kernel for tridiagonal matrix vector multiplication is shown in Algorithm 3. One thread block deals with the multiplication of one row of matrix and one column of . Algorithm 3 computes and saves the final vector into . The shared memory is used to improve the memory access speed. The synchronization function is used to ensure the correctness of the logic of the algorithm.

           input:
           output:
(1)    shared memory
(2)   for  all threads in every thread block do in parallel
(3)                local variables
(4)                
(5)                
(6)                
(7)                if     then
(8)                          if     then
(9)                                    
(10)                        else
(11)                                   
(12)              if     then
(13)                        if     then
(14)                                  
(15)                        else
(16)                                  
(17)              
(18)              
(19)              
(20)             
(21)              
(22)             

In Algorithm 3, each GPU thread deals with the multiplication of one row of tridiagonal matrix and vector . Each thread needs to use three elements of vector . The nearest two threads use the same two elements of vector . We can use the shared memory to improve the memory access performance. So the elements of vector which will be used by threads in a thread block are stored into shared memory, shown between lines 6 and 16 of Algorithm 3. The real computation of tridiagonal matrix multiplication is shown between lines 18 and 21. Finally, the results are stored into the global memory of .

3.2. Optimization

In Algorithm 1, the kernels , , and are invoked times. In each time step, kernel is invoked () times. Because the total number of the invocations of kernel is . The most time consuming part of Algorithm 1 is the loop of line 9. Loop 9 can be combined into one CUDA kernel as shown in Algorithm 4. The array is the coefficient of (8) in global memory.

        input:
        output:
(1) for  all threads in every thread block do in parallel
(2)           local variables
(3)           
(4)           
(5)           for     to     by Step  1 do
(6)                     
(7)           

So the optimized parallel solution for Caputo fractional reaction-diffusion equation is similar to Algorithm 1 except that the loop (lines 9-10) in Algorithm 1 is replaced with the optimized CUDA kernel of Algorithm 4.

4. Experimental Results

4.1. Experiment Platforms

The experiment platforms consist of one GPU and one CPU listed in Table 1. For the purpose of fair comparisons, we measure the performance provided by GPU compared to the MPI code running on multicore CPU [31]. Both codes run on double precision floating point operations. The CPU code is compiled by the mpif90 compiler with level three optimization. The GPU code is compiled by the NVCC compiler provided by CUDA version 3.0 with level three optimization too.

4.2. Numerical Example

The following Caputo fractional reaction-diffusion equation [11] was considered: with , and The exact solution of (13) is

4.3. Accuracy of the GPU Implementation

The GPU solution compares well with the exact solution to the time fractional diffusion equation in this test case of (13), shown in Figure 1. The and for the GPU solution are and . The maximum absolute errors for , , and are , , and . In fact, the accuracy and convergence of the GPU solution are the same as the serial and parallel MPI solution on CPU [31].

4.4. Total Performance Improvement

In this section, the performance of the optimized GPU solution presented in this paper is compared with the performance of the parallel CPU solution [31]. The parallel CPU solution makes full use of four cores of E5540. The optimized GPU solution is presented in Section 3.2.

For fixed , the performance comparison between GPU and multicore CPU is shown in Table 2. The thread block size is 32.

For fixed , the performance comparison between optimized GPU solution and parallel CPU solution is shown in Table 3. The thread block size is 32.

4.5. Performance Issues of GPU Solution

With , , and thread block size 64, the runtime of Algorithm 1 on Quadro FX 5800 is 1.228 seconds. Without the loop of line 9 in Algorithm 1, the runtime is only 0.032 seconds. That is to say, about 97.39% of runtime is spent on the loop of line 9. This is the reason why we develop the optimized GPU solution with an optimized CUDA kernel of Algorithm 4.

The impact of the optimized CUDA kernel on constant vector multiplication and vector-vector addition with fixed is shown in Table 4. The performance improvement with fixed is shown in Table 5. All thread block sizes are 64. The basic GPU solution is Algorithm 1 and the optimized GPU solution uses the optimized CUDA kernel of Algorithm 4.

The thread block size (BS) is a key parameter for parallel GPU algorithms. The impact of BS is shown in Table 6. From Table 6, we can see that thread block size 32 is the best choice.

5. Conclusions and Future Work

In this paper, the numerical solution of Caputo fractional reaction-diffusion equation with explicit finite difference method is accelerated on GPU. The iteration of time step consists of tridiagonal matrix vector multiplication, constant vector multiplication, and vector-vector addition. The details of the GPU solution and some basic CUDA kernels are presented. The most time consuming loop (constant vector multiplication and vector-vector addition) is optimized. The experimental results show the GPU solution compares well with the exact analytic solution and is much faster than parallel CPU solution. So the power of parallel computing on GPU for solving fractional applications should be recognized.

As a part of the future work, first, the stability, like Ulam’s type, of different fractional equations should be paid attention to [3537]. Second, parallelizing the implicit numerical method of fractional differential equations is challenging.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research work is supported in part by the National Natural Science Foundation of China under Grants no. 11175253 and no. 61170083, in part by Specialized Research Fund for the Doctoral Program of Higher Education under Grant no. 20114307110001, and in part by 973 Program of China under Grant no. 61312701001. The authors would like to thank the anonymous reviewers for their helpful comments as well.