Abstract

A parallel implementation of a method of the semi-Lagrangian type for the advection equation on a hybrid architecture computation system is discussed. The difference scheme with variable stencil is constructed on the base of an integral equality between the neighboring time levels. The proposed approach allows one to avoid the Courant-Friedrichs-Lewy restriction on the relation between time step and mesh size. The theoretical results are confirmed by numerical experiments. Performance of a sequential algorithm and several parallel implementations with the OpenMP and CUDA technologies in the C language has been studied.

1. Introduction

Many physical phenomena in transport processes are modeled by time-dependent hyperbolic conservation laws [14]. The finite volume method (FVM) is a standard conservative method to construct numerical approximations for solving hyperbolic conservation problems. Modern modifications of FVM [58] provide well-established conservative methods for solving the governing advection equations. Moreover, some of them were developed to treat high gradients and discontinuities of a solution [7, 8]. In spite of their advances for hyperbolic equations, these methods have the limitation consisting in a time step restriction, mainly for stability sake. On the other hand, during the last three decades the idea of applying the method of characteristics to advective quantities forward in time has rapidly developed and has gained popularity in many areas [913]. In contrast to traditional Eulerian schemes, semi-Lagrangian algorithms provide unconditional stability and allow using large time steps. Despite unconditional stability, these methods are explicit and therefore they look well suited for parallelization. Now semi-Lagrangian methods are intensively studied and their efficiency for convection-dominated problems is proved. For a more detailed discussion about the comparison of traditional Eulerian and semi-Lagrangian schemes for hyperbolic conservation laws, see [6, 14, 15].

Initially semi-Lagrangian algorithms, as methods of characteristics, were developed with application in climate prediction [1621]. The simplest schemes use the approximation of a trajectory (or curvilinear characteristic) by a straight line and employ a low-order interpolation to compute a numerical solution. Nowadays, simplicity and efficiency of these schemes make them quite popular in different fields of numerical modeling like fluid dynamics applications [9, 12, 22], shallow water equations [10], fiber dynamics described by the Fokker-Planck equation [11], heat-conduction equation [23], and so forth. Now modern semi-Lagrangian algorithms involve a higher-order approximation of a curvilinear characteristic and employ a higher-order interpolation; see, for example, [22]. Recently, considerable efforts have been made to construct the conservative semi-Lagrangian methods [9, 20, 2428]. For instance, Scroggs and Semazzi [9] presented a semi-Lagrangian finite volume method that used a rectangular grid for a system of conservation laws and satisfied the discrete conservation relation, but the numerical results demonstrate some violation of full conservation. Early modifications of the semi-Lagrangian approach use a rectangular grid which is fixed throughout the simulation [9, 17, 24, 25]. Semi-Lagrangian schemes allow using spatial grids independently of one another. As a result, adaptive grids are widely used in modern versions of this approach [20, 28]. In spite of the progress in semi-Lagrangian methods, for most of them [912, 20, 24, 28] convergence has not been theoretically proved.

In this paper, we present a sketch of the theoretical proof and a numerical justification for the difference scheme of the semi-Lagrangian family. We start with the theorem about an exact equality that involves two spatial integrals over domains at neighboring time levels and the third integral over an inflow boundary. To prove convergence theoretically, we use a square grid, the bilinear interpolation, and the Runge-Kutta method for the fourth-order approximation of characteristics. This allows us to prove first-order convergence in a discrete analogue of the -norm. The theoretical convergence estimates are confirmed by numerical results.

In the remaining part of the paper, some parallel implementations of this method are studied. We discuss the design subtleties of parallel implementations of the algorithm by the OpenMP-technology for shared memory computational systems and by the CUDA technology for general-purpose GPU programming. In addition, the influence of the HyperThreading technology on the performance of our OpenMP-code is studied. Moreover, the difficulties of the algorithm implementation and the performance for hybrid architecture computation systems are discussed for our CUDA codes.

2. Formulation of the Problem

Let . In the closed domain consider the two-dimensional advection equation where and are known and are sufficiently smooth in . We suppose for simplicity that the coefficients satisfy the no-slip conditions at the upper and lower sides of and the flow conditions at the left and right sides of For the unknown function the following initial and boundary conditions are specified:

3. Numerical Scheme

Subdivide the time segment into time levels , , with the time step . Let be a closed quadrangle at the time level . For each of its points on the segment we construct the characteristics defined by the system of ordinary differential equations with the initial value at level as a parameter: With the help of these characteristics the edges of generate four surfaces , , with the edges at (Figure 1).

If is located near the inflow boundary , surfaces can cross the plane . In this case we get an additional curvilinear polygon on the plane (Figure 2).

Generally speaking, and can be triangular, pentagonal, or empty domains. If one of them is empty, then the integral over an empty domain is supposed to be equal to zero. Since there is no fundamental difference, we consider only the most common case with quadrangular domains. For , , and the following statement is valid.

Theorem 1. For a smooth solution of the problem (1)–(5) we have the equality

Proof. Denote the volume bounded by , and surfaces by and its boundary by . Apply the Gauss-Ostrogradsky theorem to the left-hand side of the equality Then Here is the outer normal to and sign “” means the scalar product. The normal equals on on , and on . For any the normal is orthogonal to all tangent directions of including the tangent of characteristics . Therefore . Taking into account this reasoning in (10), we get the equality that is equivalent to the statement of theorem.

Now construct the uniform mesh with mesh-size , : We will find an approximate solution at each time level as a grid function with values unlike the values of the exact solution

To construct the difference scheme with a variable stencil, we suppose that the function at time level is already known and we need to find it at level . To compute for some , we take the square with four vertices and apply Theorem 1. To determine on the boundary of we use the rectangles which are adjoined to this boundary inside (Figure 3).

Note that are known from the boundary condition (5).

Without loss of generality we describe the construction of the difference equations for inner nodes with only. Thus, due to Theorem 1 we get Here the curvilinear polygons and are formed by the characteristics (6) that issue out of the edges of square . To compute the second integrals in (15) numerically, first we replace the exact function by its bilinear interpolant with the help of the basis functions To compute the integral over the domain , we also use the bilinear interpolant with the basis functions The left-hand side of (15) is approximated by the midpoint quadrature rule with second-order accuracy: So, instead of the exact equality (15) we get the approximate one: To simplify the numerical computation of the right-hand side in (21), we approximate the domains and by simpler ones. Since in the more general case both domains are curvilinear quadrangles, we demonstrate the approximation only for quadrangular . Introduce four additional points and on the square at time level and denote each of the eight nodes by , . Out of each we pass the corresponding characteristic to the time level which gives the point (Figure 4).

To compute the coordinates of the point numerically, we solve the system of ordinary differential equations (6) with the initial condition by the fourth-order Runge-Kutta method [29]. Thus, we find the approximation of the point . The nodes , , define the polygon which is considered as a quadrangle with four parabolic edges (Figures 4-5). The constructed domain approximates . In the same way we construct the polygon which approximates . For the above approximation the following statement is valid [30].

Lemma 2. Let the coordinates of the nodes , , be computed within the fourth-order accuracy ,  . Assume that the ratio between and is fixed: . Then for all where the notation means the measure of the domain .

Thus, the replacement of by and by , , , reduces the approximate equality (21) to another one: Divide it by and replace by the interpolant of the known grid function at the level : As a result, we get the equation for finding as an approximation of : To compute the integrals numerically, we decompose the domain (or ) into several triangles which lie only in one cell , , and have only one parabolic edge and two straight-line edges parallel to the coordinate axes. Then we replace the integral over the domain (or ) by a sum of integrals. Thus we compute integrals directly without any quadrature rule.

To evaluate the order of convergence, we use the discrete analogue of the -norm: For the numerical solution computed by (26) the following theorem is valid.

Theorem 3. Let the solution of the problem (1)–(5) be sufficiently smooth and let the discrete solution be computed by (26). Assume that . Then we have the following estimate: with the constants and independent of , , , and .

Proof. Use the induction on . For inequality (29) is valid because of the exact initial condition (4). Suppose the estimate (29) is valid for some and prove it for .
From Theorem 1  , we get For the first integral we have the equality where
In the second and third integrals we replace the polygons and by and , respectively, according to the foregoing approximation. Then we replace and by their piecewise bilinear interpolants. Due to Lemma 2 and boundedness of the functions and , formulae (30)-(31) are represented in the following way: where , , . Now multiply (26) by and subtract it from (33). Then we have Now, let us sum up the modulus of the last equation for all , and use the decomposition at level with a grid function satisfying the estimate due to the induction hypothesis. Then we get Since we have Finally consider the transformations It leads to the inequality Denote and . Thus due to the relation we get inequality (29).

Corollary 4. Let the conditions of Theorem 3 be valid. Then for one has the estimate

Remark 5. Let the functions and be greater than zero in the initial and boundary conditions (4)-(5). Then the interpolants   and   due to (3) are nonnegative. The integration of them results in nonnegative values in (26). Thus, by induction we can prove the inequality

Remark 6. The strategy of the domain approximation with 8 nodes is not optimal, of course. In fact it is enough to use 4 nodes for rectangles. But a theoretical justification becomes much more complicated. We did not demonstrate it here since the main purpose of the paper is to verify parallel properties of the proposed algorithm. Of course, such an optimization reduces the number of arithmetical operations but has no influence on the parallel properties of the algorithm. The same applies to the difference schemes of higher order.

4. The Numerical Algorithm and Its Parallel Implementations

The constructed algorithm is implemented in the following way.

Algorithm 7 (sequential). (1)Set , as the initial data (4).(2) Time loop: for each time step do:Space loop: for each cell do:(2.1)For each node , solve the system (18)–(20) and determine the corresponding vertex coordinates of a polygon .(2.2)If a certain characteristic intersects the plane then the coordinates of this cross-point are determined.(2.3)Compute according to (26) where the integrals are calculated over each nonempty intersection and separately.(2.4)Compute .The end of the space loop.(2.5)Put   for  all  .(2.6)If necessary, calculate the norms of a solution, an error, and other statistic data with respect to an actual time step.The end of the time loop.

Note that items (2.1)–(2.3) are compute-intensive, especially the procedure of determining the mutual arrangement of and the cells at the previous time level in item (2.3).

The algorithm is explicit with respect to time, since to calculate at each time level the data are used only from the previous time level .

Another advantage of Algorithm 7 is data independence in the general space loop; that is, the items (2.1)–(2.4) are carried out for any pair , independently. In this connection the data parallelism is used.

In the shared memory case for the OpenMP-technology it is sufficient to parallelize the general space loop at each time level using an OpenMP directive like the following one:

#pragma omp parallel for collapse (2)

In order for paralleling to be correct, the data-sharing attributes of all variables to intermediate outcomes of items (2.1)–(2.4) have to be private for each thread.

Another justified approach to paralleling the algorithm is the usage of the NVIDIA CUDA technology for GPU. The main aspects of the parallel implementation related to various features of general-purpose GPU programming are briefly discussed below.

All functions used in the numerical calculations on a CPU must be recompiled for a GPU. If we use NVIDIA CUDA these functions must be declared with the special qualifier __host__ __device__ which indicates that the NVCC compiler creates two versions of its executable code for a CPU (host) and for a GPU (device) separately. GPU will call a device version of a function while CPU will call its host version.

The principles of efficient CUDA programming are as follows: (1) the maximal use of inherent parallelism of the problem and (2) the optimization of memory access.

The first version of our parallel CUDA-algorithm is based on the inherent parallelism of our numerical explicit approach. Every thread treats only one cell ; hence, the space loop body (items (2.1)–(2.4) of Algorithm 7) is the general computation kernel.

While programming for GPU, the correct defining of the kernel configuration is important. The kernel configuration includes two parameters, namely, the number of blocks (blockCount) in a grid and the number of threads (blockSize) per block. There is a limit in 1024 threads per block for our GPU NVIDIA hardware. In the first CUDA-algorithm no threads amount optimization is used.

Consequently, the simplest parallel version of Algorithm 7 for the CPU/GPU hybrid architecture is the following.

Algorithm 8 (CUDA parallel, version 1). (1)Calculate the kernel configuration (blockSize, blockCount) using data about a computational domain.(2)Allocate host (CPU) and device (GPU) memory; copy initial data from host to device.(3) Time loop: for each time step do:(3.1)Call the first CUDA kernel (basic):(3.1.1)For each cell ,  ,  , execute items (2.1)–(2.4) of Algorithm 7 in parallel.(3.2) Synch point: wait for calculations to be completed.(3.3)Call the second CUDA kernel (assistive):(3.3.1)Copy data from an actual time level array to the previous one in parallel.(3.4) Synch point: wait for copying to be completed.(3.5)If necessary, copy results from device to host.(3.6)If necessary, calculate the norms of a solution, an error, and other statistic data with respect to an actual time step.The end of the time loop.(4)Copy the results from device to host.

In order to decrease the execution time of Algorithm 8, items (3.5)-(3.6) must be performed as rare as possible, for instance, only at time levels where accuracy control, data for drawing, and so forth are needed.

Algorithm 8 has two general disadvantages: (1) small speedup in comparison to the sequential version (Figure 8) and (2) impossibility of execution with a fine computational mesh (Table 2). What is the matter with these problems?

First, the general loop has a lot of selection statements.

The main selection is between two different ways of catching in item (3.1.1) of Algorithm 8 (to be more exact, in the item (2.3) of sequential Algorithm 7). The cells whose trajectories intersect the boundary and the internal cells are processed in different ways. We can use two different kernels which execute in parallel.

We use data parallelism only in Algorithm 8. However, there is yet another class of parallelism to be exploited on NVIDIA GPU. This parallelism is similar to the task parallelism that is found in multithreaded CPU applications. NVIDIA CUDA task parallelism is based on CUDA streams. A CUDA stream represents a queue of GPU operations such as kernel launches, memory copies, and event starts and stops. The order in which operations are added to the stream specifies the order in which they will be executed. Each stream may be considered as a certain task on the GPU, and there are opportunities for these tasks to execute in parallel [31]. Thus we apply CUDA streams to our two kernels to improve parallelism and total GPU utilization.

There are many selections in item (2.3) for determining mutual arrangement of and cells of the mesh of the previous time level. Unfortunately, we cannot avoid these selections.

Secondly, the CUDA kernel in (3.3) of Algorithm 8 idles in the context of computation. We apply loop unrolling in order to eliminate this kernel.

Thirdly, the basic CUDA kernel of Algorithm 8 is not optimal for memory access. To optimize concurrent read access global memory of simultaneously running threads, constant memory is preferable to use. Applying this approach in our program we allocate all invariable values in constant memory GPU.

Moreover, for the optimization of parameters of kernels launch we use the CUDA Occupancy Calculator. It calculates optimal streaming multiprocessor (SM) utilization taking into account GPU compute capability, CUDA device properties, the number of blocks in a grid, the number of threads per block, the size of the shared-memory space, and the number of registers per thread.

Consequently, the second CUDA-version of Algorithm 7 for the CPU/GPU hybrid architecture is the following.

Algorithm 9 (CUDA parallel, version 2). (1)Calculate kernel configuration (blockSize, blockCount) using data about a computational domain.(2)Allocate host (CPU) and device (GPU) memory, copy initial data from host to device, and copy constant data from host to constant memory of device.(3) Time loop: for each time step do:(3.1)Call the first CUDA kernel to boundary cells access by the first CUDA stream:(3.1.1)Execute items (2.1)–(2.4) of Algorithm 7 in parallel for each cell whose characteristics intersect the boundary.(3.2)Call the second CUDA kernel to inner cells access by the second CUDA stream:(3.2.1)Execute items (2.1), (2.3)-(2.4) of Algorithm 7 in parallel for each internal cell .(3.3) Synch point: wait for calculations of both kernels to be completed.(3.4)Call the first CUDA kernel to boundary cells access by the first CUDA stream:(3.4.1)Execute items (2.1)–(2.4) of Algorithm 7 in parallel for each cell whose characteristics intersect the boundary.(3.5)Call the second CUDA kernel to inner cells access by the second CUDA stream:(3.5.1)Execute items (2.1), (2.3)-(2.4) of Algorithm 7 in parallel for each internal cell .(3.6) Synch point: wait for calculations of both kernels to be completed.(3.7)If necessary, copy results from device to host.(3.8)If necessary, calculate the norms of a solution, an error, and other statistic data with respect to an actual time step.The end of the time loop.(4)If is odd then repeat items (3.1)-(3.2).(5)Copy results from device to host.

5. Numerical Experiments

Specify the velocities and take the initial and boundary conditions in the following form:

Numerical experiments were performed with the ICM SB RAS FLAGMAN computation system of the following configuration.

Hardware. CPU: INTEL XEON, cores, HyperThreading; 40 Gb DDR3; GPU: NVIDIA TESLA C2050 (CC 2.0).

Software. OS: UBUNTU 11.04; C/C++: GCC 4.5.2, INTEL C++ Compiler 13.1.0; CUDA C/C++: NVCC 5.0; NVIDIA CUDA 5.0; BOOST 1.53; NVIDIA CUDA-GDB 5.0.

One of the purposes for the numerical experiments was to check the convergence order in and . Therefore the computations were performed on the sequence of regular square grids, , . The number of time steps is defined by .

Assume that is the set of solutions found on the sequence of square grids. The expression as a function in can be considered as the order of convergence (Figure 6). The corresponding exact values of were computed by the characteristics method directly. In Figure 6 we can see the first order of convergence in .

In our sequential and OpenMP computational experiments we compare the GCC and Intel C++ compilers. The execution time for the better Intel compiling code is on average by 15% less than the better GCC one. All presented numerical results were compiled with −O2 optimization level. Let us remark that we try to use other compiler options (−O3, −parallel, −AVX), but the performance increases only slightly or sometimes even decreases. For the CUDA-version we used the NVCC compiler.

The results of computation speedup of the OpenMP-version are presented in Table 1 and in Figure 8. The first line of the table shows speedup (or rather slowing down) of one thread that executed the OpenMP-code as compared with the sequential code. Generally, the compiler optimization under the OpenMP library and overhead related to the synchronization of OpenMP-threads can be estimated for these data. As we can see in our case, overhead is small.

One of the purposes of the studies is to assess influence of the HyperThreading (HT) technology on the parallelism. As long as only 12 physical cores are available on the CPU, we have access to 24 logical cores when HT is enabled and to 12 logical cores when HT is disabled, respectively. Experiments show that when HT is enabled the execution time with 24 threads is about 14% less than that with 12 threads when HT is disabled (Figure 7). As our code is compute-intensive, probably, the advantage of HT may be related to optimization of memory access.

The execution time of the sequential, OpenMP, and two versions of CUDA codes is given in Table 2. The symbol signifies that the result has not been reached in reasonable time; for the CUDA-versions the symbol means that the kernel launch failed due to registers bottleneck. For OpenMP the results on 12 threads when HT is disabled and on 24 threads when HT is enabled are demonstrated.

Comparative information on a possibility of code optimization by the GCC compiler is also presented in Table 2. The execution time of the code compiled without optimization and with −O2 and −O3 optimization levels, respectively, was measured. It is clear from Table 2 that compiler optimization considerably (more than two times) reduces the execution time of the sequential code. As the compiler does not optimize the CUDA-code, the execution time of the CUDA-versions does not depend on compiling options.

Table 3 and Figure 8 present the data on computation speedup of the best OpenMP-version and two CUDA-versions in comparison with the sequential program compiled with −O2. Speedup of the second CUDA-version in comparison with the OpenMP-version is given as well. Numerical experiments show that on fine grids in comparison with the sequential program the best Open-MP program with 24 threads gives more than 12 times speedup and the second CUDA-version shows about 16 times speedup.

6. Discussion

Nowadays there are a lot of algorithms of the family of semi-Lagrangian methods. As we mentioned above, the presented method is based on square grid only and uses accessory algorithms that makes computation more resource-intensive. Nevertheless, such a complication allows us to prove first-order convergence. Furthermore, the theorem which allows one to take into account a volume of substance passed through a boundary is presented. This enables us to prove the balance equation. Numerical experiments completely confirm the theoretical convergence results.

As for parallel implementation of our approach, we can note the following.

Though the algorithm is explicit we are not satisfied with the results of the CUDA-versions.

First of all, there is a problem with feasibility of CUDA-code on fine grids (more than 640 × 640 nodes). Profile-feedback analysis shows at least two causes: (1) assumed computational domain decomposition and (2) the problem of hardware constraint on the amount of registers on streaming multiprocessors.

Concerning the first item, it should be noted that in our approach a 2D computational domain is mapped to a 1D array of cells, and every thread treats one cell (e.g., see Pseudocode 1).

__global__  void  unit_foo  (float*  inA,  float  out,  int  size)  {
  int  i=blockIdx.x*blockDim.x+threadIdx.x;
  if  (i<size)
 out[i]  =  do_smb_with(inA[i]);
}

For the parameters of kernels launch to be readily adaptable, we can apply “thread reuse”; namely, every thread treats some uncoupled cells (see Pseudocode 2).

__global__  void  vec_foo  (float*  inA,  float  out,  int  size)  {
  for  (int  i=blockIdx.x*blockDim.x+threadIdx.x;
        i<size;  i+=blockDim.x*gridDim.x)
  out[i]  =  do_smb_with(inA[i]);

Besides, we can use several video adapters. In this case we should resolve the problem of the distribution of computational cells between devices under the following restriction: we do not know in advance how many cells of the previous time level are required to calculate an actual value (e.g., varying-width shadow line).

The problem with registers in our case is related to deep nesting level of functions calls in the item (2.3) for determining the mutual arrangement of and cells of a mesh of the previous time level. Indeed, this is a bottleneck of our sequential algorithm. To resolve this issue, we should modify the initial sequential algorithm, unfortunately.

In addition, the item (2.3) has many flow control instructions (“if” statements, mainly), but we cannot say that this branching creates some significant performance penalty in terms of SIMT architecture. Indeed, in a CUDA kernel, any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge [32]. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.

Fortunately, this rule works in the cases where the control flow depends on the thread ID only. However, in our case we have many branches which do not depend on thread ID. Inside our computational kernel, the main part of branches looks like Pseudocode 3.

if  ((indCurSqOx[1]  >=  0)  &&  (indCurSqOy[1]  >=  0))  {
do_smth();
}  else  {
do_smth_else();
}

As we can see, our “IF” statement does not depend on a thread ID. Thus, this type of branch does not affect the performance of the CUDA kernel.

7. Conclusion

We present an unconditionally stable semi-Lagrangian scheme of the first-order accuracy. The numerical experiments confirm the theoretical studies. Performance of sequential and several parallel algorithms implemented with the OpenMP and CUDA technologies in the C language was studied. The optimization potential of the CUDA-version remains unexhausted yet.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work was supported by Project 14-11-00147 of the Russian Scientific Foundation.