Abstract

Using parallel computation can enhance the performance of numerical simulation of electromagnetic radiation and get great runtime reduction. We simulate the electromagnetic radiation calculation based on the multicore CPU and GPU Parallel Architecture Clusters by using MPI-OpenMP and MPI-CUDA hybrid parallel algorithm. This is an effective solution comparing to the traditional finite-difference time-domain method which has a shortage in the calculation of the electromagnetic radiation on the problem of inadequate large data space and time. What is more, we use regional segmentation, subregional data communications, consolidation, and other methods to improve procedures nested parallelism and finally verify the correctness of the calculation results. Studying these two hybrid models of parallel algorithms run on the high-performance cluster computer, we draw the conclusion that both models are suitable for large-scale numerical calculations, and MPI-CUDA hybrid model can achieve higher speedup.

1. Introduction

Finite-difference time-domain (FDTD) method has become a common method for solving Maxwell’s equations [1]. It is a full vector method and can be naturally given time-domain and frequency-domain information user need. This is the unique advantage in electromagnetic and photonic application. FDTD algorithm is discrete in terms of time and space. Therefore, the structure of the electromagnetic field must be described on the grid by the Yee cellular composition. Maxwell’s equation is discrete in time factor; therefore, time step is closely related to the mesh size. When mesh size tends to zero in the limit case, the discrete model accurately describes Maxwell’s equations.

Recently, general-purpose computing on a graphics processing unit (GPGPU) has received considerable attention in many scientific fields [24] because a GPGPU offers high computational performance at low cost. What is more, Intel Xeon Phi coprocessor, based on the Many Integrated Core (MIC) architecture, packs up to 1 TFLOP of double precision performance in one chip. It runs a Linux operations system and provides x86 compatibility and also supports several popular programming models including MPI, OpenMP, Thread Building Blocks, and others that are used on multicore architectures. High-performance computer architecture tends to hybrid system, and this corresponds to the software program design requirements mixed programming model. GPGPU and MIC accelerated computing components which appeared in recent years provide the opportunity to improve the performance of FDTD parallel algorithm. Therefore, we achieved the parallel three-dimensional FDTD algorithm based MPI-CUDA model.

The FDTD algorithm obtains a wide range of applications in many fields of electromagnetic radiation, such as radiation antenna analysis, scattering calculations, electronic packaging, and radar. With the development of high-performance computing, the MPI has solved a weakness that the computing time of the FDTD parallel algorithm [5, 6] is too long. However, increasing amount of computation, the MPI process in a single node increases computational burden. When we use two different hybrid models which are MPI-OpenMP model and MPI-CUDA model to solve this problem, we can use the distributed shared memory features to improve the parallel speedup and scalability [7, 8].

The rest of the paper is organized as follow. We present the FDTD algorithm with uniaxial perfectly matched layer (UPML) in Section 2. Then we describe the procedure and present basic steps for the method acceleration by means of MPI-OpenMP paradigms and MPI-CUDA paradigms in Sections 3 and 4. In Section 5, the performance of parallel computing of the two methods was compared and analyzed the factor that affects performance. The conclusions are given in Section 6 finally.

2. FDTD Algorithm

FDTD algorithm is a numerical method based on Maxwell’s equations. The algorithm uses leapfrog calculation method and alternating electric field and magnetic field distribution in space within a half step sampling by Yee cellular composition [9].

From Figure 1, we can see that the Yee cell has the following characteristics: each magnetic field component was surrounded by four electric field components and each electric field component was surrounded by the four components of the magnetic field, and these field components placement relative position in the Yee cell and automatically satisfy the continuity conditions in the interface. This sampling method not only meets Maxwell’s equations difference calculation but also meets Faraday’s law of electromagnetic induction and the natural Ampere’s law [10]. Therefore, this method gradually completes recursive entire electromagnetic fields. First, the explicit equations for the and are given bywhere where is the relative permittivity, is the conductivity of the tissue [S/m], is the mesh size, and is the time step. Figure 2 shows the workflow of the FDTD method.

The explicitly iterative process of FDTD algorithm requires initial field values and boundary conditions. The traditional definition of initial field values put all the space field values which are defined as 0, then the field values of two successive time steps are stored as the initial field value of the next step. There are a variety of the boundary conditions, such as Mur absorbing boundary, perfectly matched layer (PML), and uniaxial perfectly matched layer [11] (UMPL). With the improvement of FDTD algorithm these three matching layers have good effect on absorption.

In our experiments, we use uniaxial perfectly matched layer (UPML) as absorbing boundary condition. UPML inherited the PML absorbing layers good absorption characteristics, but the UPML absorbing layer is directly based on Maxwell’s equations which is different from PML absorbing layer. This ensures that the form of FDTD algorithm calculation model has good consistency with Maxwell’s equations, which makes it easier to understand and program.

3. FDTD Algorithm Based on MPI-OpenMP

3.1. MPI-OpenMP Hybrid Model

MPI programming model is a parallel programming interface for developing standards based messaging, acting on a heterogeneous network environment. It provides a reliable transport mechanism, and it uses security channel to achieve the communication between all the tasks within a process group, data exchange and processing. However, MPI is calculated by parallel interprocess communication, which results in a lower efficiency, the parallel memory overhead, programming problems, and other shortcomings. The OpenMP programming model utilizes fork-join execution mode and shared memory model where a process can be divided into several parallel task execution threads. In a single node, when the main thread is running into OpenMP parallel region, it will produce different thread queue to achieve a parallel effect. However, since OpenMP parallel programming model can only be a single node, computing power of the CPU has been greatly restricted.

MPI-OpenMP programming model is the combination of these two models, which is between multiple machines using MPI distributed memory and each MPI process uses multithreaded OpenMP shared memory model to parallel computing [12]. This model can reduce the number of MPI processes in parallel, thereby reducing the number of messages passing. What is more, OpenMP parallel on each node can save memory overhead. So this parallel hybrid model has certain advantage over a single model.

3.2. The Division of MPI Model

According to the different regional calculation scale, each division way has different characteristics. Common area includes one-dimensional, two-dimensional, and three-dimensional division. In the cube model, each section of the communication direction is all the sane, which means that the communication traffic is proportionate to the number of communication surfaces. Three-dimensional method produces the minimal communication surface. Therefore, we should select the three-dimensional division in the balanced scales calculating. But when the area which you calculated is dominant in one direction or in both directions, the communication traffic is not proportional to the number of communication surfaces. The one-dimensional or two-dimensional division can get better results. For example, long waveguide will be divided one-dimensional along the waveguide direction. In addition wing aircraft is usually divided two-dimensional alone its width direction.

In this paper, the computational model which is similar to the phone box is an extreme rectangular model. We performed a one-dimensional division by the meshing and each process is assigned to one node on average to obtain high efficiency. This division can be reduced communication time between nodes in the rectangular model (Figure 3).

3.3. MPI Parallel Communication Design

As shown in (1), FDTD field value for each step of the algorithm depends only on the value of magnetic field intensity and the circumference value of the grid, which means that it has significantly localized. Therefore, MPI algorithm just calculates the area into a plurality of subregions, and each subregion is allocated MPI node in the mesh to ensure the exchange data in the each time step. Finally, we synchronize data of each node at end of the step. In order to obtain high efficiency of parallel speedup, we partition the process by direction and put the processes average assigned to each node. This partitioning can reduce the communication between nodes time in the direction dominant of the rectangular parallelepiped model.

After the calculate area is divided into a number of subregions. The data communication is shown in Figure 4. If we want to calculate which is in process II, we should require the value of electric fields and which is in process I. Then process I transfers the values to process II. can begin calculating when data exchange is completed. Obviously, the data in the other direction can communicate with each other in this way.

3.4. OpenMP Threaded Design and Optimization

The electric field and magnetic field strength only depends on the electromagnetic field data of the former step in the electromagnetic computing. Therefore, it has the natural parallelism. In the multicore computing, the program uses OpenMP model to increase program parallelism.

For better memory utilization, the load balancing may not be very good; therefore, when considering performance optimization, we should have a compromise between the need to optimize memory utilization and load balancing. In OpenMP model, there are four kinds of commonly used scheduling, such as static, dynamic, guided, and runtime. The commonly used scheduling is static where iterations are divided into chunks of size chunk_size while chunks are assigned to threads in the team in round-robin fashion in order of thread number and dynamic where each thread executes a chunk of iterations and then requests another chunk until no chunks remain to be distributed.

From the definitions of the schedule, we can see that static scheduling applies to the situation that all CPU functions are similar due to the complexity of balancing reasons, while the dynamic scheduling applies to the situation of the ability to run large differences between the CPUs. Due to the experimental environment that each CPU computing power is not very different, we use static scheduling so that each thread can average computing tasks [13] (see Algorithm 1).

 #pragma omp parallel for schedule(static)
 for (; _tot; ++)
  for (; _tot; ++)
   for (; _tot; ++)
    ⋯
 In order to increase OpenMP load balance and get better scalability, we merge
the nested loops and reduce from 3 nest layers to 2 layers.
 _end = _tot;
_end = (_tot) * (_tot);
_start = _end + 1;
 #pragma omp parallel for schedule(static)
 for (_start; _end; ++)
 {
  /_end;
  ;
  if (!) continue;
  for (; _tot; ++)
   ⋯
 }

Its obvious that , , and directions have similar calculation.

3.5. MPI-OpenMP Hybrid Programming

With the increasing scale of operation, the shared storage processor overhead of process limits parallel performance. We can solve this problem by OpenMP and MPI hybrid model where each node is calculated by MPI processes and each node uses the optimal number of threads. Thus the amount of computation is shared by multiple threads, thereby reducing communication overhead and easing the process workload on each node.

When FDTD algorithm based on MPI-OpenMP model allocates a process for each compute node, each node is assigned a process. Passing MPI initialization and calculation parameters initialization, the process starts OpenMP parallel computing. In this phase, the program should update cycle electric field value and the electric field value with adjacent nodes exchanges the date. After the communication is completed, CPUs use OpenMP programming model to update the value of the magnetic field. When the programs achieve the maximum time step, MPI processes and the main procedures will end. As for the border absorption treatment, process calculation is carried out in a single process in every time step. Therefore, it can also improve program parallelism through OpenMP model.

The pseudocode is as shown in Algorithm 2.

 MPI_Init(&argc, &argv); //MPI parallel environment initialization
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 MPI_Comm_size(MPI_COMM_WORLD, &size);
 MPI_Get_processor_name(processor_name, &namelen);
 MPI_Barrier(MPI_COMM_WORLD);
 node.init(_size/size, _szie, _size); //Grid data initialization
 UPMLcompute(); //Set up the absorbing boundary
 for (int ; < max_step; ++) //Iterating over the time step
 {
 #pragma omp parallel for schedule(static)
  for ( = 1; _tot; ++)
  {
   …Compute; //Calculating the electric field and magnetic field values
  }
 MPI_Send(datasend_right, data_size, MPI_DOUBLE, NodeRank + 1, 1,
 MPI_COMM_WORLD); //Send data to the neighboring node
 MPI_Recv(datarecv_right, data_size, MPI_DOUBLE, NodeRank + 1, 1, MPI_COMM_WORLD,
&status); //Accept the adjacent node data
 MPI_Barrier(MPI_COMM_WORLD);
 }
 MPI_Finalize();
 Out(EHdate); //Output the results

4. MPI + GPU Programming Model Design

4.1. FDTD Parallel Algorithm Based on GPU Model

FDTD is the most popular method of computational electromagnetic simulation because of its simple algorithm and high computational efficiency. Figure 5 shows the flow of the three-dimensional FDTD method for a single GPU computation [14]. There are four tasks within each time step for the GPU side in this figure: electric field computation (_field), uniaxial perfectly matched layers (UPML) computation for electric field (_upml) as the absorbing boundary condition, magnetic field computation (_field), and PML for magnetic field (_upml). All field updates in each time step can be parallelized and are offloaded to the GPU [15].

Our GPU implementation of the FDTD method is based on the C++ code that runs on the ZiQiang4000 high-performance computer clusters of Shanghai University. Since all the magnetic and electric field computations can be vectored and parallelized, these computations are also candidates for GPU computation with a CUDA kernel. The CUDA kernel including _field, _pml, _field, and _pml follows. And our GPU three-dimensional FDTD program requires no data transfer for field updates because of all computations within each time. The CUDA kernel function [16] would be in Algorithm 3.

__global__ void caluH(arguments)
{
= threadIdx. + blockIdx.  * blockDim.;
= threadIdx. + blockIdx.  * blockDim.;
= threadIdx. + blockIdx.  * blockDim.;
 if ( < _size && < _size && < _size)
 {
  int   *  _size *  _size + *  _size + ;
  double tmp = _[];
  _[] =   *  _[] −   *  ((_[ + _size] − _[]) − (_[ + 1] − _[]))/delta;
  _[] =   *  _[] +   *  (  *  _[] −   *  tmp)
 }
}
The host:
dim3 block(BlockDim[0], BlockDim[1], BlockDim[2]);
Grid[0] = /blockDim[0];
Grid[1] = /blockDim[1];
Grid[2] = /blockDim[2];
dim3 grid (Grid[0], Grid[1], Grid[2]);
caluH <<<grid, block>>> (arguments);

4.2. The Optimized Memory of Access Patterns

The kernel code which can be called by the GPU has role similar to caluH C function. The kernel variable threadIdx, blockIdx, and blockDim can calculate the array indexes , , and . Host-side code then uses caluH grid, block (arguments) function to call the kernel code, where grid and block specify the division of thread blocks and the number of threads in the each block.

As the internal memory of GPU, registers and shared memory have a high access speed and a small data latency. Therefore, the data is loaded into the global memory so that shared memory can improve the efficiency of GPU parallel. Consider_shared_ double new_;new_ [_size] = _ [_size].

The first line of code is carried in the shared memory array definition. Since the shared memory space is very limited, programmer should choose dynamically allocated arrays or static allocation based on the data size. The second line was to load the data into the global memory shared memory which can increase memory access speed by data sharing. Obviously, other data should also do the same optimization.

4.3. MPI-CUDA Hybrid Programming

In the MPI programming model, a problem will be divided into multiple subtasks so that each process to execute a task. But with the growth of the scale, MPI communication overhead increases at the same time; thereby this will reduce the parallel efficiency. However, the model is just coarse-grained parallelism between the nodes, and computing capacity of CPU is not fully utilized. Then the CUDA model is just to make up the deficiency of this section. The use of multithreaded shared memory mechanism on the GPU achieves data sharing fine-grained parallelism, which can achieve higher speedup [17].

In order to reduce the computational load of the GPU, we use a single node in a multi-GPU programming model [18], which means that each process is calculated by calling the function cudaSetDevice() to get two GPUs in a single node.

Workflow of three-dimensional FDTD program based MPI-CUDA model is as follows.(1)Initialize calculation parameter GPU and MPI environment.(2)Divide the regional calculation and allocate GPU memory and copy initial data to GPU. (3)Time step for-loop Use MPI model(a)execute electric field computation on GPU(b)execute magnetic field computation on GPU.(5)Process output results.

5. Experiments and Analysis

5.1. Experimental Environment

ZiQiang4000 normal cluster is constituted of 40-unit IBM X240 server blade, which contains two intel E5-2680, 64 G shared memory, and 16 core components in each blade. The GPU computing cluster node contains 11 sets of IBM DX360, which has two intel E5-2680, 64 G shared memory, and plus two Nvidia M2090 GPU constitution. The software environment of the system includes Centos Linux 6.3 operating system, NVCC compiler supports OpenMP guided C/C++ compiler, and MPICH2 parallel environment.

5.2. Model Validation and Analysis

Verifying the correctness of parallel programs is a prerequisite for its performance analysis. In order to verify the correctness of the program and UMPL absorbing boundary absorption effect, this paper makes use of three-dimensional point source radiation to have a situation analysis. we compared the run time between GPUs and CPUs in the case of Gaussian incidence at 900 MHz. What is more, the grid size is 0.02 mm and the calculation scale is of . We use UPML absorbing layer which is set to 10-mesh size and the maximum time step 1000. Figure 6 shows the 50 layers and 65 layers of XY section of the electric field strength calculations at the 1000 steps. As we can see in the figure, electromagnetic waves shape was diffusion, and the further the cross-sectional layer of electric field away from the excitation source, the smaller the value you get. The excitation source is at the center of the field map and gradually spreads to the electromagnetic field borders, which fits the propagation of electromagnetic waves.

5.3. Parallel Analysis

In order to obtain intuitive results, we expand the calculated scale to , and the absorbing boundary UMPL mesh size is set to 16. The results were as follows after 1000 steps calculation using MPI model.

Table 1 shows the result of point source radiation calculation time based on the MPI model, such as speedup and efficiency of the communication area. In this paper, we performed 16 processes on each node. It can be seen from the table that, as the number of processes increases, the computing time significantly reduced for linear speedup. And the electromagnetic radiation program has good scalability and parallelism in a distributed process parallel environment. But with the increase in the number of processes, the parallel efficiency becomes more and more low. This is because the increase in the number of processes leading to the communication area is increasing. It will increase the communication overhead and cause the load between nodes to be not balanced.

From Figure 7, we can see that, when calculating small scale, the acceleration effect of GPU is not obvious. This is due to the number of threads too little to create enough block to make SM concurrent execution, reducing GPU occupancy rate. With the increasing of the calculation size, GPU computing resources gradually are called and the speed of computation is improved. Therefore, when calculating a smaller scale, GPU speed increase is limited, and in the large-scale electromagnetic calculations, GPU-accelerated effects model is very impressive, which is reaching about 9 times the CPU. After optimizing memory access patterns by shared storage, the max speed ratio achieves 23 when the number of grids is . The reason is segmentation of data blocks fit in GPU memory.

MPI-OpenMP hybrid programming model combines two kinds of models, which can take advantage of shared memory and message passing model and improve FDTD algorithm parallelism. FDTD program is mainly calculated by the parameter initialization, the time step, absorbing boundary calculated, updating the electric field, and data processing. However, over 98% of the total time is conducted in two stages which are absorbed electromagnetic boundaries update phase and calculating electromagnetic fields. We can utilize OpenMP programming model to improve program parallelism at this two stages. In this paper, each node runs two processes, and the number of parallel threads is 8. Similarly, MPI-CUDA models use shared memory and message passing mechanism to calculate the electric field value performed on the GPU. Each block is divided into , and the calculation results are shown in Table 2.

From Columns 2 and 3 of Table 2, it can be seen that pure MPI program is better than MPI-OpenMP model when the calculating cores is small. This is because MPI communication is very small when the number of processes is small, and OpenMP program will be a corresponding increase in the thread overhead. But with the increase of the number of the GPUs, the growth rate of MPI model will decline as MPI process communication and synchronization overhead increases, while MPI-OpenMP program obtains a better parallelism for saving offset cost of the thread. By comparing the two parallel methods, we can see MPI application performance superior to MPI-OpenMP program when calculating the cores is less than 32. But with the increase in the number of cores, the advantage of MPI-OpenMP program will gradually be revealed and get better speedup.

From Columns 2 and 4 of Table 2, it can be seen that pure MPI-CUDA program is better than pure MPI model, and it has a linear increase when the number of processes is small. But with the increase in the number of processes, the efficiency of the program decreased very significantly. MPI communication overhead is increased in the proportion of the total time. As the overall calculation region is divided into a plurality of subregions, the size of each process significantly reduced to GPU parallel efficiency. We can find that the speedup of MPI-CUDA programming model is about 20 times the pure MPI program, which effectively improved the parallelism of the program.

Figure 8 shows the result of FDTD algorithm based on two hybrid models. Due to the GPU’s high-speed floating-point operations, MPI-CUDA parallel model clearly has a better effect. However, this model is limited by memory and is device-dependent. MPI-OpenMP program can save communication costs and have better scalability. These two kinds of parallel models are suitable for mass-scale electromagnetic calculations which conform the actual electromagnetic calculations. Therefore, the programmer should be selected by different hybrid models according to the different circumstances.

6. Conclusions

With the development of high-performance computing, the new technology provides a better use of space for FDTD algorithm. The various parallel models make hybrid programming model become the mainstream of high-performance computing. In this study, we implemented the three-dimensional FDTD method by MPI-OpenMP model. What is more, we also implemented FDTD method on multi-GPU cluster environment with CUDA and MPI and fulfilled the simulation of three-dimensional numerical electromagnetic radiation. The results indicate that hybrid programming model can better take advantage of distributed shared memory in order to improve parallel performance.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the Major Research Plan of NSFC [no. 91330116], the National Science Foundation of China [no. 71203135], Key Project of Science and Technology Commission of Shanghai Municipality [no. 11510500300], and the Research Fund for the Doctoral Program of Higher Education [no. 20113108120022].