#### Abstract

Parallel FDTD method is applied to analyze the electromagnetic problems of the electrically large targets on super computer. It is well known that the more the number of processors the less computing time consumed. Nevertheless, with the same number of processors, computing efficiency is affected by the scheme of the MPI virtual topology. Then, the influence of different virtual topology schemes on parallel performance of parallel FDTD is studied in detail. The general rules are presented on how to obtain the highest efficiency of parallel FDTD algorithm by optimizing MPI virtual topology. To show the validity of the presented method, several numerical results are given in the later part. Various comparisons are made and some useful conclusions are summarized.

#### 1. Introduction

The Finite Difference Time Domain (FDTD) method, introduced by Yee, in 1966 [1], is one of the most popular three-dimensional methods in computational electromagnetics. The FDTD method has been applied to many radar cross section (RCS) calculations with accurate results. However, as a powerful numerical technique, the FDTD method is restrained to computation resource when analyzing the scattering of electrically large targets. To overcome the problem, a Parallel FDTD algorithm using the Message Passing Interface (MPI) library was developed by Volakis et al. in 2001 [2]. It is easy to implement because the Yee scheme is explicit. The FDTD in Cartesian coordinates can be easily divided into many subspaces, and each computer in a parallel system deals with one or several subdomains. The FDTD algorithm is combined with the MPI to run on parallel system. The MPI functions are employed to exchange the tangential electric (magnetic) fields on the boundary of the subdomain among the adjacent neighbors [3–5]. Parallel computation of the E-H components with an MPI Cartesian 2D topology is adopted and has been explained step by step in [6]; as the authors of it addressed, it is the first paper on Parallel FDTD using MPI protocol. Zhang et al. used an MPI Cartesian 3D topology in his research [7]. As an extension of these researches, we have developed a parallel FDTD code using an MPI Cartesian 3D topology. It has been successfully run on PC cluster [5, 6] and blade server [8], both of which belong to Science and Technology on Antenna and Microwave Laboratory in China.

Now, the code is developed to solve larger-scale electromagnetic problems. It is applied on super computer based on Linux system, which belongs to Shanghai Supercomputer Center of China (SSC). Numerical examples prove that the virtual topology will affect the computational efficiency of Parallel FDTD severely. In this paper, the influence of different virtual topology schemes on parallel performance of Parallel FDTD will be studied in detail. The general rules are presented on how to obtain the highest efficiency of Parallel FDTD algorithm by optimizing MPI virtual topology.

In Section 2, parallel FDTD algorithm is presented briefly. In Section 3, some numerical results are given, which show that this method is efficient and accurate. Current distribution over the surface of scatter and the near field distribution are plotted. Discussions about the influence of different virtual topology schemes and different number of processors on parallel performance of Parallel FDTD are presented in Section 4. Finally, some useful conclusions are summarized.

#### 2. Parallel Algorithm

MPI was proposed as a standard by a broadly based committee of vendors, implementers, and users. Now, it becomes a definition for interfaces among a cluster of computers or the processors of a multiprocessor parallel computer. The key problem that MPI-based programming relates to is how to distribute the tasks of users to processors according to the capability of each processor and reduce the communications among processors to as little as possible. Reducing the communications is especially crucial as the speed of communication is far slower than that of computation.

FDTD is easy to implement because the Yee scheme is explicit. Besides, it has the principle advantage that since the grid is regular and orthogonal, electromagnetic field components are easily indexed by in Cartesian coordinates. The parallelism of the FDTD algorithm is based on a spatial decomposition of the simulation problem geometry into contiguous nonoverlapping rectangular subdomains. The computational space can be easily divided into nearly equal parts along the three directions, and each processor in a parallel system deals with one or several sub-domains, as shown in Figure 1(a). The virtual topology of the processors’ distribution is chosen in a similar shape as the problem partition. Each subdomain is mapped to its associated node where all the field components belonging to this subdomain are computed. To update field values lying on interfaces between sub-domains, it is necessary to exchange data between neighboring processors. An *x-z *slice of the computational volume at one of the interfaces between nodes in the *y*-direction is shown in Figure 1(b). In this paper, we adopt 3D communication pattern, which is introduced as follows.

**(a) Partition of the simulated problem**

**(b) Field components communication**

The FDTD algorithm is combined with MPI to run on parallel system. The MPI functions are employed to exchange the tangential electric (magnetic) fields on the boundary of the subdomain among the adjacent neighbors.

The parallel algorithm can be described as follows.(1)Initialization.(a)MPI Initialization.(b)Reading the modeling parameters from the input files.(c)Creation of the three-dimensional Cartesian topology.(d)Creation of the derived data types for communication.(e)Start time measurements.(f)Allocation of memory.(g)Setting all field components to zero.(2)At each time step.(a)Exciting source only on processors that include the source plane.(b)Calculation of new magnetic field components on each processor.(c)Communication of the magnetic field components between processors.(d)Calculation of new electric field components on each processor.(e)Communication of the electric field components between processors.(f)Calculation of transmission only on processors that include the transmission plane.(g)Collection of field variables only on processors that include detection points.(3)Reducing the transmission to a fixed processor and writing it on file.(4)Saving results in files.(5)Deallocation of memory.(6)Stop time measurement.(7)MPI Finalization.(8)End.

#### 3. Hardware Platform

*
(i) Think Station*

Its machine-type model is 4155-D43 with a total of 24 Intel(R) Xeon(R) X5650 CPU cores (2.67 GHz per CPU) and a total of RAM approximately equal to 64 GB.

*
(ii) Shanghai Supercomputer Center (SSC)*

The 37 nodes from Magic-cube Machine with a total of 512 AMD CPU cores (1.9 GHz per CPU and 4 cores on each CPU) 16 CPU cores on each node and 4 GB RAM per core, and a total of RAM approximately equal to 2.3TB. Infiniband is used for the network interconnection.

#### 4. Numerical Results

For the absorbing UPML medium, we use a thickness of 5 cells in the following examples.

##### 4.1. An Example for Validation

For validation, the bistatic RCS is calculated for a PEC sphere with a 1 m () radius, and the incident plane wave is arriving from the −*x *axis and the polarization is along −*z *axis. Working frequency is 1.0 GHz. The increment m is used here, and the amount of FDTD grids is . The bistatic RCS of the sphere is shown in Figure 2(a). The result is compared with the one obtained by the Moment of Method (MOM), which shows a good agreement between them. Current distribution over the surface of the sphere is given in Figure 2(b). In Figures 3(a) and 3(b), smooth contour fills of the amplitude of E field and H field distributions in frequency domain are plotted, respectively. This problem is calculated on *Think Station*. Total computation time (in seconds) with different number of processors and different virtual topology schemes in *1000 time-steps* are compared in Table 1.

**(a)**

**(b)**

**(a)**

**(b)**

In Table 1, virtual topology schemes are described as for all three communication patterns. If the value is 1 in some direction, it implies that there is no topology in this direction. For example, means there is no topology in and directions, respectively, thus the virtual topology is actually in one dimension. Similarly, means there is no topology in direction, thus the virtual topology is actually in two dimensions.

From Table 1, it is obvious that increasing the number of processors can bring us the reduction of the computation time rapidly. But different virtual topology schemes will cost different computation time even if the code is run with the same number of processors.

We discuss the parallel performance of the Parallel FDTD using the different dimensional virtual topology with the same number of processors**. **In this case, the computing time is calculated with the shortest time as the reference. Take the case of eight processors as the example. The reference calculating time is 91.25 seconds using the virtual topology scheme:

From above, it is obvious that for the same number of processors, the more the dimension of the virtual topology, the less the computation time required.

Parallel efficiency is shown in Figure 4, in which the computing time consumed by different number of processors is referred as the shortest time for each case. Parallel efficiency is decreased with the increasing of processors. That is because the amounts of the transferred data are increased with processors, and then the time consumed on communication is increased.

In addition, the topology along the direction where the amount of the FDTD grids is larger can save the computation time for the same dimensional virtual topology. Different division subdomains with the same dimensional virtual topology lead to different amounts of the transferred data. Expression for the total number of the grids lay on interface between processors is where , , represent the total number of grids in , , direction, respectively, and are the values of virtual topology in three directions. , , and are integers and should satisfy the condition: So, the topology scheme should be created along the directions where the amount of the FDTD grids is larger to decrease the amounts of the transferred data, then to save the communication time.

Till now, the general rules on how to obtain the highest efficiency of Parallel FDTD algorithm by optimizing MPI virtual topology can be drawn as follows.(1)If possible, the optimum virtual topology scheme should be created in three dimensions, and then the better is in two dimensions, which can bring us higher efficiency than in one dimension.(2)As to the same dimensional virtual topology, the topology scheme should be created along the directions where the amount of the FDTD grids is larger.

##### 4.2. Radiation of the Waveguide with Ten Slots

A waveguide with ten slots is analyzed by parallel FDTD. The dimension of the waveguide and the slot structure in this example are chosen as follows: the thickness of the waveguide wall is 1.27 mm, the length of the slot is 15.785 mm, the width of the slot is 2.54 mm, and all of the offsets of the slots are 6.35 mm. Its FDTD model is shown in Figure 5(a). Its working frequency is 10 GHz. The increment mm is used here, and the total amount of FDTD grids is . We introduce a sinusoid ally modulated Gaussian pulse excitation by modifying the updating equation for the component in the excitation plane as follows:
where . By properly setting , and , we can get a useful frequency bandwidth. Finally, smooth contour fill of electric field distribution in frequency domain on plane is shown in Figure 5(b). Results of the radiation patterns obtained by parallel FDTD agree excellent with the ones obtained by HFSS shown in Figures 6(a) and 6(b). Total computation time (in seconds) with eight processors and different virtual topology schemes in *1000 time-steps* is compared in Table 2.

**(a)**

**(b)**

**(a)**

**(b)**

From Table 2, the shortest computing time consumed by virtual topology in two dimensions is shorter than in one dimension. However, three-dimensional virtual topology is not better than two-dimensional virtual topology. As shown in Table 2, the transferred data for the virtual topology scheme is much more than the one for the case. Thus, as to the same dimensional virtual topology, the topology scheme should be created along the directions where the amount of the FDTD grids is larger.

##### 4.3. Analysis of the Scattering of an Airplane

Then we analyze the scattering of a perfectly conducting airplane whose FDTD model is shown in Figure 7(a). Its working frequency is 200 MHz. The increment is used here. The direction of the incident wave is −*x* and the polarization is along +*y*. Inductive current distribution over the airplane surface is given in Figure 7(b). Figure 8 presents the bistatic RCS of the conducting airplane obtained by using Parallel FDTD. The frequency domain near-field distribution on the plane is given in Figure 9.

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

This example is calculated on super computer with 512 cores, which belong to the SSC. Time consumed by two virtual topology schemes with the same number of cores are listed in Table 3. Total amount of the FDTD grids is . It is obvious that the results accord with the rules presented before. With the same dimensional virtual topology, the parallel performance is mainly affected by the amounts of the transferred data especially in large-scale problems.

#### 5. Conclusions

In this paper, parallel FDTD method is applied to analyze the scattering of the electrically large targets. The code we developed is successfully run on super computer in Shanghai Supercomputer Center of China (SSC). The influence of different virtual topology schemes on the parallel performance of Parallel FDTD is studied in depth and in detail. The results show that the computation time efficiency can be improved by properly choosing MPI virtual topology schemes. Following the two conclusions above, we can obtain the highest computational efficiency.

#### Acknowledgments

This work is partly supported by the Fundamental Research Funds for the Central Universities of China (JY10000902002 and K50510020017) and the National Science Foundation of China (61072019). This work is also supported by Shanghai Supercomputer Center of China (SSC).