Abstract
Finite difference time domain (FDTD) method is a very poplar way of numerically solving partial differential equations. FDTD has a low operational intensity so that the performances in CPUs and GPUs are often restricted by the memory bandwidth. Recently, deeply pipelined FPGA accelerators have shown a lot of success by exploiting streaming data flows in FDTD computation. In spite of this success, many FPGA accelerators are not suitable for realworld applications that contain complex boundary conditions. Boundary conditions break the regularity of the data flow, so that the performances are significantly reduced. This paper proposes an FPGA accelerator that computes commonly used absorbing and periodic boundary conditions in many 3D FDTD applications. Accelerator is designed using a “Clike” programming language called OpenCL (open computing language). As a result, the proposed accelerator can be customized easily by changing the software code. According to the experimental results, we achieved over 3.3 times and 1.5 times higher processing speed compared to the CPUs and GPUs, respectively. Moreover, the proposed accelerator is more than 14 times faster compared to the recently proposed FPGA accelerators that are capable of handling complex boundary conditions.
1. Introduction
Finite difference time domain (FDTD) method is a very important and widely used one in many areas such as electromagnetic field analysis [1], optoelectronics [2], and antennas [3]. FDTD computation is an iterative method where a grid is updated in each iteration according to a fixed computation pattern. It is one of the most researched subjects and there are many proposals for new algorithms and accelerators. FPGA accelerators [4–10] are getting popular recently since they use deeply pipelined architectures by exploiting streaming data flows in FDTD computation. Streaming data flow computing is ideal for applications with low operational intensity [11] such as FDTD. The operational intensity refers to the amount of operations per data movement. In spite of the success, many FPGA accelerators are purely academic and cannot be used efficiently for realworld applications. The main reason for this is the lack of support for boundary conditions or the inefficiency of computing them.
Commonly available boundary conditions in FDTD computation are Dirichlet, periodic, and absorbing boundary conditions. Dirichlet or the fixed boundary condition is the simplest one where a constant (usually zero) is used for the data on the boundaries. Fixed boundaries are used when the outside of a grid is a perfect electric or a magnetic conductor. In this case, the fields on the boundaries equal zero. Most of the existing works use fixed boundary conditions to gain high speedups by preserving the regularity of the data flow. Absorbing boundary condition (ABC) is often used to simulate infinite domains by applying absorbing layers close to the boundary. Implementing ABC computation could break the regularity of the data flow due to the data dependency near the boundaries. Periodic boundary condition (PBC) is used to simulate infinite periodic structures by computing a small portion called a “unit.” PBCs are applied at the boundaries so that the same unit cell is replicated. Implementing PBC computation without breaking the regularity of the data flow is even more difficult, due to the data dependency among different boundaries. However, many realworld FDTD applications [12–14] use both absorbing and periodic boundaries, and their efficient implementation is very important.
In this paper, we propose an FPGA accelerator for 3D FDTD computation that efficiently supports absorbing and periodic boundary conditions. This paper is an extension of our previous work [15] which introduces the basic FPGA accelerator architecture. In this paper, we improve the FPGA architecture and implement it using a “Clike” programming language called “OpenCL” (open computing language) [16]. Therefore, we can use the proposed accelerator for different applications and boundary conditions by just changing the software code. Moreover, OpenCL provides a complete framework for system design that includes not only the FPGA accelerator design, but also the device drivers and API to control and transfer data between the accelerator and the host [17]. According to the experimental results, we achieved over 3.3 and 1.5 times higher processing speeds compared to those of the CPUs and GPUs, respectively. Moreover, the proposed accelerator is more than 14 times faster compared to the recently proposed FPGA accelerators that are capable of handling complex boundary conditions.
2. Related Works
FPGA accelerators for FDTD computation with simple boundary conditions are already implemented in previous works [5–7]. Although those works use different design methods such as OpenCL [16], HDL, or MaxCompiler [18], the basic architecture is very similar. To explain the accelerator architecture, we consider a simple example of 2D stencil computation as shown in Figure 1. Note that FDTD method is also regarded as a stencil computation. Figure 1(a) shows the data streaming order in a 2D grid. A gridpoint is called a cell. Data of the cells are streamed from lefttoright along the axis and bottomtotop along the axis. Figure 1(b) shows the computations of two consecutive iterations. To compute in the iteration (), data of its surrounding cells belonging to the iteration are required. When the computation of is in progress in the iteration (), all the data required for the computation of are available. Therefore, the computations of in iteration and in iteration can be done in parallel. We call this “iterationparallel” computation. Detailed description of the iterationparallel computation can be found in [19].
(a) Order of the data flow
(b) Parallel computation of the cells in multiple iterations. Computations of and can be done in parallel
Figure 2 shows the flowchart of the whole computation. Computations of iterations are done in parallel on FPGA. If the total number of iterations is larger than (which is the usual case), the FPGA computation is done for “” times. In this method, the global memory is read once in each iteration. Iterationparallel computation reduces the global memory access while allowing sufficient amount of parallel operations.
The works in [5–7] propose FPGA accelerators based on iterationparallel computation. A high level design tool called MaxCompiler [18] is used in [5] to accelerate 3D FDTD computation. OpenCL is used in [6, 7] for 2D FDTD computation. Although these works use iterationparallelism to increase the processing speed, they only support simple boundary conditions such as fixed boundaries where constants are used as the boundary data.
The works in [8–10] propose FPGA accelerators that support complex boundary conditions. Impedance boundary condition is used in [8]. In terms of computation, impedance boundary is more complex compared to a fixed boundary, since it requires more computations. In terms of data dependency, it is easy to implement since the outputs are produced in the same order of the input data stream. However, it can be used only for 2D FDTD and the computations are done in 32bit fixed point. Therefore, the usage of such an accelerator is severely restricted. Absorbing boundary conditions are supported by the FPGA accelerator proposed in [9]. However, it has not considered the periodic boundary conditions. The work in [10] is one the very few works that proposes an FPGA accelerator to compute both absorbing and periodic boundary conditions. The boundaryrelated data streamed out from the FPGA to the CPU through the PCI express bus. They are processed in the CPU and streamed into the FPGA after the computation of the core area is finished. In this method, the frequent data transfers and synchronizations between CPU and FPGA reduce the performance. Although the accelerators in [8–10] process boundary conditions, they have not used the iterationparallel computation which is the most efficient way on FPGAs to increase the processing speed. Since the boundary conditions distort the regularity of the data flow, it is usually difficult to implement the iterationparallelism.
3. FPGA Accelerator for 3D FDTD Computation
3.1. 3D FDTD with Periodic Boundary Conditions
In this section, we explain the FPGA acceleration for 3D FDTD computation when there are periodic and absorbing boundary conditions. Figure 3 shows the simulation area. Simulating very large grids could take enormous amount of processing time. Instead, we divide the grid into multiple boxes and consider that the same unitbox is replicated. We use periodic boundary conditions to correctly update the box. Using this method, we can reduce the computation amount significantly. This method is used for realworld FDTD simulations in many works such as [20–22].
Figure 4 shows the 3D computation domain of the unitbox. It contains six boundaries and the core. A cell in this 3D domain is denoted by the coordinates , , and . The electric fields , , and belonging to three directions are computed in each cell. The computations are classified into the core area and the boundary area computations. Absorbing boundary conditions are used to compute the top and the bottom boundaries ( boundaries), while periodic boundary conditions are used for the other four boundaries ( and boundaries). Figure 5 shows a cross section of the computation domain that represents one plane. To compute the electric fields of , we need the data of the of the same iteration. Therefore, the must be computed before computing the . This is against the order of the input data stream shown in Figure 1(a).
Figure 6 shows the flowchart of the 3D FDTD algorithm. In usual Yee’s FDTD algorithm [1], both the firstorder differential equations of electric and magnetic fields are calculated. However, if we are interested only in electric field, we can simply substitute the magnetic field terms by equations that only contain electric field terms. The details of this method are available in [23, chapter 2]. Since the same equations are used, the physical model and the numerical values are the same as Yee’s algorithm. However, we need the electric field data of the current and the previous iteration to compute the next iteration. In each iteration, both periodic and absorbing boundary data are computed. The periodic boundary data of iteration are computed using the electric field data of the same iteration. The absorbing boundary data are computed using the electric field data of the iterations and . Since the boundary computations require the data of the core cells, data dependencies exist among different cells of the same iteration. Therefore, the computations of the current iteration must be completed before starting the computations of the next iteration. Due to this problem, we cannot use the existing methods that are based on the iterationparallel computation.
The computation of the electric field in direction at time step () is shown in (1). To compute , we need the electric field data at the time steps and . We also need the electric field data in and directions at the time step and also some constants such as ~ and ssn. Note that the computations of and are also done similar to Constant data set ssn contains wave impedance and refractive index data. It is given by Note that is the refractive index data for each coordinate, and it is given initially as an input data file. The minimum value of the refractive index data is given by . The stability constant is given by CFL and we use the value 0.99. The values of and do not change with iterations. Those are constants for all iterations. Therefore, we can compute those on CPU and the processing time required for this computation is very small, compared to the processing time of the whole FDTD computation.
3.2. Data Flow Regulation
Our goal in the proposed implementation is to maintain the regularity of the input and output data streams even when there are boundary conditions. However, as explained in Section 3.1, the order of the computations is different from that of the input data stream. To solve this problem, we regulate the output data stream by reordering the computed data. For example, let us consider the plane shown in Figure 5 and assume that and . We consider the computations of the cells to when on plane. In this example, the data of the core are required to compute the boundary . Let us assume that the data of is computed at the clock cycle . Using the data of the as the input, we can compute the boundary at the clock cycle . To maintain the regularity of the output data stream, the output data should be in the order of cells. If the data of the is released to the output data stream at the clock cycle , the data of the must be released at the clock cycle and even it was computed before the . That is, the data of the must be stored for 7 clock cycles before being released to the output data stream. Similarly, the core data of the cells also have to be delayed for 7 cycles each.
Figure 7 shows the proposed pipelined computation module (PCM) of our FPGA accelerator. It contains modules to compute core and the boundary data and a shiftregister array to store the computed data. The input data streamed into the PCM in a regular pattern. The computation of the core starts before the boundaries. The computed data are written to the appropriate locations of the shiftregister array. The locations of the shiftregister array are calculated in such a way to preserve the data order. According to the example in Figure 5, the data of the is stored in seven places after the data of the , irrespective of the order of the computation. After the data area is arranged in the correct order, they are released to the output stream.
Figure 8 shows the overall architecture. It consists of PCMs where each of which processes one iteration. This architecture is very similar to the one discussed in [5–7]. The difference is the structure of a PCM which can process both core and boundary data and also produces a regular output data stream. To compute the electric field of the next iteration, we need the data of the current and the previous iterations. Fortunately, they are already available in the shiftregisters of the current and the previous PCMs. For example, if we have 3 PCMs, the initial data of the iteration and the data of three iterations , and are available simultaneously. Therefore, PCMs and an additional shiftregister array are sufficient to compute iterations in parallel as shown in Figure 8.
However, there are a few disadvantages in this method. Since we have to use shiftregisters to temporally store the computed data, the area of a PCM is increased. As a result, the number of PCMs can be decreased so that the processing time can be increased. The computed data have to be stored for several clock cycles until they are arranged correctly. This increases the latency and also the processing time.
3.3. OpenCLBased Implementation and Optimization
This section explained how to implement and optimize the proposed FPGA accelerator using OpenCL [16]. OpenCL is a framework to write programs to execute across a heterogeneous parallel platform that consists of a host and devices. “Intel FPGA SDK for OpenCL” [24] is used for FPGA and CPU based systems, where the CPU is the host and the FPGA is the device. The “kernels” are the functions executed on FPGA and they are implemented using OpenCL. The inputs for the kernels are transferred from the host, and the outputs are transferred back to the host.
Listing 1 shows an extract of a Cprogram used to compute the electric field in direction according to (1). Arrays , , and are used to store the data of the electric fields in iterations , , and , respectively. The electric fields at the coordinates and are represented by array indexes Enx[i][j][k] and Enx[i−1][j][k], respectively. Similar method is used to compute electric fields in and directions. Note that notation is used for the cell coordinates in equations, and [i][j][k] notation is used to show the data array indexes used in the Cprogram.

Since is defined at as shown in (1), we have to use the average of ssn[i+1][j][k] and ssn[i][j][k] to determine ssn at direction (ssnx). Similar process is applied to compute ssny and ssnz also, as shown in Listing 2. Note that arrays ssnx, ssny, and ssnz are used in the Cprogram to store the data of , , and , respectively. Constants , , and are the ssn values of , , and directions, respectively.

To calculate the electric fields , , and of a cell, we have to load nine data values that include electric field data of the present and the previous iterations of all three directions (Enx,Epx,Eny,Epy,Enz,Epz) and constants (ssnx,ssny,ssnz).
If all the data are stored in different offsets in the global memory, we require at least 9 “load transactions.” To reduce the amount of transactions, we use an arrayofstructure in the global memory to store the data. Each element in the array has the data of the present and the previous electric fields and the constants. Therefore, one array element provides all the data necessary to compute one cell. One array element contains 36 bytes ((float)) in single precision and 72 bytes in double precision. On the other hand, one memory load/store transaction accesses 64 bytes from the global memory. If the data are not aligned, one or two transactions are required to access an array element. If the data are aligned to 64byte boundary, only one transaction is required. Similarly, two memory transactions are required to load one array element in double precision for aligned data and more than two are required for nonaligned data.
To reduce the amount of memory transactions, we have to reduce the data amount of one array element. To reduce the data amount, we compute the constant in the FPGA instead of loading the already computed ones from the memory. For example, if we compute ssnx and ssny on FPGA, we have to load only ssn and ssnz from the memory. This reduces 4 bytes per cell. If we reduce the size of an array element to 32 bytes, we can load two array elements in one transaction. The computations of the constants can be done in parallel to the computations of the electric fields. However, it requires additional computation units and increases the logic area of the FPGA.
Algorithm 1 shows the pseudo code of the computation kernel of the accelerator. The inputs and outputs of the kernel are represented by the arrayofstructures in line . After an array element is read, the current and the previous electric field data of all three directions and constants are stored in the shiftregisters. Shiftregisters are included in each PCM to store the data of its inputs from the previous PCM. Shiftregisters are defined using twodimensional arrays as shown from line in Algorithm 1. One array dimension represents the number of PCMs and the other represents the lifetime of the data. The lifetime represents the number of steps that one data value should be stored in the FPGA until it is no longer required for any further computation. Note that the lifetime is quite large since the output of one iteration is used in next two iterations. We also need another set of shiftregisters to temporally store the computed data in order to regulate the data order. Those shiftregisters are shown from line . The lifetime of the data stored in those shiftregisters is small since the data are used in the same iteration. The behaviors of the shiftregisters are coded from lines to .

The computations of the cells in the core and the boundary are shown from line . Note that the computed data are temporary stored in the shiftregisters to regulate the output data stream. The locations the shiftregisters are determined by processing some conditional branches as shown from line 37. After the output data are available and the order is fixed, data are written to the shiftregisters of the next PCM. If it is the last PCM, data are written back to the global memory.
4. Evaluation
For the evaluation, we use two FPGA boards, two GPUs, and two multicore CPUs. The FPGA boards are DE5 [25] and 395D8 [26]. FPGAs are configured using Quartus 16.0 with SDK for OpenCL. The CPU code is written in C language with OpenMP directives and compiled using Intel C compiler 2016 (Intel Parallel Studio XE 2016) with relevant optimization options. GPU code is written in CUDA C code considering multithreaded dataparallel computation. GPUs are programmed using CUDA 7.5 compiler. The operating system is CentOS 6.7. We use a unitbox that has a periodic structure for the simulation. It is 960 nm × 960 nm × 9,540 nm long in , , and dimensions, respectively. The objective of the simulation is to find various optical phenomena such as guidedmode resonance, at the wavelength range of 1,000 nm~2,000 nm. To simulate such resonance phenomena, a fine spatial mesh is needed. Thus, we set the grid division as 20 nm for each direction. This leads to a grid of in each , , and axis, respectively. Note that the grid size is given by parameters in the OpenCL code, and we can use a different grid by changing the values of the parameters. Computation is done for 8,192 iterations.
Table 1 shows the comparisons of the accelerators with different data access methods. We used DE5 board that contains Stratix V 5SGXEA7N2F45C2 FPGA. The computations are done in singleprecision floating point. Method uses “noncoalesced” memory access since it loads 9 data values from different locations in the global memory. Therefore, nine memory transactions are required for the computation of a cell. Since one transaction loads 64 bytes of data, the computations of the neighboring cells can be done by accessing the cache instead of the global memory. Memory access is coalesced in method by using an arrayofstructures. An array element contains 36 bytes so that the data are not aligned to the 64byte boundary of a transaction. As a result, multiple transactions are required to access array elements, and the processing time is increased compared to method . This problem is easily solved in method by aligning the data to the 64byte boundary. After the data are aligned, only a single transaction is required to access an array element. Moreover, the clock frequency is also improved due to the simplicity of the data access. As shown in Table 1, the processing time is reduced significantly compared to method . In method , the memory access is further reduced by decreasing the input data amount. It is done by computing the constants ssnx and ssny inside the FPGA in every cycle, instead of loading precalculated constants from the memory. Using this method, we reduce the size of an array element to 32 bytes so that two array elements can be read in a single transaction. The clock frequency is improved and the processing time is reduced significantly compared to method . In method , the size of an array element is further reduced to 28 bytes by computing all constants (ssnx,ssny,ssnz) on FPGA. However, this has increased the number of access points to ssn as shown in Listing 2. Moreover, the logic area is increased compared to method . Some of these reasons could be the cause for the processing time increase and clock frequency reduction. The best balance of the computation and the memory access is achieved in method and it is the fastest implementation.
Table 2 shows the comparisons against the CPU and GPU based implementations. In this comparison, we used singleprecision floatingpoint computation. The processing speed using DE5 FPGA board is 3.3 times and 1.5 times higher compared to those of the CPUs and GPUs, respectively. The processing speed using 395D8 FPGA board is also higher than those of the CPUs and GPUs. The performances of GPUs are very sensitive to the memory bandwidth. Recent highend GPU such as K40 with 288 GB/s bandwidth could provide better performance compared to this work. However, we believe that our evaluation using GTX680 (192.2 GB/s bandwidth) is a reasonable one, since it has a similar bandwidth compared to the highend K20 GPU. In addition, it would be fair to compare devices such as Stratix V and GTX680, which are available in the same era. Despite having lower bandwidth and peak performance, DE5 FPGA gives the highest processing speed. Note that the peak performance of the FPGAs is calculated according to [27]. Unlike CPUs and GPUs, FPGAs are reconfigurable devices, and we can design the most suitable architecture for a given application considering its operations. We can use the resources efficiently by designing computing units that do only the required computation.
The comparison using doubleprecision floatingpoint computations is shown in Table 3. Although FPGAs give better performance compared to CPUs, they cannot beat the performances of the GPUs. Doubleprecision performance on FPGA is around 25% of the singleprecision performance. The current generation FPGAs (Stratix V series) do not have dedicated floatingpoint units. Multiplications are done in DSPs and the additions are done using logic blocks. Due to the large logic block requirement for doubleprecision computation, the performances are decreased. However, the nextgeneration FPGAs such as Aria 10 and Stratix 10 devices contain dedicated floatingpoint units. Therefore, we can expect a significant reduction of the logic area and increase of the processing speed.
Table 4 shows the comparisons with other recent works that propose FPGA accelerators for 3D FDTD. All computations are done in singleprecision floating point. The accelerator proposed in [9] provides 1000 mega cell/s processing speed using absorbing boundary conditions. However, it has not used periodic boundary conditions. Accelerators proposed in [5, 10] provide 325 and 1,820 mega cell/s processing speeds, respectively, for fixed boundary conditions. When absorbing boundaries are used, the performance of [10] is reduced to 65.5%. When periodic boundary conditions are used, the performance is reduced to just 5.5% and 8.4%, respectively, comparing with the ones that use fixed and absorbing boundary conditions. The accelerator proposed in this paper provides 1,427 mega cells/s processing speed on a single FPGA even using both absorbing and periodic boundary conditions together. This is a 14.2 times improvement compared to [10]. Since we used a single FPGA compared to four FPGAs in [10], our achievement is significant.
Memory bandwidths of the method in [10] are 26.8 GB/s and 29.9 GB/s when using fixed and absorbing boundaries, respectively. MAX3 board with Virtex6 FPGAs used in [10] has a theoretical memory bandwidth of 38.6 GB/s, so that the accelerator is not memory bound. The theoretical memory bandwidth of the DE5 board used in this work is 25.6 GB/s. If the method in [10] is implemented on the DE5 board, it will become memory bound and would perform worse compared to the implementation on MAX3. On the other hand, our method uses only 13.4 GB/s bandwidth so that it is not memory bound in either of the FPGA boards. We use a single but very deep pipeline so that the application does not become memory bound. However, [10] uses multiple but shorter pipelines in parallel, and that requires reasonably large bandwidth. When [10] uses both absorbing and periodic boundaries, the boundary data are processed in the host CPU. This requires frequent data transfers between host and FPGA. Such transfers depend on the PCIe bandwidth, and it is much lower than the memory bandwidth. Moreover, there could be some control overheads, synchronizing overheads, and so on that slow down the computation.
Table 5 shows the performance comparison against FPGA accelerators that use simple boundary conditions. All the implementations are done on DE5 FPGA board using singleprecision floatingpoint computation. The 2D FDTD computation with fixed boundaries can be regarded as one of the simplest FDTD computations. The most recent and the fastest implementation of the 2D FDTD computation is proposed in [7]. It produces 150.3 Gflop/s of processing speed. We also measured the performance of the 3D FDTD computation with absorbing boundaries. It produces 98.5 Gflop/s of processing speed. The proposed 3D FDTD computation with absorbing and periodic boundary conditions produces 89.9 Gflop/s of processing speed. The performance of the proposed method is slightly small compared to the 3D FDTD implementation with only absorbing boundaries. However, the performance is reduced to nearly 60% compared to the 2D FDTD with fixed boundaries. Despite this reduction, we believe that retaining nearly 60% of the processing speed of the simplest 2D FDTD and yet processing the complex boundary conditions is a significant achievement.
We may improve the performance using manual HDLbased designs. However, we believe that the gap between the performances of OpenCLbased and HDLbased designs is getting narrower. In our earlier work in [7], we found that OpenCLbased design provides 76% of the FPGA peak performance for 2D FDTD with fixed boundaries. In this paper, we achieved over 51% of the peak performance for 3D FDTD with absorbing boundary and 46% of the peak performance with both absorbing and periodic boundaries. Moreover, the work in [28] reports that, although HDLbased designs use less resources compared to OpenCLbased designs, there is not much difference in the clock frequency and performance.
5. Conclusion
We have proposed an FPGA accelerator for 3D FDTD that efficiently supports absorbing and periodic boundary conditions. The data flow is regulated in a PCM after the computation of the boundary data. This allows data streaming between multiple PCMs, so that we can implement iterationparallel computation. The FPGA architecture is implemented using OpenCL. Therefore, we can use it for different applications and boundary conditions by just changing the software code. Since OpenCL is a system design method, we can implement the proposed accelerator on any system that contains an OpenCLcapable FPGA. According to the experimental results, we achieved over 3.3 times and 1.5 times higher processing speeds compared to the CPUs and GPUs, respectively. Moreover, the proposed accelerator is more than 14 times faster compared to the recently proposed FPGA accelerators that are capable of handling boundary conditions.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.