Research Article  Open Access
Faisal Mahmood, Märt Toots, LarsGöran Öfverstedt, Ulf Skoglund, "Algorithm and Architecture Optimization for 2D Discrete Fourier Transforms with Simultaneous Edge Artifact Removal", International Journal of Reconfigurable Computing, vol. 2018, Article ID 1403181, 17 pages, 2018. https://doi.org/10.1155/2018/1403181
Algorithm and Architecture Optimization for 2D Discrete Fourier Transforms with Simultaneous Edge Artifact Removal
Abstract
Twodimensional discrete Fourier transform (DFT) is an extensively used and computationally intensive algorithm, with a plethora of applications. 2D images are, in general, nonperiodic but are assumed to be periodic while calculating their DFTs. This leads to crossshaped artifacts in the frequency domain due to spectral leakage. These artifacts can have critical consequences if the DFTs are being used for further processing, specifically for biomedical applications. In this paper we present a novel FPGAbased solution to calculate 2D DFTs with simultaneous edge artifact removal for highperformance applications. Standard approaches for removing these artifacts, using apodization functions or mirroring, either involve removing critical frequencies or necessitate a surge in computation by significantly increasing the image size. We use a periodic plus smooth decompositionbased approach that was optimized to reduce DRAM access and to decrease 1D FFT invocations. 2D FFTs on FPGAs also suffer from the socalled “intermediate storage” or “memory wall” problem, which is due to limited onchip memory, increasingly large image sizes, and strided columnwise external memory access. We propose a “tilehopping” memory mapping scheme that significantly improves the bandwidth of the external memory for columnwise reads and can reduce the energy consumption up to . We tested our proposed optimizations on a PXIebased Xilinx Kintex 7 FPGA system communicating with a host PC, which gives us the advantage of further expanding the design for biomedical applications such as electron microscopy and tomography. We demonstrate that our proposed optimizations can lead to reduced FPGA and DRAM energy consumption when calculating highthroughput 2D FFTs with simultaneous edge artifact removal. We also used our highperformance 2D FFT implementation to accelerate filtered backprojection for reconstructing tomographic data.
1. Introduction
Discrete Fourier Transform (DFT) is a commonly used and vitally important function for a vast variety of applications including, but not limited to, digital communication systems, image processing, computer vision, biomedical imaging, and biometrics [1, 2]. Fourier image analysis simplifies computations by converting complex convolution operations in the spatial domain to simple multiplications in the frequency domain. Due to the fundamental nature of 2D DFTs, they are commonly used in a variety of image processing applications such as tomographic image reconstruction [3], nonlinear interpolation, texture analysis, tracking, image quality assessment, and document analysis [4]. Because of their computational complexity, DFTs often become a computational constraint for applications requiring highthroughput and realtime or near realtime operations, specifically for machine vision applications [5]. Image sizes for many of these applications have also increased over the years, further contributing to the problem.
The CooleyTukey fast Fourier transform (FFT) algorithm [6], first proposed in 1965, reduces the complexity of DFTs from to for a 1D DFT. However, in the case of 2D DFTs, 1D FFTs have to be computed in two dimensions, increasing the complexity to , thereby making 2D DFTs a significant bottleneck for realtime machine vision applications [7]. Recently, there has been substantial effort to achieve highperformance implementations of multidimensional FFTs to overcome this constraint [5, 7–14]. Due to their inherent parallelism and reconfigurability, Field Programmable Gate Arrays (FPGAs) are attractive targets for accelerating FFT computations. Being a highly flexible platform, FPGAs can fully exploit the parallel nature of the FFT algorithm. 2D FFTs are generally calculated in stages where all elements of the first stage must be available before the second stage can be calculated. This creates the socalled “intermediate storage” problem associated with strided external memory access, specifically for large datasets.
While calculating 2D DFTs, it is assumed that the image is periodic, which is usually not the case. The nonperiodic nature of the image leads to artifacts in the Fourier transform, usually known as edge artifacts or series termination errors. These artifacts appear as several crosses of highamplitude coefficients in the frequency domain, as seen in [15, 16]. Such edge artifacts can be passed to subsequent stages of processing, and in biomedical applications they may lead to critical misinterpretations of results. Efficiently removing such artifacts without compromising resolution is a major problem. Moreover, simultaneously removing these spurious artifacts while calculating the 2D FFT adds to the existing complexity of the 2D FFT kernel.
Contributions. In this paper we present solutions for a highperformance 2D DFT with simultaneous edge artifact removal (EAR) for applications which require high frame rate 2D FFTs such as realtime medical imaging systems and machine vision for control. Our proposed optimizations lead to a highperformance solution for removing edge artifacts while the transform is being calculated, thus preventing time consuming and possibly erroneous postprocessing steps. Moreover, the proposed optimizations reduce the overall energy consumption. This work builds on our previous work presented in [8]. Major contributions include the following:(1)We propose optimized periodic plus smooth decomposition (OPSD) as an optimization for standard periodic plus smooth decomposition (PSD) for edge artifact removal (Section 4).(2)Based on OPSD, we propose an architecture that can reduce the access to DRAM and can decrease the number of 1D FFT invocations by performing columnbycolumn operations on the fly (Section 4).(3)Since OPSD is heavily dependent on efficient FPGAbased 2D FFT implementation which is limited by DRAM access problems, we design a memory mapping scheme based on “tilehopping”, which can reduce row activation overhead while accessing columns of data from the DRAM (Sections 5 and 6).(4)The proposed OPSD and memory “tilehopping” optimizations also lead to better energy performance as compared to rowmajor access (Section 6.4).(5)We use our implementation as an accelerator for filtered backprojection (FBP), an analytical tomographic reconstruction method, and show that for large datasets our 2D FFT with edge artifact removal (EAR) can significantly improve reconstruction run time (Section 7).
As compared to our previous work [8], the current implementation achieves better runtime, i.e., ms as compared to ms for a image. This increased performance is achieved by using stateoftheart hardware for highbandwidth communication between the FPGA and CPU modules, and by utilizing an efficient memory mapping scheme. The current work also analyzes the energy consumption of the proposed paradigm. Moreover, we also show how the realtime 2D FFTs can be used for tomographic reconstructions.
Paper Outline. The paper follows FPGA image processing design methodology outlined in [16, 17], which involves carefully profiling the software solution to understand computational bottlenecks and overcoming them through careful reformulation of the algorithm within a parallel hardware framework. Section 2 gives a comprehensive background of highperformance 2D FFTs using FPGAs, the DRAM intermediate storage problem, and edge artifacts. Section 3 presents PSD in detail which is the implementation objective. Section 4 presents OPSD, an optimized solution to reduce the latency of the serial part of the algorithm which limits overall performance. Section 5 presents a memory mapping scheme that can reduce the columnwise strided external memory access. Section 6 presents experimental results and explains target selection in detail, experimental setup, and benchmarks results. Section 7 presents filtered backprojection as a proposed application. Section 8 presents conclusions.
2. Background
2.1. HighPerformance 2D FFTs Using FPGAs
There are several resourceefficient, highthroughput implementation approaches of multidimensional DFTs on a variety of different platforms. Many of these methods are softwarebased and have been optimized for efficient performance on generalpurpose processors (GPPs), for example, Intel MKL [11], FFTW [9], and Spiral [10]. Implementations on GPPs can be readily adapted for a variety of scenarios. However, GPPs consume more power as compared to dedicated hardware and are not ideal for realtime embedded applications. Several applicationspecific integrated circuit (ASIC) based approaches have also been proposed [18–20], but since it is not easy to modify ASICs, they are not costeffective solutions for rapid prototyping of image processing systems. Graphical Processing Units (GPUs) on the other hand can achieve relatively high throughput but are energy inefficient and limit the portability of largescale imaging systems.
Due to their inherent parallelism and reconfigurability, FPGAs are attractive for accelerating FFT computations, since they fully exploit the parallel nature of the FFT algorithm. FPGAs are particularly an attractive target for medical and biomedical imaging apparatus and instruments such as electron microscopes and tomographic scanners. Such devices do not have to be manufactured in bulk to justify applicationspecific solutions and require high bandwidth. Moreover, increasing mobility and portability constitute a future objective for many medical imaging systems. FPGAs are also more efficient for prototyping machine vision applications since they are relatively more finegrained when compared to GPPs and GPUs and can serve as a bridge between generalpurpose and applicationspecific acceleration solutions.
2.2. DRAM Intermediate Storage Problem
There have been several highthroughput 2D FFT FPGAbased implementations over the past few years. Most of these rely on repeated invocations of 1D FFTs by row and column decomposition (RCD) with efficient use of memory [5, 7, 12, 21, 22]. RCD makes use of the fact that a 2D Fourier transform is separable and can be implemented in stages; i.e., a rowbyrow 1D FFT can be proceeded by a columnbycolumn 1D FFT with intermediate storage (Figure 1). Most of the previous RCDbased 2D FFT FPGA implementation approaches have two major design challenges: The 1D FFT implementation needs to have a reasonably high throughput and needs to be resource efficient. Moreover, spatial parallelism needs to be exploited by running several 1D FFTs simultaneously. External DRAM needs to be efficiently addressed and to have a high bandwidth.
(a) Dataset level view
(b) Memory level view
Since the columnbycolumn 1D FFT requires data from all rows, intermediate storage becomes a major problem for large datasets. Many implementations rely on local memory such as resourceimplemented block RAM for intermediate storage which is not possible for large datasets [21]. Large datasets have to be offloaded to external DRAM because only a portion of the dataset that fits on the chip can be operated on at a given time. For complex image processing applications, this means repeated storage and access to the external memory during every stage of processing.
As shown in Figure 2(a) DRAM hierarchy from top to bottom is rank, chip, bank, row, and column. Each DRAM bank (Figure 2(b)) has a row buffer that holds the most recently referred row. There is only one row buffer per bank which means only one row from the datagrid can be accessed at once. Before accessing a row, it has to be activated by transferring the contents from internal capacitor storage into a set of parallel sense amplifiers. The row buffer is the socalled “fast buffer”, because when a row is activated and placed in the buffer, any element can be accessed at random.
(a) DRAM hierarchy
(b) Single DRAM bank
(c) Row buffer hit and miss
If a new row has to be activated and accessed into the row buffer, a row buffer miss occurs and requires a higher latency, (Figure 2(c)). On the contrary if the desired row is already in the buffer, a row buffer hit or page hit occurs and the latency to access elements is substantially lower, . This implies , where is the overhead associated with accessing a new row to read a specific element (Figure 2(c)) [12]. There is also overhead involved in writing the row back to the datagrid (precharge), say, . However, both and can be concealed by interleaving (switching between banks). Since rowwise access is trivial, the rowbyrow 1D FFT part of RCDbased 2D FFT is easily accomplished. However, once the rowbyrow 1D FFT is stored in the DRAM in standard rowmajor order, to access a single column, each row of the DRAM has to be accessed into the row buffer, rendering the read process extremely inefficient. This is typically the major bottleneck for highthroughput 2D FFTs (Figure 1(b)). We address this problem by designing a custom memory mapping scheme (Section 5).
2.3. Edge Artifacts
While calculating 2D DFTs, it is assumed that the image is periodic, which is usually not the case. The nonperiodic nature of the image leads to artifacts in the Fourier transform, usually known as edge artifacts or series termination errors. These artifacts appear as several crosses of highamplitude coefficients in the frequency domain (Figure 3(b)). Such edge artifacts can be passed to subsequent stages of processing, and in biomedical applications they may lead to critical misinterpretations of results. No current 2D FFT FPGA implementation addresses this problem directly. These artifacts may be removed during preprocessing, using mirroring, windowing, zero padding, or postprocessing, e.g., filtering techniques. These techniques are usually computationally intensive, involve an increase in image size, and also tend to modify the transform.
(a)
(b)
(c)
(d)
(e)
(f)
The most common approach is by ramping the image at corner pixels to slowly attenuate the edges. Ramping is usually accomplished by an apodization function such as a Tukey (tapered cosine) or a Hamming window, which smoothly reduces the intensity to zero. Such an approach can be implemented on an FPGA as a preprocessing operation by storing the window function in a lookup table (LUT) and multiplying it with the image stream before calculating the FFT [16]. Although this approach is not extremely computationally intensive for small images, it inadvertently removes necessary information from the image. Loss of this information may have serious consequences if the image is being further processed with several other images to reconstruct a final image that is used for diagnostics or other decisioncritical applications. Another common method is by mirroring the image from to . Doing so makes the image periodic and reduces edge artifacts. However, this not only increases the size of the image by , but also makes the transform symmetric, which generates an inaccurate phase component.
Simultaneously removing the edge artifacts while calculating a 2D FFT imposes an additional design challenge, regardless of the method used. However, these artifacts must be removed in applications where they may be propagated to subsequent processing levels. An ideal method for removing these artifacts should involve making the image periodic while removing minimal information from the image. Periodic plus smooth decomposition (PSD), first presented by Moisan [4] and used in [23–25], is an ideal method for removing edge artifacts (specifically for biomedical applications) because it does not directly intervene with pixels beside those of the boundary and does not increase image size. Moreover, its inherently parallel nature makes it ideal for a highthroughput, FPGAbased implementation. We have further optimized the original PSD decomposition algorithm to make the overall implementation much more efficient, by decreasing the number of required 1D FFT invocations and by reducing external DRAM utilization (Section 4).
2.4. LabVIEW FPGA HighLevel Design Environment
A major concern while designing complex image processing hardware accelerators is how to fully harness on the divideandconquer approach. Algorithms that have to be mapped to multiple FPGAs are often marred by communication problems, and custom FPGA boards reduce flexibility for largescale and evolving designs. For rapid prototyping of our algorithms, we used LabVIEW FPGA 2016 (National Instruments), a robust dataflowbased graphical design environment. LabVIEW FPGA provides integration with National Instruments (NI) Xilinxbased reconfigurable hardware, allowing efficient communication with a host PC and highthroughput communication between multiple FPGAs through a PXIe (PCI eXtensions for Industry Express) bus. LabVIEW FPGA also enables us to integrate external Hardware Description Language (HDL) code and gives us the flexibility to expand our design for future processing stages. We used NI PXIe7976R FPGA boards that have a Xilinx Kintex 7 FPGA and 2GB highbandwidth (10GB/s) external memory. This platform has already been extensively used for rapid prototyping of communication standards and protocols before moving to ASIC designs. The optimizations and designs we present here are scalable to most reconfigurable computingbased systems. Moreover, LabVIEW FPGA provides efficient highlevel control over memory via a smart memory controller.
3. Periodic Plus Smooth Decomposition (PSD) for Edge Artifact Removal (EAR)
Periodic plus smooth decomposition (PSD) involves decomposing the image into a periodic and smooth component to remove edge artifacts with minimal loss of information from the image [4]. This section presents an overview of the PSD algorithm and profiles the algorithm for possible parallelization and optimization to achieve efficient FPGA implementation.
Let us have discrete by grayscale image on a finite domain . The discrete Fourier transform (DFT) of is defined asThis is equivalent to a matrix multiplication , whereand has the same structure as but is mdimensional. has period which means that ; therefore,
Since in general is not periodic, there will be highamplitude edge artifacts present in the DFT stemming from sharp discontinuities between the opposing edges of the image as shown in Figure 3(b). Reference [4] proposed a decomposition of into a periodic component , which is periodic and captures the essence of the image with all highfrequency details, and a smoothly varying background , which recreates the discontinuities at the borders, so . Periodic plus smooth decomposition can be computed by first constructing a border image , where represents the boundary discontinuities when transitioning rowwise and when going columnwise.It is obvious that the structure of the border image is simple with nonzero values only in the edges as shown below:The DFT of the smooth component can be then found by the following formula:
The DFT of the image with edge artifacts removed is then . Figures 3(c) and 3(d) show the DFT of the smooth and periodic components, respectively. Figures 3(e) and 3(f) show the reconstructed periodic and smooth components. On reconstruction, it is evident that there is negligible visual difference between the actual image and the periodic reconstructed image for this example.
3.1. Profiling PSD for FPGA Implementation
Algorithm 1 summarizes the overall PSD implementation. There are several ways of arranging the algorithm. We have arranged it so that DFTs of the periodic and smooth components are readily available for further processing stages. For best results, both the periodic and smooth components should undergo similar processing stages and should be added back together before displaying the result. However, depending on the application it might be acceptable to discard the periodic component completely. For an × image, steps A and C have a complexity of and steps B and D have complexity and , respectively. Computationally the performance of PSD is limited by steps A and C. Figure 4 shows a proposed toplevel architecture where step A and steps B, C, and D are completed on separate FPGAs while step E can be done on the host PC. Two high end FPGAs are used instead of one because resources on one FPGA are insufficient to compute a large size 2D FFT as well its edge artifact removal components. The overall performance may be limited by FPGA 2 where most of the serial part of the algorithm lies. There are two major factors which limit the throughput of such a design:(1)While FPGA 1 and FPGA 2 can run in parallel, the result of step A from FPGA 1 has to be stored on the host PC while steps B, C, and D are completed on FPGA 2 before step E can be completed on the host PC.(2)The DRAM intermediate storage problem explained in Section 2.2 and Figures 1 and 2 has to be addressed since strided access to the DRAM for columnwise operations can significantly limit throughput.

As for , it has been addressed in the next section where we make use of the inherent symmetry of the boundary image to reduce the time required to compute the 2D FFT of the boundary image. As for , it has been addressed by designing a semicustom memory mapping controller which "tiles" the DRAM floor and "hops" between several tiles so as to minimize strided memory access.
4. Proposed: Optimized Periodic Plus Smooth Decomposition (OPSD)
In this section, we optimize the original PSD algorithm. This optimization is to effectively reduce the number of 1D FFT invocations and the number of times the DRAM is accessed.
Equation (6) shows that, except the corners, the boundary image is symmetrical in the sense that boundary rows and columns are algebraic negation of each other. In total has unique elements, with the following relations between corners with respect to columns and rows:
In computing the FFT of , one normally proceeds by first running 1D FFTs columnbycolumn and then 1D FFTs rowbyrow or vice versa. An FFT of a column vector with length is , where is given in (4). The columnwise FFT of the matrix is thenLet us have a closer look at the first column, denoted by . The 1D FFT of this vector is
It can be shown that the 1D FFT of the column isand the 1D FFT of the last column isTherefore, the columnwise FFT of the matrix is
To compute the columnbycolumn 1D FFT of the matrix, , we only have to compute the FFT of the first vector and then use the appropriately scaled vector, , to derive the remainder of the columns. The rowbyrow FFT has to be calculated in a row burst normal way. Algorithm 2 presents a summary of the shortcut for calculating . The steps presented in Algorithm 2 can replace step C in Algorithm 1. By reducing columnbycolumn 1D FFT computations for the boundary image, this method can significantly reduce the number of 1D FFT invocations, reduce the overall DRAM access, and eliminate problematic columnwise strided DRAM access for an efficient FPGAbased implementation. For columnwise operations, a single 1D FFT of size is required rather than 1D FFTs of size . Moreover, since one has to simply store one column of data, it can be stored on the onchip local memory (BRAM or SRAM). This can be implemented by temporarily storing the initial vector and scaling factors in the block RAM/register memory, drastically reducing DRAM access, and lowering the number of required 1D FFT invocations. Performance evaluation for this has been presented in the results section.

Table 1 shows a comparison of mirroring, PSD, and our proposed OPSD with respect to DRAM access points. Mirroring has been used for comparison purposes because it is an alternative technique that reduces edge artifacts while maintaining maximum amplitude information. However, due to replication of the image, most of the phase information is lost. Figure 5 graphically shows that our OPSD method significantly reduces reading from external memory and reduces the overall number of DFT computations required. It should be noted that such optimization is only possible for either columnwise or rowwise operations because after either of these operations the output is not symmetrical anymore. Completing the columnwise operation first prevents strided reading; however, this results in strided writing to the DRAM before rowwise traversal can start. This can be minimized by making efficient use of the local block RAM. Output columns are stored in the block RAM before being written to the DRAM in patches such that each row buffer access writes elements in several columns.

5. Proposed: TileHopping Memory Mapping for 2D FFTs
In this section we propose a tilehopping external memory access pattern for efficiently addressing external memory during intermediate storage between rowwise and columnwise 1D FFT operations to calculate a 2D FFT. As explained in Section 2.2 and Figure 2, columnwise reads from DRAM can be costly due to the overhead associated with activating and precharging. In the worst case scenario it can limit DRAM bandwidth up to 80% [26]. This is a problem with all such image processing operations where one stage of the processing has to be completed on all elements before the next stage can start. In the past there have been several implementations using local memory; however, with growing demand for larger image sizes external memory has to be used. There have been several DRAM remapping attempts before, such as [5, 13]. They propose a tilebased approach where an image (input array) is divided into tiles where is the size of the DRAM row buffer which allows for very highbandwidth DRAM access. Although, this method may be ideal to maximize the DRAM performance for 2D FFTs, it incurs a high resource cost associated with local memory transposition and storing large chunks of data (entire row/column of tiles) in the local memory. Moreover, tiling in the image domain also requires remapping rowbyrow operations. Another approach to reducing strided DRAM access has been presented in [27]. They present a 2D decomposition algorithm which decomposes the problem into smaller subblock 2D FFTs which can be performed locally. This introduces extra row and column data exchanges, and total number of operations are increased from to . Other implementations do not address the external memory issue in detail.
We propose tilehopping address mapping which reduces the number of row activations required to access a single column. Unlike [5], our approach does not require significant local operations or storage. The proposed memory mapping controller was designed on top of LabVIEW FPGA’s existing memory controller which efficiently controls interleaving and issues activation and precharge commands in parallel with data transfer. The reduced number of row activations also reduces the amount of energy required by the DRAM. This will be further discussed in the energy evaluation presented in Section 6.4 and Table 3 where we demonstrate that the proposed tilehopping address mapping can reduce energy consumption up to 53%.
Instead of writing the results of the rowbyrow 1D FFT in rowmajor order we remap the results in a blocked or tiled pattern as shown in Figure 6. This means that when accessing an image column, several elements of that column can be retrieved from a single DRAM row access. For an image, each row of size can be divided into tiles (i.e., , where is the number of elements in each tile). These tiles can be remapped onto the DRAM floor as shown in Figure 6. If the size of the row is small enough, it may be possible to convert it into a single tile (i.e., ). However, this is unlikely for realistic image sizes. For a tile of size , a single row of the image is written into the DRAM by transitioning through rows. If is the size of the row buffer, there are distinct tiles represented in each DRAM row and it contains the same number of elements from a single image column. Given regular rowmajor storage when accessing columnwise elements, one would have to transition through DRAM rows to read a single image column. However, with this approach, when accessing an image column, elements of that column could be read from a single DRAM row which has been referred to in the row buffer. Although the cost of writing an image row is higher when compared to a standard rowmajor DRAM writing pattern, (i.e., referring to rather than rows), the number of DRAM row referrals during columnwise read is reduced to which is less row referrals for a single column read.
(a) Dataset view
(b) Memory view
(c) Memory view
We refer to this method as tilehopping because it entails mapping data onto several DRAM tiles and then hopping between the tiles such that several elements of the image column exist in a DRAM row which has been referred to in the row bank. Although this mapping scheme has been developed for columnwise access required during 2D FFT calculation, the scheme is general and can be adapted to other applications. Performance evaluation of this method has been presented in the experiments section.
6. Experimental Results and Analysis
6.1. Hardware Configuration and Target Selection
Since 2D DFTs are usually used for simplifying convolution operations in complex image processing and machine vision systems, we needed to prototype our design on a system that is expandable for next levels of processing. As mentioned earlier, for rapid prototyping of our proposed OPSD algorithm and tilehopping memory mapping scheme, we used a PXIebased reconfigurable system. PXIe is an industrial extension of a PCI system with an enhanced bus structure that gives each connected device dedicated access to the bus with a maximum throughput of 24GB/s. This allows a highspeed dedicated link between a host PC and several FPGAs. The LabVIEW FPGA graphical design environment is efficient for rapid prototyping of complicated signal and image processing systems. It allows us to effectively integrate external HDL code and LabVIEW graphical design on a single platform. Moreover, it allows a combination of highlevel synthesis (HLS) and custom logic. Since current HLS tools have limitations when it comes to complex image and signal processing tasks, LabVIEW FPGA tries to bridge these gaps by streamlining the design process.
We used FlexRIO (Flexible Reconfigurable I/O) FPGA boards plugged into a PXIe chassis. PXIe FlexRIO FPGA boards are adaptable and can be used to achieve high throughput, because they allow direct data transfer between multiple FPGAs at rates as high as 8GB/s. This can significantly simplify multiFPGA systems, which usually communicate via a host PC. This feature allows expansion of our system to further processing stages, making it flexible for a variety of applications. Figure 7 shows a basic overview of a PXIebased, multiFPGA system with a host PC controller connected through a highspeed bus on a PXIe chassis.
Specifically, we used two NI PXIe7976R FlexRIO boards which have Kintex 7 FPGA and 2GB external DRAM with theoretical data bandwidth up to 10GB/s. This FPGA board was plugged into a PXIe1085 chassis along with a PXIe8880 Intel Xeon PC controller. PXIe1085 can hold up to 16 FPGAs and has 8 GB/s perslot dedicated bandwidth and an overall system bandwidth of 24 GB/s.
6.2. Experimental Setup
As per Algorithms 1 and 2, discussed in previous sections, implementation involves five stages: (A) calculating the 2D FFT of an image frame, (B) calculating the boundary image, (C) calculating the 2D FFT of the boundary image, (D) calculating the smooth component, and (E) subtracting the smooth component from the 2D FFT of the original image to achieve the periodic component. The bottleneck consistently occurs in A and the serial part of the algorithm (A B C). The limitation due to A is reduced removing the socalled “memory wall” by using our proposed tilehoppingbased memory mapping. The limitations due to the serial part of the algorithm are reduced by using OPSD rather than PSD. For quantification, the delay for A is 0.62ms for a 512 × 512 image.
The design flow presented in Figure 4 was followed. Dataflow is clearly shown in a graphical programming environment making it easier to visualize how a design efficiently fits on an FPGA. Highly efficient implementations of 1D FFT were used from LabVIEW FPGA for parallel rowbyrow operations and by integrating Xilinx LogiCORE for columnbycolumn operations. Each stage of the design was dynamically tested and benchmarked. The image was streamed from the host PC using a Direct Memory Access (DMA) FIFO. 1D FFTs are performed in parallel rows of 8 and stored in the DRAM via local memory in a tiled pattern as explained in the previous section. This follows reading several rows to extract a single column which is Fouriertransformed using Xilinx LogiCORE and is sent back to the host PC. If the image is being streamed directly from an imaging device which scans and provides random or nonlinear sequence of rows, it is necessary to store a frame of the image in a buffer. This can also be accomplished by streaming the image flow from the host PC or using a smart camera which can delay image delivery by a single frame. Local memory shown in Figure 9 is used to buffer data between external memory and 1D FFT cores. This memory is divided into read and write components and is implemented using FPGA slices. Block RAM (BRAM) is used for temporary storage of vectors required for calculating the 2D FFT of the boundary image (in the case of FPGA 2). The Control Unit (CU) organizes scheduling for transferring data between local and external memory. CU is based on LabVIEW’s existing memory controller and our memory mapping scheme presented in Section 5.
Step B was accomplished using standard LabVIEW FPGA HLS tools for programming (5) using the graphical programming environment. In step C the 2D FFT of the boundary image needs to be calculated by row and column decomposition. However, as shown mathematically in the previous section, the initial rowwise FFTs can be calculated by computing the 1D FFT of the first (boundary) vector and the FFTs of reaming vectors can be computed by appropriate scaling of this vector.
We need the boundary column vector for 1D FFT calculation of the first and last columns. We also need the boundary row vector for appropriate scaling of for the 1D FFT of every column between the first and last columns. Row and column vectors of the boundary image are stored in block RAM (BRAM). Figure 8 shows a functional block diagram of the overall 2D FFT with optimized PSD process. Steps D and E are performed on the host PC to minimize memory clashes and to access the periodic and smooth components of each frame as they become available. 86% of resources are used on FPGA 1 and 41% resources are used on FPGA 2. The resource utilization is reported according to LabVIEW FPGA synthesis and compilation experiments. It should be noted that part of the reason for high resource utilization is because of using LabVIEW FPGA highlevel synthesis tools as well as Xilinx LogiCORE tools. Using standardized tools makes it harder to optimize for resource utilization. The current implementation was optimized for performance in terms of run time and energy consumption.
6.3. Performance Evaluation
The overall performance of the system was evaluated using the setup presented in Figure 9. The data was streamed from the host PC; in certain cases high frame rate videos as well as direct camera input were streamed from the host. All results presented are for 16bit fixedpoint precision. Figure 10(a) presents the effectiveness of our proposed tilehopping memory mapping scheme. It clearly shows the effectiveness of our proposed memory mapping since it is closer to the theoretical peak performance. Figure 10(b) presents the overall results comparing PSD and OPSD and demonstrating the effectiveness of our proposed optimization. PSD was also implemented on the same platform but the optimization presented in Section 4 was not used. This rendered the serial portion of the algorithm to be the bottleneck which reduced overall performance. Table 2 shows a comparison of our implementation in contrast to recent 2D FFT FPGA implementation approaches and shows that we achieve a better performance even with simultaneous edge artifact removal. Although, our implementation is tested with 16bit fixedpoint precision which limits the accuracy of the transform, the precision may be sufficient for a variety of speed critical applications where alternative edge artifact removal methods (e.g., filtering) may decrease overall system performance. Although, the dynamic range of fixedpoint data is smaller than floatingpoint data, which can lead to errors in the 2D FFT; for certain applications it is more important to have an artifactfree transform as compared to a highly accurate transform.
 
Optimized periodic + smooth decomposition (OPSD) Simultaneous edge artifact removal Runtime (ms). 
 
Energy per read (EPR). Columnwise memory access (CW). 
(a) 2D FFT performance: DRAM tilehopping
(b) 2D FFT with EAR performance: PSD versus OPSD
6.4. Energy Evaluation
The overall energy consumption of the custom computing system depends on power performance of the system components and throughput disparity. Throughput disparity results in idle time for at least one of the components and lowers the overall system throughput. A throughputoptimized system minimizes instances where certain components of the architecture are idle. The proposed optimizations in Sections 4 and 5 clearly reduce throughput disparity and minimize the idle time of the system. Thus, besides causing delays due to significant overhead, standard columnwise DRAM access also contributes to the overall energy consumption. This is not only due to high count of DRAM row charges but also because of energy consumed by the FPGA in idle state. Ideally, maximizing the DRAM bandwidth limits the amount of energy consumption. Proposed “tilehopping” memory mapping scheme improves the DRAM bandwidth as seen in Figure 10 and hence reduces the overall energy consumption. The same is true for the proposed OPSD method where reduced DRAM access and 1D FFT invocations lead to reduced energy consumption. In this section we analyze the amount of improvement in energy consumption based on the proposed optimizations.
We estimate the DRAM power consumption for both the baseline (standard, strided) and the optimized (“tilehopping”) memory access using MICRON DRAM power calculator. The energy is calculated in for each read, i.e., energy per read. This is accomplished by calculating the run time for a specific 2D FFT and estimating the amount of energy consumed by the DRAM power calculator. Table 4 depicts the DRAM energy consumption for 2D FFTs before and after the proposed “tilehopping” optimization for columnwise DRAM access. As mentioned earlier, rowwise DRAM access is fast and row buffer size data can be accessed by a single row activation. According to Table 4 the energy required for DRAM access is reduced by 42.7%, 48.8%, and 52.9% for , , and size 2D FFTs, respectively.
 
Energy per point (EPP). 
The metric used to compare the overall energy optimization achieved for 2D FFTs with EAR is energy per point, i.e., the amount of average energy required to compute the 2D FFT of a single point in an image with simultaneous edge artifact removal. This was achieved by calculating the energy consumed by Xilinx LogiCORE IP for 1D FFTs, the DRAM, and the edge artifact removal part separately. The estimated energy calculated does not include energy consumed by the PXIe chassis and the host PC. Essentially, the FPGAbased architecture presented here could be used without the host controller. The energy consumption incorporates dynamic as well as static power. The overall energy consumption per point is reduced by 56.9%, 58.6%, and 62% for calculating , , and size 2D FFTs with EAR, respectively.
7. Application: Filtered BackProjection for Tomography
In order to further demonstrate the effectiveness of our implementation, we use the created 2D FFT module as an accelerator for reducing the run time for filtered backprojection (FBP). FBP is a fundamental analytical tomographic image reconstruction method. In depth details regarding the basic FBP algorithm have been left out for brevity, but can be found in [28, 29]. The method can be used to reconstruct primitive 3D tomograms from 2D data, which can then be used as a basis for more complex regularizationbased methods such as [30–32]. The algorithmic flow is based on the Fourier slice theorem; i.e., 2D Fourier transforms of projections are an angular component of the 3D Fourier transform of the 3D reconstructed volume. Our 2D FFT accelerator was used to calculate the 2D FFTs of the projections as well as for initial stages of the 3D FFT which was then completed on the host PC. Similar to the 2D FFT, the 3D FFT is separable and can be divided into 2D FFTs and 1D FFTs. The results have been shown in Table 5. It can be seen that the improvement for smaller size densities is not significant because their FFTs are quite fast on generalpurpose CPUs. However, for larger densities the FFT accelerator can give a significant improvement. If the remaining components are also implemented on an FPGA, significant speed increase can be achieved. Results of a thin slice from a 3D simulated SheppLogan [33] phantom have been shown in Figure 11. It can be seen that the results from the hardware accelerated FBP are of slightly lower quality. This is due to the fact that our 2D FFT implementation is less accurate (16 bit, fixedpoint) as compared to the CPUbased implementation (FFTW, doubleprecision floating). The accelerated FBP was also tested with real Electron Tomography (ET) data.

8. Conclusion
2D FFTs often become a major bottleneck for highperformance imaging and vision systems. The inherent computational complexity of the 2D FFT kernel is further enhanced if effective removal (using PSD) of spurious artifacts introduced by the nonperiodic nature of reallife images is taken into account. We developed and implemented an FPGAbased design for calculating highthroughput 2D DFTs with simultaneous edge artifact removal. Our approach is based on a PSD algorithm that splits the frequency domain of a 2D image into a smooth component which contains the highfrequency, crossshaped artifacts and can be subtracted from the 2D DFT of the original image to obtain a periodic component that is artifactfree. Since this approach calculates two 2D DFTs simultaneously, external memory addressing and repeated 1D FFT invocations become problematic. To solve this problem we optimized the original PSD algorithm to reduce the number of DFT samples to be computed and DRAM access. Moreover, to reduce strided access from the DRAM during columnwise reads we presented and analyzed “tilehopping”, a memory mapping scheme which reduces the number of DRAM row activations when reading a single column of data. This memory mapping scheme is general and may be used for a variety of other applications. We demonstrate that “tilehopping” memory mapping can reduce the DRAM energy consumption by 52.9%. Moreover, we show that the proposed optimizations lead to less energy consumption for the overall 2D FFT with EAR architecture.
Our methods were tested using extensive synthesis and benchmarking using a Xilinx Kintex 7 FPGA communicating with a host PC on a highspeed PXIe bus. Our system is expandable to support several FPGAs and can be adapted to various largescale computer vision and biomedical applications. Despite decomposing the image into periodic and smooth frequency components, our design requires less run time, compared to traditional FPGAbased 2D DFT implementation approaches and can be used for a variety of highly demanding applications. One such application, filtered backprojection, was accelerated using the proposed implementation to achieve better results specifically for larger size raw tomographic data.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by Japanese Government OIST Subsidy for Operations (Ulf Skoglund) Grant no. 5020S7010020. Faisal Mahmood and Märt Toots were additionally supported by the OIST Ph.D. Fellowship. The authors would like to thank National Instruments Research for their technical support during the design process. The authors would also like to thank Dr. Steven D. Aird for language assistance and Shizuka Kuda for logistical arrangements.
Supplementary Materials
File: 2D FFTEARDemo.mp4: video demonstrating the FPGA output of 2D FFT with edge artifact removal. (Supplementary Materials)
References
 R. N. Bracewell, The fourier transform and iis applications, vol. 5, New York, NY, USA, 1965.
 Theory and application of digital signal processing, vol. 1, PrenticeHall, Inc, Englewood Cliffs, NJ, USA, 1975. View at: Publisher Site
 R. A. Brooks and G. Di Chiro, “Theory of image reconstruction in computed tomography,” Radiology, vol. 117, no. 3 I, pp. 561–572, 1975. View at: Publisher Site  Google Scholar
 L. Moisan, “Periodic plus smooth image decomposition,” Journal of Mathematical Imaging and Vision, vol. 39, no. 2, pp. 161–179, 2011. View at: Publisher Site  Google Scholar  MathSciNet
 B. Akin, P. A. Milder, F. Franchetti, and J. C. Hoe, “Memory bandwidth efficient twodimensional fast Fourier transform algorithm and implementation for large problem sizes,” in Proceedings of the 20th IEEE International Symposium on FieldProgrammable Custom Computing Machines, FCCM 2012, pp. 188–191, Canada, May 2012. View at: Google Scholar
 J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex Fourier series,” Mathematics of Computation, vol. 19, no. 90, pp. 297–301, 1965. View at: Publisher Site  Google Scholar  MathSciNet
 H. Kee, N. Petersen, J. Kornerup, and S. S. Bhattacharyya, “Systematic generation of FPGAbased FFT implementations,” in Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 1413–1416, USA, April 2008. View at: Google Scholar
 F. Mahmood, M. Toots, L.G. Ofverstedt, and U. Skoglund, “2D Discrete Fourier Transform with simultaneous edge artifact removal for realTime applications,” in Proceedings of the International Conference on Field Programmable Technology, FPT 2015, pp. 236–239, New Zealand, December 2015. View at: Google Scholar
 M. Frigo and S. G. Johnson, “FFTW: an adaptive software architecture for the FFT,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 3, pp. 1381–1384, IEEE, May 1998. View at: Google Scholar
 M. Püschel, J. M. F. Moura, J. R. Johnson et al., “SPIRAL: code generation for DSP transforms,” Proceedings of the IEEE, vol. 93, no. 2, pp. 232–273, 2005. View at: Publisher Site  Google Scholar
 E. Wang, Q. Zhang, B. Shen et al., HighPerformance Computing on the Intel Xeon Phi, Springer International Publishing, 2014. View at: Publisher Site
 B. Akin, F. Franchetti, and J. C. Hoe, “FFTS with nearoptimal memory access through block data layouts,” in Proceedings of the 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2014, pp. 3898–3902, Italy, May 2014. View at: Google Scholar
 B. Akin, F. Franchetti, and J. C. Hoe, “Understanding the design space of DRAMoptimized hardware FFT accelerators,” in Proceedings of the 25th IEEE International Conference on ApplicationSpecific Systems, Architectures and Processors, ASAP 2014, pp. 248–255, Switzerland, June 2014. View at: Google Scholar
 C.L. Yu, K. Irick, C. Chakrabarti, and V. Narayanan, “Multidimensional {DFT} {IP} generator for {FPGA} platforms,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 58, no. 4, pp. 755–764, 2011. View at: Publisher Site  Google Scholar  MathSciNet
 D. He and Q. Sun, “A practical printscan resilient watermarking scheme,” in Proceedings of the IEEE International Conference on Image Processing 2005, ICIP 2005, pp. 257–260, Italy, September 2005. View at: Google Scholar
 D. G. Bailey, Design for embedded image processing on FPGAs, John Wiley Sons, 2011.
 B. G. Batchelor, “Implementing Machine Vision Systems Using FPGAs,” in Machine Vision Handbook, 1136, 1103 pages, Springer, London, UK, 2012. View at: Publisher Site  Google Scholar
 T. Lenart, M. Gustafsson, and V. Öwall, “A hardware acceleration platform for digital holographic imaging,” Journal of Signal Processing Systems, vol. 52, no. 3, pp. 297–311, 2008. View at: Publisher Site  Google Scholar
 G. H. Loh, “3DStacked Memory Architectures for Multicore Processors,” ACM SIGARCH Computer Architecture News, vol. 36, no. 3, pp. 453–464, 2008. View at: Publisher Site  Google Scholar
 Z. Qiuling, B. Akin, H. E. Sumbul et al., “A 3Dstacked logicinmemory accelerator for applicationspecific data intensive computing,” in Proceedings of the IEEE International 3D Systems Integration Conference, pp. 1–7, October 2013. View at: Publisher Site  Google Scholar
 I. S. Uzun, A. Amira, and A. Bouridane, “FPGA implementations of fast Fourier transforms for realtime signal and image processing,” pp. 283–296. View at: Google Scholar
 T. Dillon, “Two VirtexII FPGAs deliver fastest, cheapest, best highperformance image processing system,” '' Xilinx Xcell Journal, vol. 41, pp. 70–73, 2001. View at: Google Scholar
 A. Hast, “Robust and invariant phase based local feature matching,” in Proceedings of the 22nd International Conference on Pattern Recognition, ICPR 2014, pp. 809–814, Sweden, August 2014. View at: Google Scholar
 B. Galerne, Y. Gousseau, and J.M. Morel, “Random phase textures: theory and synthesis,” IEEE Transactions on Image Processing, vol. 20, no. 1, pp. 257–267, 2011. View at: Publisher Site  Google Scholar  MathSciNet
 R. Hovden, Y. Jiang, H. L. Xin, and L. F. Kourkoutis, “Periodic Artifact Reduction in Fourier Transforms of Full Field Atomic Resolution Images,” Microscopy and Microanalysis, vol. 21, no. 2, pp. 436–441, 2014. View at: Publisher Site  Google Scholar
 D. G. Bailey, “The advantages and limitations of high level synthesis for FPGA based image processing,” in Proceedings of the the 9th International Conference, pp. 134–139, Seville, Spain, September 2015. View at: Publisher Site  Google Scholar
 W. Wang, B. Duan, C. Zhang, P. Zhang, and N. Sun, “Accelerating 2D FFT with nonpoweroftwo problem size on FPGA,” in Proceedings of the 2010 International Conference on Reconfigurable Computing and FPGAs, ReConFig 2010, pp. 208–213, Mexico, December 2010. View at: Google Scholar
 F. Natterer, The Mathematics of Computerized Tomography, John Wiley & Sons, 1986. View at: MathSciNet
 R. A. Crowther, D. J. DeRosier, and A. Klug, “The Reconstruction of a ThreeDimensional Structure from Projections and its Application to Electron Microscopy,” Proceedings of the Royal Society A Mathematical, Physical and Engineering Sciences, vol. 317, no. 1530, pp. 319–340, 1970. View at: Publisher Site  Google Scholar
 U. Skoglund, L.G. Öfverstedt, R. M. Burnett, and G. Bricogne, “Maximumentropy threedimensional reconstruction with deconvolution of the contrast transfer function: A test application with adenovirus,” Journal of Structural Biology, vol. 117, no. 3, pp. 173–188, 1996. View at: Publisher Site  Google Scholar
 F. Mahmood, N. Shahid, P. Vandergheynst, and U. Skoglund, “Graphbased sinogram denoising for tomographic reconstructions,” in Proceedings of the 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC 2016, pp. 3961–3964, USA, August 2016. View at: Publisher Site  Google Scholar
 F. Mahmood, N. Shahid, U. Skoglund, and P. Vandergheynst, “Adaptive GraphBased Total Variation for Tomographic Reconstructions,” IEEE Signal Processing Letters, vol. 25, no. 5, pp. 700–704, 2018. View at: Publisher Site  Google Scholar
 L. A. Shepp and B. F. Logan Jr., “The Fourier reconstruction of a head section,” IEEE Transactions on Nuclear Science, vol. 21, no. 3, pp. 21–43, 1974. View at: Google Scholar
Copyright
Copyright © 2018 Faisal Mahmood et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.