VLSI Design

Volume 2014, Article ID 651943, 24 pages

http://dx.doi.org/10.1155/2014/651943

## A Self-Reconfigurable Platform for the Implementation of 2D Filterbanks with Real and Complex-Valued Inputs, Outputs, and Filter Coefficients

Department of Electrical and Computer Engineering, University of New Mexico, Albuquerque, NM 87131, USA

Received 18 October 2013; Revised 18 March 2014; Accepted 19 March 2014; Published 4 May 2014

Academic Editor: Wieslaw Kuzmicz

Copyright © 2014 Daniel Llamocca and Marios Pattichis. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We introduce a dynamically reconfigurable 2D filterbank that supports both real and complex-valued inputs, outputs, and filter coefficients. This general purpose filterbank allows for the efficient implementation of 2D filterbanks based on separable 2D FIR filters that support all possible combinations of input and output signals. The system relies on the use of dynamic reconfiguration of real/complex one-dimensional filters to minimize the required hardware resources. The system is demonstrated using an equiripple and a Gabor filterbank and the results using both real and complex-valued input images. We summarize the performance of the system in terms of the required processing times, energy, and accuracy.

#### 1. Introduction

Filterbank implementations in hardware require significant resources. Each filter in the filterbank is often implemented using multiply-and-add circuitry (modifiable coefficients) or by a fixed-coefficient circuitry. In what follows, we will use the term* static implementations* to describe filterbank implementations that are based on fixed hardware. A clear limitation of static implementations comes from the fact that hardware resources cannot be adjusted based on the number of coefficients or the number of filters. Furthermore, energy consumption is often a function of the implemented static hardware regardless of whether the hardware resources are actually used.

Dynamic partial reconfiguration (DPR) technology is becoming increasingly popular for addressing the aforementioned problems by time-multiplexing FPGA resources [1]. For filterbanks, DPR lets us implement only one filter at a time, which can potentially result in significant power savings over static implementation approaches.

In this paper, we introduce a self-reconfigurable implementation for a 2D complex filterbank that only requires one 1D filter to be present on the FPGA at a time. We study all the possible cases as well as providing results in terms of accuracy, energy, and performance. Applications of complex filterbanks can be found in [2, 3].

For efficient hardware realizations, we focus on 2D FIR separable filtering that allows implementations by means of two 1D FIR filters. This separability property allows us to consider a DPR approach that keeps only one filter (row or column) at a time.

We presented some related earlier work in [4–7]. In [4], we described an efficient 1D FIR filtering system that combined the distributed arithmetic (DA) technique with DPR. An early 2D real filterbank implementation was presented in [5]. We presented a dynamic real 2D FIR filter and a dynamic complex 2D FIR filter in [6, 7], respectively. Extending this prior work, the main contributions of the current work include (i) a novel implementation for 2D separable complex filterbanks, (ii) a platform to generate 2D filter coefficients for the filterbanks, (iii) a thorough description of all the possible filterbank instances: real/complex input images, real/complex coefficients, (iv) a characterization of the filterbank hardware implementation in terms of energy, performance, accuracy, and memory requirements, and (v) a comprehensive comparison with a static implementation.

Since our 2D filterbank is implemented via dynamic reconfiguration of 1D FIR filters, our approach depends on the use of an efficient DPR controller. Refer to [8–10] for research related to efficient DPR controller development.

The rest of the paper is organized as follows. Section 2 presents background and related work. Section 3 details the dynamic 2D complex filterbank implementation. Section 4 presents the experimental setup (specific filterbanks employed). Section 5 presents the results, and Section 6 lists the conclusions and further work.

#### 2. Background and Related Work

Most image processing systems rely on static architectures that cannot be modified at run-time. There has been, however, a number of works that make use of DPR in their image processing systems: a design that dynamically reconfigures discrete cosine transform (DCT) modules of various sizes [9], a dynamic systolic array accelerator for Kalman and wavelet filters [11], a fingerprint image processing hardware whose stages (e.g., segmentation, normalization, and smoothing) are time-multiplexed via DPR [12], a 3D Haar wavelet transform (HWT) implemented by dynamically reconfiguring a 1D HWT core thrice [13]. The work in [14] presents a pixel processor that reconfigures input/output widths, number of pixel processed in parallel, and the single-pixel operation.

Our specific focus is on filter/filterbank implementations. We start by listing the static implementations. Multiply-and-add approaches for 2D separable filters are summarized in [15]. The work in [16] presented a design methodology that decomposes a 2D filter into 2D separable and nonseparable filters, and efficiently allocates the heterogeneous resources (embedded multipliers, LUTs, FFs) on an FPGA. A filterbank design for 2D discrete wavelet transforms was presented in [17]. Gabor filterbank implementations appear in [18, 19]. The work in [19] details the implementation of a separable/nonseparable Gabor filterbank (using DSP slices) for modeling simple cells. The authors in [20] described an interesting static architecture for particle filters that, by tuning buffer parameters and interconnection switches, allowed run-time modification of the number of particles (trading off accuracy and performance) and the degree of parallelism (trading off power and performance).

As for filter/filterbank implementations using DPR, we note a 2D real filterbank based on the coefficient-only reconfiguration of a single 1D FIR Filter [5], a system that switches between a mean and a median filter via DPR [22], and a 2D FIR filter than can dynamically reconfigure its coefficients [23]. In [6], we implemented a real input, real coefficients 2D FIR filter by dynamically reconfiguring between the row and column filter. Similarly, a complex input, complex coefficients 2D FIR filter was presented in [7]. This prior research was fundamentally limited in that it lacked the generality for implementing arbitrary filterbanks, it had not been applied to both images and videos, and it did not include comparison with static architectures.

The current paper significantly extends prior work by including (i) a scalable architecture (fully parameterized VHDL core) for implementing a separable digital filter that considers arbitrary combinations of both real and complex inputs, outputs, and coefficients, (ii) a parameterized 2D filterbank implementation that uses separable filters to support multiple scales and directions (see examples in Section 5.1), (iii) architecture validation that includes full implementations of the ubiquitous Gabor filterbanks using a number of scales and orientations to both digital images and videos of different sizes (e.g., QCIF, VGA, 720 p), (iv) an extensive comparison to a static approach and a discussion that supports the use of the proposed approach for larger image sizes.

The reconfiguration time overhead is a limiting factor in the use of DPR. Techniques to reduce this overhead include improving the access speed to the configuration memory [8], reducing the size of the reconfigurable area [24], efficient run-time task scheduling [25], and compressing the bitstreams while they are moved through the slow parts of the system [9].

#### 3. Implementation of a Run-Time Reconfigurable 2D Filterbank for Video Processing Applications

Here, we describe the dynamic implementation of a 2D separable real/complex input, real/complex coefficients filterbank. We first describe the stand-alone real/complex input, real/complex coefficients 1D FIR filter IP. Then, we explain how to use DPR to implement a 2D filter with only one 1D filter at a time. Finally, we describe the 2D filterbank implementation.

##### 3.1. Distributed Arithmetic Stand-Alone 1D Real/Complex Input, Real/Complex Coefficients FIR Filter

This 1D real/complex FIR core, depicted in Figure 1(a) is implemented with distributed arithmetic and it is fully parameterized. This constant-coefficient filter supports symmetric, antisymmetric, and nonsymmetric coefficients. It has a parameter “*mode,*” where four cases are supported:(i)real input, real coefficients (rIrH): this single-filter hardware was presented in [6, 21]. See Figure 1(a);(ii)complex input, complex coefficients (cIcH): this hardware was presented as a core in [7]. See Figure 1(d);(iii)real input, complex coefficients (rIcH): this hardware allows the implementation of phase-based methods that produce a complex-valued image from a real-input image [2]. See Figure 1(e);(iv)complex input, real coefficients (cIrH): This is the most general case that supports the implementation of Gabor filterbanks that dominate image analysis applications (see numerous applications in [2]). See Figure 1(b);

Figure 1 depicts the 1D real/complex FIR core with its I/Os and parameters for four different cases of parameter “*mode.*” The parameters , , , and (shared by all the individual FIR DA cores) denote the number of coefficients, the input bitwidth, the coefficients’ bitwidth, and the LUT input size, respectively. Symmetry can be (nonsymmetric),* YES* (symmetric), and* AYES* (antisymmetric). The parameter controls the truncation/saturation scheme of the real and imaginary outputs, each with bits. The input () and output () bitwidth refer to the bitwidth of each part (real and imaginary).

We let represent the number of output bits with fractional bits. The output format is then expressed as []. Then, we let the fixed-point input format be [] and the coefficients’ format [], with normalized inputs/coefficients to be in the interval of . Table 1 lists the required output format (for both real/imaginary outputs) for each “*mode.*” In cases where we need smaller output formats, overflow is avoided by performing LSB truncation and saturation (controlled by the parameter). Table 1 also lists the I/O latency (register levels between input and output) where (i) and for symmetric/antisymmetric filters, and (ii) and for nonsymmetric filters. The VHDL code of this IP core is available online at http://www.ivpcl.org/.

##### 3.2. A Dynamically Reconfigurable System for the Real/Complex 1D FIR Filter

The constant-coefficients 1D real/complex FIR filter is turned into a flexible FIR filter via DPR. Here, we allow for the entire filter to be reconfigured, so that we can modify every parameter and thus explore many different realizations.

Figure 2(a) shows an embedded system that allows for full reconfiguration of the real/complex FIR filter parameters. The logic outside the PRR is called the static region. The embedded interface containing the 1D real/complex FIR core is called “real/complex FIR filter processor.” This filter processor contains the FIR core along with the glue logic for PLB interfacing.

The real/complex FIR filter processor and the processor interact via a 32-bit PLB bus. The bitstreams (filter realization data) and input data are initially stored in the compact flash (CF) card. At run-time the main memory holds input and output data and the bitstreams. To perform DPR, the bitstreams in memory are streamed to the ICAP. The filter processor supports burst transfers, hence the use of the Xilinx Central DMA core. The ethernet link allows for data communication, throughput measurements, and system status report. Some interfacing is included in the PRR so as to allow for changes to our filter core I/O bitwidth during DPR.

The real/complex FIR filter processor is a hardware IP. It includes the parameters of the real/complex FIR filter IP along with two new parameters: input stream length () and filter output choice (*style*). As for the I/O bitwidth, the 32-bit PLB interface of this FIR filter processor only supports three cases: , , and , .

Figure 2(c) depicts the real/complex FIR filter processor in detail. We clock the logic inside the PRR at* clkfx*, while the rest is clocked at* PLB_clk*. The FIFOs (iFIFO, oFIFO) keep the* PLB_clk* and* clkfx* clock regions separated.

With as the input stream length, the -coefficient filter peripheral generates . The filter processor offers four filter output choices (*style*): (i) basic: first output samples, (ii) centered: samples in the range , (iii) full: all of the samples, and (iv) streaming: , infinite output samples.

Figure 3 depicts the input/output interface between the filter IP and the FIFOs. The hardware implementation of this interface depends on (i) the parameter* mode* and (ii) the I/O allowed cases (, , or , ).

As for dynamic frequency control, the dynamic reconfiguration port (DRP) of the multimode clock manager (MMCM) inside Virtex-6 FPGAs can adjust the frequency at run-time without loading a new bitstream. Figure 2(b) depicts the frequency and partial reconfiguration (PR) core. The parameter controls frequency at run-time. Also, the “*PR_done*” signal is asserted after a DPR process is completed, so that the PRR flip flops are cleared [21]. The PRR outputs toggle during DPR, but it does not matter as the FIFO is reset as soon as “*PR_done*” is asserted.

This embedded interface lets us dynamically switch from a real/complex filter to another real/complex filter. We next explain how to implement a 2D real/complex separable FIR filter based on this feature.

##### 3.3. Dynamic Implementation of a 2D Real/Complex Separable FIR Filter

In the embedded system of Figure 2, a 2D real/complex separable FIR filter (represented by 2 bitstreams: row and column filter) is implemented by cyclically swapping the row filter with the column filter via DPR [6]. This approach only requires resources for one 1D FIR filter at a time and can yield significant savings over static implementations of 2D filters. In addition, the following needs to be considered.(1)Usually, the output image size has to match the input image size. The real/complex FIR filter processor sets its parameter* style = centered* (center of the convolution output).(2)The column filter and row filter differ not only in the coefficients, but also on other parameters (I/O format, number of coefficients, etc.), requiring a reconfiguration of the entire 1D filter.(3)Unless the image is square, the length of the input signal in the row filtering case is different than in the column case. This means that we have to change to match the input size.(4)Two reconfigurations are performed per frame. The row filter processes and stores the result in a sequential row-by-row fashion, but the column filter does so in a sequential column-by-column fashion. As a result, the row-filtered output images need to be transposed prior to column filtering.(5)This cyclic swapping between the row and column filter usually implies a reconfiguration rate of two reconfigurations per frame. Besides input and output frames, we require memory to store the intermediate row-filtered frame.

We now have to deal with 2D real/complex input, 2D real/complex coefficients, and a 2D real/complex output. Thus, we must establish what 1D row and column filters (*mode*) we should be using in each case.

We represent the complex input image by: . We denote the separable complex coefficients by , , . The complex output image then results are . Figure 4 presents a conceptual implementation of the four cases we might encounter, each with a unique set of applications.(i)*Real image, real 2D coefficients *(RIRH): in this simple case, the row filter and column filter are of type rIrH (real input, real coefficients).(ii)*Complex image, real 2D coefficients *(CIRH): here the row and column filters are of type cIrH (complex input, real coefficients).(iii)*Real image, complex 2D coefficients *(RICH): here the row filter must be of type rIcH (real input, complex coefficients), while the column filter must be of type cIcH (complex input, complex coefficients).(iv)*Complex image, complex 2D coefficients *(CICH): here, both the row and column filter must be of type cIcH (complex input, complex coefficients).

The input bitwidth of the 2D filter matches the input bitwidth of the row filter. The output bitwidth of the 2D filter matches the output bitwidth of the column filter.

Based on the three allowed I/O cases for and , a common approach is to match the input bitwidth of the column filter with the output bitwidth of the row filter and to set to be the same for the row and column filters. Example: , : row filter (, ), column filter (, ).

##### 3.4. 2D Real/Complex Filterbank Implementation

To implement filterbanks, we extend the cyclic approach by sequentially applying DPR to move along each 2D filter’s row and column bitstreams. No extra reconfiguration penalty exists since we are always reconfiguring twice per frame. The execution time grows linearly with the number of 2D filters.

Figure 5 depicts the 2D separable filterbank implementation on an FPGA. Figure 5(a) shows how a set of filterbanks can be stored in memory. Figure 5(b) provides a conceptual implementation of the filterbank using DPR. Figure 5(c) shows how the 1D filters are loaded cyclically on the FPGA. We refer to [5] for early work on a 2D real input, real coefficients filterbank example (coefficient-only reconfiguration).

#### 4. Energy, Performance, and Accuracy Measurements

In this section, we describe how we compute and or measure these quantities that will let us evaluate the filterbank hardware implementation.

##### 4.1. Performance Calculations

We report the performance of the 2D filterbank from the IP angle, that is, assuming a continuous streaming of the input signal and maximum reconfiguration speed. We consider the embedded system as a generic test-bed. In general, the embedded system performance depends on many factors that are specific to each particular system (e.g., cache size, processor type, bus usage, etc.).

###### 4.1.1. Filter Processing Time

A 2D filter operates on a stream-by-stream basis. After each stream (a single row or column) is processed, the register chain in the FIR filter is cleared, ready for a new stream. Let the lengths of the input streams be defined as and for the row filter and column filter, respectively (input video frame of size rows by columns). The processing time is then given by
where and represent the number of row and column filter coefficients, respectively. REG_LEVELS*r* and REGLEVELS*c* denote an initial latency (see Table 1) and is the clock period. Here, note that , cycles are needed to provide centered row/column convolution outputs.

###### 4.1.2. Reconfiguration Time

Based on the PRR bitstream size and the reconfiguration speed, the reconfiguration time results are

The maximum reconfiguration speed (400 MB/s in Virtex-4) is achieved with a direct link between the ICAP port and the BRAMs (local memory inside the FPGA) that holds the bitstreams [8, 24]. The approach is limited by the available BRAMs. While BRAMs are scarce in Virtex-4 devices, there are far more BRAM resources in the (newer) Virtex-6 devices [26], making the approach more practical. We will use the maximum reconfiguration speed in our performance results.

###### 4.1.3. 2D Filterbank Performance

Filtering a frame of pixels entails row, column filtering, and two reconfigurations. The processing time of a filterbank with “” 2D filters is given by

The filterbank performance (in frames per second) results are

In (3), the transposition time for the row-filtered image is not considered. The transposition operation is a software routine in the embedded platform (e.g., see [6]). For completeness, we report the transposition time in the embedded results section.

##### 4.2. Energy Measurement Estimation

Here, we report the energy spent by the 2D filterbank and the reconfiguration process for processing a single frame. Energy provides more information than power since 2D filtering involves several stages (row filtering, column filtering, and reconfiguration) that draw different amounts of power.

###### 4.2.1. Power Measurements

Power inside the FPGA is drawn by the following power supply rails: (i) internal supply voltage VCCINT with current ICCINT and (ii) auxiliary supply voltage VCCAUX with current ICCAUX. The output supply power is not considered since it is only associated with the power drawn by the external pins.

Power at each supply rail is divided into static and dynamic power. The* static power* is divided into* device static power* and* design static power*. These quantities are thoroughly summarized in [14].

We report the energy spent by the real/complex FIR peripheral and the reconfiguration process for processing one frame. For power consumption of the PLB peripheral , we use Xilinx Power Analyzer (XPA) at 25°C for v and v rails. XPA provides an accurate estimate based on simulated switching activity of the place-and-routed circuit and exact utilization statistics [27]. We report the power as the sum of the dynamic power and design static power, ignoring the device static power (fixed quantity that depends on the operating conditions and device size: 1.5 W for the XC6VLX240T FPGA at 25°C):

The TI power adaptor inside the ML605 board can monitor the current of the FPGA voltage rails (via I^{2}C). In [7], we found that, during DPR, only the VCCINT supply current increased by 50 mA. The reconfiguration power increase is then VCCINT × 50* *mA.

###### 4.2.2. Energy per Frame

The total energy per frame is the sum of the energy spent by the following processes: (i) row filtering, (ii) turning a row filter into a column filter via DPR, (iii) column filtering, and (iv) turning a column filter into a row filter via DPR. We apply this process for each 2D filter in the filterbank. Using the power and the processing times of each process, the energy per frame (epf) is given by

##### 4.3. Accuracy Measurements

We measure the accuracy of the filtered images (magnitude response) using the peak signal-to-noise ratio (PSNR) between the fixed-point filterbank output from the FPGA and the 2D filterbank output using double floating point precision. Here, we compare the output of each 2D filter.

#### 5. Experimental Setup

This section describes the filterbanks selected as well as the test images. A 2D FIR filterbank is represented by its associated parameters (for each 2D filter), bitstreams (row and column filters), and its energy-performance-accuracy values.

##### 5.1. Generation of the Set of 2D Filterbanks

We test our system by using the four cases our system can support: (i) real image, real coefficients, (ii) real image, complex coefficients, (iii) complex image, real coefficients, and (iv) complex image, complex coefficients. To this end, we selected two filterbank types.

Table 2 summarizes the characteristics of filterbanks we selected: equiriple ( real coefficients) and Gabor ( complex coefficients). Figure 6 depicts the frequency response of the equiripple Filterbank, while Figure 7 depicts that of the Gabor Filterbank. The Gabor filterbank is defined in terms of Gabor functions given by with , denoting the central frequency of the Gabor filter, and is the Gaussian function.

The filters are sampled versions of . The coefficients’ real part is symmetric, while the imaginary part is antisymmetric. The 2D Gabor and equiripple filterbank examples considered here is motivated from the wide range of applications [2, 28, 29].

The coefficients are generated in a text file and loaded into the real/complex FIR filter core as a parameter.

We report results based on 8-bit input images. For complex images, the real and imaginary parts are assumed to have 8-bits each.

The input bitwidth of the 2D filters is set to be 8 and the output bitwidth to be 16. In the case of complex filter, these bitwidths apply to both the real and imaginary parts. In the context of the 32-bit PLB peripheral, this means that row filters have , , and column filters have . We also constrain the output to be in the same range as the input: (). For the outputs, we use an arithmetic mode that uses truncation of the LSB followed by saturation. We also set , , and to be the same for both row and column filters.

##### 5.2. Generation of Real/Complex Images and Video

We selected “*lena*” as the real image and “*foreman*” as the real video. Both use 8-bit pixels at CIF () resolution. Since we deal with CIF-sized frames, we need , (PLB peripheral parameters).

For a complex image, we consider a 2D Hilbert extended image that is to be processed through a bank of bandpass filters: this is called the analytical image. The 2D Hilbert extended image is computed using [29]
where is a 2D Hilbert transform applied along each row of the real-valued image , and is a complex-valued image. Figures 8(a) and 8(b) show the real image* lena* and the analytical (complex-valued) image* lena*. Figures 8(c) and 8(d) depict the Gabor and equiripple filterbanks.

The following is the list of cases that we will test:(i)RIRH: real image (*lena*), real coefficients (equiripple filterbank);(ii)CIRH: complex image (analytical* lena*), real coefficients (equiripple filterbank);(iii)RICH: real image (*lena*), complex coefficients (Gabor filterbank);(iv)CICH: complex image (analytical* lena*), complex coefficients (Gabor filterbank).

Note that the 300-frames “*foreman*” video undergoes the same process as the “*lena*” image. We generate an analytical image out of each “*foreman*” frame.

We also control the frequency of operation of the real/complex FIR filter PLB peripheral. Figure 2(b) shows how the frequency of “*clkfx*” can be modified at run time. We selected , , and five different frequencies (MHz): 100 (), 66.66 (), 50 (), 40 (), and 33.33 ().

##### 5.3. Platform Testing Scheme

The system is implemented in the ML605 Xilinx Dev. Board that houses a XC6VLX240T Virtex-6 FPGA. It also comes with 512 MB of DDR3RAM. The soft-core MicroBlaze and peripherals run at 100 MHz.

#### 6. Results and Analysis

##### 6.1. Hardware Resource Utilization

###### 6.1.1. 1D Real/Complex FIR Filter Processor IP Utilization

Table 3 presents the resource breakdown as required by the row and column filters (this is the complete PLB peripheral, PRR included) for each of the four considered cases.

###### 6.1.2. Size of Reconfigurable Area

The PRR size is set to the largest possible filter realization, which in this case is given by the column filter. Hence, the PRR size is the same for the cases RICH and CICH (the column filter is always cIcH), while the PRR size is different for the cases RIRH and CIRH (the column filters are rIrH and cIrH, resp.). Table 4 shows the PRR sizes in number of slices and bitstream size.

##### 6.2. Embedded System Results

###### 6.2.1. Embedded System Resource Utilization

Table 5 shows the hardware resource utilization of the embedded FIR filtering system that can implement all the cases. The PRR includes a 1D real/complex FIR filter and the PLB interface. The largest realization occupies 2160 slices (98% of the allocated space for the PRR). This is slightly lower than what Table 4 reports since the results are obtained by compiling the embedded system (also we only include the part inside the PRR). For transposition of the row-filtered image, we have 9.53 ms for a CIF frame (output bit-width = 16).

For reconfiguration, we used the Xilinx ICAP core, obtaining average reconfiguration speed of 16.28 MB/s. Significant improvements can be obtained with the use of custom-built controllers as reported in [18].

###### 6.2.2. Embedded System Performance

Performance metrics are listed in Table 6. The performance is heavily affected by the transposition time and the reconfiguration time (using the 16.28 MB/s speed). Results are acceptable in the sense that we are processing video images using a filterbank, not a single filter. Among the suggested improvements to the embedded system, we note (i) reducing the reconfiguration time by using a custom-built ICAP controller (mentioned in Section 2), (ii) improving the algorithm for fast transposition in memory, and (iii) increasing the DMA burst length (currently 16) and the FIFOs’ depth (FIFOs are shown in Figure 2).

###### 6.2.3. Memory Requirements

The 2D filters of a filterbank are represented by a set of bitstreams. The larger the filterbank is, the larger the memory overhead is. Table 7 lists the memory requirements for the filterbank implementations.

##### 6.3. Energy-Performance-Accuracy Results

Tables 8 and 9 list the processing time and frames per second of the filterbanks, respectively (using a CIF frame size). These values are listed as a function of the frequency.

Figure 9 shows the energy consumed by the PLB peripheral and the reconfiguration process for processing a single CIF frame. The energy per frame is shown as a function of the frequency for every type of 2D filterbank. We note that there is a decrease in processing time as the frequency operation increases. As a result, overall, we observe a slight decrease in the total amount of required energy to process each frame. The energy savings occur despite the fact that the increased operating frequency results in higher dynamic power.

Table 10 lists the accuracy (PSNR) values for the four considered cases for* lena* image (CIF). Notice that the accuracy values are very high (90 dB); this is due to the fact that we are working with 16-bit output images. In addition, we streamed the* foreman* video sequence (CIF) through all the filterbanks; the accuracy results are shown in Figure 10. We can see that the (high) values are very close to those of* lena’s*.

Finally, Figure 11 shows the actual FPGA outputs (magnitude, magnitude spectra, and real component) of the equiripple filterbank for an input real image (*lena*). Figure 12 shows the same equiripple filterbank outputs for an input complex image (*analytical lena*). Figures 13, 14, and 15 show the actual FPGA outputs (magnitude, magnitude spectra, and real component) of the Gabor filterbank for an input real image (*lena*). Figures 16, 17, and 18 show the same Gabor filterbank results for an input complex image (*analytical lena*).

##### 6.4. Visual Assessment of the Results

We present a visual assessment of the results to demonstrate the effectiveness of the proposed approach. In Figures 13–18, we present extensive results on the* lena* image based on the use of a Gabor filterbank. The examples allow the readers to assess the performance of our methods by using a popular image with a widely used filterbank as documented in [2]. Furthermore, to allow for comparisons to more recent methods, we also include results from the use of equiripple filterbanks as documented in [28].

To analyze the results, we note that the magnitude outputs of each Gabor filter represents an estimate of the instantaneous amplitude (IA) as documented in [2, 28].

Large values of the IA indicate the presence of significant frequency components that fall within the passband of the corresponding filter. The IA outputs are given in the left column of each example in Figures 13–18. For these figures, in the images presented in the right column we have the real part of the output. These images represent the frequency-modulation (FM) components that capture the spatial frequency variations in the image [2, 28]. The passband of each filter can be visualized in the middle columns of Figures 13–18. Here, we note that the image spectrum that falls within the passband is extracted from the rest of the spectrum.

We present two sets of Gabor filterbank results on the* lena* image. The first set is based on the use of a real-valued* lena* image without prefiltering with the directional Hilbert transform as shown in Figure 8(a). These results are shown in Figures 13–15. The second set of results come from the use of a complex* lena* input image that is formed from the output of the directional Hilbert transform as shown in Figure 8(b). This set of results is shown in Figures 16–18.

A visual evaluation of the IA for each filter in Figures 16–18 clearly indicates that both the direction and magnitudes of each frequency component are well represented in the IA outputs. For the lower frequency components, the real-valued output images presented in the right-hand column reconstruct the overall image structure with strong emphasis on the image structure that exhibit the frequency components that fall within the passband of each filter. The results indicate that Gabor filters with higher frequency magnitude components provide very good spatial localization of the underlying frequency components. Similar observations apply to the outputs from the real-valued* lena* image as shown in Figures 13–15. However, it is clear that the preapplication of the directional Hilbert transform will provide better results, as indicated by the finer localization of the frequency components.

We also present results based on the equiripple filter design in Figure 11 (real input) and Figure 12 (complex input). The results indicate strong spatial localization of a wide range of frequency components (e.g., see first row for both figures).

We also present filterbank results on the* foreman* video in Figure 10. From the results, it is clear that we have very high accuracy since PSNR remains above 85 dB for all 300 video frames. Furthermore, it is clear that accuracy is a function of the frequency band. For example, the highest, diagonal frequency components captured in Filter 3 of the equiripple design (Figure 10(b)) maintain much higher accuracy than the vertical frequency components of Filter 2. A comparison of the Gabor filterbank results between the real and complex-valued input video frames clearly demonstrate the consistency of prefiltering with the directional Hilbert transform.

##### 6.5. Comparison with a Static Implementation

Here, we provide a comparison of our dynamic filterbank implementation with an efficient static implementation (no DPR). The static implementation is based on a multiply-and-add 1D FIR core that implements the row and column filters by loading new coefficients into registers (instead of performing DPR). The 1D multiply-and-add FIR core is shown in Figure 19. Note that we show the case rIrH; the other cases (cIrH, rIcH, cIcH) are generated as in Figure 1. The I/O latency is given by for the cIcH mode and for the cIrH, rIcH, and rIrH modes. Prior to each 1D filtering, we need to feed the coefficients into a register chain. As in the dynamic filterbank implementation, we have row filtering, transposition, and column filtering.

Table 11 shows the resource consumption of the static implementation using the same parameters as in the case of the dynamic system: row filters with , , and column filters with . Also, and are set to be the same for both row and column filters. The static system uses far more resources than our dynamic system (Table 3). In the static system, since the hardware implementation is always that of the largest column filter (since we cannot reconfigure dynamically), the power consumption of the row and column filters is about the same (the row filter spends slightly less power since some input ports are driven to zero).

Performance and energy formulas (4) and (6) apply for the static system (omitting the reconfiguration time). In the static system, computing , requires additional cycles in the inner summation of (1), since we need to feed the coefficients. Table 12 shows performance comparison for the dynamic and static implementations at 100 MHz. As the frame size increases, the performance of the dynamic system gets very close to that of the static system. Table 13 shows the energy per frame (mJ) for the given filterbanks and for the dynamic and static cases at 100 MHz. Results are given for different frame sizes. For most cases, the static system spends more energy than the dynamic one, except for the small QCIF and CIF frame sizes. This is because the dynamic system incurs a reconfiguration time overhead, which is offset as the frame size increases. Also, it is important to point out that the video frame size has a negligible effect on the hardware resources.

#### 7. Conclusion

The paper presents an efficient implementation of a general-purpose 2D filterbank that supports any possible combination of real/complex inputs, filter coefficients, and outputs. The efficiency of the approach is based on the use of separable implementations and dynamic partial reconfiguration so that the hardware requirements are reduced to a single, general-purpose 1D FIR filter.

The system has been demonstrated using both real and complex-valued images and video frames along with real and complex 2D separable filterbanks. The dynamic implementation was evaluated in terms of performance, energy per frame, and accuracy values for a CIF frame and five different frequencies of operation. Also, we included a comprehensive comparison with a static implementation that demonstrates that the dynamic approach is promising for large image sizes. The results show that this implementation is very efficient in terms of resources used while delivering high accuracy results using fixed-point hardware.

Further work can focus on the implementation of image analysis systems. For example, once the filterbank outputs are obtained we can implement AM-FM feature extraction and classification [28].

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### References

- J. Becker, M. Hübner, G. Hettich, R. Constapel, J. Eisenmann, and J. Luka, “Dynamic and partial FPGA exploitation,”
*Proceedings of the IEEE*, vol. 95, no. 2, pp. 438–452, 2007. View at Publisher · View at Google Scholar · View at Scopus - A. Bovik,
*The Essential Guide to Image Processing*, Elsevier/Academic Press, 2nd edition, 2009. - A. Bovik,
*The Essential Guide to Video Processing*, Elsevier/Academic Press, 2nd edition, 2009. - D. Llamocca, M. Pattichis, and G. A. Vera, “Partial reconfigurable FIR filtering system using distributed arithmetic,”
*International Journal of Reconfigurable Computing*, vol. 2010, Article ID 357978, 14 pages, 2010. View at Publisher · View at Google Scholar · View at Scopus - D. Llamocca and M. Pattichis, “Real-time dynamically reconfigurable 2-D filterbanks,” in
*Proceedings of the IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI '10)*, pp. 181–184, Austin, Tex, USA, May 2010. View at Publisher · View at Google Scholar · View at Scopus - D. Llamocca, C. Carranza, and M. Pattichis, “Separable FIR filtering in FPGA and GPU implementations: energy, performance, and accuracy considerations,” in
*Proceedings of the 21st International Conference on Field Programmable Logic and Applications (FPL '11)*, pp. 363–368, Chania, Greece, September 2011. View at Publisher · View at Google Scholar · View at Scopus - D. Llamocca, C. Carranza, and M. Pattichis, “Dynamic multiobjective optimization management of the Energy-Performance-Accuracy space for Separable 2-D complex filters,” in
*Proceedings of the 22nd International Conference on Field Programmable Logic and Applications (FPL' 2012)*, pp. 579–582, Oslo, Norway, August 2012. View at Publisher · View at Google Scholar · View at Scopus - M. Liu, W. Kuehn, Z. Lu, and A. Jantsch, “Run-time partial reconfiguration speed investigation and architectural design space exploration,” in
*Proceedings of the 19th International Conference on Field Programmable Logic and Applications (FPL '09)*, pp. 498–502, Prague, Czech Republic, September 2009. View at Publisher · View at Google Scholar · View at Scopus - J. Huang and J. Lee, “A self-reconfigurable platform for scalable dct computation using compressed partial bitstreams and blockram prefetching,”
*IEEE Transactions on Circuits and Systems for Video Technology*, vol. 19, no. 11, pp. 1623–1632, 2009. View at Publisher · View at Google Scholar · View at Scopus - M. S. Pattichis and J. C. Hoffman, “A high-speed dynamic partial reconfiguration controller using direct memory access through a multiport memory controller and overclocking with active feedback,”
*International Journal of Reconfigurable Computing*, vol. 2011, Article ID 439072, 10 pages, 2011. View at Publisher · View at Google Scholar · View at Scopus - A. Sudarsanam, R. Barnes, J. Carver, R. Kallam, and A. Dasu, “Dynamically reconfigurable systolic array accelerators: a case study with extended Kalman filter and discrete wavelet transform algorithms,”
*IET Computers and Digital Techniques*, vol. 4, no. 2, Article ID ICDTA6000004000002000126000001, pp. 126–142, 2010. View at Publisher · View at Google Scholar · View at Scopus - M. Fons, F. Fons, and E. Cantó, “Fingerprint image processing acceleration through run-time reconfigurable hardware,”
*IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 57, no. 12, pp. 991–995, 2010. View at Publisher · View at Google Scholar · View at Scopus - A. Ahmad and A. Amira, “Efficient reconfigurable architectures for 3D medical image compression,” in
*Proceedings of the International Conference on Field-Programmable Technology (FPT '09)*, pp. 472–474, Sydney, Australia, December 2009. View at Publisher · View at Google Scholar · View at Scopus - D. Llamocca and M. Pattichis, “A dynamically reconfigurable pixel processor based on power/energy-performance-accuracy optimization,”
*IEEE Transactions on Circuits and Systems For Video Technology*, vol. 23, no. 3, pp. 488–502, 2013. View at Google Scholar - H. Neoh and A. Hazanchuk, “Adaptive edge detection for real-time video processing using FPGAs,” in
*Proceedings of the Global Signal Processing Expo and Conference (GSP '04)*, 2004. - C. Bouganis, S. Park, G. Constantinides, and P. Cheung, “Synthesis and optimization of 2D filter designs for heterogeneous FPGAs,”
*ACM Transactions on Reconfigurable Technology Systems*, vol. 1, no. 4, 2009. View at Google Scholar - P. Longa, A. Miri, and M. Bolic, “A flexible design of filterbank architectures for discrete wavelet transforms,” in
*Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07)*, pp. III1441–III1444, April 2007. View at Publisher · View at Google Scholar · View at Scopus - N. VoB and B. Mertsching, “Design and Implementation of an accelerated gabor filter bank using parallel hadware,” in
*Proceedings of the International Conference on Field Programmable Logic and Applications*, 2001. - Y. C. P. Cho, S. Bae, Y. Jin, K. M. Irick, and V. Narayanan, “Exploring Gabor filter implementations for visual cortex modeling on FPGA,” in
*Proceedings of the 21st International Conference on Field Programmable Logic and Applications (FPL '11)*, pp. 311–316, Chania, Greece, September 2011. View at Publisher · View at Google Scholar · View at Scopus - S. Hong, J. Lee, A. Athalye, P. M. Djurić, and W.-D. Cho, “Design methodology for domain specific parameterizable particle filter realizations,”
*IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 54, no. 9, pp. 1987–2000, 2007. View at Publisher · View at Google Scholar · View at Scopus - D. Llamocca,
*Dynamically Reconfigurable Management of Energy, Performance, and Accuracy applied to Digital Signal, Image, and Video Processing Applications [Ph.D. thesis]*, University of New Mexico, Albuquerque, New Mexico, 2012. - S. U. Bhandari, S. Subbaraman, S. S. Pujari, and R. Mahajan, “Real time video processing on FPGA using on the fly partial reconfiguration,” in
*Proceedings of the International Conference on Signal Processing Systems (ICSPS '09)*, pp. 244–247, Singapore, May 2009. View at Publisher · View at Google Scholar · View at Scopus - T. Raikovich and B. Fehér, “Application of partial reconfiguration of FPGAs in image processing,” in
*Proceedings of the 6th Conference on Ph.D. Research in Microelectronics and Electronics (PRIME '10)*, deu, Berlin, Germany. View at Scopus - Y. Hori, A. Satoh, H. Sakane, and K. Toda, “Bitstream encryption and authentication with AES-GCM in dynamically reconfigurable systems,” in
*Proceedings of the International Conference on Field Programmable Logic and Applications (FPL '08)*, pp. 23–28, Heidelberg, Germany, September 2008. View at Publisher · View at Google Scholar · View at Scopus - J. A. Clemente, J. Resano, C. González, and D. Mozos, “A Hardware implementation of a run-time scheduler for reconfigurable systems,”
*IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 19, no. 7, pp. 1263–1276, 2011. View at Publisher · View at Google Scholar · View at Scopus - Xilinx,
*Virtex-6 Family Overview (DS150)*, Xilinx, San Jose, Calif, USA, 2010. - Xilinx,
*Power Methodology Guide (UG786)*, vol. 13. 1, Xilinx, San Jose, Calif, USA, 2011. - V. Murray, P. Rodríguez, and M. S. Pattichis, “Multiscale AM-FM demodulation and image reconstruction methods with improved accuracy,”
*IEEE Transactions on Image Processing*, vol. 19, no. 5, pp. 1138–1152, 2010. View at Publisher · View at Google Scholar · View at Scopus - A. C. Bovik, M. Clark, and W. S. Geisler, “Multichannel texture analysis using localized spatial filters,”
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol. 12, no. 1, pp. 55–73, 1990. View at Publisher · View at Google Scholar · View at Scopus