Reconfigurable Computers (RCs) with hardware (FPGA) co-processors can achieve significant performance improvement compared with traditional microprocessor ()-based computers for many scientific applications. The potential amount of speedup depends on the intrinsic parallelism of the target application as well as the characteristics of the target platform. In this work, we use image processing applications as a case study to demonstrate how hardware designs are parameterized by the co-processor architecture, particularly the data I/O, i.e., the local memory of the FPGA device and the interconnect between the FPGA and the . The local memory has to be used by applications that access data randomly. A typical case belonging to this category is image registration. On the other hand, an application such as edge detection can directly read data through the interconnect in a sequential fashion. Two different algorithms of image registration, the exhaustive search algorithm and the Discrete Wavelet Transform (DWT)-based search algorithm, are implemented on hardware, i.e., Xilinx Vertex-IIPro 50 on the Cray XD1 reconfigurable computer. The performance improvements of hardware implementations are and , respectively. Regarding the category of applications that directly access the interconnect, the hardware implementation of Canny edge detection can achieve speedup.

1. Introduction

Reconfigurable Computers (RCs) are traditional computers extended with co-processors based on reconfigurable hardware like FPGAs. Representative RC systems include SGI RC100 [1], SRC-6 [2], and Cray XD1 [3]. These enhanced systems are capable of providing significant performance improvement for scientific and engineering applications [4]. The performance of a hardware design on an FPGA device depends on both the intrinsic parallelism of the design as well as the characteristics of the FPGA co-processor architecture, which consists of the FPGA device itself and the surrounding data interface. Due to the limited size of the internal Block RAM memory, it is not applicable to store large amounts of data inside the FPGA device. Therefore external, however, local SRAM modules are generally connected to the hardware co-processor for data storage, such as the example shown in Figure 1. Furthermore, an FPGA co-processor can directly access the host memory through the interconnect, which generally provides a sustained bandwidth up to several GB/s. The available number and data width of local memory banks and the interconnect channels play important roles in the hardware implementation on an FPGA device and decide the parallelism a design can achieve in many cases. In this work, image processing algorithms are adopted as a case study to demonstrate how hardware designs can be parameterized by the data I/O of the FPGA co-processor in order to achieve the best performance.

Image processing applications are capable of gaining a significant performance improvement with hardware design [5, 6]. Two categories of image processing applications are selected to represent hardware designs that present different data I/O requirements. The first category of applications is image registration, which requires the use of local memory for data access. The second category of applications is edge detection, which directly reads from or writes to the interconnect in a streaming fashion. Image registration is a very important image processing task. It is used to align or match pictures taken in different conditions (at a different time, angle, or from different sensors). A vast majority of automatic image processing systems require the use of image registration processes. Common image registration applications include target recognition (identification of a small target inside a much bigger image), satellite image alignment in order to detect changes such as land usage or forest fires [7], matching stereo images to recover depth information, and superposing medical images taken at different moments for diagnosis [8, 9]. As an example of the applications in the second category, edge detectors encompass image processing algorithms that identify the positions of edges in an image. Edges are discontinuities in terms of intensity or orientation or both and generally represent meaningful characteristics of the image (boundaries of objects, e.g.). Commonly, edge detectors are used to filter relevant information in an image. Thus, they greatly reduce the amount of processing needed for interpreting the information contents of an image. One of the most important edge detection algorithms is the Canny edge detection [10]. The Canny edge detection operator was developed by Canny in 1986 and uses a multistage algorithm to detect a wide range of edges in images. It remains until now, as a state-of-the-art edge detector used in many applications.

The implementation of image registration and edge detection on reconfigurable computers has been previously reported in [11, 12], respectively. In this paper, we not only present the detail of hardware design itself, but also exploit the role of data I/O of the FPGA co-processor in the design process. More precisely, we demonstrate how the design of image processing applications is parameterized by local memory architecture and DMA interface on reconfigurable computers. Furthermore, the hardware processing time of both applications are formalized in terms of clock cycles. The remaining text is organized as follows. In Section 2, we discuss the related work based on literature survey. In Section 3, two related image registration algorithms, the exhaustive search algorithm and the DWT-based search algorithm, and their hardware implementations are discussed. Section 4 focuses on the design of Canny edge detection in hardware. The implementation of all applications on the Cray XD1 reconfigurable computer is presented in Section 5. Finally, Section 6 concludes this work.

Since image registration is a computation-demanding process in general, hardware (e.g., FPGA device) is leveraged to improve the processing performance. In [8], Dandekar and Shekhar introduced an FPGA-based architecture to accelerate mutual information-(MI)-based deformable registration during computed tomography-(CT)-guided interventions. Their reported implementation was able to reduce the execution time of MI-based deformable registration from hours to a few minutes. Puranik and Gharpure presented a multilayer feedforward neural network (MFNN) implementation in Xilinx XL4085 for template search in standard sequential scan and detect (SSDA) image registration [13]. In [14], Liu et al. proposed a PC-FPGA geological image processing system in which the FPGA was used to implement Fast Fourier Transform-(FFT)-based automatic image registration. In [15], El-Araby et al. prototyped an automatic image registration methodology for remote sensing using a reconfigurable computer. However, these previous work only emphasized the image registration algorithm itself. The FPGA data I/O as a factor in the design was not discussed in detail.

Low-level image processing operators, such as digital filters, edge detectors and digital transforms are good candidates for hardware implementation. In [16], a generic architectural model for implementing image processing algorithms of real-time applications was proposed and evaluated. In [17], a Canny edge detection application written in Handel-C and implemented in the FPGA device was discussed. The proposed architecture is capable of producing one edge-pixel every clock cycle. The work in [18] illustrated how to use design patterns in the mapping process to overcome image processing constraints on FPGAs. However, most of the previous works, for example, [1618], only focused on the algorithms alone and did not consider the platform characteristics as a factor in the design.

3. Image Registration

In this section, we discuss how to implement image registration algorithms in hardware to exploit the processing parallelism, which is bounded by the local memory architecture.

3.1. Background

Image registration can be defined as a mapping between two images, the reference image and the test image , both spatially and with respect to the intensity [19]. If these images are defined as two 2D arrays of a given size denoted by and where and each map to their respective intensity values, then the mapping between images can be expressed as where is a 2D spatial-coordinate transformation and is a 1D intensity or radiometric transformation. More precisely, is a transformation that maps two spatial coordinates, and , to new spatial coordinates and : is used to compensate gray value differences caused by different illuminations or sensor conditions.

According to [19], image registration can be viewed as the combination of four components: (1)a feature space, that is, the set of characteristics used to perform the matching and which are extracted from the reference and test images; (2)a search space, that is, the class of potential transformations that establish the correspondence between the reference and test images; (3)a search strategy, which is used to decide how to choose the next transformation from the search space; (4)a similarity metric, which evaluates the match between the reference image and the transformed test image for a given transformation chosen in the search space.

The fundamental characteristic of any image registration technique is the type of spatial transformation or mapping used to properly overlay two images. The most common transformations are rigid-body, affine, projective, perspective, and global polynomial. Rigid-body transformation is composed of a combination of a rotation (), a translation , and a scale change (). An example is shown in Figure 2. It typically has four parameters, , which map a point () of the first image to a point () of the second image as follows: where and are the coordinate vectors of the two images; is the translation vector; is a scalar scale factor; is the rotation matrix. Since the rotation matrix is orthogonal, the angles and lengths in the original image are preserved after the registration. Because of the scalar scale factor , rigid-body transformation allows changes in length relative to the original image, but they are the same in both and axes. (Please note both and are coordinates in the same Cartesian coordinate system with the origin .)

Computing the correlation coefficient is the basic statistical approach to registration and is often used for template matching or pattern recognition. A correlation coefficient is a similarity measure or match metric, that is, it gives a measure of the degree of similarity between a template (the reference image) and an image (the transformed test image). The correlation coefficient between the reference image and the image , which is the test image after rigid-body transformation, is given as where and are mean of the image and . If the image matches , the correlation coefficient will have its peak with the corresponding transformation. Therefore, by computing correlation coefficients over all possible transformations, it is possible to find the transformation that yields the peak value of the correlation coefficient.

In this work, rigid-body transformation is selected for the registration between two images and the correlation coefficient is used to measure the similarity. Further, we assume that both the reference image and the test image are 8-bit grayscale and share the same size.

Given a search space, (,,) theoretically all tuples of (,,) are to be tested to find the tuple that generates the maximum correlation coefficient between the reference image and the transformed test image (In this work, the scale factor is fixed at 1.). Figure 3 shows the two steps to test each tuple. The first step is to apply a rigid-body transformation on the test image to get . The second step is to calculate the correlation coefficient between and the reference image . As shown in Figure 3(b), only the pixels of both images within the shaded region are used during the calculation. Since the test image is rotated and translated to obtain , some parts of are beyond the shaded region. In other words, some portions of the shaded region (shown as crossed regions in the left Cartesian coordinate system of Figure 3(b)) do not belong to . For the pixels belonging to these crossed regions, their values are treated as zeros in the calculation.

In the remaining part of this section, two different approaches based on rigid-body transformation are discussed in details. The first approach literally tests the whole search space to find the best tuple of (,,). The second approach applies DWT on both the reference image and the test image to reduce the search resolutions in order to improve the search efficiency.

3.2. Exhaustive Search Algorithm

As its name implies, the exhaustive search algorithm tests all possible tuples of (,,) with a fixed search resolution, , , and , on each dimension, respectively, in order to find the tuple that produces the highest correlation coefficient between the transformed test image and the reference image. If this algorithm is implemented on a scalar microprocessor, these tuples have to be tested in sequence. However, if the same algorithm is implemented in hardware, multiple tuples can be tested in parallel to improve the performance.

Since the size of an image is normally bigger than the amount of available Block RAM inside an FPGA device, the local external memory is used to store images. Assuming there are individual local memory banks connected directly to the FPGA device, and one bank keeps the reference image , one bank keeps the test image . The other remaining banks are used to store transformed test images using different tuples of (,,), as shown in Figure 4. If we further assume that each memory bank has its own independent read and write ports, transformations of the test image can be carried out concurrently. The calculation of correlation coefficients between the image and different can be performed in parallel as well.

Given the coordinate of one pixel in the original image, (3) is used to calculate the coordinate of the corresponding pixel in the transformed image. Then, the intensity of the pixel in can be written into at the coordinate of . If we assume that there are pixels in the original image and the hardware implementation is fully pipelined, the transformation step would take approximately clock cycles. Furthermore, some extra clock cycles are needed to initialize the intensities of all pixels within the shaded region to zero due to two reasons. First, there are several regions within which the pixels do not belong to , as shown in Figure 3(b). Second, there may exist artifacts whose coordinates are within both the shaded region and , but are not calculated due to discretization, as shown in Figure 5. If the intensities of these pixels are left randomly, it may affect the accuracy of the correlation coefficient. Therefore, it is necessary to initialize the intensities of all pixels in the shaded region to zero in the first place. Since the data width of the memory bank's access ports is multiple-byte, multiple pixels can be initialized in one clock cycle. If we assume that the data width is -byte, then the initialization process would take roughly clock cycles. Overall, the transformation step of tuples would take clock cycles. The mean intensity of each can be calculated during the transformation step, hence it takes no extra time. The mean intensity can be precalculated by the microprocessor and forwarded to the FPGA device later since it remains unchanged during the whole image registration process.

Although the calculation of the correlation coefficient as (4) between and is more complicated than the transformation step, it takes the same time as the initialization process since pixels can be read and processed in the same clock cycle. Altogether, these three steps, including initialization, transformation and correlation coefficient calculation, would take clock cycles for testing tuples of (,,). If the entire search space consists of tuples, the whole registration process would take clock cycles. Apparently, the image registration time in hardware can be significantly reduced by increasing the number of local memory banks. Widening the data width of access port of local memory can improve the performance as well; however, it can also hit the upper bound very quickly due to the fact that

3.3. DWT-Based Search Algorithm

Although the exhaustive search algorithm is quite straightforward, it is computation-demanding as well. In [20], a DWT-based image registration approach was proposed. As shown in Figure 6, both the test image and the reference image go through several levels of Discrete Wavelet Transform before applying image registration. After each level of DWT, the image size is shrunk to of the previous level. In the meantime the image resolution is reduced to half. For example, if levels of DWT are applied on both the test image and the reference image , two series of images, , and , , are obtained. The registration process starts from the exhaustive search between and among the search space of ) with the search resolution, , , and , on each dimension. The registration result between and , (,,) becomes the center of the search space of the registration between and . In other words, the registration between and is among the search space of . However, the search resolution is increased to , , and , on each dimension. In general, when the registration process traces back one level, the search scope is reduced to half on each dimension, and the search resolution is increased two times on each dimension, respectively. This search strategy is illustrated in Figure 7. Table 1 details the search space and search resolution at each step in which rotation is taken as an example.

Different DWT decomposition processes have to be carried out in a sequence. Similarly the search processes at different levels need to be performed one after the other. Due to these two reasons, the original image and the decomposed images can be stored in the same memory bank, as shown in Figure 8 in which or denote the decomposed image at the level .

If we use the same assumptions as in Section 3.2, that is, both original images consist of pixels, the data width of the local memory is -byte, and there are independent local memory banks, then the DWT decomposition step alone will take clock cycles.

The search between the decomposed reference image and the test image at each level can use the same method described in Section 3.2. By observing Table 1, it is found that the search space at each level, except the level , is 125 tuples of (,,). Therefore, the search from level to level will take clock cycles. The search at the level itself takes clock cycles.

4. Edge Detection

Edge detection aims at identifying pixels in a digital image at which the image brightness changes sharply, that is, having discontinuities. Most edge detection algorithms involve the convolution process between the image and a kernel. Convolution provides a way of “multiplying together” two arrays of numbers, generally of different sizes, but of the same dimensionality, to produce a third array of numbers of the same dimensionality. This can be used in image processing to implement operators whose output pixel values are simple linear combinations of certain input pixel values. In an image processing context, one of the input arrays is normally just a grayscale image. The second array is usually much smaller, and is also two dimensional (although it may be just a single coefficient). The second array is always known as the kernel, as shown in Figure 9. If the image has rows and columns, and the kernel has rows and columns, then the size of the output image will consist of rows and columns. Mathematically we can write the convolution between the image and the kernel as where runs from to and runs from to .

From Figure 9 and (10), we can find that the calculation of different pixels in the output image is independent to each other. Therefore, the intensities of multiple output pixels can be computed in parallel in a hardware design. Since the input image reading and the output image writing are both in a sequential mode, the data storage in local memory can be avoided. Instead, the user logic can access the interconnect directly for fetching the source image and storing the result image. In order to optimize the hardware performance, the interconnect and the user logic are chained into a pipeline, and the data run through the pipeline as a stream. We call this architecture Streaming Data Transfer Mode, shown in Figure 10. Two DMA engines work in parallel to retrieve raw data blocks from and return result data blocks to the main memory. Under ideal circumstances, the reading DMA engine receives one raw data block from the input channel and the writing DMA engine puts one result data block to the output channel every clock cycle.

Being one stage of the overall pipeline, the design of the algorithm logic is parameterized by the characteristics of other components in the pipeline, that is, the data width of the interconnect between the FPGA device and the . Because the data width of the interconnect fabric is multiple-byte wide in general, several pixels are fed into the algorithm logic in the same clock cycle. To maximize the throughput of the overall architecture, the algorithm logic has to be capable of performing the operations of multiple pixels concurrently and taking new data input every clock cycle. In the following discussion, we assume that () the image is in 8-bit grayscale, () the image size is and the kernel size is , and () the data width of the interconnect is -byte. Furthermore, pixels in the original image are delivered into the algorithm logic in a stream, starting with the pixel at the top-left corner, ending with the pixel at the bottom-right corner.

The diagram of the algorithm logic is shown in Figure 11. The architecture consists of four components, one Line Buffer, one Data Window, an array of PEs, and the Data Concatenating Block.

The quantity of PEs is , that is, the data width of the interconnect in byte. Every PE is fully pipelined and is capable of taking a new input, that is, one block of pixels, every clock cycle. The output of one PE is the intensity of one pixel in the result image.

The Data Window is a two-dimension register array of in charge of providing image blocks to the downstream PEs. Analogously, the Data Window scans the original image from left to right and from top to bottom with a horizontal stride of pixels and a vertical stride of 1 pixel. Figure 12 demonstrates the scanning from left to right and shows the contents in two consecutive steps of the data window. Once the data window receives a valid input, that is, one block of pixels, it does a -byte left shift for all rows with the input tailing the right most columns.

The Line Buffer is a two-dimension register array of . The original image is transferred into the FPGA as a stream. However, PEs request image blocks that spread different rows. The purpose of the line buffer is to keep all pixels in registers until one block forms. One block, shown in gray in Figure 11, comprises the new arrived pixels and another pixels that reside at the head of every row of the line buffer. Every time the line buffer receives pixels of the original image, it performs two actions simultaneously. The first action is to deliver the new formed image block to the data window. The second action is to do a -byte left shift in a zigzag form, in which two neighbor rows are linked together by connecting their tail and head.

The design of the Data Concatenating Block is straightforward because what it does is to concatenate the outputs of the upstream PEs together. Once it receives valid output from the PEs, that is, consecutive pixels in the output image, it sends them into the output channel.

In general, the four components in Figure 11 form a pipeline chain and each of them is fully pipelined as well. This architecture is able to accept pixels of the original image every clock cycle and output pixels of the result image every clock cycle at the same time. In case of a multiple-stage algorithm, such as the four-stage Canny edge detector, various stages can be chained together and each stage consists of these components with different parameters and functionalities. Under ideal scenario, it would take clock cycles to perform an edge detection operation on an input image. However, the real performance is upper-bounded by the sustained bandwidth of the interconnect in general.

5. Implementation and Results

These two image registration algorithms and a Canny edge detection algorithm have been implemented on the Cray XD1 reconfigurable computer with Xilinx XC2VP50 FPGA devices. On the Cray XD1 platform, each FPGA device is connected to four local SRAM modules, 4 MB each, as shown in Figure 13. Every local memory module has separate reading and writing ports connected to the FPGA device, and is able to accept reading or writing transactions every clock cycle. Furthermore, the interconnect has separate input and output channels, both of them are 64-bit wide. The maximum operating clock rate for the user logic is 200 MHz, which is achieved in all our hardware implementations.

For the exhaustive search approach, the reference image and the test image are stored in two separate local memory modules. Every time, two possible tuples of (,,) are tested, that is, two different rigid-body transformations are performed on the test image simultaneously and produce two transformed test images, and , which are stored in two additional local memory modules separately. The calculation of the correlation coefficients between these two images and the reference image is carried concurrently as well. All the hardware modules are fully pipelined in our design in order to realize the full potential of a hardware implementation. Compared with the performance of the software implementation running on a single microprocessor, an AMD Opteron 2.4 GHz, the performance of the hardware implementation on a single FPGA device is approximately 10 better, as shown in Table 2. The performance improvement of hardware implementation compared with software implementation is mainly due to the fully pipelined hardware design. For example, during the correlation coefficient calculation step, it only takes one clock cycle for hardware to calculate 8 pixels. On the other hand, the software implementation has to calculate one pixel each time. The measured time on the Cray XD1 is the end-to-end time including data transfer time between the and the FPGA and the hardware processing time on the FPGA. In the 16.193 s, data transfer time is merely 6 . Therefore, the performance of hardware implementation in this case is almost upper-bounded by the available number of local memory banks since only 45% of the FPGA slices are used. As shown in (5), the hardware processing time is reciprocal to the number of memory banks that are used to store transformed test images. Therefore the speedup of hardware implementation compared with software version is linearly proportional to the number of local memory banks.

For the DWT-based approach, three levels of DWT are applied on both the test image and the reference image before the registration process starts. All the original image and decomposed images of different levels are stored in local memory modules, as depicted in Figure 8. Since the whole process consists of two steps, the DWT decomposition and the search, the hardware utilization almost doubles in this case. More built-in multipliers are used in the DWT decomposition as well. Compared with the software implementation, the hardware is barely 2 faster for the DWT-based search approach. The comparatively low speedup achieved by hardware in this case is due to the fact that the amount of computation is significantly reduced because of the DWT decomposition.

For both cases, the hardware implementation is mainly characterized by the local memory architecture, and the performance can be improved if more processing concurrency is allowed, given that more independent local memory modules are available.

For the category of applications that use the interconnect for data access, a Canny edge detection algorithm is implemented following the architecture in Figure 11. The Canny edge detector comprises four stages in which each stage takes the output from the preceding stage and feeds its output to the following stage as follows:

Because the interconnect is 8-byte wide, 8 edge detection operators are implemented and execute in parallel. Figure 14 shows the original image and the corresponding output after applying the Canny edge detection algorithm. In Canny edge detection algorithm, the processing of each pixel in the output image involves neighbor pixels in the input image. Furthermore, this processing consists of 4 stages and takes hundreds of clock cycles as latency. Due to the fully pipelined design in hardware implementation, the FPGA device is capable of computing 8 pixels in the output image every clock cycle no matter how complicated the computation of each pixel is. On the other hand, the pixels in the output image are computed one by one in the software implementation. Further, it would take thousands of cpu cycles to compute one pixel. Therefore hardware implementation of the Canny edge detection algorithm achieves 544 speedup compared with the corresponding software implementation. Higher speedup can be achieved if multiple images can be processed simultaneously given several interconnect channels are connected to the same FPGA co-processor and there are enough hardware resources to implement multiple instances of edge detection operators.

6. Conclusions

In this paper, we demonstrate how the parallelism of a hardware design on reconfigurable computers is parameterized by the co-processor architecture, particularly the number and the data width of local memory banks and the interconnect. Image registration algorithms based on rigid-body transformation are adopted as a case study to represent applications that use local memory. Two related; however, different algorithms, the exhaustive search algorithm and the DWT-based search algorithm, are described in detail. For the exhaustive search algorithm, the performance is linearly proportional to the available number of local memory banks. On the other hand, the DWT-based search algorithm improves the efficiency by applying DWT decomposition on both the reference and test images before the search in order to reduce the search scope. Compared with software implementations, hardware implementations of exhaustive search and DWT-based search achieve 10 and 2 speedup, respectively. For the category of applications that directly access the host interconnect, edge detection is selected as a case study. A streaming data transfer mode in which the user logic and the interconnect are chained into a pipeline is proposed. A user logic hardware architecture whose parallelism is decided by the data width of the interconnect is discussed in detail. A Canny edge detection application following the proposed architecture is capable of achieving 544 speedup compared with the corresponding software design.

As a future work, we will extend our work to the Tile 64 platform [21]. Parameters such as the external memory bandwidth and the intertile bandwidth will be taken into account to implement image processing algorithms. We expect to take advantages from Tile 64's capability of creating multicore pipelines internally.


This work was supported in part by Arctic Region Supercomputing Center (ARSC) at the University of Alaska Fairbanks.