Abstract

Accurate real-time motion estimation is very critical to many computer vision tasks. However, because of its computational power and processing speed requirements, it is rarely used for real-time applications, especially for micro unmanned vehicles. In our previous work, a FPGA system was built to process optical flow vectors of 64 frames of image per second. Compared to software-based algorithms, this system achieved much higher frame rate but marginal accuracy. In this paper, a more accurate optical flow algorithm is proposed. Temporal smoothing is incorporated in the hardware structure which significantly improves the algorithm accuracy. To accommodate temporal smoothing, the hardware structure is composed of two parts: the derivative (DER) module produces intermediate results and the optical flow computation (OFC) module calculates the final optical flow vectors. Software running on a built-in processor on the FPGA chip is used in the design to direct the data flow and manage hardware components. This new design has been implemented on a compact, low power, high performance hardware platform for micro UV applications. It is able to process 15 frames of image per second and with much improved accuracy. Higher frame rate can be achieved with further optimization and additional memory space.

1. Introduction

Optical flow aims to measure motion field from the apparent motion of the brightness pattern. Optical flow is one of the most important descriptions for an image sequence and is widely used in 3D vision tasks such as motion estimation, structure from motion (SfM), time-to-impact, and so forth. An accurate hardware-friendly optical flow algorithm is proposed and its hardware design is presented in this paper. This design can be used for unmanned air vehicle (UAV) navigation tasks, moving object detection, and other applications which require fast and accurate motion estimation.

The basic assumption of optical flow algorithms is the brightness constancy constraint, which assumes that image brightness changes are due only to motion. In other words, if motion between frames is small, other effects (such as changes in lighting conditions) causing brightness changes can be neglected. However, one limitation of the use of optical flow is its computational requirement. The processing time of existing optical flow algorithms is usually on the order of seconds or longer per frame. This long processing time thus prevents optical flow algorithms from being used for most real-time applications such as autonomous navigation for unmanned vehicles.

In recent years, a number of different schemes have been proposed to implement optical flow algorithms for real-time applications. The basic idea behind them is to use pipeline or parallel processing architectures to speed up computations. For example, graphics processing unit (GPU) devices have recently been used for optical flow implementation with good results [13]. Alternatively, Correia and Campilho [4] proposed a design which can process the Yosemite sequence in 47.8 milliseconds using a pipeline image processor. Compared to GPUs or processors, FPGAs are more suitable for small UAV operation because of their compact size and low power consumption. FPGAs have been used to calculate optical flow [513] in the past few years because of their configuration flexibility and high data processing speed. For example an iterative algorithm proposed by Horn and Schunck [14] was implemented in an FPGA in [57]. The classical Lucas and Kanade approach [15] was also implemented in [7, 8] and provided a tradeoff between accuracy and processing efficiency.

To improve accuracy and provide high-performance solutions, many different optical flow algorithms have been developed in the last two decades. During this time, 3D tensor techniques have shown their superiority in producing dense and accurate optical flow fields [1619]. The 3D tensor provides a powerful and closed form representation of the local brightness structure. In our previous work, a tensor-based optical flow algorithm was implemented on an FPGA to process 64 frames of 640 × 480 image per second.

Despite the fact that hardware-based optical flow design can achieve much higher processing rates than software, hardware-based designs are usually less accurate. Using the design in [12] as an example, its accuracy measured in angular error is 12.7° on the Yosemite sequence while state-of-art optical flow algorithms using software and with sufficient memory space can achieve accuracy close to 1.0° on the same sequence. Two main reasons affecting accuracy are the following. (1)Algorithm limitation: many software-based algorithms use iteration and apply various optimization techniques to find optimal values. However, hardware is best for algorithms which can be pipelined and processed in parallel. Therefore, only a very limited subset of the available software-based algorithms is a good fit for hardware. Sacrificing accuracy for processing speed is inevitable for hardware implementation.(2)Hardware resource limitations: for optical flow algorithms, smoothing is a very important operation for suppressing noise and extracting accurate motion information. However, current smoothing algorithms for software are not suitable for hardware implementation where a tradeoff between accuracy and feasibility needs to be made. Moreover, software-based algorithms mostly use floating point computations. When implementing software optical calculations in hardware, data truncation, rounding, and saturation operations lower the algorithm’s accuracy.

Our goal in this paper is to improve the accuracy of previous hardware optical flow designs under the prerequisite of speed and feasibility. The design described herein uses temporal smoothing in addition to spatial smoothing in the calculation to improve the accuracy and stability of the algorithm. To accommodate temporal smoothing (which is not trivial in hardware), a new hardware organization is used. The hardware pipeline is decomposed into two modules. The first module calculates derivative frames for each image frame. The second module calculates optical flow from the derivative frames. From the experimental analysis done, it can be seen that this design achieves an error rate of 6.7° on the Yosemite sequence which doubles the accuracy of previous work [12]. The temporal smoothing used prevents the creation of a fully pipelined design. Further, it requires additional memory bandwidth, leading to a processing rate of 15 frames per second for 640 × 480 images. Higher frame rates could be achieved with further optimizations and additional memory space. However, this frame rate is sufficient for many unmanned vehicle applications that require higher accuracy.

The proposed algorithm is implemented on a newly developed standalone hardware platform. Images captured by the on-board CMOS camera are processed by hardware and can be transferred to PC for debugging and evaluation purposes.

This paper is organized as follows. In Section 2, the algorithm is formulated and our modifications are introduced. In Section 3, the hardware structure and tradeoffs made in the design are discussed. In Section 4, the hardware platform is introduced. Performance analysis of the proposed design on synthetic and real sequences is also shown. Conclusions and future work are discussed in Section 5.

3. Optical Flow Algorithm

2.1. Algorithm Description

An image sequence can be treated as volume data where and are the spatial components, and is the temporal component. According to the brightness constancy constraint, object movement in the spatiotemporal domain will generate brightness patterns with certain orientations. The 3D tensor is a compact representation of local orientation and there are different types of 3D tensors based on different formulations [20, 21]. The gradient tensor is used in this design because it is easier to implement with pipelines in hardware.

The outer product O of the averaged gradient is defined as where and wi are weights for averaging the gradients. The gradient component is calculated using a simple mask shown in (3) which is the same as in [12], which was chosen as a tradeoff between accuracy and efficiency, Gradient tensor T is a positive semidefinite matrix that is constructed by weighting O in a small neighborhood as

Optical flow is measured in pixels per frame and can be extended to a 3D spatiotemporal vector, . For an object with only translational movement and without noise in the neighborhood, . In the presence of noise and rotation, will not be zero. Instead, can be determined by minimizing . A cost function can be defined for this purpose as

The velocity vector is the 3D spatiotemporal vector which minimizes the cost function at each pixel. The initial optical flow can be solved as

The initial optical flow vector is smoothed in a local neighborhood to suppress noise further. The final optical flow vector is formulated as

The algorithm proposed in this paper is similar to the one in [17] which assumes a constant motion model. Affine motion model is often used to incorporate tensors in a small neighborhood [17] where pixels in a neighborhood are assumed to belong to the same motion model. To conserve hardware resources, the constant model is used in this design. The constant model performs almost as well as affine motion model when operating in a small neighborhood.

2.2. Smoothing Masks

As mentioned in Section 2.1, smoothing is very important to the algorithm performance. Smoothing is performed through three smoothing masks wi, ci, and mi as shown in (2), (4), and (7) where wi is for gradient averaging, ci is for tensor construction, and mi is for velocity smoothing. The first mask is a spatiotemporal smoothing mask, and the other two are spatial smoothing masks.

To reach an optimal tradeoff between algorithm performance and hardware resources, the configurations of these masks were evaluated carefully by simulation before implementation. Smoothing mask parameters were determined by three factors: mask shape, mask size, and mask kernel components. In software, large smoothing masks (e.g., 19 × 19 or larger) are often used. In hardware, smaller masks must be used because of resource limitations (e.g., 7 × 7 or smaller). As for mask shape, a square mask is usually used for the sake of simplicity and efficiency. Spatial and temporal smoothings can use different size masks to improve performance. While spatial smoothing can be readily performed in either software or hardware, spatiotemporal smoothing is much more challenging in hardware than in than software. Temporal smoothing is significantly more complicated than spatial smoothing because it involves multiple image frames. However, temporal smoothing is important for estimating motion field consistency over a short period of time. Incorporating temporal smoothing substantially improves algorithm performance as can be seen in the experimental analysis section. In this design, the size of the first smoothing mask wi in (2) is 5 × 5 × 3 of which 3 is the temporal smoothing size (3 frames). The number of frames used for temporal smoothing is largely determined by hardware resources and processing speed requirement. The sizes of the other two spatial smoothing masks ci, and mi are 3 × 3 and 7 × 7, respectively. Parameters of all the smoothing masks are in a shape of Gaussian function. To save hardware resources, a 2D Gaussian mask is decomposed into two 1D Gaussian masks which are cascaded and convolved along and directions separately [12]. Different settings of these masks are simulated by software at bit-level accuracy and evaluated on synthetic sequence with ground truth to obtain an optimal combination in practice.

3. Hardware Structure

The diagram of the hardware platform using a Xilinx Virtex-4 FX series FPGA for implementing our optical flow algorithm discussed in Section 2 is shown in Figure 1. Most hardware components are either connected to the processor local bus (PLB) or on-chip peripheral bus (OPB). The PLB is used for connecting hardware components that require high bandwidth such as camera and USB ports, and so forth. The OPB is used for connecting relatively slower units such as UART, interrupt controller, and so on. PLB and OPB are connected through PLB2OPB bridge. Two main components of this design: derivative (DER) module and optical flow computation (OFC) module are connected to the PLB bus through a bus interface. The DER and OFC modules share the SRAM through a multiport SRAM arbiter connected to the SRAM to obtain optimal accessing bandwidth. The slower SDRAM is accessed through the PLB bus.

The main difference between this design and our previous design [10] is that it incorporates temporal smoothing in the pipeline. In this design, three sets of derivative frames, gx, gy, and gt shown in (2) are temporally smoothed.

With temporal smoothing, multiple sets (3 in this design) of derivative frames must be stored as they are calculated and then reloaded during the smoothing process. It is impossible to store multiple sets of derivative frames on-chip using the hardware resources that are available. Therefore, the optical flow calculation pipeline has to be divided into two steps. The first part (called DER module) generates derivative frames and the second part (called OFC module) handles the rest of calculations. These two hardware modules must be managed to synchronize their computation tasks and handle exceptions such as dropped frames. Software running on the built-in on-chip powerPC processors is used for this management task.

Figure 2 shows the diagram of the DER module. Every cycle when a new image is captured directly into the SDRAM through the PLB bus, reading logic reads the captured image from the SDRAM into a pipeline and stores it in the SRAM. Whenever there are five consecutive image frames stored in the SRAM, they are all read out for computing the derivative frames , and using (3). and are calculated from the current incoming and is calculated from , , , , and the current incoming frame. The resulting derivative frames are stored in the SRAM as well as the SDRAM for future usage. The duplicate copy stored in the SDRAM is needed for temporal smoothing for future frames. This storing and retrieving of derivative frames from SDRAM consumes PLB bus bandwidth and hence slows down the processing speed. As shown in Figure 1, if the hardware platform used had sufficient SRAM (currently only 4 Mb), then all 9 derivative frames (3 sets of , and ) could be stored in the SRAM and take the advantage of a high-speed multiport memory interface.

Figure 3 shows the dataflow of the OFC module. Once a new set of derivative frames is calculated, software will trigger the OFC module to start the calculation of optical flow. Derivative frames for the current frame in the SRAM (, and ) and the derivative frames already stored in the SDRAM (, and ) and (, and ) are first read into the pipeline for temporal smoothing. Because the size of the temporal smoothing mask is 3, derivative frames at time , and are averaged to obtain the smoothed derivative frames for the current frame at time ( in Figure 3). These frames are then spatially smoothed. The spatially smoothed derivative frames ( in Figure 3) are then used to construct the tensors as shown in (4). Six tensor components t1, t2, t3, t4, t5, and t6 are obtained after this step. These six components are spatially smoothed using a 3 × 3 smoothing mask. Then, and can be calculated from these smoothed components as shown in (6). Two motion components and are spatially smoothed with to get the final optical flow vectors as shown in (7).

Figure 4 shows the system software diagram. There are three types of frames existing in the system: (1)image frames captured by the camera;(2)derivative frames calculated by the DER module;(3)optical flow fields calculated by the OFC module. The DER module uses the raw images as input and the OFC module uses the output from the DER module (derivative frames) as the input. Three linked lists are used to store these frames and maintain their temporal correspondence. A frame table entry (FTE) is used to store image frames, a DER_FTE is used to store derivative frames and an OFC_FTE is used to store optical flow frames. As shown in Figure 4, there are 5 corresponding pairs of FTE and DER_FTE and 3 pairs of DER_FTE and OFC_FTE.

It is noted that these modules execute asynchronously. When a new raw image is captured (FTE7 in this case), the camera core invokes an interrupt. This interrupt is sensed by the FTE interrupt handler in software and a trigger signal is generated and sent to the DER module to initiate a derivative computation. When a new set of derivative frames is calculated (DER_FTE4 in this case), the DER module invokes an interrupt. This interrupt is sensed by the DER_FTE interrupt handler in software and a trigger signal is generated and sent to the OFC module to initiate an optical flow computation.

4. Results

4.1. Hardware Platform

This design was implemented on the BYU Helios Robotic Vision board [22, 23]. This embedded vision sensor board was developed in the Robotic Vision Laboratory at Brigham Young University for real-time visual computing. The Helios board, shown in Figure 5(a), is compatible with Xilinx Virtex-4 FX series FPGAs, up to the FX60. The Virtex-4 FX60 has 25 280 slices and two built-in 400 MHz powerPC processors, allowing for both custom hardware and software development. The Autonomous Vehicle Toolkit (AVT) daughterboard (Figure 5(c)) which is capable of supporting up to two CMOS cameras (Figure 5(b)) was designed to enhance the functionalities of the Helios board. The CMOS camera (capable of acquiring 60 frames per second) was set to capture 30 frames of 640 × 480 8-bit color image data per second in this design.

For computer vision tasks, high-speed memory is critical. There are currently 32 Mb SDRAM and 4 Mb SRAM on the Helios platform. The SRAM is faster and easier to access than the SDRAM. In this design, we tried to store as much intermediate data as possible (e.g., image frames and derivative frames) in the SRAM. The rest of the data was stored in the SDRAM and accessed through the PLB Bus. There are other hardware resources and features on the Helios that are useful for other real-time vision tasks but which were not used in this design. This hardware platform is suitable for small UGV or UAV which requires low power, real-time computation capability. Detailed information can be found in [22, 23].

The whole system executes at a clock speed of 100 MHz and the system is using one clock domain. At this frequency, it is able to calculate 15 frames of optical flow field per second. All computations used fixed point representation to save hardware resources. Temporal smoothing requires much more data transfer than spatial smoothing, and the PLB bandwidth is the limiting factor to increased processing speed. Our estimation is that higher frame rate can be achieved with further optimization work and additional memory space (specifically on-board SRAM). These processing speed improvements are discussed in Section 5. From the performance analysis in the following sections, it can be seen that with temporal smoothing the accuracy of the estimated optical flow is improved substantially on a synthetic sequence that is commonly used for benchmarking.

The whole design utilized 21481 slices (84% of the total 25280 slices on a Virtex-4 FX60 FPGA). The DER module used 2196 slices (8% of the total) and the OFC module used 6324 slices (25% of the total). The remainder was used for the camera core, I/O, and other interface circuitry. A graphical user interface (GUI) was developed to transfer the status of the Helios board and the real-time video to PC for display through the USB interface.

4.2. Synthetic Sequences

This design was tested on two synthetic sequences to show its effectiveness. One of these sequences (Yosemite) has an established ground truth and is commonly used for benchmarking. A bit level simulation coded in MATLAB was programmed to evaluate the algorithm’s accuracy. The error is measured in angular error which is widely used for evaluating optical flow algorithm.

Figure 6(a) shows one frame of the Yosemite sequence with the calculated optical flow vectors superimposed for visual feedback. The angular error of the Yosemite sequence was 6.7°. To the best of our knowledge, this is the most accurate result generated to date using hardware, and has half the error rate of our previous design [12] which achieved an error rate of 12.7°. There are two major reasons for this improvement. The first is that a better hardware platform (the Helios platform) was used and supported the implementation of a more complex algorithm on the hardware. The second reason is the incorporation of temporal smoothing in the algorithm. Without temporal smoothing, the accuracy would deteriorate to approximately 10.5°. This experiment demonstrates that temporal smoothing has a significant impact on the algorithm performance. Using temporal smoothing is an acceptable tradeoff of processing speed for accuracy for applications that require higher accuracy.

Figure 6(b) shows the result of the other synthetic sequence (flower garden). While the proposed algorithm did not perform well in regions without much texture, this is a shortcoming of all optical flow and motion estimation algorithms. In regions with noticeable texture, the algorithm performed very well.

4.3. Real Sequences

Two real sequences were also used to test the performance of the proposed algorithm. Figure 7(a) shows the result of the SRI tree sequence, another standard benchmark sequence for testing optical flow algorithms based on captured image data. It can be seen that the proposed algorithm performed very well on this sequence. Optical flow vectors were generally very smooth except for some few noisy vectors along the motion boundary. This was caused by the violation of the assumption that the motion should be constant in a local neighborhood which is another shortcoming of all optical flow and motion estimation algorithms.

Figure 7(b) shows the result of the corridor sequence, an image sequence consisting of around 500 frames captured in our lab. As can be seen, there are noisy motion vectors in the background area which has less texture than other regions. In this case, even stronger smoothing would help suppress the noise and “infer” the velocity from the local neighborhoods.

The data width transferred between the Helios board and GUI was 8 bits per pixel. To transfer the optical flow vector values, we defined the data as a 4.4 format signed fixed point number which means it had 1 sign bit, 3 integer bits, and 4 fraction bits. Therefore, small optical flow vectors are coded with gray intensity values. Large positive optical flow vectors (object moves down or to the right) are coded with bright intensity values. Similarly, large negative optical flow vectors (object moves up or to the left) are coded with dark intensity values. This coding mechanism was for the display of the motion field in real time for debugging purpose.

Figure 8(a) shows the screenshot of an original video frame that was captured using our hardware platform, transferred to the PC through the USB interface, and displayed in the GUI. The palm moved downwards from the top of the screen at a steady speed. Figure 8(b) shows the motion field (mostly in direction). Since the palm moving downwards, the resulting optical flow vectors are coded with low intensity as explained previously. The light gray intensity values in the background represent very small motion that was caused by the flickering of the fluorescent lights in the lab and the digitization noise from the camera.

5. Conclusion

High computation and low power consumption requirements make it difficult to use general purpose processors or GPU to implement optical flow algorithms for real-time small unmanned vehicle applications. A hardware accelerated design of a tensor-based optical flow algorithm has been developed and implemented in an FPGA platform. Two synthetic sequences and two real sequences were used to test its effectiveness. The accuracy on the Yosemite sequence was shown to be substantially better than our previous design and other hardware implementation found in the literature.

The accuracy improvement in this design is due to two reasons. (1)A better hardware platform is used in this design to be able to compute a more complex algorithm.(2)Temporal smoothing is incorporated into the calculation. Temporal smoothing increases the temporal coherency, provides stronger smoothing, and improves performance.

To accommodate temporal smoothing, the pipelined hardware structure is divided into two parts: DER and OFC modules. Accordingly, the amount of image data that must be stored and transferred increases dramatically. Currently, this design is running slower (15 fps of 640 × 480 image) than our previous design. The limiting factor of the processing speed as well as the accuracy is the PLB bus bandwidth usage. This limiting factor can be alleviated by reducing the image size or increasing the SRAM size. Processing speed could reach 60 fps if the image size is reduced to 320 × 240. Increasing the SRAM size could avoid the storing and retrieving of derivative frames from the slower SDRAM through the PLB bus. Also, increasing the SRAM size would allow the use of a larger temporal smoothing kernel which would significantly improve the optical flow computation accuracy.

Our future work includes improving the hardware structure so that the design uses less bus bandwidth and thereby is able to achieve a higher frame rate. Another goal is to develop navigation algorithms for obstacle avoidance and implement the whole design on a UGV. Once this goal has been achieved, we will look into the hallway or canyon navigation applications for UAV.

Acknowledgment

This work was supported in part by David and Deborah Huber.