#### Abstract

The robustness of the human visual system recovering motion estimation in almost any visual situation is enviable, performing enormous calculation tasks continuously, robustly, efficiently, and effortlessly. There is obviously a great deal we can learn from our own visual system. Currently, there are several optical flow algorithms, although none of them deals efficiently with noise, illumination changes, second-order motion, occlusions, and so on. The main contribution of this work is the efficient implementation of a biologically inspired motion algorithm that borrows nature templates as inspiration in the design of architectures and makes use of a specific model of human visual motion perception: Multichannel Gradient Model (McGM). This novel customizable architecture of a neuromorphic robust optical flow can be constructed with FPGA or ASIC device using properties of the cortical motion pathway, constituting a useful framework for building future complex bioinspired systems running in real time with high computational complexity. This work includes the resource usage and performance data, and the comparison with actual systems. This hardware has many application fields like object recognition, navigation, or tracking in difficult environments due to its bioinspired and robustness properties.

#### 1. Introduction

Bioinspired systems emulate the behavior of biological ones. Neuromorphic approximations [1] are based on the way how the nervous systems create physical architectures and computations, attending to the morphology, information coding, robustness against damage, and so on. Neuromorphic systems usually deliver good primitives for the building of more complex systems, being the output of each system simpler than its input. This data reduction helps in the task of integrating every response associated with all information channels [2].

Attending to the estimation of a pixel motion inside the image sequence, there are many models and algorithms that could be classified as belonging to the matching domain approximations [3], energy models [4], and gradient models [5]. Related to this last family, different studies [6–8] show that this represents an admissible choice for keeping a tolerable tradeoff between accuracy and computing resources. For designing systems operating efficiently, it is required to deal with many challenges, such as robustness, static patterns, illumination changes, different kinds of noise, contrast invariance, and so on. If bioinspirational behavior is required, that is, ability to detect correct motion related to optical illusions or avoiding operations like matrix inverse or iterative methods that are not biologically justified, we have to select carefully a model that carries out this kind of requirements. This is the Multichannel Gradient Model (McGM) [9–12].

Motivated by these previous results and analysis, we present the architecture and implementation of a customizable optical flow embedded processing core running in real time. This system works in the framework of a codesign scheme that is able to manage complex situations in real environments [13] better than other algorithms [14] and mimic some behavior of the mammalians [15].

This paper is organized as follows. First, the stages of McGM model are explained very briefly; after that, we tackle the precision study of every conceptual stage, obtaining a set of bit width values which models the filters and the bit width stage required to obtain results that match with the statistical error metric requirements. From this previous study, we design the customizable architecture implementation attending to the original model plus several hardware modifications in order to improve the feasibility of the system. An example of this is the design of IIR filters replacing the original FIR filters due to the memory limitation of the prototyping platform, or the use of several information channels with a few bit width, replicating the nature of the brain (large number of neurons with very little precision for a few channels with huge information capacity) [14]. After that, we explain the coarse pipeline processing architecture and the platform and language used in our systems. Finally, quality results, hardware associated cost, and comparisons to other implementations are shown.

#### 2. Multichannel Gradient Model (McGM)

The original algorithm was proposed by Johnston and Clifford, and we have applied Johnston’s description of the McGM model [9], while adding several variations to improve the viability of the hardware implementation. Figure 1 shows a simplified scheme of the algorithm.

##### 2.1. Iir Filtering

A temporal IIR filter is modeled
from its original FIR description due to the limitation of available memory in
our prototyping platform [15, 16]. The result
is a recursive filter with only two frames of latency, being *o* and *i* the output and input, respectively, of the filter and *a _{i}* ,

*b*the coefficients from our previous work [14, 15]: where ; , , and

_{i}*α*drives the peak in the temporal impulse response function. It is calibrated with a frame peak value equals to 10 following a critical flicker fusion limit of 60 Hz, according to the human visual system evidences [11]: Attending to the original algorithm, we need to perform the order zero, one and two derivatives, which represent our first triplet of information to be processed, as shown in Figure 1. The derivatives are obtained applying a gradient operator of minimal length (+1,1) to (1):

##### 2.2. Fir Spatial Filtering

A set of spatial FIR filters is modeled by the next impulse response corresponding to bidimensional Gaussians and their separable derivatives:

where represents the spread of
the Gaussian and is the
Hermite polynomial of order *n* . The
convolution is done in a separable way, taking derivatives in *x* and *y* directions up to sixth and second order, respectively, due to
bioinspired and robustness reasons [11–13]. The aim of
this stage is to cover enough spread area of information channels that allow us to contribute
to the calculus when any of them are null due to many reasons, such as noise.
Therefore, we have three spatial structures, each one containing a pyramidal
set of several filters corresponding to Gaussians and their different
derivatives.

##### 2.3. Steering Stage

The steering stage represents the
approach to projecting the space-temporal filters calculated in previous
stages, under the different orientations. Being *n* and *m* the order in *x* and *y* directions, respectively, the angle projected, *D* the derivative operator, and the Gaussian expression, we
obtain the general expression of the filter rotated in the space as a linear
combination of filters belonging to the same order basis [14]. Thus, we have to
apply this transformation to each value:

##### 2.4. Taylor Expansion Stage

In this stage a truncated Taylor expansion is done, substituting
it for the point on the space-time image in order to further enhance the
algorithm. To perform this, it is necessary to use each oriented filter
previously calculated. This expansion is highly versatile and represents a
robust information structure of the sequence in space and time:
With this, it is necessary to
differentiate each Taylor
expansion respect to *x*, *y*, *t*,
calling these derivatives *X*, *Y*, *T*,
and forming the following sextet of quotient as shown in the quotient stage:

##### 2.5. Quotient Stage (general Primitives) and Following Stages

This is the last stage belonging to
the common path, where a quotient of every sextet’s component is computed from
every measurement of the product of steered Taylor expansion differentiates:The architecture of the core is branched
in two separated ways, modulus and phase, with different bit operations working
independently, containing products, several quotients, and even trigonometric
operations as arctangent, which are performed in software. The details of the
software stages can be found in previous works [14, 15] being the final
aim to recover a dense representation of motion. Therefore, we have two values
for each input pixel corresponding to modulus and phase of the velocity, *that is*, velocity projection in *x* and *y* directions, following the next expressions

#### 3. Precission Study (Bit Width Analysis)

We have designed
a specific strategy to define the bit width required in each conceptual stage
following this previous algorithm. The basic idea is to transform every
calculus in the model, applying a chained process of quantization. For the sake
of clarity, if the parameters of the convolution are the bit width of the input *I*, the length of the filter *L*, the mask size *M,* we can compute the output bit width simply shifting the range the
output bit width *O*: Applying this
method in each stage, we obtain a set of values that throw back the
transformation between floating point domain and integer domain, getting a
tradeoff between bit width and affordable error.

As the metric
error value, we have proposed the most common ones used in the specialized
literature, such as Barron’s vector [8] and Galvin’s couple of metrics [6],
where *Vc* and *Ve* are the values of the correct and experimental velocities,
respectively, and is the
normal component to the Galvin vector difference: We have also
taken into account the simple error measures (absolutes and relatives) relative
to modulus and phase: Regarding the
stimuli, we have used synthetic compositions of sine waves of different spatial
frequencies and the famous stimulus of diverging tree and translating tree [17],
commonly used to evaluate optical flow. As a result, we obtain the set of
precision parameters that are applied in the model attending to the range of
affordable error. Figure 2 shows the bit width of the stages performed in
hardware, and Table 1 contains the final values chosen, for an FIR Blur filter
length of 5 pixels, FIR spatial filter of 23 pixels, and IIR temporal filter
equivalent to an FIR length of 21 frames, with a more detailed analysis
available in [14].

**(a)**

**(b)**

#### 4. Codesign Process

The system has been designed as a codesign process working with an asynchronous pipeline (micropipeline). The PC feeds the FPGA with a stream of frames through a bank of memory connected to PCI bus. The board takes a continuous stream of pixels at its input (1 byte/pixel); however, we employ 32 bits at the output, coming back to the PC, where they are reordered and written to the hard disk. We have selected Handel C to implement this core, using DK tool [18]. Relating to the prototyping board, an AlphaData RC1000 board has been used, which includes a Virtex 2000E-BG560 chip and 4 SRAM banks of 2 MB each [19]. The memory banks can be accessed both from the FPGA and the PCI bus, Figure 3 showing the communication scheme of the codesign system between the external memory banks, FPGA, and the host platform.

We have implemented a bit width precision defined version of the model, that we called “semihardware” version or SmHW; furthermore the next step is to implement different hardware cores for examining the tradeoff between accuracy and efficiency. We have developed in the FPGA two kinds of platforms that are called “basic” (HWbas) and “extended” (HWext) architectures. The SW version is implemented using the temporal FIR filtering, 24 orientations (each ), the SmHW version keeps the same number of orientations, although the implementation of the IIR filters and the Taylor Expansion is not completed (only are used the 65% of the weights). The basic architecture has one less order of spatial differentiation than the versions commented above, and it has only 18 orientations (each ), remaining the rest of the parameters constant. The extended architecture has one additional order less than the basic and also decreases in the number of orientations, taking 8 orientations (each ). Table 2 summarizes the main differences between these versions attending to the nature of temporal filter, the final spatial derivative order, the number of orientations, and the density of the weights used in the expansion.

#### 5. Results

We have analyzed the resources required by the platform and also the number of cycles (NCs) of each stage in Table 3. Every stage belonging to both architectures has been designed as customizable, scalable, and modular.

The basic architecture computes
initial blur filter in order to remove aliasing components, IIR temporal filtering
that performs the temporal derivatives, FIR spatial filtering, *that is,* spatial derivatives, and
steering filtering that project the results onto the whole space (the SW prefix
denotes that these stages are performed in software). This architecture contains
the processing scheme belonging to most of gradient-based optical flow models, thus
it could be considered as a motion preprocessor [15, 16]. The extended
architecture is able to cover more stages and is focused in the specific McGM
algorithm, implementing all the stages commented previously, plus a Taylor
expansion, Taylor product (their derivative products), and the quotient stage as shown
in Figure 4.

##### 5.1. Hardware Cost

The basic architecture consumes 41%
of the board slices, with every stage being performed with parameter values
very close to the original model (derivatives in *x* up to order 5, 18 orientations in the steering stage),
implementing 4 stages. Nevertheless, the extended architecture requires 97% of the
development board.

##### 5.2. Performance

Related to the number of cycles, we
have noted the Xilinx timing analyzer tool [20] to be very conservative; thus
we can increase the throughput around 25%–35% if we clock
the system manually from the values obtained. The slower stage in the
basic architecture is the FIR filtering, while the last stages designed need
the maximum number of Block RAMs and slices due to the computation being
performed replicating the spatial convolution (FIR filter) concurrently for *n* orientations until order *m* in *x*.
Nevertheless, in the extended architecture we must keep resources for the next
stages, removing some contributions and parallelizing the processing scheme in
discrete groups, which replace the whole group entirely concurrently. For
instance, the steering stage is performed with fewer terms and with reduced parallelization
level, requiring almost the double of cycles. Applying this strategy of keeping
enough resources in the prototyping board, we can extend the model to
additional stages. We can see in Figure 4 the global codesign scheme and the
two architectures involved, representing the transactions between external RAM
(grey blocks) and the stages. The stage corresponding to IIR filter has to keep
3 frames using the bank number one, the steering stage reads the orientation
weights from bank number three, and the send/receive modulus connects the
input/output data between the FPGA and the host system via the PCI bus using
DMA transfer. Figure 5 shows the performance for the whole systems using
chained stages, attending to the pixel/seconds processed, concluding that it is
possible to compute 177 frames/second with a resolution of pixels in the basic architecture, and
37.9 frames/second for the extended architecture.

##### 5.3. Quality of The Results

An accuracy analysis has been carried out, being possible to examine the quality of the results under different transformations and metrics, as we can see in Figure 6. The phase and modulus metrics (difference between values) show a good behavior regarding the implementation changes, while Barron’s metric seems to go well keeping the proportion accuracy under changes, but Galvin and Galvin perpendicular metrics suffer with the implementation change from SW to HW. It is due to the nature of the metric, which gives an idea about how the algorithm copes with the Aperture Problem [8], this topic being discussed in previous work [14]. Despite restricting every version in terms of precision parameters one step further until finally taking the extended architecture, in general the error values are delimited reasonably.

##### 5.4. Some Visual Results

Figure 7 shows some visual
results corresponding to different versions of our system, concretely SW versus
HWbas. It can be noted that while the SW version keeps a calculus density close
to 100% (middle row in Figure 7), HWbas loses some points due to precision bit
width (bottom row in Figure 7), *that is*,
the bit number of the parameters in each stage. The input sequence, called diverging
tree (upper row in Figure 7) has a divergent structure where the modulus is
supposed to vary poorly and the phase is changing regularly over . Since we
are working with synthetic sequences, we can estimate the error without any ambiguity.
Also we have used the translating tree sequence, where modulus is changing from
left to right and the phase is practically almost the same.

#### 6. Comparison with other Approaches

There are other gradient optical flow models implemented in hardware [21, 22], belonging to the Lucas and Kanade algorithms [23] and to Horn and Schunk approximations [24, 25], while in Table 4 we can see the average error for different metrics, although only we compare the Barron’s metric since the cited authors do not provide other measurements.

Attending to the errors, our
implementation provides better results than the other approaches, even with calculation
density 100%. Nevertheless, the final results are improved if the points where
the scene structure changes, *that is,* points smaller than
a determinate temporal derivative, are filtered. This is caused by a least
squares process being performed at the end of the algorithm for calculating the
modulus and the phase final values.
The points filtered would force the slope of the linear regression to be very
small, with the value of velocity is almost null.

Regarding throughput, we are able to calculate more than 2000 Kpixel/s in the basic Architecture and about 1000 Kpixel/s in the extended. It would locate our implementation between those in [23, 26], enough for real-time purposes, although it could be improved using a board with more resources that is used here and increasing the parallelism level.

The error using the diverging and translating tree sequences [17] is shown in Figure 7, and it is obtained with different metrics regarding the expressions (12)-(13).

#### 7. Conclusion

We have developed an FPGA-based implementation of a bioinspired robust motion estimation system with an associated complexity higher than those found in other gradient-based models commonly used in the literature. The study of precision calibrates the model and adjusts the bit width needed for keeping a tradeoff solution between accuracy and efficiency, acting as a bridge between software and hardware and estimating the cost to convert every stage from floating to fixed point. Taking the results from this precision study, different hardware moduli have been designed, organizing this in two high parallelized architectures. The first one, referred to as basic architecture and common to optical flow gradient models, is a superconvolutive processor orientated along multiple angles. It could be used as a starting point for many computer vision algorithms, not necessarily restricted to the motion estimation field, like change detection, stereo, or even biometry techniques such as real-time signature recognition. The second architecture, called extended, is focused in the Mutichannel Gradient Model, and includes the truncated Taylor expansion representation of space temporal information of the scene, its three differentiates respect space and time, and the quotients of the products of these last functions. The rest of the stages, called velocity primitives, corresponding to the expressions (8)-(9) are performed in software in the framework of a codesign process, where the final modulus value is a quotient of determinants and the final phase is an arctangent. This extension can be implemented using a board with more resources than the VIRTEX 2000 E and, depending on the accuracy required, using a structure based on LUTs or implementing a CORDIC core. Both architectures are scalable and modular, and also extensible to one device with more resources that our prototyping platform.

Additionally, the resources consumed have been evaluated as well as the throughput and the accuracy of the designed coprocessors. All models come forward with asynchronous segmented architectures (micropipelines). Regarding quality, the average error has been compared using Barron’s metric, since other authors do not provide results with other metrics; also the throughput of the design has been compared with other implementations. This work generates dense optical flow maps up to 80 frames/second and 185 frames/second for a resolution of in the extended and basic architectures, respectively. The present contribution opens the door to embed complex bioinspired systems that require a huge quantity of computation. We are currently improving the system to extend the model to a fully stand alone platform also to deal with stereo vision. Several application fields are though to use it, such as motion illusion detection or video compression.

#### Acknowledgments

This work was partially supported by Projects TEC2007-68074-C02-01/MIC, TIN2005-05619-2004-07032 (Spain), EU Project DINAM-VISION (DPI2007-61683), and an EU “Marie Curie” Fellowship (QLK5-CT-1999-50523). The authors would like to thank the anonymous reviewers for their insightful suggestions and Professor Johnston and Dr. Dale, from the Vision Group at University College London, for their help and support during this research.