The numerous benefits of real-time 3D awareness for autonomous vehicles have motivated the incorporation of stereo cameras to the perception units of intelligent vehicles. The availability of the distance between camera and objects is essential for such applications as automatic guidance and safeguarding; however, a poor estimation of the position of the objects in front of the vehicle can result in dangerous actions. There is an emphasis, therefore, in the design of perception engines that can make available a rich and reliable interval of ranges in front of the camera. The objective of this research is to develop a stereo head that is capable of capturing 3D information from two cameras simultaneously, sensing different, but complementary, fields of view. In order to do so, the concept of bifocal perception was defined and physically materialized in an experimental bifocal stereo camera. The assembled system was validated through field tests, and results showed that each stereo pair of the head excelled at a singular range interval. The fusion of both intervals led to a more faithful representation of reality.
1. Introduction
The advantages and flaws of stereoscopic vision systems have been described many times since compact cameras entered the arena of perception sensors. However, significant
advances in electronics and processor speed have led to the enhancement of the benefits of stereo vision and diminish its disadvantages. When
compared to monocular cameras, the most important advantage brought by stereo
cameras is the availability of ranges, that is, the possibility of estimating
distances between the camera and objects located in its field of view. The
addition of the range to the two coordinates already available with monocular
cameras implies the possibility of registering a three-dimensional (3D) point
cloud, representing the scene more faithfully. In spite of this, the access to
the range can be a double-edge sword if ranges are measured outside an interval
of acceptable reliability; if the baseline is shortened to focus on a short
range, errors at a long distance will increase, and if the contrary is done,
the accuracy of short ranges will decrease [1]. The configuration of the
camera, especially the baseline and focal length of the lenses, determines the
boundaries of such range interval. Rovira-Más et al. [2] studied the
relationship between the camera configuration and the recommended range for
safe operations. The fact that ranges are measured with reliability in a
relatively narrowband very often results in limitations for those applications that require certain
flexibility and versatility in perception.
The most
popular applications of stereo to intelligent vehicles are autonomous guidance,
safeguarding, localization, and (3D) mapping. The requirements of each particular
operation are different and can even vary with time or vehicle speed. The
challenge of navigating through well-structured crop rows to a great extent
differs from the difficulties found by autonomous mobile robots operating in a
manufacturing environment [3]. Autonomous driving, for instance, often entails
a variable look-ahead distance according to the traveling speed. Safeguarding
cannot be restricted to a too narrow set of ranges if protection needs to be
assured in a wide variety of situations; detecting people around unmanned
vehicles to facilitate a safe operation is one of the highest priority issues
in perception technology for autonomous navigation [4]. Simultaneous
localization and mapping (SLAM) benefits from registering wide areas with each
stereo pair, the wider the better, as long as the data acquired is consistent
enough to be incorporated to the map. The introduction of global navigation
satellite systems and vision has led to the adaptation of SLAM methods
originally developed for indoors to outdoor environments [5]. The advantage of
gathering three-dimensional information cannot be realized at full extent
unless the camera can cover the necessary field of view for a given
application. Figure 1 illustrates two fundamental perceptive needs faced by
autonomous vehicles: protection around the vehicle ensured by a safety range; the
location of guidance targets at look-ahead intermediate distances. This
research intends to enlarge the dimensions of the space sensed by a stereo head
to efficiently and simultaneously detect short and medium ranges.
Figure 1: Perceptive
needs for autonomous agricultural vehicles.
Before
enlarging the capacity of a stereo camera in terms of range potential, it is
important to establish the expectable situations, or at least what scenarios
are considered normal for a particular application. The further target ranges
are searched, the wider stereo baselines have to be, which makes difficult the
design of compact and light stereo systems. Therefore, it is important to
decide as accurate as possible the limits for the “projected field of view.” An
extreme case of wide baseline stereo was solved by Olson et al. [6] in their
Mars rover, where kilometric distances were pursued, although they had to
sacrifice real-rime performance. In agricultural robotics, kilometric distances
are not required, but what kind of perception is necessary inside conventional
farm fields? Subramanian and Burks [7] provided an example of how accurate
perception has to be in order to maneuver in an orchard with an autonomous
vehicle. The vehicle navigated satisfactorily with maximum errors of 9 cm inside
a path of 3.5 m width. Once the boundaries for the field of view have been
approximately defined, the following step is to find the combination of
baseline and optics to sense such portion of space. RASCAL, an autonomous
vehicle which participated in DARPA Grand Challenge [8], was set to detect
objects in a range interval of 5–25 m with a
baseline of 30 cm and lenses of 8.5 mm focal length. Similar arrangements were done for an autonomous vehicle performing collision avoidance in on-highway
driving [9]: 30 cm baseline and 7.5 mm lenses to cover a range span between 2 and
20 m. These two vehicles managed to carry out the desired task with just one
stereo camera. However, a more demanding situation in both accuracy and
reliability could benefit from a multicamera perception, although applications
in this fashion are very rare.
Bostelman
et al. [10] equipped a mobile field robot with a dual stereo vision system (two
stereo cameras in the same frame). The objective was to develop two world
models (WM1, WM2) simultaneously with a different resolution grid and constant
number of cells (). The total extent of the map was 40 m for WM1 and
120 m for WM2. Information was fused after integrating the other sensors in the
map instead of merging it directly from the stereo cameras. Another example of
multicamera stereo perception was reported by Broggi et al. [11]. For this
application, three cameras were combined to provide three baselines (0.5, 1,
and 1.5 m) although the optics was the same for all the cameras (6 mm lenses).
The flexibility of the design, together with the good performance of the
processing algorithm, resulted in a successful driving of the vehicle,
finishing the DARPA Grand Challenge. The same objective could be reached by
dynamically changing the focal length of the lenses instead of the baseline.
The fact that there are two lenses per stereo device that need to have the same
focal length increases the complexity of this solution since variations of the
focal length are carried out mechanically with zoom lenses. Nevertheless, a
research team at the University of Central Florida
has
recently developed zoom lenses that can alter their focal length nearly
instantaneously without changing the position of the lenses. These adaptive
lenses are based on the ability of a liquid-crystal layer to alter the degree
to which it can refract light when exposed to an electric field [12].
The
objective of this investigation is to develop a stereo system that is capable
of capturing 3D information from two different fields of view at the same time.
In order to do so, the concept of bifocal
perception was defined and an experimental system was assembled and
tried.
2. Concept of Bifocal Perception
Bifocal
lenses are a special kind of lenses with
two distinctive areas, each one having a different eyesight correction. This
special optics is designed for people who need assistance for both near and far
distances, and prefer a solution in just one single lens. If the correction
varies progressively, the lenses are then called varifocal lenses. The human eye can automatically focus according
to the portion of the lens the eye is looking through. Unfortunately, a camera
cannot mimic this behavior and typically each lens has a unique focal length f. As a result, bifocal perception can only be achieved in machine vision with at
least two cameras: one lens covering short ranges and the other one in charge
of sensing at long distances. Both cameras working concurrently cover a wider
area of the target scene.
Different physical realizations can be devised to
realize the concept of bifocal perception. If a compact stereo camera is
assembled with two lenses of different focal lengths, far distance and near
distance, information can be acquired at the same time. Obviously, since the
lenses have a different focal length, the stereo effect is impossible to
obtain. The system would function just as a superposition of two monocular
cameras, losing the opportunity of retrieving stereo information, and therefore
the 3D awareness. In addition, the conventional stereovision calibration
algorithms cannot be applied, and consequently the original images cannot be
easily rectified for lens aberration. Each camera would require its own
calibration test.
The best way of achieving long-range and short-range
perceptions simultaneously in a unique system is by mounting two stereo cameras in a
perception head. This configuration is clearly more advantageous than the
simple union of two monocular cameras to sense different fields: first, stereo
data is available and either 2D or 3D information is obtainable for both fields
of view; second, each stereo rig can be calibrated independently, being all the
images properly rectified; third, stereo calibration is faster, easier, and
usually more accurate than monocular cameras calibration. The arrangement of
two stereo cameras in a unique perception head is the physical realization of
the idea of stereo bifocal perception.
A basic design for a bifocal stereo head is depicted in Figure 2(a), where B1 and B2 are the baselines for the short-range and long-range cameras,
respectively. Such layout can be further elaborated by increasing compactness
and adding two lateral (monocular) cameras for side perception, as shown in Figure
2(b).
Figure 2: Design of a
bifocal stereo head: (a) basic assembly; (b) compact model.
The basic idea behind the concept of bifocal stereo
for intelligent vehicles is the capability of sensing at medium ranges (e.g., for
guidance purposes) as well as at short ranges (for obstacle detection). In
order to succeed in this endeavor, both fields of view must be different and,
if possible, complementary. A typical configuration can be given by a 22 cm
baseline and 16 mm lenses for the long-range camera combined with a camera of 10 cm
baseline equipped with 4 mm lenses for detecting short ranges. Given that the
fields of view will have a significantly different angle as a consequence of
large differences in the focal lengths, the composed field of view can be
homogenized through the concept of density grids and validity box (as defined
in [13]).
Density grids are regular grids, either in two or three dimensions,
where each cell is characterized by its three-dimensional density (d3D),
defined as the number of validly stereo-correlated points per unit volume of
the cell. When registering the grids, the overlap is easily
eliminated by selecting consecutive validity boxes. Nevertheless, certain
overlap is recommended for redundancy purposes, what helps to check that
objects registered by both cameras have the same position and dimensions.
Figure 3 illustrates the management and handling of the two fields of view
sensed by a bifocal stereo camera through density grids and validity boxes.
Figure 3: Data
management for bifocal stereo cameras with density grids.
3. System Architecture
Bifocal stereo requires two stereo cameras working
independently, although their frame rate has to be high enough to register the
two stereo pairs of images almost simultaneously. The perception head was
assembled with two compact stereo cameras manufactured by Videre Design (Menlo Park, Calif, USA).
One of the cameras has a fixed baseline of 22 cm whereas the other camera
features a variable baseline between 10 cm and 20 cm. Both cameras supported
interchangeable lenses. Figure 4(a) shows the bifocal perception head employed
in the experiments. Although one of the two cameras allowed for baseline
variations, all the images captured with that camera were acquired with an 11 cm
baseline. The choice of baselines and lenses obeyed to the objective of sensing
short and medium ranges, therefore, the 11 cm baseline camera was equipped with
4 mm lenses, and the 22 cm baseline head supported 16 mm lenses. There are several
ways to position one camera with relation to the other; for instance, they can
share the same centerline, or, on the contrary, the reference lenses (left
lenses) can be aligned one over the other. This simple detail is relevant
because the final 3D cloud should have a unique center of coordinates, and a
coordinate frame translation needs to be done with the data coming from one of
the two cameras. The schematic of Figure 4(b) represents the relative position
between the reference lenses (left lens) of both stereo cameras, where is the difference in X coordinates, is the difference in Z coordinates, and () are the ground coordinates of the
point P acquired by the long-baseline
camera. Obviously, since both rigs are coplanar, Y coordinates (representing ranges) do not need to be adjusted. In
the system mounted and represented in Figure 4(a), the definite center of
camera coordinates was set at the left lens of the short-baseline camera,
placed under the long-baseline camera. Consequently, the coordinates of the points
detected by the long-baseline camera had to be translated according to the
expression of (1), where () are the ground coordinates registered by the
camera set for far ranges, and ()
are the ground coordinates of the merged 3D cloud.
Figure 4: (a) Bifocal head used in the experiments; (b) relative
position between reference lenses.
Before both clouds were merged, the original camera
coordinates were transformed to the ground coordinates represented in Figure 5.
Since every camera was operated from a different computer, the final fusion of
data took place after the data was logged; however, future implementations will
consider the possibility of running the bifocal stereo head from a unique
processor, and therefore obtaining the final 3D cloud as the sole output. A
complete diagram of the system architecture is illustrated in Figure 6. The
distinct feature of a variable baseline for one of the stereo units resulted in
the necessity of two IEEE-1394 ports in one computer, achieved with a FireWire
hub for port multiplication. This need was caused by the complete separation of
left and right sensors to ensure mobility in the variable-baseline camera. A
heavy-duty battery typically used in marine applications (recreational boats)
guaranteed stable and durable power to run both computers and cameras. The
computers executed the same C++ especially programmed software in a Windows
environment:
Figure 5: Definition of ground coordinates.
Figure 6: System architecture for bifocal stereo camera.
4. Design of Experiments
The goal of the experimental design is to demonstrate
that bifocal stereo heads provide a richer and more robust level of perception
than conventional binocular cameras without paying a high extra cost for it,
either computational or economical. In particular, the following tests try to
show that both cameras are complementary and by merging their three-dimensional
information the result is a denser cloud covering a wider interval of ranges,
which is desirable for intelligent vehicles perception engines. The procedure
envisioned analyzes the data coming from each sensor independently, generating
a composed 3D cloud, where it is possible to check the completeness of the
rendered scene and how well it matches the actual scenario. The proposed
quantitative analysis provides how many points fall in each 10 meter interval
(decameter) from the camera for each sensor. This determination was used to
verify how both sensors complement each other in their perceptive capabilities.
The qualitative analysis consisted of a visual confirmation that the 3D virtual
image coincided with the real scene and therefore included the most important
features located inside the field of view of the stereo head.
The study of bifocal stereo was carried out through
five experiments. There were two different situations especially interesting to
look into: first, the detection of objects separated far enough between them to
present a challenge if perceived by a conventional binocular camera; second,
the perception of a continuous row of trees, where such continuity can be
traced in the point cloud without any loss of relevant information when the
medium-range area takes over from the short-range section. The availability of
perception at two range levels, medium and short, should provide a rich
representation of all objects located within the amalgamated field of view, and
consequently, no lack of cohesion should be found when objects extend from one
range level to the other.
5. Results
In order to explore the boundaries of the ranges
determined by each camera, in Test 1 two objects were set far apart and
captured by the bifocal stereo head. The configuration of the head was such
that near objects were scanned with an 11 cm baseline and 4 mm lenses whereas
longer distances were sensed with a 22 cm baseline and 16 mm lenses. Figure 7(a) shows the left image captured by the short-range unit, where a person stood in
front of the head at approximately 9.1 m (30 feet) from the image
plane. Behind the target person there was a tree, which was considered the main
target for the midrange unit, as shown in Figure 7(b). The distance between the
bifocal head and the tree of interest was 23.5 m (77 feet); therefore, the
gap between both targets was approximately 14.4 m. The 3D representation of the
merged cloud, given in Figure 8(a), shows, as expected, an accumulation of
points in two different areas: one near the camera and the other further away.
The points coming from the short-range unit covered fairly well the first 15 m,
and therefore captured the person located at 9.1 m. The background tree, on the
contrary, was well defined by the section of the cloud obtained with the 22 cm
baseline unit.
Figure 7: Test 1: (a) left image taken with the short-baseline camera; (b) left image
captured with the long-baseline camera.
Figure 8: Test 1: (a) 3D point cloud; (b) side view of 3D point cloud.
The raw data from which navigation and awareness
information is extracted are the point clouds represented in Figures 8, 11, and 12. The first stage in the signal processing protocol was the filter embedded
in the stereo correlation software which eliminated from the disparity image
those pixels with low probability of being correct. The second stage applies
the concept of validity box [13], which removes a small quantity of points that
are obviously wrong such as negative points (underground) or points ten meters
above the ground (clouds confusion). The third stage involves the processing of
the cloud for decision making through the concept of density grids [13], which computes
the 3D density in cells palliating the effect of outliers. This third step
falls outside the scope of this paper and therefore is not shown in the
included figures.
If the midrange sensor assembly provides reliable
perception up to the tree located further than 20 m from the bifocal head, it
might lead to the conclusion that there is no need for the short-range unit.
The side view depicted in Figure 8(b) provides an answer to such conjecture.
The dark points representing the information gathered by the midrange rig show
a noticeable set of noisy points for ranges between 5 and 20 m, where only empty
space is expected as demonstrated by Figure 7(b). The separation of space
according to the optimal camera arrangement not only assures that the relevant
objects are sensed with the best possible hardware, but also can palliate the
effect of noise in the final 3D point cloud. The selection of the proper density
grid, as indicated in Figure 3, can help to make the perception engine more
reliable.
The distribution of detected ranges for each baseline-lenses
combination is plotted in Figure 9. The points were counted for each interval
of ten meters (decameter of study) from the camera. The plot shows that the two
units that comprise the bifocal head complement each other to output a more
regular cloud along the field of view. Looking at the images given in Figure 7,
it is expected to find a decline in the number of 3D points for the
intermediate decameter, which mainly captures the empty space between the
person and the tree. The first decameter summed up a total of 15671 points, the
second one decreased to 6501 points, and the third one increased again to 9969
points. The number of points constitutes the “critical mass” of the perceived
scene; if there are no points, there is no perception. The occurrence of points
is a necessary condition to perceive an object, but it is not sufficient. There
is a need to process the point cloud to extract information robustly because
the reliability in the detection of an object cannot be solely indicated by the
number of points, but, evidently, the number of points implies richness of
perception, which is the primary condition to be met.
Figure 9: Test 1: distribution of points in the complete 3D point cloud.
The visual capture of two targets separated by a
distance in the order of 15 m can be effective even if the space that lies between them is not
accurately sensed; after all, the focus of this experiment (Test 1) is,
exclusively, on the objects rather than the space between them. A successful
perception for the scene portrayed in Test 1 (Figure 7) can mask a lack of
continuity in the cloud for those ranges around the approximate boundary
between the two studied areas; the transition between them should be smooth and
coherent. Test 4 was one of the experiments designed to analyze such important
case. Figure 10 represents a turf lane bounded by two rows of trees separated 6 m
(19 feet).
Figure 10(a) provides the left image captured by the short-range camera (B = 11 cm;
f = 4 mm), whereas Figure 10(b) is the left image acquired by the long-baseline
camera (B = 22 cm; f = 16 mm). Both images illustrate how regularly the trees are
placed.
Figure 10: Test 4: (a) left image from short-baseline camera; (b) left image from
long-baseline camera.
Figure 11: Test 4: 3D point cloud.
Figure 12: Test 4: (a) front view of 3D point cloud; (b) side view of 3D point cloud.
The 3D representation of the scene is shown by the
point cloud of Figure 11, where the points obtained with the long-baseline
camera are darker than the points generated by the short-baseline rig.
This composed view of the cloud gives an idea of the
selective perception achieved through the concept of bifocal stereo, but the
side view of Figure 12(b) demonstrates that the accumulation of the majority of
the points occurs at two adjacent range intervals: between 5 and 12 m, and
between 12 and 20 m. The portion of space beyond 12 m from the bifocal head is
not reliably sensed by the short-range camera, and it can be seen in the drop
of density shown by Figure 12(b). Likewise, the optimal range for the midrange
camera is also indicated by the high concentration of the 3D cloud; outside
these confidence intervals, noise is likely to occur. Finally, the front view
of the complete scene, portrayed in Figure 12(a), confirms the consistency
between both partial clouds; tree height and row spacing are equivalent for the
resulting clouds gathered with the two sensors comprising the bifocal head.
The distribution of points measured by decameters is
graphed in Figure 13. This plot demonstrates again the high degree of
complementation between both sensors to improve the perception reliability in a
range from 5 to 20 m. The first 10 m were represented by a total of 28971 points,
and the second decameter toted up 21956 points. Between 20 and 30 m, only 5804
points gave a picture of the end of the row, which meant a severe descent in
the perception capabilities of the stereo head.
Figure 13: Test 4: distribution of points according to their origin.
Table 1 summarizes the results found in the five tests
designed to evaluate the bifocal stereo head. The superposition of perception
zones took place in every case, following the tendency seen in Figures 9 and 13. In average, the camera setup for near ranges acquired 74% of the points
located in the first ten meters, but 78% of the points falling between 10 and
30 m from the bifocal head were obtained by the long-baseline camera. There was
an important effect of noise on the point cloud, not only found with too far
ranges, but also generated by the long-baseline camera when sensing near
ranges. Each sensor had a clearly marked area of recommended perception and
either excessive ranges or too short distances resulted in noisy outcomes.
Table 1: Distribution of points (%) for each camera according
to the range interval and test.
6. Conclusion
The novel
concept of bifocal stereo is feasible and can be realized in practice at a
reasonable cost and effort. A working head was assembled for this research
project and evaluated through several field experiments with positive outcomes.
Results proved that bifocal perception provides a more reliable and richer
representation of the target scene than conventional binocular cameras covering
range intervals in the reach of 30 m, as each camera can be set up to sense only
on the recommended interval of ranges. Following this procedure, camera fields
of view can be adjusted to register a unified and larger portion of space. In
the particular case developed for this study, the fusion of both cameras
covered the ranges in front of the head between 5 and 25 m. An envisioned
implementation of this system on an autonomous vehicle would mount the stereo
head on the vehicle front, a tractor cabin, for example, and would process the
perceived data in an independent processor fixed under the driver seat. The
processor would filter the data and extract the significant information from
the unified field of view. Based on the elaborated perception information, the
processor would send navigation and safeguarding commands at least at 10 Hz to
the vehicle actuators, that is, brakes and steering controller. The resolution
of the grids is determinant to reach this minimum frequency for a safe
navigation, as there is a tradeoff between resolution and processing speed.
Data from other sources such as a laser rangefinder or a GPS receiver might
also be integrated with vision data at the processor level for a more robust
solution.
Several
improvements can be introduced in the design of bifocal stereo heads to
increase their compactness and efficiency: first, the relative location of the
lenses can be arranged in a one-row configuration; second, a unique computer can process all the
information acquired from two, or more, stereo sensors instead of using an
independent unit for each camera. It remains for future projects the implementation
of the entire system in an autonomous vehicle to verify the advantages of
bifocal stereo over conventional stereo in a real situation.
Acknowledgments
The material presented
in this paper was based upon work supported partially by the Spanish Ministry
of Education and Science Funds (AGL2006-09656/AGR), the USDA Hatch Funds
(ILLU-10-352 AE), and Bruce Cowgur Mid-Tech Memorial Funds. Any opinions,
findings, and conclusions expressed in this publication are those of the
authors and do not necessarily reflect the views of the University of Illinois,
the Spanish Ministry of Education and Science, the USDA, and Midwest
Technologies Inc.