Abstract

The numerous benefits of real-time 3D awareness for autonomous vehicles have motivated the incorporation of stereo cameras to the perception units of intelligent vehicles. The availability of the distance between camera and objects is essential for such applications as automatic guidance and safeguarding; however, a poor estimation of the position of the objects in front of the vehicle can result in dangerous actions. There is an emphasis, therefore, in the design of perception engines that can make available a rich and reliable interval of ranges in front of the camera. The objective of this research is to develop a stereo head that is capable of capturing 3D information from two cameras simultaneously, sensing different, but complementary, fields of view. In order to do so, the concept of bifocal perception was defined and physically materialized in an experimental bifocal stereo camera. The assembled system was validated through field tests, and results showed that each stereo pair of the head excelled at a singular range interval. The fusion of both intervals led to a more faithful representation of reality.

1. Introduction

The advantages and flaws of stereoscopic vision systems have been described many times since compact cameras entered the arena of perception sensors. However, significant advances in electronics and processor speed have led to the enhancement of the benefits of stereo vision and diminish its disadvantages. When compared to monocular cameras, the most important advantage brought by stereo cameras is the availability of ranges, that is, the possibility of estimating distances between the camera and objects located in its field of view. The addition of the range to the two coordinates already available with monocular cameras implies the possibility of registering a three-dimensional (3D) point cloud, representing the scene more faithfully. In spite of this, the access to the range can be a double-edge sword if ranges are measured outside an interval of acceptable reliability; if the baseline is shortened to focus on a short range, errors at a long distance will increase, and if the contrary is done, the accuracy of short ranges will decrease [1]. The configuration of the camera, especially the baseline and focal length of the lenses, determines the boundaries of such range interval. Rovira-Más et al. [2] studied the relationship between the camera configuration and the recommended range for safe operations. The fact that ranges are measured with reliability in a relatively narrowband very often results in limitations for those applications that require certain flexibility and versatility in perception.

The most popular applications of stereo to intelligent vehicles are autonomous guidance, safeguarding, localization, and (3D) mapping. The requirements of each particular operation are different and can even vary with time or vehicle speed. The challenge of navigating through well-structured crop rows to a great extent differs from the difficulties found by autonomous mobile robots operating in a manufacturing environment [3]. Autonomous driving, for instance, often entails a variable look-ahead distance according to the traveling speed. Safeguarding cannot be restricted to a too narrow set of ranges if protection needs to be assured in a wide variety of situations; detecting people around unmanned vehicles to facilitate a safe operation is one of the highest priority issues in perception technology for autonomous navigation [4]. Simultaneous localization and mapping (SLAM) benefits from registering wide areas with each stereo pair, the wider the better, as long as the data acquired is consistent enough to be incorporated to the map. The introduction of global navigation satellite systems and vision has led to the adaptation of SLAM methods originally developed for indoors to outdoor environments [5]. The advantage of gathering three-dimensional information cannot be realized at full extent unless the camera can cover the necessary field of view for a given application. Figure 1 illustrates two fundamental perceptive needs faced by autonomous vehicles: protection around the vehicle ensured by a safety range; the location of guidance targets at look-ahead intermediate distances. This research intends to enlarge the dimensions of the space sensed by a stereo head to efficiently and simultaneously detect short and medium ranges.

Before enlarging the capacity of a stereo camera in terms of range potential, it is important to establish the expectable situations, or at least what scenarios are considered normal for a particular application. The further target ranges are searched, the wider stereo baselines have to be, which makes difficult the design of compact and light stereo systems. Therefore, it is important to decide as accurate as possible the limits for the “projected field of view.” An extreme case of wide baseline stereo was solved by Olson et al. [6] in their Mars rover, where kilometric distances were pursued, although they had to sacrifice real-rime performance. In agricultural robotics, kilometric distances are not required, but what kind of perception is necessary inside conventional farm fields? Subramanian and Burks [7] provided an example of how accurate perception has to be in order to maneuver in an orchard with an autonomous vehicle. The vehicle navigated satisfactorily with maximum errors of 9 cm inside a path of 3.5 m width. Once the boundaries for the field of view have been approximately defined, the following step is to find the combination of baseline and optics to sense such portion of space. RASCAL, an autonomous vehicle which participated in DARPA Grand Challenge [8], was set to detect objects in a range interval of 5–25 m with a baseline of 30 cm and lenses of 8.5 mm focal length. Similar arrangements were done for an autonomous vehicle performing collision avoidance in on-highway driving [9]: 30 cm baseline and 7.5 mm lenses to cover a range span between 2 and 20 m. These two vehicles managed to carry out the desired task with just one stereo camera. However, a more demanding situation in both accuracy and reliability could benefit from a multicamera perception, although applications in this fashion are very rare.

Bostelman et al. [10] equipped a mobile field robot with a dual stereo vision system (two stereo cameras in the same frame). The objective was to develop two world models (WM1, WM2) simultaneously with a different resolution grid and constant number of cells (). The total extent of the map was 40 m for WM1 and 120 m for WM2. Information was fused after integrating the other sensors in the map instead of merging it directly from the stereo cameras. Another example of multicamera stereo perception was reported by Broggi et al. [11]. For this application, three cameras were combined to provide three baselines (0.5, 1, and 1.5 m) although the optics was the same for all the cameras (6 mm lenses). The flexibility of the design, together with the good performance of the processing algorithm, resulted in a successful driving of the vehicle, finishing the DARPA Grand Challenge. The same objective could be reached by dynamically changing the focal length of the lenses instead of the baseline. The fact that there are two lenses per stereo device that need to have the same focal length increases the complexity of this solution since variations of the focal length are carried out mechanically with zoom lenses. Nevertheless, a research team at the University of Central Florida has recently developed zoom lenses that can alter their focal length nearly instantaneously without changing the position of the lenses. These adaptive lenses are based on the ability of a liquid-crystal layer to alter the degree to which it can refract light when exposed to an electric field [12].

The objective of this investigation is to develop a stereo system that is capable of capturing 3D information from two different fields of view at the same time. In order to do so, the concept of bifocal perception was defined and an experimental system was assembled and tried.

2. Concept of Bifocal Perception

Bifocal lenses are a special kind of lenses with two distinctive areas, each one having a different eyesight correction. This special optics is designed for people who need assistance for both near and far distances, and prefer a solution in just one single lens. If the correction varies progressively, the lenses are then called varifocal lenses. The human eye can automatically focus according to the portion of the lens the eye is looking through. Unfortunately, a camera cannot mimic this behavior and typically each lens has a unique focal length f. As a result, bifocal perception can only be achieved in machine vision with at least two cameras: one lens covering short ranges and the other one in charge of sensing at long distances. Both cameras working concurrently cover a wider area of the target scene.

Different physical realizations can be devised to realize the concept of bifocal perception. If a compact stereo camera is assembled with two lenses of different focal lengths, far distance and near distance, information can be acquired at the same time. Obviously, since the lenses have a different focal length, the stereo effect is impossible to obtain. The system would function just as a superposition of two monocular cameras, losing the opportunity of retrieving stereo information, and therefore the 3D awareness. In addition, the conventional stereovision calibration algorithms cannot be applied, and consequently the original images cannot be easily rectified for lens aberration. Each camera would require its own calibration test.

The best way of achieving long-range and short-range perceptions simultaneously in a unique system is by mounting two stereo cameras in a perception head. This configuration is clearly more advantageous than the simple union of two monocular cameras to sense different fields: first, stereo data is available and either 2D or 3D information is obtainable for both fields of view; second, each stereo rig can be calibrated independently, being all the images properly rectified; third, stereo calibration is faster, easier, and usually more accurate than monocular cameras calibration. The arrangement of two stereo cameras in a unique perception head is the physical realization of the idea of stereo bifocal perception. A basic design for a bifocal stereo head is depicted in Figure 2(a), where B1 and B2 are the baselines for the short-range and long-range cameras, respectively. Such layout can be further elaborated by increasing compactness and adding two lateral (monocular) cameras for side perception, as shown in Figure 2(b).

The basic idea behind the concept of bifocal stereo for intelligent vehicles is the capability of sensing at medium ranges (e.g., for guidance purposes) as well as at short ranges (for obstacle detection). In order to succeed in this endeavor, both fields of view must be different and, if possible, complementary. A typical configuration can be given by a 22 cm baseline and 16 mm lenses for the long-range camera combined with a camera of 10 cm baseline equipped with 4 mm lenses for detecting short ranges. Given that the fields of view will have a significantly different angle as a consequence of large differences in the focal lengths, the composed field of view can be homogenized through the concept of density grids and validity box (as defined in [13]). Density grids are regular grids, either in two or three dimensions, where each cell is characterized by its three-dimensional density (d3D), defined as the number of validly stereo-correlated points per unit volume of the cell. When registering the grids, the overlap is easily eliminated by selecting consecutive validity boxes. Nevertheless, certain overlap is recommended for redundancy purposes, what helps to check that objects registered by both cameras have the same position and dimensions. Figure 3 illustrates the management and handling of the two fields of view sensed by a bifocal stereo camera through density grids and validity boxes.

3. System Architecture

Bifocal stereo requires two stereo cameras working independently, although their frame rate has to be high enough to register the two stereo pairs of images almost simultaneously. The perception head was assembled with two compact stereo cameras manufactured by Videre Design (Menlo Park, Calif, USA). One of the cameras has a fixed baseline of 22 cm whereas the other camera features a variable baseline between 10 cm and 20 cm. Both cameras supported interchangeable lenses. Figure 4(a) shows the bifocal perception head employed in the experiments. Although one of the two cameras allowed for baseline variations, all the images captured with that camera were acquired with an 11 cm baseline. The choice of baselines and lenses obeyed to the objective of sensing short and medium ranges, therefore, the 11 cm baseline camera was equipped with 4 mm lenses, and the 22 cm baseline head supported 16 mm lenses. There are several ways to position one camera with relation to the other; for instance, they can share the same centerline, or, on the contrary, the reference lenses (left lenses) can be aligned one over the other. This simple detail is relevant because the final 3D cloud should have a unique center of coordinates, and a coordinate frame translation needs to be done with the data coming from one of the two cameras. The schematic of Figure 4(b) represents the relative position between the reference lenses (left lens) of both stereo cameras, where is the difference in X coordinates, is the difference in Z coordinates, and () are the ground coordinates of the point P acquired by the long-baseline camera. Obviously, since both rigs are coplanar, Y coordinates (representing ranges) do not need to be adjusted. In the system mounted and represented in Figure 4(a), the definite center of camera coordinates was set at the left lens of the short-baseline camera, placed under the long-baseline camera. Consequently, the coordinates of the points detected by the long-baseline camera had to be translated according to the expression of (1), where () are the ground coordinates registered by the camera set for far ranges, and () are the ground coordinates of the merged 3D cloud.

Before both clouds were merged, the original camera coordinates were transformed to the ground coordinates represented in Figure 5. Since every camera was operated from a different computer, the final fusion of data took place after the data was logged; however, future implementations will consider the possibility of running the bifocal stereo head from a unique processor, and therefore obtaining the final 3D cloud as the sole output. A complete diagram of the system architecture is illustrated in Figure 6. The distinct feature of a variable baseline for one of the stereo units resulted in the necessity of two IEEE-1394 ports in one computer, achieved with a FireWire hub for port multiplication. This need was caused by the complete separation of left and right sensors to ensure mobility in the variable-baseline camera. A heavy-duty battery typically used in marine applications (recreational boats) guaranteed stable and durable power to run both computers and cameras. The computers executed the same C++ especially programmed software in a Windows environment:

4. Design of Experiments

The goal of the experimental design is to demonstrate that bifocal stereo heads provide a richer and more robust level of perception than conventional binocular cameras without paying a high extra cost for it, either computational or economical. In particular, the following tests try to show that both cameras are complementary and by merging their three-dimensional information the result is a denser cloud covering a wider interval of ranges, which is desirable for intelligent vehicles perception engines. The procedure envisioned analyzes the data coming from each sensor independently, generating a composed 3D cloud, where it is possible to check the completeness of the rendered scene and how well it matches the actual scenario. The proposed quantitative analysis provides how many points fall in each 10 meter interval (decameter) from the camera for each sensor. This determination was used to verify how both sensors complement each other in their perceptive capabilities. The qualitative analysis consisted of a visual confirmation that the 3D virtual image coincided with the real scene and therefore included the most important features located inside the field of view of the stereo head.

The study of bifocal stereo was carried out through five experiments. There were two different situations especially interesting to look into: first, the detection of objects separated far enough between them to present a challenge if perceived by a conventional binocular camera; second, the perception of a continuous row of trees, where such continuity can be traced in the point cloud without any loss of relevant information when the medium-range area takes over from the short-range section. The availability of perception at two range levels, medium and short, should provide a rich representation of all objects located within the amalgamated field of view, and consequently, no lack of cohesion should be found when objects extend from one range level to the other.

5. Results

In order to explore the boundaries of the ranges determined by each camera, in Test 1 two objects were set far apart and captured by the bifocal stereo head. The configuration of the head was such that near objects were scanned with an 11 cm baseline and 4 mm lenses whereas longer distances were sensed with a 22 cm baseline and 16 mm lenses. Figure 7(a) shows the left image captured by the short-range unit, where a person stood in front of the head at approximately 9.1 m (30 feet) from the image plane. Behind the target person there was a tree, which was considered the main target for the midrange unit, as shown in Figure 7(b). The distance between the bifocal head and the tree of interest was 23.5 m (77 feet); therefore, the gap between both targets was approximately 14.4 m. The 3D representation of the merged cloud, given in Figure 8(a), shows, as expected, an accumulation of points in two different areas: one near the camera and the other further away. The points coming from the short-range unit covered fairly well the first 15 m, and therefore captured the person located at 9.1 m. The background tree, on the contrary, was well defined by the section of the cloud obtained with the 22 cm baseline unit.

The raw data from which navigation and awareness information is extracted are the point clouds represented in Figures 8, 11, and 12. The first stage in the signal processing protocol was the filter embedded in the stereo correlation software which eliminated from the disparity image those pixels with low probability of being correct. The second stage applies the concept of validity box [13], which removes a small quantity of points that are obviously wrong such as negative points (underground) or points ten meters above the ground (clouds confusion). The third stage involves the processing of the cloud for decision making through the concept of density grids [13], which computes the 3D density in cells palliating the effect of outliers. This third step falls outside the scope of this paper and therefore is not shown in the included figures.

If the midrange sensor assembly provides reliable perception up to the tree located further than 20 m from the bifocal head, it might lead to the conclusion that there is no need for the short-range unit. The side view depicted in Figure 8(b) provides an answer to such conjecture. The dark points representing the information gathered by the midrange rig show a noticeable set of noisy points for ranges between 5 and 20 m, where only empty space is expected as demonstrated by Figure 7(b). The separation of space according to the optimal camera arrangement not only assures that the relevant objects are sensed with the best possible hardware, but also can palliate the effect of noise in the final 3D point cloud. The selection of the proper density grid, as indicated in Figure 3, can help to make the perception engine more reliable.

The distribution of detected ranges for each baseline-lenses combination is plotted in Figure 9. The points were counted for each interval of ten meters (decameter of study) from the camera. The plot shows that the two units that comprise the bifocal head complement each other to output a more regular cloud along the field of view. Looking at the images given in Figure 7, it is expected to find a decline in the number of 3D points for the intermediate decameter, which mainly captures the empty space between the person and the tree. The first decameter summed up a total of 15671 points, the second one decreased to 6501 points, and the third one increased again to 9969 points. The number of points constitutes the “critical mass” of the perceived scene; if there are no points, there is no perception. The occurrence of points is a necessary condition to perceive an object, but it is not sufficient. There is a need to process the point cloud to extract information robustly because the reliability in the detection of an object cannot be solely indicated by the number of points, but, evidently, the number of points implies richness of perception, which is the primary condition to be met.

The visual capture of two targets separated by a distance in the order of 15 m can be effective even if the space that lies between them is not accurately sensed; after all, the focus of this experiment (Test 1) is, exclusively, on the objects rather than the space between them. A successful perception for the scene portrayed in Test 1 (Figure 7) can mask a lack of continuity in the cloud for those ranges around the approximate boundary between the two studied areas; the transition between them should be smooth and coherent. Test 4 was one of the experiments designed to analyze such important case. Figure 10 represents a turf lane bounded by two rows of trees separated 6 m (19 feet). Figure 10(a) provides the left image captured by the short-range camera (B  =  11 cm; f  =  4 mm), whereas Figure 10(b) is the left image acquired by the long-baseline camera (B  =  22 cm; f  =  16 mm). Both images illustrate how regularly the trees are placed.

The 3D representation of the scene is shown by the point cloud of Figure 11, where the points obtained with the long-baseline camera are darker than the points generated by the short-baseline rig.

This composed view of the cloud gives an idea of the selective perception achieved through the concept of bifocal stereo, but the side view of Figure 12(b) demonstrates that the accumulation of the majority of the points occurs at two adjacent range intervals: between 5 and 12 m, and between 12 and 20 m. The portion of space beyond 12 m from the bifocal head is not reliably sensed by the short-range camera, and it can be seen in the drop of density shown by Figure 12(b). Likewise, the optimal range for the midrange camera is also indicated by the high concentration of the 3D cloud; outside these confidence intervals, noise is likely to occur. Finally, the front view of the complete scene, portrayed in Figure 12(a), confirms the consistency between both partial clouds; tree height and row spacing are equivalent for the resulting clouds gathered with the two sensors comprising the bifocal head.

The distribution of points measured by decameters is graphed in Figure 13. This plot demonstrates again the high degree of complementation between both sensors to improve the perception reliability in a range from 5 to 20 m. The first 10 m were represented by a total of 28971 points, and the second decameter toted up 21956 points. Between 20 and 30 m, only 5804 points gave a picture of the end of the row, which meant a severe descent in the perception capabilities of the stereo head.

Table 1 summarizes the results found in the five tests designed to evaluate the bifocal stereo head. The superposition of perception zones took place in every case, following the tendency seen in Figures 9 and 13. In average, the camera setup for near ranges acquired 74% of the points located in the first ten meters, but 78% of the points falling between 10 and 30 m from the bifocal head were obtained by the long-baseline camera. There was an important effect of noise on the point cloud, not only found with too far ranges, but also generated by the long-baseline camera when sensing near ranges. Each sensor had a clearly marked area of recommended perception and either excessive ranges or too short distances resulted in noisy outcomes.

6. Conclusion

The novel concept of bifocal stereo is feasible and can be realized in practice at a reasonable cost and effort. A working head was assembled for this research project and evaluated through several field experiments with positive outcomes. Results proved that bifocal perception provides a more reliable and richer representation of the target scene than conventional binocular cameras covering range intervals in the reach of 30 m, as each camera can be set up to sense only on the recommended interval of ranges. Following this procedure, camera fields of view can be adjusted to register a unified and larger portion of space. In the particular case developed for this study, the fusion of both cameras covered the ranges in front of the head between 5 and 25 m. An envisioned implementation of this system on an autonomous vehicle would mount the stereo head on the vehicle front, a tractor cabin, for example, and would process the perceived data in an independent processor fixed under the driver seat. The processor would filter the data and extract the significant information from the unified field of view. Based on the elaborated perception information, the processor would send navigation and safeguarding commands at least at 10 Hz to the vehicle actuators, that is, brakes and steering controller. The resolution of the grids is determinant to reach this minimum frequency for a safe navigation, as there is a tradeoff between resolution and processing speed. Data from other sources such as a laser rangefinder or a GPS receiver might also be integrated with vision data at the processor level for a more robust solution.

Several improvements can be introduced in the design of bifocal stereo heads to increase their compactness and efficiency: first, the relative location of the lenses can be arranged in a one-row configuration; second, a unique computer can process all the information acquired from two, or more, stereo sensors instead of using an independent unit for each camera. It remains for future projects the implementation of the entire system in an autonomous vehicle to verify the advantages of bifocal stereo over conventional stereo in a real situation.

Acknowledgments

The material presented in this paper was based upon work supported partially by the Spanish Ministry of Education and Science Funds (AGL2006-09656/AGR), the USDA Hatch Funds (ILLU-10-352 AE), and Bruce Cowgur Mid-Tech Memorial Funds. Any opinions, findings, and conclusions expressed in this publication are those of the authors and do not necessarily reflect the views of the University of Illinois, the Spanish Ministry of Education and Science, the USDA, and Midwest Technologies Inc.