Abstract

Developing matching algorithms from stereo image pairs to obtain correct disparity maps for 3D reconstruction has been the focus of intensive research. A constant computational complexity algorithm to calculate dissimilarity aggregation in assessing disparity based on separable successive weighted summation (SWS) among horizontal and vertical directions was proposed but still not satisfactory. This paper presents a novel method which enables decoupled dissimilarity measure in the aggregation, further improving the accuracy and robustness of stereo correspondence. The aggregated cost is also used to refine disparities based on a local curve-fitting procedure. According to our experimental results on Middlebury benchmark evaluation, the proposed approach has comparable performance when compared with the selected state-of-the-art algorithms and has the lowest mismatch rate. Besides, the refinement procedure is shown to be capable of preserving object boundaries and depth discontinuities while smoothing out disparity maps.

1. Introduction

Stereo vision is the technique of constructing a 3D description of the scene from stereo image pairs, which is important in many computer vision tasks such as inspection [1], 3D object recognition [2], robot manipulation [3], and autonomous navigation [4]. Stereo vision systems can be active or passive. Active techniques utilize ultrasonic transducers and structured light or laser to simplify the stereo matching problem. On the other hand, passive stereo vision based only on stereo image pairs is less intrusive and typically able to provide a compact and affordable solution for range sensing.

For passive stereo vision systems, stereo matching algorithms are crucial for correct and accurate depth estimation, which find for each pixel in one image the corresponding pixel in the other image. A 2D picture of displacements between corresponding pixels of a stereo image pair is named as a disparity map [5].

Reference [6] is an intensively cited classification of stereo matching algorithms for rectified image pairs. The paper divides most of the algorithms into four sequential parts: matching cost calculation, cost aggregation, disparity computation, and disparity refinement. Among the steps, cost aggregation determines the performance of an algorithm in terms of computational complication and correctness.

Cost aggregation can be local [712] or global [1316], based on differences in the range of supporting regions or windows. Global methods assume that the scene is piecewise smooth and search for disparity assignments over the whole stereo pair [6], which requires high computational operation. The local methods, also known as window-based, typically require less memory and computation. As a result, the window-based algorithms are popular for fast disparity calculations [17].

Local methods tend to be sensitive to noise; however, and its correctness at regions with sparse texture or near depth discontinuities relies on proper selection of window size. To overcome this problem, [7] proposed variable windows for matching calculation, while [8] proposed multiple windows to enhance correctness at regions near depth discontinuities. Nevertheless, performance of these approaches is limited, since same aggregation weights are applied over the windows.

Recent years have seen adaptive support weight approaches [9] to improve quality of disparity maps. Unfortunately, these approaches require independent support weights calculation for each pixel and dramatically increase computational complexity.

To simplify computation, [10] introduced joint histogram to reduce the search region of disparity and [11] proposed the usage of a sparse Census mask. A summed normalized crosscorrelation was proposed in [12] to calculate matching cost in two stages. Segmentation and plane fitting on disparity planes [1316] are also popular to improve accuracy of disparity, but the performance relies on correctness in both segmentation and plane fitting.

An effective local stereo matching algorithm is introduced in [18], which significantly simplifies the intensity-dependent aggregation procedure of local methods. The algorithm aggregates cost values effectively in terms of bilateral filtering by only four passes along regions, called separable successive weighted summation (SWS), eliminating iteration and support area dependency. However, the dissimilarity measures are coupled, which significantly restrict the flexibility in weighting the aggregated costs.

In this paper, we present an improved stereo matching algorithm. Similar to [18], our algorithm uses whole regions as matching primitives to assess disparity based on SWS among horizontal and vertical directions. We also use the basic metrics, such as truncated sum of absolute difference and truncated absolute gradient difference, as dissimilarity measure to provide a trade-off between accuracy and complexity.

The main contribution of this paper is to afford a decoupled aggregation algorithm to access the stereo matching cost under the framework of SWS. The algorithm is simply yet efficient as well as robust. In addition, the resultant disparity map is in a discrete space, which is unsuitable for image-based rendering. We propose a subpixel refinement technique that employs inferior candidate disparities, rather than spatial neighbors, to smooth out discrete values in the disparity map. By this arrangement for curve-fitting, even regions near discontinuous depth can be correctly refined. Moreover, this technique increases the resolution of a stereo algorithm with marginal additional computation.

2. Aggregation Algorithm Design

Our stereo matching algorithm consists of three main stages. First, initial cost values are calculated based on dissimilarity measure between pixels in the reference and target images, and the costs are aggregated using the proposed method. Second, we perform initial disparity estimation by the use of a winner-takes-all minimum search based on the aggregated costs. Third, we check differences between the disparity values of corresponding pixel pairs for the existence of obscured regions and patch them by the smallest disparity values of nearby regions. Finally, the disparity map is refined by a proposed curve-fitting procedure.

2.1. Cost Definition

Assuming that the image pair is rectified and horizontally aligned, two dissimilarity measures between the pixels on the reference image and the target image are used in this work.

The truncated absolute difference cost, , is defined as the absolute difference between the intensity of pixels with the corresponding pixel being shifted pixels, along horizontal direction on the reference image: where and are the intensities of the pixels on reference and target images with corresponding to three-channel colors: , , and , respectively, are pixel coordinates, and is a threshold value for with . The use of a threshold to restrict cost value has been a well-adopted practice to reduce the effects of noise and potential mismatch in obscured regions.

Besides, the truncated absolute gradient difference cost is defined as where and are the horizontal and vertical gradient operators, respectively, and is a threshold value for with .

2.2. Cost Aggregation

Aggregation of primary costs determines correctness and accuracy of disparity estimation [1921]. Based on the observation that costs, such as the absolute difference and the absolute gradient difference, represent different dissimilarity characteristics, they should be independently weighted in aggregation. This section proposes a new method that is compatible with separable successive weighted summation (SWS) [18] while efficiently provides decoupled dissimilarity aggregation for robust stereo matching.

Once the and cost measures are obtained, as was defined in the last subsection, the aggregated cost function is set as the weighted sum of both measures according to a weighting factor, :

In (3), definition of the weightings, , , , and , are based on the operational principles of bilateral filters [22] to dramatically reduce computation:

The values of and increase with smaller , while the values of and rise with lesser . Also, the weightings are the multiples of horizontal and vertical weightings, which decrease with the increase of distances to the reference pixels. Hence, for each pixel, neighboring pixels with similar intensity have higher support during the aggregation.

The aggregation is a two-dimensional convolution. To reduce the computational complexity, each convolution is further decomposed into four one-dimensional convolutions [18]. These one-dimensional convolutions operate from left to right, from right to left, from top to bottom, and from bottom to top, respectively.

Let us take the absolute difference part of , denoted as in (3), as an example. If we define the left-to-right weighted sum, , as from the definition of , (4): we have that can be written in a recursive form Similarly, we may define the right-to-left weighted sum, , as which can also be written in a recursive form

We have that

Note that, in (13), we have defined to simplify the following derivation.

In the vertical direction, if we define the top-to-bottom and bottom-to-top weighted sums, and , as we have that

In calculating (16), both and are recursively obtained as

Hence, the first part of the aggregated cost, , can be efficiently calculated by (10), (12), and (17).

With similar procedure, we have that where with

These terms can all be written in the following recursive form:

2.3. Disparity Computation

In the last section, the matching cost is aggregated through weighted summation over the entire image, nd the disparity which provides minimum cost is assigned to the corresponding pixel. That is, the assigned disparity for a pixel in the reference image, , is the one with the minimum aggregated matching cost: where is the disparity search space with being the maximum possible disparity value.

The initial disparity map normally contains obscured outlier regions. The disparities at these regions are significantly different when they are found from different reference images. If and are the disparities with the left and right images as reference images, respectively, we can apply the Left-Right Consistency Check (LRC) [23] to determine if a pixel is located at an obscured region:

Once the obscured regions are found, occlusion handling [24] can be applied to patch them with the smallest disparity values of nearby regions, and the corresponding costs are also assigned for disparity refinement, as to be used in the next subsection.

2.4. Disparity Refinement

The disparity map obtained by the method proposed in the last section is discrete, since the disparity search space is an integer set. We propose a smoothing technique for the disparity refinement in this subsection.

Considering that the initial disparity has the smallest aggregation cost in the potential solution space, we may interpolate for refined value by fitting data sets to upward curves. Besides, rather than directly using the neighboring disparity for refinement, we use both the costs and disparity in the curve-fitting.

Assuming that is the cost value corresponding to disparity at on the reference image, we denote , such that and , to simplify the following presentation.

Firstly, the disparity-cost sets around the pixel, , , and , are fitted to a hyperbolic function

As the minimum value of the curve is located at , we have that

Secondly, an upward parabola equation is used to fit other disparity-cost sets around the pixel, , , and . As the minimum value of the parabola is located at , we have

The averaged value, , is then used as the refined disparity.

3. Experimental Study

In contrast to the approach of [18], the proposed algorithm applied independent weights to the mismatch measures while preserving comparable computational efficiency. Below, we rewrite the aggregated cost function, (3), for comparison:

The aggregated cost function of [18] is equivalent to where is the Census measure [25, 26]. It is clear from the comparison between (28) and (29) that the proposed algorithm (28) enables separate weightings on different measures in the aggregation.

A performance comparison between the proposed method and the algorithm of [18], denoted as InfoPerm [18], using the Middlebury stereo test bench [27, 28] is presented in Figure 1. In the computation, the parameters of the proposed algorithm, , , , , and , are selected as 22, 38, 32, 23, and 0.6, respectively.

In Figure 1, and hence the following presentations, disparities with errors larger than 0.5 disparity levels are named as mismatches, which are denoted in gray at non-occluded regions and black for occluded regions. The percentages of mismatches are further calculated and summarized in Table 1, which shows that the proposed algorithm outperforms InfoPerm [18] in correctness at non-occluded regions in the benchmark tests.

In addition to InfoPerm [18], several state-of-the-art methods, such as SNCC [12], HistAggr2 [10], RTCensus [11], InfoPerm [18], AdaptWeight [9], FeatureGC [13], ObjectStereo [15], and AdaptingBP [16] were also implemented on the Middlebury stereo test bench [25, 26] for a complete performance comparison. Among them, SNCC [12], HistAggr2 [10], RTCensus [11], InfoPerm [18], and AdaptWeight [9] are local stereo match algorithms, while FeatureGC [13], ObjectStereo [15], are AdaptingBP [16] are global techniques.

Figure 2 shows a comparison of the disparity maps between RTCensus [11], SNCC [12], AdaptWeight [9] and the proposed approach, where all the disparity maps have been refined. The results show that the proposed method has comparable performance with these state-of-the-art methods, and the refinement strategy, introduced in Section 2.4, is able to preserve clear boundaries.

The complete comparison of the mismatch rates between these algorithms is summarized in Table 2. In the table, “nonocc.” denotes the pixels in the non-occluded region, and “disc.” represents the discontinuous but visible pixels near the occluded regions. According to Table 2, the proposed algorithm outperforms AdaptWeight [9] in all of the mismatch evaluations, although it seems inferior to AdaptWeight [9] in generating sharp boundary, as shown in Figure 1.

It is also interesting to note that the local stereo matching algorithms, such as RTCensus [11] and SNCC [12], outperform other algorithms for the Teddy and Cones image pairs. Nevertheless, the global stereo matching algorithms, such as FeatureGC [13] and AdaptingBP [16], perform better for the Tsukuba and Venus image pairs. This observation indicates that both the local and global approaches are case-sensitive. However, the proposed approach has comparable performance in all of the cases and has the lowest mismatch rate in the benchmark evaluation.

4. Conclusion

Stereo matching algorithms are crucial for correct and accurate depth estimation in passive stereo vision systems. A stereo matching algorithm processes rectified stereo image pairs to generate the disparity map, which is used to calculate the depth image (z-map), and hence the 3D point cloud in camera coordinates. For practical applications, the algorithms should require less computational resources and provide precise disparity maps.

In this paper, we proposed an efficient stereo matching algorithm and a refinement strategy for the disparity maps. The algorithm effectively aggregates cost values in terms of bilateral filtering by only four passes along regions, which is able to provide a decoupled dissimilarity measure aggregation while preserving computational efficiency. Besides, the refinement strategy is a simple application of the aggregated costs that use both the costs and disparity in the curve-fitting, rather than directly using the neighboring disparity for refinement.

Experimental results using the Middlebury stereo test bench [25, 26] show that the algorithm has comparable performance with the state-of-the-art algorithms and outperforms the representative algorithms in the overall mismatch rate.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This paper was sponsored by Chang Gung Memorial Hospital, Chang Gung University, and the National Science Council, Taiwan, under Contract nos. CMRPD1C0021, CMRPD2C0051, NSC 100-2221-E-182-008, NSC 101-2221-E-182-006, and NSC 102-2221-E-182-073 and the National Science Foundation of China, under Contract no. 61271326.