Abstract

This paper presents an effective cost aggregation strategy for dense stereo matching. Based on the guided image filtering (GIF), we propose a new aggregation scheme called Pervasive Guided Image Filtering (PGIF) to introduce weightings to the energy function of the filter which allows the whole image pair to be taken into account. The filter parameters of PGIF are calculated as two-dimensional convolution using the bright and spatial differences between the corresponding pixels, which can be incrementally calculated for efficient aggregation. The complexity of the proposed algorithm is O(N), which is linear to the number of image pixels. Furthermore, the algorithm can be further simplified into O(N/4) without significantly sacrificing accuracy if subsampling is applied in the stage of parameter calculation. We also found that a step function to attenuate noise is required in calculating the weights. Experimental evaluation on version 3 of the Middlebury stereo evaluation datasets shows that the proposed method achieves superior disparity accuracy over state-of-the-art aggregation methods with comparable processing speed.

1. Introduction

Stereoscopic vision is less invasive and valuable for many applications, such as 3D reconstruction and environmental detection of autonomous vehicles, as it relies only on pairs of images captured from different perspectives. In stereo vision systems, stereo matching algorithms are critical for correct and accurate disparity estimation, finding corresponding pixels in matching images for each pixel in the reference image.

According to [1], dense stereo matching algorithms fall into two categories: local methods and global methods. The global methods treat disparity calculations as a minimization problem, where the objective function consists of a measurement part and a penalty part. The measurement part indicates the similarity between the slice on the pair of images and the penalty part to suppress the change in disparity. Representative global methods include Belief Propagation [2], Graph Cut [3], and Dynamic Programming [4]. These techniques require a lot of calculations and are not applicable for real-time applications.

In contrast, the local methods are popular for fast disparity calculations. Local approaches tackle the effects of light changes through local windows and are categorized into parametric [5] and nonparametric methods [6]. Local algorithms generally perform four stages [1]: (1) preliminary matching cost calculations, (2) cost aggregation over support regions, (3) disparity estimation, and (4) disparity refinement.

Cost aggregation plays a key role in the local stereo matching algorithms. Early approaches, such as in [7], achieved limited performances, especially in discontinues and occlusion regions. To accommodate the assumption that pixels are of similar disparity within the same aggregation windows [8], many adaptive weighting aggregation methods, such as in [911], were presented.

Recent years have seen a trend to treat the cost aggregation as image filtering. The guided image filter [12] (GIF) is able to provide superior edge profiles without gradient-reversal artifacts, and was successfully applied to cost aggregation in [13]. In order to reduce the computational load of the GIF, [14] recommends subsampling the cost and the guidance image to calculate the coefficients. Ref. [15] proposed the Weighted Guided Image Filtering (WGIF) to improve GIF by modifying the regularization term of the energy function. Besides, [16, 17] applied WGIF for stereo matching with limited performance because of the lack of pixel information outside the fixed windows.

In order to improve the precision of stereo matching, a series of approaches with adaptive guided filters were proposed [1820] to remove the limitation of the fixed-window formulation. Among them, [20] adaptively tunes the size of rectangular support windows based on both the intensity difference and distance between the neighboring pixel and the central pixel. However, these approaches still suffer from the loss of information outside the windows. Ref. [21] uses whole image for matching cost calculation, where the weights of aggregation are computed according to a measure of similarity between neighboring pixels in the guidance image. In addition, a new scheme called weight propagation is proposed to efficiently calculate weights.

In this work, we extend the scheme of GIF [12] for disparity cost aggregation, as suggested in [13], by introducing bilateral weights that take distance and intensity differences into account, as suggested in [21]. This approach uses the whole image for aggregation. Similar to the convolution procedure of [21], the complexity of the proposed algorithm is O(N), which is linear to the image resolution. We call our approach Pervasive Guided Image Filtering, denoted as PGIF.

The main contribution of this paper is a new cost aggregation algorithm for stereo matching which can be summarized as follows:

(1) An innovative aggregation scheme is proposed that weights the cost function of the GIF to allow for consideration of the entire image pair.

(2) We demonstrate that a constraint modification of the aggregation weight by a step function is crucial to avoid value reduction in the weight propagation process. This minor modification further improves the accuracy of disparity.

(3) The proposed aggregation algorithm can be calculated as two-dimensional convolution with complexity O(N) using the weight-propagation method of [21]. Besides, the algorithm can be further simplified into O(N/4) without sacrificing accuracy if subsampling is applied in calculating the parameters of the guided image filtering, as suggested by [14].

(4) A performance evaluation of version 3 of the Middlebury Stereoscopic Evaluation Dataset demonstrates the superiority of the proposed method to most state-of-the-art aggregation methods in terms of disparity accuracy and processing speed.

2. Aggregations of Preliminary Cost

2.1. Cost Function Definition and GIF Aggregation

In a local stereo matching procedure, a disparity map is obtained through four steps: (1) computation of the preliminary matching cost, (2) aggregation of the cost via volumetric filtering, (3) disparity selection via winner-take-all (WTA), and (4) postprocessing for disparity refinement. In the first step, the preliminary matching cost is a three-dimensional array of dissimilarity measures for every pixel within potential disparities. When a pixel at is assigned a disparity value d, we use the notation as the preliminary cost based on the left image.

There are many metrics that can be used to measure the degree of matching between image patches, such as sum of squared difference and normalized cross correlation. In the following investigations, we use the basic metric, the truncated absolute difference of the gradient, for the cost:where and are the left and right images of the stereo pair and is a truncation threshold, normally assigned as 2, which is applied to reduce mismatch in noisy or obscured regions. This concise arrangement also reflects the effectiveness of the proposed scheme.

The general form of cost aggregation in the traditional local stereo matching algorithms can be written as a weighted sum of the preliminary matching cost:where is the aggregated cost and are pixels within a window centered at pixel .

Equation (2) is a general form of window-based cost aggregation. When the aggregation weight is set to 1, we have the simplest stereo matching algorithm, such as in [1]. The performance of the window-based approaches depends on the correct choice of the size of the support window, since the effects of external pixels are ignored.

Full Image-Guided Filtering [21] (FIF) is one of the methods using the entire image as a supporting window, which implements a scheme called weight propagation for the cost aggregation:In (3), is a constant, is the path between a pixel pair , and is the Euclidian distance between the intensities of two pixels, and , along the path. Note that the guided image filtering [12] (GIF) scheme is not employed in this method.

On the other hand, Fast Cost-Volume Filtering method [13] (FCVF) achieves significant performance compared to most local stereo matching methods. In this method, the weights are not explicitly calculated, as shown in (2). Instead, based on the principle of GIF [12], the filtered cost assumes a local linear model in terms of the guidance image such thatwhere and are obtained by minimizing an energy function within the supporting window . The energy function for the parameters, a and b, is defined aswhere is the normalization parameter used to limit the magnitude of , usually chosen to be 0.0001. The optimal values of the linear parameters in (5), denoted as and , arewhere is the number of pixels in the support window centered on . This GIF-based aggregation can also be rearranged into the form of (2) by replacing the parameters of (4) with the best values of (6): where is the mean values of and is the variance of

2.2. The Proposed Aggregation Scheme: Pervasive Guided Image Filtering (PGIF)

According to the investigation of the last section, we have that a proper image filtering algorithm is critical for the accuracy of stereo matching. However, most GIF-based local stereo matching methods suffer from the same problem of missing information outside the supporting window.

In this section, we propose a GIF-based scheme that uses the full image for the filtering and call it the Pervasive Guided Image Filtering (PGIF). Similar to (4), the scheme also approximates the filtered cost as a linear model in terms of the guidance image However, the corresponding energy function for the linear parameters is defined as a weighted version of (5):where is a regularization parameter to restrict the value of and the weight reflects the relative importance of the pixel q = (i, j) with respect to the pixel p = (x, y). Importantly, the weight is calculated as a multiplication of horizontal and vertical weighting factors, and :whereand

The parameter is a constant filter factor, where the larger value corresponds to a higher weight. As , successive multiplication of them ensures smaller weights for longer distances. Furthermore, the step function defined in (11) is introduced to avoid the loss of information when there is a significant local density difference along the path from pixel to pixel . By this arrangement, the values of the weightings stay within the interval , which are proportional to density similarity.

The necessity of introducing the function is illustrated in Figure 1, where all pixel values are 10, except , which is 99. Assuming , we have that However, if is not applied, we have that , making the intensity of invisible to both and

Optimal values of the linear parameters of the filter, and , are obtained by minimizing the energy function This can be achieved by assigning the first derivatives of it to zero:

Solving (13), the best estimate of the parameters iswhere

To reduce the computational complexity of (15), these equations can be decomposed into four one-dimensional convolutions using the principle of weight propagation, as developed in [21]. Let us take the numerator of , denoted as , as an example:where the weight is calculated according to (10) and and are the width and height of the guidance image, respectively. Firstly, we define the left-to-right and the right-to-left weighted sums, and asSince can be written in recursive form:Similarly,

Secondly, we define the horizontal weighted sum, denoted as :Based on a similar procedure, we define the top-to-bottom and bottom-to-top weighted sums, and , asHence, can be calculated as

In sum, we can compute using (23), where and are obtained by (21) and (22). Next, in (21) and (22) are calculated from (20). Finally, and of (20) are recursively calculated using (18) and (19).

We have that all of the two-dimensional convolutions in (15) can be decomposed into one-dimensional convolution operations in four directions with computational complexity O(N), where is the total number of pixels of the guidance image.

Once we have the optimal filter parameters and , the aggregated matching cost volume is calculated as

We may substitute the parameters of (24) by the optimal values of (14) and rearrange them into the following form:

The disparity map, denoted as , can then be obtained by the winner-take-all operation on the aggregated cost volume :where is the range of disparity values.

A close comparison between (7) and (25) indicates that the effective region of (25) extends from the local aggregation window of (7) to the entire guided image , where the weight of (25) includes the weight of the energy function defined in (9).

In addition, in order to speed up the calculation, [14] suggests subsampling both the cost volume and the guidance image to calculate the linear coefficients. We can follow the same process by subsampling and to calculate the optimal filter parameters and then use them for normal aggregation. This will accelerate the algorithm to O(N/4).

3. Experimental Study

Extensive experiments were conducted to verify the effectiveness of the proposed scheme in calculating disparity maps. We studied five representative stereo matching algorithms and two versions of the proposed scheme:(i)The Fast Cost Volume Filtering, denoted as FCVF [13](ii)The Full-Image Guided Filtering for Fast Stereo Matching, denoted as FIF [21](iii)The Fast Guided Image Filtering, denoted as FGIF [14](iv)The Adaptive Guided Image Filter, denoted as AGIF [20](v)The Weighted Guided Image Filtering, denoted as WGIF [17](vi)The proposed scheme, which is of computational complexity O(N), denoted as PGIF(N)(vii)The proposed scheme with downsampling in calculating the linear parameters, which is of computational complexity O(N/4), denoted as PGIF(N/4).

It is worth noting that there is only one design parameter, of (11), in our approach. For all of these computational experiments, we fixed its value as 4. Besides, the experiments are performed in MATLAB 2017b using Intel Core I7 3610 with 16 GB RAM.

We tested these frameworks using the Middlebury (version 3) benchmark stereo database [22], which has a collection of stereo pairs for matching performance evaluation. Among them, 15 pairs were used, from “Adirondack” to “Vintage”, in the “trainingQ” image set. The Middlebury defines two measures for evaluation, including nonoccluded (non-occ), and all regions. The “error rates” to be presented below are denoted as “bad 1.0” in the benchmark, which is the percentage of “bad” pixels with a disparity error greater than 1.0. Furthermore, the “weighted average error” is calculated using different weights for different image pairs such that the image pairs “PianoL”, “Playroom”, “Playtable”, “Shelves”, and “Vintage” contribute only half of the error values. This arrangement of weights is applied, according to the remarks on the benchmark website, “to compensate for the varying difficulty of the different datasets.”

Figure 2 shows the disparity maps obtained by these algorithms without disparity refinement. The corresponding error rates in the nonoccluded region and the all-region are shown in Tables 1 and 2, respectively. The erroneous pixels in the nonoccluded area are marked in red, while those in the occluded area are marked in green.

According to the results, FIF [21] performs best for Recycle and Shelves, and AGIF [20] performs best for PlaytableP in the nonoccluded (non-occ) region. Besides, FIF [21] performs best for Recycle and Shelves, and AGIF [20] performs best for MotorcycleE, Pipes, and PlaytableP in the all-region. However, PGIF(N) performs best in all other situations, including both the nonoccluded region and the all-region. It is clear that PGIF(N) has the best ranking for the weighted average error performance, which is 14.66% for the nonoccluded region and 20.49% for the all-region, while PGIF(N/4) is just second to it, which is 15.10% for the nonoccluded region and 20.74% for the all-region, respectively.

The times required to run these stereo matching algorithms are shown in Table 3. As expected, PGIF(N/4) only needs to calculate a quarter of the time required for PGIF(N). Furthermore, if we check the time required to run the stereo matching algorithm, we have that PGIF(N) and FIF [21] belong to the same level, while PGIF(N/4) and FGIF [14] belong to the same level. We may conclude that PGIF(N/4) is the best scheme when considering both the correctness of the matching and the efficiency of the calculation.

Similar to the proposed scheme, FIF [21] also uses the entire image for aggregation. However, in addition to the difference between the methods in cost-volume filtering, a step function , defined in (12), is introduced in the proposed scheme to reduce the effects of noise. It is important to clarify whether the performance improvement of the proposed solution is mainly due to this function. Therefore, extensive experiments were conducted on this clarification and are summarized in Figure 3, in which the same color convention as used in Figure 2 is applied. Based on these results, we have that using the step function does improve the performance of FIF [21]. Importantly, our method is still better than the modified FIF [21].

In addition, there are several parameters that may affect the performance of the algorithms under study, including the parameter of FCVF [13], the parameter of FIF [21], and the parameter of the PGIF(N). With a close look at the filter kernel of FCVF [13], included in (7), we have that the resulting disparity map is smoother with the increase of the parameter . The parameter in the filter kernel of FIF [21], presented in (3), and the parameter of the proposed scheme both have the similar behaviors.

We studied their effects on error rates through extended experimental calculations and the results are summarized in Figure 4. Based on the results, we have that the best values of these parameters are = 5, = 0.11, and = 4. These values were applied to the experimental calculations presented above.

4. Conclusions

Inspired by FIF [21] and FCVF [13], we propose a new GIF-based aggregation scheme for the calculation of dense disparity maps. The scheme uses the entire image for cost calculation and exploits the weight propagation [21] method for efficient computation.

We redesign the energy function of the guided image filter [12] so that we can not only preserve the original weight propagation structure [21] but also embed the effects of distance and intensity differences in the weights. This arrangement allows efficient use of the entire image when the GIF [12] scheme is used for cost aggregation. Besides, there is only one design parameter of the proposed scheme. We also suggest the introduction of a step function in the calculation of weights to attenuate noise.

A performance evaluation using the Middlebury (version 3) benchmark stereo database [22] shows that the proposed solution provides superior disparity accuracy and comparable processing speed compared to representative aggregation methods. The experimental results also justify the use of a step function when calculating the aggregate weights.

Data Availability

The dataset used to support the findings of this study is included in the article, which is cited at relevant places within the text as [22].

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research is supported by the National Natural Science Foundation of China, under Grant no. 61471263; the Natural Science Foundation of Tianjin, China, under Grant no. 16JCZDJC31100; the Ministry of Science and Technology, ROC, under Grant nos. MOST 106-2221-E-182-033 and 107-2221-E-182-078; and Chang Gung Memorial Hospital, Taiwan, under Grant no. CORPD2H0011.