Abstract

We propose a new dense local stereo matching framework for gray-level images based on an adaptive local segmentation using a dynamic threshold. We define a new validity domain of the frontoparallel assumption based on the local intensity variations in the 4 neighborhoods of the matching pixel. The preprocessing step smoothes low-textured areas and sharpens texture edges, whereas the postprocessing step detects and recovers occluded and unreliable disparities. The algorithm achieves high stereo reconstruction quality in regions with uniform intensities as well as in textured regions. The algorithm is robust against local radiometrical differences and successfully recovers disparities around the objects edges, disparities of thin objects, and the disparities of the occluded region. Moreover, our algorithm intrinsically prevents errors caused by occlusion to propagate into nonoccluded regions. It has only a small number of parameters. The performance of our algorithm is evaluated on the Middlebury test bed stereo images. It ranks highly on the evaluation list outperforming many local and global stereo algorithms using color images. Among the local algorithms relying on the frontoparallel assumption, our algorithm is the best-ranked algorithm. We also demonstrate that our algorithm is working well on practical examples as for disparity estimation of a tomato seedling and a 3D reconstruction of a face.

1. Introduction

Stereo matching has been a popular topic in computer vision for more than three decades, ever since one of the first papers appeared in 1979 [1]. Stereo images are two images of the same scene taken from different viewpoints. Dense stereo matching is a correspondence problem with the aim to find for each pixel in one image the corresponding pixel in the other image. A map of all pixel displacements in an image is a disparity map. To solve the stereo correspondence problem, it is common to introduce constraints and assumptions, which regularize the stereo correspondence problem.

The most common constraints and assumptions for stereo matching are the epipolar constraint, the constant brightness or the Lambertian assumption, the uniqueness constraint, the smoothness constraint, the visibility constraint and the ordering constraint [24]. Stereo correspondence algorithms belong to one of two major groups, local or global, depending on whether the constraints are applied to a small local region or propagated throughout the whole image. Local stereo methods estimate the correspondence using a local support region or a window [5, 6]. Local algorithms generally rely on an approximation of the smoothness constraint assuming that all pixels within the matching region have the same disparity. This approximation of the smoothness constraint is known as the frontoparallel assumption. However, the frontoparallel assumption is not valid for highly curved surfaces or around disparity discontinuities. Global stereo methods consider stereo matching as a labeling problem where the pixels of the reference image are nodes and the estimated disparities are labels. An energy functional embeds the matching assumptions by its data, smoothness, and occlusion terms and propagates them along the scan line or through the whole image. The labeling problem is solved by energy functional minimization, using dynamic programming, graph cuts, or belief propagation [79]. A recent review of both local and global stereo vision algorithms can be found in [10].

Algorithms based on rectangular window matching give an accurate disparity estimation provided the majority of the window pixels belong to the same smooth object surface with only a slight curvature or inclination relative to the image plain. In all other cases, window-based matching produces an incorrect disparity map: the discontinuities are smoothed, and the disparities of the high-textured surfaces are propagated into low-textured areas [11]. Another restriction of window-based matching is the size of objects of which the disparity is to be determined. Weather the disparity of a narrow object can be correctly estimated depends mostly on the similarity between the occluded background, visible background, and object [12]. Algorithms which use suitably shaped matching areas for cost aggregation result in a more accurate disparity estimation, [1318]. The matching region is selected using pixels within certain fixed distances in RGB, CIELab color space, and/or Euclidean space.

To alleviate the frontoparallel assumption, some approaches allow the matching area to lie on the inclined plane, such as in [19, 20]. The alternative to the idea that properly shaped areas for cost aggregation can result in more accurate matching results is to allocate different weights to pixels in the cost aggregation step. In [21], the pixels closer in the color space and spatially closer to the central pixel are given proportionally more significance, whereas, in [22], the additional assumption of connectivity plays a role during weight assignment.

Our stereo algorithm belongs to the group of local stereo algorithms. Within the stereo framework, we rely on some standard and some modified matching constraints and assumptions. We use the epipolar constraint to convert the stereo correspondence into a one-dimensional problem. However, we modify the interpretation of the frontoparallel assumption and the Lambertian constraint. A novel interpretation of the frontoparallel assumption is based on local intensity variations. By adaptive local segmentation in both matching windows, we constrain the frontoparallel assumption only to the intersection of the central matching segments of the initial rectangular window. This mechanism prevents the propagation of the matching errors caused by occlusion and enables an accurate disparity estimation for narrow objects. The algorithm estimates correctly disparities of both textured as well as textureless surfaces, and disparities around depth discontinuities, disparities of the small as well as large objects independently of the initial window size. We apply the Lambertian constraint to local intensity differences and not to the original gray values of the pixels in the segment. In the postprocessing step, we apply the occlusion constraint without imposing the ordering constraint, which enables successful disparity estimation for narrow objects. Also, our stereo algorithm is suitable for a fast real-time implementation, because it is local algorithm for gray-valued images which uses a local segmentation and only a small subset of window pixels for cost calculation.

Our main contribution is the introduction of the relationship between the frontoparallel assumption and the local intensity variation and its applications to the stereo matching. In addition, we introduce a preprocessing step that smoothes low-textured areas and sharpens texture edges producing the image more favorable for a proper local adaptive segmentation.

The paper is organized as follows: in Section 2, we explain our stereo matching framework: the preprocessing step, the adaptive local segmentation, the matching region selection, the stereo matching, and the postprocessing step; in Section 3, we show and discuss the results of our algorithm on different stereo images; in Section 4, we draw conclusions.

2. Stereo Algorithm

Our algorithm consists of three steps: a preprocessing step, a matching step, and a postprocessing step. The flow chart of the algorithm is shown in Figure 1. Input to the algorithm is a pair of rectified stereo images 𝐼𝑙 and 𝐼𝑟, where one of them, for instance 𝐼𝑙, is considered as the reference image. For each pixel in the reference image, we perform matching along the epipolar line for each integer-valued disparity within the disparity range. Firstly, the input images are preprocessed, as explained in Section 2.1. The preprocessing step is applied to each image individually. Next, we calculate the local intensity variations maps for the preprocessed images and used them to determine the dynamic threshold for adaptive local segmentation, and elaborated in Section 2.2. Further, the stereo matching comprises a final region selection from segments, a matching cost calculation for all disparities from the disparity range and disparity estimation by a modification of the winner-take-all estimation method, see Section 2.3. The result of the matching is two disparity maps, 𝐷LR and 𝐷RL, corresponding to the left and right images of the stereo pair. Finally, postprocessing step calculates the final disparity map corresponding to the reference image as described in Section 2.4.

2.1. Preprocessing

We apply a nonlinear intensity transformation to the input images in order to make them more suitable for adaptive local segmentation. The presence of the Gaussian noise and the sampling errors in image can produce erroneous segments for matching. The noise is dominant in the low-textured and uniform regions, while the sampling errors are pronounced in the high-textured image regions. The sampling effects can be tackled by choosing a cost measure insensitive to sampling as in [23], or by interpolating the cost function as in [24]. We handle these problems differently and within the preprocessing step. The applied transformation suppresses the noise in low-textured regions while simultaneously suppressing the sampling effects in the high-textured regions.

The transformation is based on the interpolated subpixel samples by bicubic transform in the 4 neighborhoodS and by consistently replacing the central pixel value by maximum or by minimum value of the set, depending on the relation between the mean and the median of the set. We form a set of samples of the observed pixel at the position (𝑥,𝑦) and the intensities in horizontally and vertically interpolates image at the subpixel level at 𝛿𝑖: 𝛿𝑖7=81+𝑖8𝐼,𝑖{0,1,,14},𝑣=𝑥𝛿𝑖,𝑦,𝐼𝑥,𝑦𝛿𝑖.𝑖{0,1,,14}(1)

The intensity transformation is performed by replacing the intensity 𝐼(𝑥,𝑦) with the new intensity as 𝐼(𝑥,𝑦)=max{𝑣}ifmedian{𝑣}>mean{𝑣}min{𝑣}otherwise.(2)

All intensity values are corrected in the same manner. If the pixel intensity differs significantly from its four neighbors, as in the high-textured regions, it will be replaced by the maximum value in the interpolated subpixel set 𝑣, resulting in the sharpening effect. On the other hand, in low-textured regions, the intensity change is small, and replacing the initial intensity value systematically with the minimum value of the interpolated subpixel set 𝑣, it produces the favorable denoising effect. These positive effects originate from the image resampling done by bicubic interpolation, because the bicubic interpolation exhibits overshoots at locations with large differences between adjacent pixels, see Chapter 4.4 in [25] and Chapter 6.6 in [26]. These favorable effects are lacking if the interpolation method is linear.

We illustrate the effect of the preprocessing step for an image from a stereo pair from the Middlebury evaluation database in Figure 2. Therefore, the preprocessing step modifies regions with high-intensity variations and results in the sharper image. Further, in Section 3, we show the influence of this step to overall algorithm score.

2.2. Adaptive Local Segmentation

Adaptive local segmentation establishes a new relationship between the local intensity variation and the frontoparallel assumption applied to stereo matching. Adaptive local segmentation selects a central subset of pixels from a large rectangular window for which we assume that the frontoparallel assumption holds for the segment. The segment contains the central window pixel and pixels, spatially connected to the central pixel, whose intensities lie within the dynamic threshold from the intensity of the central window. Starting from the segment, we form a final region selection for matching, see Section 2.3.

The idea behind the adaptive local segmentation is to prevent that the matching region contains the pixels with significantly different disparities prior to actually estimating disparity. We accomplish this aim by conveniently choosing threshold for segmentation based on the local texture. If local texture is uniform with local intensity variations caused only by the Gaussian noise, we opt for a small threshold value. In this way, because the intensity variations are small, the segment will comprise the whole uniform region. We assume that these pixels originate from the smooth surface of one object and therefore that the frontoparallel assumption holds for the segment. On the other hand, if the window is textured, that is, intensity variations are significantly larger than the noise level, it is not possible to distinguish based only on the pixel intensities and prior to matching, whether the pixels originate from one textured object or from several different objects at different distances from the camera. In this case, relying on the high texture for an accurate matching result, it is good to select small segment in order to assure that the segment contains pixels from only one object and does not contain depth discontinuity. Due to the high local intensity variations, this is achieved by large threshold.

We introduce local intensity variation measure in order to determine the level of local texture and subsequently the dynamic threshold. We define the local intensity variation measure as a sharpness of local edges in the 4 neighborhoods of the central window pixel. The sharper local edges are, the larger is the local intensity variation. We calculate the local intensity variation using the maximum of the first derivatives in the horizontal and the vertical directions at the half-pixel interpolated image by benefiting again from overshooting effect of the bicubic interpolation.

The horizontal central difference for a pixel at the position (𝑥,𝑦) in image 𝐼 is calculated as |||𝐼1𝐻=𝑥21,𝑦𝐼𝑥+2|||,,𝑦(3) where 𝐼(𝑥1/2,𝑦) and 𝐼(𝑥+1/2,𝑦) are horizontal half-pixel shifts of image 𝐼 to the left and to the right. The vertical central difference for a pixel at the position (𝑥,𝑦) in image 𝐼 is calculated as |||𝐼1𝑉=𝑥,𝑦21𝐼𝑥,𝑦+2|||,(4) where 𝐼(𝑥,𝑦1/2) and 𝐼(𝑥,𝑦+1/2) are vertical half-pixel shifts of image 𝐼. We define the intensity variation measure as 𝑀𝑡(𝑥,𝑦)=max(𝑉,𝐻).(5)

We divide local intensity variations into four ranges based on the preselected constant 𝑇 and define a dynamic threshold for each range by a look-up table: 𝑇𝑑𝑇(𝑥,𝑦)=2𝑀𝑡𝑇(𝑥,𝑦)0,43𝑇4𝑀𝑡𝑇(𝑥,𝑦)4,𝑇2𝑇𝑀𝑡𝑇(𝑥,𝑦)2,𝑇2𝑇𝑀𝑡[(𝑥,𝑦)𝑇,).(6)

Figure 3 shows a color-coded dynamic threshold map, or equivalently local intensity variation ranges, for the left image from Tsukuba stereo pair from the Middlebury stereo evaluation set [27].

The dynamic threshold 𝑇𝑑(𝑥,𝑦) defined by (6) for the reference pixel in the reference image is also used for the adaptive local segmentation in the nonreference image for all potentially corresponding pixels from the disparity range.

The adaptive local segmentation pseudocode for the reference pixel 𝐼𝑙(𝑥,𝑦) in the left image is given by Algorithm 1. The segmentation is performed for reference and nonreference windows independently using the same threshold 𝑇𝑑(𝑥,𝑦). Thus, in the 𝑊×𝑊 window, where 𝑊=2𝑤+1, around the pixel at the position (𝑥,𝑦) in the reference image, we declare that the pixel at (𝑖,𝑗) position, where 𝑖, 𝑗=1,,𝑊 in the reference window, belongs to the segment if its gray value 𝑝𝑙𝑖,𝑗 differs from the central pixel's gray value 𝑐𝑙=𝑝𝑙𝑤+1,𝑤+1 for less than the dynamic threshold 𝑇𝑑(𝑥,𝑦). The segment pixels in the nonreference window are chosen in similar way using the same threshold 𝑇𝑑(𝑥,𝑦). Next, the central 8-connected components in the dilated masks are selected. The final segments are defined by the binary 𝑊×𝑊 maps, 𝐵𝑙 and 𝐵𝑟, with ones if the pixels belong to the segment. Dilation is performed by 3×3 squared structured element to include additional neighbor pixels into segments and to merge isolated but close-selected pixels.

Step 𝟏 : Dynamic thresholding
for   𝑖 = 1 to 𝑊   do
for   𝑗 = 1 to 𝑊   do
  if   | 𝑝 𝑖 , 𝑗 𝑙 / 𝑟 𝑐 𝑙 / 𝑟 | < 𝑇 𝑑 ( 𝑥 , 𝑦 )   then
   set 𝐁 𝑖 , 𝑗 𝑙 / 𝑟 to 1
  end if
end for
end for
Step 𝟐 : Dilation
Dilate 𝐁 𝑙 / 𝑟 with 3 × 3 squared structured element
Step 𝟑 : Imposing connectivity
for   𝑖 = 1 to 𝑊   do
for   𝑗 = 1 to 𝑊   do
  if   𝐁 𝑖 , 𝑗 𝑙 / 𝑟 = 1 and not connected to 𝐁 𝑤 + 1 , 𝑤 + 1 𝑙 / 𝑟   then
   set 𝐁 𝑖 , 𝑗 𝑙 / 𝑟 to 0
  end if
end for
end for

2.3. Stereo Correspondence

The matching region is defined by the overlap of the adaptive local segments in the reference and nonreference windows. Thus, the matching region is defined by binary map 𝐵, which has ones if and only if both binary maps, 𝐵𝑙 and 𝐵𝑟, have ones at the same positions, as given in Algorithm 2.

for   𝑖 = 1 to 𝑊   do
for   𝑗 = 1 to 𝑊   do
  if   𝐁 𝑙 𝑖 , 𝑗 𝐁 𝑟 𝑖 , 𝑗   then
   set   𝐁 𝑖 , 𝑗   to 1
  end if
end for
end for

We assume that the corresponding pixels have similar intensities and that the differences exist only due to the Gaussian noise with the variance 𝜎2𝑛. One-dimensional vectors, 𝐳𝑙 and 𝐳𝑟, are formed from the pixels from the left and right matching window at positions of ones within the binary map 𝐁. Besides the noise, differences between vectors can occur due to different offsets and due to occlusion. To make the matching vectors insensitive to local different offsets, we subtract the central pixel values 𝑐𝑙 and 𝑐𝑟 from vectors 𝐳𝑙 and 𝐳𝑟, given by Algorithm 3. In this way, the intensity information is transformed from the absolute intensities to the differences of intensities with respect to the central window pixels. Further, we impose the Lambertian assumption on the pixels after the central pixel subtraction and not on the original pixel intensities. To prevent the occlusion influence in matching we eliminate the occlusion outliers by keeping only the coordinates of vectors which differ for less than threshold 𝑇 as given by Algorithm 4.

𝑁 𝑝 is the length of the vectors 𝐳 𝑙 and 𝐳 𝑟
𝑐 𝑙 and 𝑐 𝑟 are the central intensities in the left and in the right window
for   𝑖 = 1 to 𝑁 𝑝   do
𝐳 𝑙 ( 𝑖 ) = 𝐳 𝑙 ( 𝑖 ) 𝑐 𝑙
𝐳 𝑟 ( 𝑖 ) = 𝐳 𝑟 ( 𝑖 ) 𝑐 𝑟
end for

𝑁 𝑝 is the length of the initial vectors 𝐳 𝑙 and 𝐳 𝑟
𝑘 = 0
for   𝑖 = 1 to 𝑁 𝑝   do
if   | 𝐳 𝑙 ( 𝑖 ) 𝐳 𝑟 ( 𝑖 ) | 𝑇   then
  Remove 𝐳 𝑙 ( 𝑖 ) and 𝐳 𝑟 ( 𝑖 )
end if
end for
𝑁 𝑝 is the length of the final vectors 𝐳 𝑙 and 𝐳 𝑟

We calculate the matching cost using the sum of squared differences (SSDs) [7, 28]. To compare the costs with different length of vectors 𝐳𝑙 and 𝐳𝑟 for different disparities, we introduce the normalized SSD: 𝐶𝑛SSD1(𝑑)𝑁𝑝𝐳𝑙𝐳𝑟24𝜎2𝑛,(7) where 𝑁𝑝 is the length of vectors 𝐳𝑙 and 𝐳𝑟 for disparity 𝑑.

The winner-take-all (WTA) method selects the disparity with the minimal cost for the observed reference pixel. In our algorithm, besides the cost, the number of pixels participating in the cost calculation is also an indication of a correspondence. This ordinal measure cannot be used directly in the disparity estimation, because it is not always a reliable indication of the correspondence as in the case of occlusion. If the number of pixels used in the cost calculation is very low, it may be due to occlusion. However, a reliable match has a substantial ordinal support.

We combine the cost and the number of participating pixels in the disparity estimation and introduce a hybrid WTA; we consider only disparities supported by a sufficient number of pixels as potential candidates for a disparity estimate. Thus, the final disparity estimate is chosen from a subset of the all possible disparities from the disparity range. We term these disparity candidates as the reliable disparity candidates [13, 29].

The reliable disparity candidates have at least 𝑁𝑠=𝐾𝑝max{𝑁𝑝𝑥,𝑦} supporting pixels, where 𝑁𝑝𝑥,𝑦 is a set containing the number of pixels participating in the cost aggregation step for each possible disparity value from the disparity range [𝐷min,𝐷max]. 𝐾𝑝 is the ratio coefficient 0<𝐾𝑝1. The estimated disparity 𝑑(𝑥,𝑦) is 𝑑(𝑥,𝑦)=argmin𝑑𝑖{𝐷min,,𝐷max}𝐶𝑛𝑥,𝑦SSD𝑑𝑖𝑁𝑝𝑥,𝑦𝑑𝑖>𝑁𝑠,(8) where 𝑥=1,,𝑅 and 𝑦=1,,𝐶, for image of the dimension 𝑅×𝐶 pixels, and 𝑑𝑖 belongs to the set of all possible disparities from the disparity range [𝐷min,𝐷max].

The final result of the hybrid WTA is the disparity map 𝐷[][]𝐷={𝑑(𝑥,𝑦)𝑥1,𝑅𝑦1,𝐶}.(9)

We calculate two disparity maps, one disparity map, 𝐷LR, with the left image 𝐼𝑙 as the reference, and the other, 𝐷RL, as the right image 𝐼𝑟 as the reference.

2.4. Postprocessing

In the postprocessing, we detect the disparity errors and correct them. There are some areas of incorrect disparity values caused by low-textured areas larger than the initial window. There are some isolated disparity errors with significantly different disparity from the neighborhood disparities, so called outliers, caused by isolated pixels or groups of several pixels if the adaptive local segmentation did not result in sufficiently large segment due to high local intensity variation. Also, there are disparity errors caused by occlusion. Although the matching procedure is the same for both occluded and nonoccluded pixels, our stereo matching algorithm does not propagate error caused by occlusions because the boundaries of objects are taken into account by both the adaptive local segmentation and the final matching region selection. However, occluded pixels do not have corresponding pixels, and the estimated disparities for the occluded pixels are incorrect.

The postprocessing consists of several steps including median filtering of the initial disparity maps, disparity refinement of the individual disparity maps, consistency check, and propagation of the reliable disparities.

First, we apply 𝐿×𝐿  median filter to both disparity maps, 𝐷LR and 𝐷RL, and eliminate disparity outliers. Second, we refine the filtered disparity maps individually to correct low-textured areas with erroneous disparities, in an iterative procedure. The refinement step propagates disparities by histogram voting to the regions with close intensities defined by a look-up table given in (10) across the whole image as illustrated in propagation scheme in Figure 4. Some similar notions to this approach appear separately in the literature, [18, 30], and we were inspired by them. In [30], the cost aggregation is done along the 16 radial directions in disparity space, while in [18], histogram voting is used within the segment for disparity refinement. We refine our disparity maps by histogram voting of accumulating disparities along 8 radial directions across the whole disparity map with constraint of the maximum allowed intensity difference with the pixel being refined. The maximum intensity difference is defined by a dynamic threshold 𝑇𝑝 with the same logic behind as in local intensity variation measure in Section 2.2, with the difference that here we distinguish three ranges of intensity differences. Thus, the histogram is formed using disparities of the pixels with close intensities along 8 radial directions, see Figure 4 and Table 1. The pixels are close in intensities, and their disparities are taken into account in histogram forming if they lie within the threshold 𝑇𝑝 from the intensity of the pixel at the observed position (𝑥,𝑦). The threshold 𝑇𝑝(𝑥,𝑦) is selected based on a look-up table: 𝑇𝑝𝑇(𝑥,𝑦)=2𝑀𝑡𝑇(𝑥,𝑦)0,23𝑇4𝑀𝑡𝑇(𝑥,𝑦)2,3𝑇4𝑇𝑀𝑡(𝑥,𝑦)3𝑇4,.(10)

The histogram 𝐻 with a number of bins equal to the number of disparities within the disparity range is formed by counting the disparities along 8 radial directions for the pixels whose intensity is within threshold 𝑇𝑝(𝑥,𝑦): 𝐻𝑑𝑥tmp,𝑦tmp𝑑𝑥=𝐻tmp,𝑦tmp||𝐼𝑥+1iftmp,𝑦tmp||𝐼(𝑥,𝑦)<𝑇𝑝(𝑥,𝑦),(11) where 𝑥tmp and 𝑦tmp are given by Table 1.

We calculate disparity 𝑑 as a disparity of the normalized histogram maximum: (𝑖)=𝐻(𝑖)𝑖𝐻(𝑖),𝑖=𝐷minto𝐷max,𝑑(12)=argmax𝑖(𝑖),𝑖=𝐷minto𝐷max.(13) The initial disparity 𝑑(𝑥,𝑦) is replaced by the new value 𝑑 if it is significantly supported, that is, if the normalized histogram value (𝑑) is greater than 𝛼; otherwise, it is left unchanged: 𝑑𝑑(𝑥,𝑦)=||𝑑if||𝑑𝑑(𝑥,𝑦)>1>𝛼𝑑(𝑥,𝑦)otherwise,(14) where 𝛼, 0𝛼<1, is a significance threshold. The steps given by (11), (12), (13), and (14) are repeated iteratively until there are no more updates to disparities in the map.

Next, we detect occluded disparities by the consistency check between two disparity maps: ||𝐷RL𝑥,𝑦𝐷LR(𝑥,𝑦)𝐷LR||(𝑥,𝑦)1.(15)

If the condition in (15) is not satisfied for disparity 𝐷LR(𝑥,𝑦), we declare it as inconsistent and eliminate it from the disparity map. The missing disparities are filled in by an iterative refinement procedure similar to the previously applied procedure for the disparity propagation by histogram voting. In the iterative step to fill in the inconsistent disparities, we use the threshold look-up table (10) as in the disparity refinement step. We calculate the histogram of the consistent disparities with close intensities along 8 radial directions as given by (11) and (12). The missing disparity is filled in with the disparity 𝑑 with the largest support in the histogram, provided that the histogram is not empty. The remaining unfilled inconsistent disparities, and we fill in by the disparity of the nearest neighbor with known disparities with the smallest intensity differences. As a last step in the postprocessing, we apply 𝐿×𝐿  median filter to obtain the final disparity map.

3. Experiments and Discussion

We have used the Middlebury stereo benchmark [4] to evaluate the performance of our stereo matching algorithm. The parameters of the algorithm are fixed for all four stereo pairs as required by the benchmark. There are five free parameters in our algorithm. The threshold value is set to 𝑇=12. The half-window size is 𝑤=15, and the window size is 𝑊×𝑊 where 𝑊=31. The noise variance 𝜎2𝑛 is a small and constant scaling factor in (7). The ratio coefficient in hybrid WTA is 𝐾𝑝=0.5. In the postprocessing step, the median filter parameter is 𝐿=5, and the significance threshold in histogram voting is 𝛼=0.45.

Figure 5 shows results for all four stereo pairs from the Middlebury stereo evaluation database: Tsukuba, Venus, Teddy, and Cones. The leftmost column contains the left images of the four stereo pairs. The ground truth (GT) disparity maps are shown in the second column, the estimated disparity maps are shown in the third column, and the error maps are shown in the forth column. In the error maps, the white regions denote correctly calculated disparity values which do not differ for more than 1 from the ground truth. If the estimated disparity differs for more than 1 from the ground truth value, it is marked as an error. The errors are shown in black and gray, where black represents the errors in the nonoccluded regions, and gray represents errors in the occluded regions. The quantitative results in the Middlebury stereo evaluation framework are presented in Table 2.

The results show that our stereo algorithm preserves disparity edges. It estimates successfully the disparities of thin objects and successfully deals with subtle radiometrical differences between images of the same stereo pair. Occlusion errors are not propagated, and occluded disparities are successfully filled in the postprocessing step. A narrow object is best visible in the Tsukuba disparity map (the lamp construction) and in Cones disparity map (pens in a cup in the lower right corner). Our algorithm correctly estimates disparities of both textureless and textured surfaces, for example, the examples of large uniform surfaces in stereo pairs Venus and Teddy are successfully recovered.

The images in the Middlebury database have different sizes, different disparity ranges, and different radiometric properties. The stereo pairs Tsukuba, 384×288 pixels, and Venus, 434 × 383 pixels, have disparity ranges from 0 to 15 and from 0 to 19. The radiometric properties of the images in these stereo pairs are almost identical, and the offset compensation given by Algorithm 3 is not significant for these two example pairs, as we demonstrated in [13]. As required by the Middlebury evaluation framework, we apply the offset compensation to all four stereo pairs. The stereo pairs Teddy, 450×375 pixels, and Cones, 450×375 pixels, have disparity ranges from 0 to 59. The images of these stereo pairs are not radiometrically identical, and the offset compensation successfully deals with these radiometrical differences [13].

The error percentages together with ranking in the Middlebury evaluation online list are given in Table 2. The numbers show error percentages for nonoccluded regions (NONOCC), discontinuity regions (DISC), and the whole (ALL) disparity map. The overall ranking of our algorithm in the Middlebury evaluation table of stereo algorithms is the 28th place out of 123 evaluated algorithms. Thus, our stereo algorithm outperforms many local as well as global algorithms. Among the algorithms ranked in the Middlebury stereo evaluation, there are only two local algorithms ranked higher than our algorithm, but both of them do not impose the frontoparallel assumption strictly: a local matching method using image geodesic-supported weights GeoSup [5] and a matching approach with slanted support windows PatchMatch [31]. Both of these algorithms use colored images, while our algorithm works with intensity images and achieves comparable results. Although these approaches have better general ranking in the Middlebury stereo evaluation list, our approach with matching based on frontoparallel regions outperforms the PatchMatch algorithm for Tsukuba stereo pair, and the GeoSup algorithm for Tsukuba, Teddy, and Cones stereo pairs. Thus, our approach with region selection by threshold produces more accurate disparity maps for cluttered scenes than GeoSup algorithm with region selection using geodesic support weights.

To investigate the contribution of the preprocessing and the postprocessing steps to the overall result, we show in Table 3 the results we obtained on the benchmark stereo pairs with or without the preprocessing and the postprocessing steps in the algorithm. We show the results if neither, only one, and both steps are applied. If our postprocessing step was omitted, the 𝐿×𝐿 median filter was applied. From the results in Table 3, we conclude that both steps, if individually applied, improve the qualities of the final disparity maps. If we apply both steps, the accuracy of the disparity maps is the highest. Furthermore, the improvement contribution of the preprocessing step is greater than the postprocessing step only for Venus stereo pair. This is because the sampling effects were most pronounced in Venus scene. In addition, we show in Figure 6 the disparity maps for Tsukuba stereo pair for all four combinations: if the preprocessing and the postprocessing steps are included or not in the algorithm. We conclude that the preprocessing step plays a significant role in accurate disparity estimation of textureless areas, while the postprocessing step especially helps in an accurate estimation of disparity discontinuities.

To illustrate the subtle features of our algorithm not captured in the standard test bed images, and we apply our stereo algorithm, while retaining the parameter values, on some other images from the Middlebury site in Figure 7. For two other stereo pairs, Art and Dolls, we show the left images of two stereo pairs in the leftmost column. The ground truth (GT) disparity maps are in the second column. The third column shows our estimation of the disparity maps. The fourth column shows the error maps with regard to the ground truth. The algorithm successfully recovers the disparities of very narrow structures as in Art disparity map. The disparity of the cluttered scene is successfully estimated, as in Dolls disparity map.

Next, we demonstrate that the presented local stereo algorithm works well on practical problems. Examples of disparity map estimation and 3D reconstruction of a face are shown for stereo pair Sanja in Figure 8. The disparity map estimation of a plant in stereo pair Tomato seedling is shown in Figure 9. The parameters of the algorithm are kept the same as in the previous examples. Thus, our algorithm successfully estimates the disparity of the smooth low-textured objects and is suitable also for application to 3D face reconstruction, Figure 8(d). Our algorithm also successfully estimated the disparity map of the tomato seedling. Tomato seedling stereo images represent a challenging task for a stereo matching algorithm in general, because the viewpoints significantly differ and the structure of the plant is narrow, that is, much smaller than the window dimension.

As far as the initial window size is concerned, our algorithm is not influenced by the window size above certain size. In principle, we could apply our algorithm using the whole image as the initial window around the reference pixel. This would result in a sufficiently large region selection for uniform regions in the image and make the ordinal measure within the hybrid WTA more reliable. On the other hand, in matching windows with high local intensity variations, the selected region is always significantly smaller than the window and does not change if the window is enlarged because of the connectivity constraint with the reference central pixel.

4. Conclusion

In our local stereo algorithm, we have introduced a new approach for stereo correspondence based on the adaptive local segmentation by a dynamic threshold so that the frontoparallel assumption holds for a segment. Further, we have established a relationship among the local intensity variation in an image and the dynamic threshold. We have applied the novel preprocessing procedure on both stereo images to eliminate the influence of noise and sampling artifacts. The mechanism for the final matching region selection prevents error propagation due to disparity discontinuities and occlusion. In the postprocessing step, we introduce a new histogram voting procedure for disparity refinement and for filling in the eliminated inconsistent disparities. Although the starting point in matching is the large rectangular window, disparity of narrow structures is accurately estimated.

We evaluated our algorithm on the stereo pairs from the Middlebury database. It ranks highly on the list, outperforming many local and global algorithms that use color information while we use only intensity images. Our algorithm is the best performing algorithm in the class of local algorithms which use intensity images and the frontoparallel assumption without weighting the intensities of the matching region. Furthermore, our algorithm matches textureless as well as textured surfaces equally well, handles well the local radiometric differences, preserves edges in disparity maps, and successfully recovers the disparity of thin objects and the disparities of the occluded regions. We demonstrated the performance of our algorithm on two additional examples from the Middlebury database and on two practical examples. The results on this additional examples show that the disparity maps of scenes of different natures are successfully estimated: smooth low-textured objects as well as textured cluttered scenes, narrow structures, and textureless surfaces. Moreover, our algorithm has also other positive aspects making it suitable for real-time implementation: it is local; it has just five parameters; intensity variations are locally calculated, and there is no global segmentation algorithm involved.