Advances in 3DTV: Theory and PracticeView this Special Issue
An Occlusion Approach with Consistency Constraint for Multiscopic Depth Extraction
This is a new approach to handle occlusions in stereovision algorithms in the multiview context using images destined for autostereoscopic displays. It takes advantage of information from all views and ensures the consistency of their disparity maps. We demonstrate its application in a correlation-based method and a graphcuts-based method. The latter uses a new energy, which merges both dissimilarities and occlusions evaluations. We discuss the results on real and virtual images.
Augmented reality has many applications in several domains such as games or medical training. On the other hand autostereoscopic display is an emergent technology, which adds a perception of depth enhancing the users immersion. Augmented reality can be applied to autostereoscopic display in a straightforward way by adding virtual objects on each image. However, it is much more interesting to use the depth-related information of the real scene so that virtual objects could be hidden by real ones.
To that end, we need to obtain one depth map for each view. The particular context of images destined to be viewed on autostereoscopic displays allows us to work on a simplified geometry (e.g., no rectification is needed, epipolar pairs are horizontal lines of the same rank, and disparity vectors are thus aligned along the abscissa). However, our aim is to obtain good assessment of depth in all kinds of scenes, without making any assumption on their contents. Indeed images may have homogeneous colors as well as they may have various colors. Also, due to the principle of autostereoscopic displays, the users can see two images at the same time. It is then crucial to have strongly consistent depth maps. For example if a virtual object is drawn in front of a real object in one view, it has to be drawn in the same order in all views. Therefore we introduce a new occlusion approach for multiview stereovision algorithms, which aims to ensure the consistency of the depth maps.
We propose an example of application of our approach in a correlation-based method and in a symmetrical graph-cuts-based method. Finally we discuss their results.
2. Related Work
Stereovision algorithms aim to find the disparity maps in order to deduce the depth maps. That is the reason why we will use the phrase “disparity maps” instead of “depth maps” in the following lines. Depth maps can be easily obtained from disparity maps using a triangulation step.
Let us admit we have a set of images, numbered from the left () to the right (), shot with a parallel capture system specifically designed for autostereoscopic displays. Figure 1 illustrates these shooting conditions with . is the focal distance and dioc is the Distance Intra Optical Centers. designates the set of pixels of the image. is a function associating a disparity vector with any pixel from . This vector is the difference between the coordinates of the corresponding pixel of in image and those of in image . The corresponding pixel of in the next image is then at position . Since we are using a simplified geometry, is of dimension one. It is an integer too (), so that we do not have to deal with subpixels. Moreover it is possible to find the corresponding pixel of in any image using the only disparity . In any image, the corresponding pixel is given by . For example the corresponding pixel of in the previous image is .
Optical flow algorithms are based on a cost , which evaluates the color dissimilarity between two pixels and . In the case of color images, it is given by
where , , and (resp., , , and ) are the red, green, and blue components of (resp., ). Several methods use the multiview aspect in order to make such algorithms more robust and less sensitive to noise. The local cost of a pixel () according to disparity is then given by
The reader can refer to Scharstein and Szeliski  for a complete taxonomy of dense stereovision algorithms. In many recent publications from this domain, authors use color segmentation in their methods [2–5]. However color segmentation and other primitive extraction methods are independent for each view. It is then impossible to ensure the consistency between the disparity maps. Moreover, it may be a cause of errors when applied to images with mainly homogeneous colors. Therefore these methods are incompatible with the objective presented previously, which imposes working in a local colorimetric context. We may not have any assumption about the content of the images and, thus, we cannot extract other features. So we aim to make up for this lack of information by taking advantage of redundancies in the images.
A lot of algorithms deal with occlusions in order to obtain better disparity maps, which preserve discontinuity at object's boundaries. The first step to deal with occlusions is to be able to detect them. Egnal and Wildes  compare five approaches. Some of them are based on the idea that color discontinuities correspond to depth discontinuities. These approaches are called photometry-based approaches. Alvarez et al.  use the gradient of the gray levels in order to locally adjust the smoothness constraint of their energy: the lower the value of the gradient, the stronger the smoothness constraint. On the other hand, geometry-based approaches use disparities in order to detect occlusion areas. The reader can refer to  for a complete comparison of such methods. We prefer geometry-based approaches to the photometry-based ones since they do not make any assumption on colors and allow disparity maps from the images to interact and to be linked. This link ensures the consistency of the disparities and makes up for the lack of information previously discussed. The most widely used geometry-based method is the Left Right Checking (LRC) approach [8–10]. The principle is that a pixel should match a pixel from another image with the same disparity; otherwise an occlusion occurs. In the case of two images (numbered and ), this can be expressed by
where is close to zero when there is no occlusion and high when pixel is occluded in image . There have been several attempts to improve the robustness of this method [11, 12]. However, the original LRC is still the most popular approach [2, 5, 13, 14]. We propose an LRC-based approach, which differs in several points:(i)it is extended to the multiview context,(ii)it ensures a geometric consistency between depth maps.
After the detection step, the main difficulty is to handle occlusions in the matching algorithms. Woetzel and Koch  propose a correlation-based algorithm, which do not add up dissimilarity costs for the whole set of image pairs but for a subset of it. They replace (2) by
where is the set of chosen image pairs. The authors propose two methods to choose the set of image pairs. The first one is to select the furthest left or the furthest right images. The second one is to select the pairs with the smallest costs. This method reduces the impact of occlusions on the results but introduces a lot of errors in the images which are the futhest away.
There are two categories of methods based on energy minimization performing the matching while taking occlusions into account.
The first category contains iterative methods [8, 10] based on Algorithm 1. In order to apply this principle, Proesmans et al. , who work with two images, use four maps, one disparity map plus one occlusion map per image. The occlusion maps are computed using the LRC approach. Strecha and Van Gool  extend this principle to the multiview context. First, disparity maps are computed from the views. Then for each view, occlusion maps are computed as being the LRC evaluation with all other views.
In order to obtain better results, some methods start again at step 2 when step 3 is over and loop until the system converges. The problem of iterative methods is that disparities and occlusions estimations are independent, do not interact with each other and, thus, do not ensure a global geometric consistency.
The second category is then composed of methods to estimate occlusions and disparities simultaneously. In the context of two views, Alvarez et al.  introduce the following energy with
where evaluates dissimilarities between corresponding pixels, corresponds to the smoothness constraint, and is the sum of for every pixel of image . and are weighting factors.
Note that even if pixels are detected as occluded, their dissimilarities are still taken into account in the dissimilarity term. That means that this term contains dissimilarities of mismatching pixels, which have nothing in common. This is a problem since that introduces noise into the energy. In order to solve that, Ince and Konrad  use an energy similar to (5) with the dissimilarity terms of the form
where is a weighting function, which approaches zero when is high (i.e., occlusions are detected). Moreover, they use the smoothness term in order to extrapolate the disparities in occluded areas.
By the same token, we have proposed a multiview graph-cuts-based method in , which integrates occlusion penalties in its energy function without integrating dissimilarities of mismatching pixels.
In spite of the fact that these methods use smooth and discontinuity preserving functions, they still can contain inconsistencies that we will detail in Section 3.
3. A New Approach for Occlusion Detection
Let us imagine a standard scene with a man behind a wall. Four views of this scene are shot; Figure 2(a) shows corresponding epipolar lines from each image and superimposes them. This representation, that we call matching graph, is useful in order to define matches and occlusions. In Figure 2(b), pixel (green circle) has disparity . The corresponding pixel in is also a pixel of the man, so there is a match between these pixels. In image , the corresponding pixel of is part of the wall (blue square). It has a larger disparity (since the wall is nearer). Therefore there is an occlusion between these two pixels, as shown in Figure 2(b) with an orange diamond. The scene described in Figure 2(c) is an example of an impossible matching graph. This graph shows that the wall pixel in image corresponds to a pixel in image , which is not a pixel of the wall but a representation of the man. It has a smaller disparity. This situation is impossible as it supposes that the wall is hidden by the man whose disparity implies that he is behind it. This is what we call a consistency error. The LRC does not handle this kind of inconsistencies.
We propose the following rules in order to define our approach of occlusions. Let us assume and its corresponding pixel :(i)if , and match, (ii)if , is occluded,(iii)if , there is a consistency error.
In order to simplify writing, we call the occlusion image when an occlusion occurs at image number , that is, when corresponding pixels from images and do not match. In Figure 2(b), for instance, there is an occlusion between images and , is then equal to .
3.2. Energy Function
In order to take the rules presented above into account, we use an energy function of the form
where is the set of all disparity functions . is the smoothness term, which may be the same as the one used in (5). This smoothness constraint is applied to each disparity function. contains all the dissimilarity, occlusion, and consistency penalties.
In the case of Figure 2(b), will include the three dissimilarities between the four pixels of the wall, plus one dissimilarity between the two pixels of the man, plus one occlusion penalty between the two mismatching pixels. In Figure 2(c), is the sum of the three dissimilarities between the man pixels, the dissimilarity between the wall pixels, and the consistency penalty. Of course these examples take a very small number of pixels into account whereas considers all of them.
Finally, this term is given by
where is the local cost between and its corresponding pixel either in the previous () or the next () image. Let us call this pixel . We have
The term is the same except that is in image instead of . and are two constant values corresponding to the occlusion penalty and the consistency penalty, respectively. Due to our specific domain of application (augmented reality on autostereoscopic displays), the consistency constraint must be very strong and the value of is then very high to ensure that this case will never happen.
This energy function can be used in different methods as we will see in Section 4.
In this section, we present two applications of our occlusion approach. The first one does not use any smoothness constraint and focuses on our approach of occlusions in order to emphasize its relevance on a correlation-based method. The second one is an application of the energy function as defined in the previous section on a graph-cuts-based method.
Both methods use the same constant . We found empirically that a value of 100 gives good results with our different sets. We give a value of .
4.1. Correlation-Based Method
This method uses two distinct local costs. The first one supposes there is no occlusion and the second one supposes there is exactly one occlusion. These two costs are in competition by means of a Winner Takes All (WTA) algorithm.
The first cost could be any local cost as found in the literature. Our implementation uses cost described in (2) where is the absolute difference of intensities summed over the three color components.
The second cost is a local subset of (8). Indeed the cost of a pixel includes the (9) energies for all pixels linked to in the images, assuming that there is only one occlusion and two disparities (on the left and the right of the occlusion). In order to ensure the consistency of the matching graph implicitly, only values meeting the consistency condition are tested. Therefore the constant never appears. finally contains one penalty due to the occlusion, plus dissimilarities between matching pixels. The local cost for in image is
where is the occlusion image and and are two disparities, respectively, on the left and on the right of the occlusion. Figure 3 shows the three terms of with and with the same configuration as in Figure 2(b), that is, it shows the costs of each term. is the corresponding pixel of in image : is equal to if is on the left of the occlusion (), and equal to if it is on the right (). and contain dissimilarity costs on the left and on the right sides of the occlusion, respectively. They are given by
In order to ensure the consistency of the matching graph, only disparities meeting the following condition are tested:
Finally, the selection is based on a WTA algorithm: if the minimum cost for a pixel is obtained using , then the disparity is assigned to the pixel. If it is obtained using then the disparity assigned to the pixel is either or , depending on whether is, respectively, on the left or on the right of the occlusion.
4.2. Graph-Cuts-Based Method
Our method is based on the energy function previously described in (7) and (8). We use the graph-cuts method in order to minimize our energy. Please refer to publications by Boykov et al.  for a complete presentation of the graph-cuts method and by Kolmogorov and Zabih  for an explanation of the graph construction. We use the algorithm and, unlike others, we loop only once for each disparity. Moreover we always loop from the highest disparity to the lowest one, since we have found that this gives more accurate occlusion areas in the results. The graph we present in this section is based on this assumption, as we will see further. Our graph is composed of one node for each pixel from all the images. It also has a source () and a sink (), which mean “keep the same disparity” and “change to disparity ”, respectively.
Now, we will see how to construct the graph corresponding to (9). In fact will not have the same corresponding pixel in the other image whether it changes its disparity or not. The graph corresponding to is then composed of three nodes , , and , which are the corresponding pixels if is cut from the source or from the sink. However, it can be separated into two graphs using pairs (, ) and (, ). Figure 4 shows the general graph corresponding to .
The smoothness of the result is ensured by term of the energy given in (7). We use the following definition: where is the set of neighbour pixels pairs in image . We use two implementations of the smoothness constraint. In the first one contains only horizontal neighbours in order to obtain a 1D smoothness constraint. Such a constraint makes for an independent selection of epipolar lines and then a parallel implementation. The second one uses a 2D smoothness constraint and includes both horizontal and vertical neighbours.
To compare our methods, first between them and secondly with other existing ones, we use two sets of 8 images. The first one is a set of images of a virtual scene, which allows us to compare results against ground truth. The second one is a set of photographies taken at Palais du Tau in Reims . The dimensions of images in both sets are pixels. Figure 5 shows one image of each set. The photography has homogeneous colors whereas the virtual scene has various colors. This allows us to test our methods in both cases.
We compare three pairs of methods. The first pair is composed of correlation-based methods. One uses the cost of (2) and the other is our own correlation-based method. The second and third pairs are graph-cuts based methods. One with a smoothness constraint along epipolar lines, and one with a 2D smoothness constraint. Each pair is composed of a method using our energy function and one using a standard energy of the form
Using the ground truth of the virtual scene, we give the error rate corresponding to each method in Table 1. This error results from the absolute differences between real disparities and our disparities summed on all pixels and divided by the number of pixels. The result is then the average disparity error per pixel. Our local methods have the highest error rate, and our global method has the smallest error rate. For each category, the error rate of the method using our occlusion approach is always lower than the method which does not use it. We observe that the error rate of the correlation-based method without occlusion handling is lower than the one of the method using 1D smoothness constraint. We think that is due to the fact that our virtual images contain no noise and no specular light at all. That is the reason why the correlation-based method gives particularly good results in this case. Therefore, the comparison between correlation-based methods and energy minimization based methods is meaningless.
Table 1 also gives computation times on both sets of images. We used an Intel Core 2 Duo CPU E4700 and 2 Go of memory. Both correlation-based methods are implemented using CUDA on an NVIDIA Quadro FX 3700 graphic card. Times include the computation of the whole set of the disparity maps. Globally, methods using the occlusion approach are slower than the other methods. This is due to the fact that they have more possibilities to take into account. Indeed, the correlation-based method has more tests to carry out, and the graph-cuts based method has a more complicated graph structure, that is slower to solve.
Figure 6 shows results obtained without and with occlusion handling (). Figure 6(a) illustrates the standard method; whereas Figure 6(b) corresponds to the method we have presented. We can see in these images that our method has precisely detected occlusions on boundaries of the columns. However, areas without depth discontinuities like the background contain errors. We think that it is due to the noise sensitivity of our occlusion detection. The first three extracts given in Figures 7(a), 7(b), and 7(c) show the details of these results on the front column. We observe that our correlation-based method has accurately defined discontinuities fitting the real boundaries of the column, but disparities in occluded areas contain errors (noise in Figure 7(c)), since the method has difficulties to find them in such areas. Finally, our approach is not well suited to the principle of correlation-based methods. In fact, in such methods the selection of disparities is independent for each pixel, and the consistency from one disparity map to another cannot be ensured.
On the other hand, graph-cuts based methods allow the symmetrical minimization of energy, ensuring a strong consistency. Figures 6(c) and 6(d) show results with the 1D smoothness constraint, and Figures 6(e) and 6(f) show results with the 2D smoothness constraint. Again, the method using our occlusion approach has very accurate depth discontinuities at object boundaries. In our 1D smoothness method (Figure 6(d)), the horizontal line effect is more visible than in the classical method (Figure 6(c)). This is due to the fact that our occlusion approach penalizes any disparity variation, because it is detected as an occlusion. Our method tends to add a strong smoothness constraint along epipolar lines. The 2D smoothness constraint allows compensating for this artifact. The images in Figures 7(d) and 7(e) show the front column obtained with these methods. We notice that without occlusion handling the method cannot find the disparities of some pixels, which are not visible in all images. On the other hand, it accurately detects the occlusion using our occlusion approach. However, we observe in Figure 6(f) that the column in the background (in a white ellipse) is not well defined. This is due to the principle of plane-sweeping algorithms and to the fact that this column is actually between two planes. The reader can refer to  for a presentation of our refinement step, which solves this problem.
We have introduced a new approach in order to handle occlusions of a scene in a multiview context. As a proof of the relevance of this new detection rule, we have presented two methods with the particularity of handling objects boundaries very accurately. Even if these methods can handle two-view stereovision, they are designed for the multiview context with any number of views. The results we obtain show that our occlusion approach succeeds in detecting objects boundaries to the detriment of computation times, and can still create disparities even for pixels that are not visible in all views. Moreover, used on symmetrical energy minimization-based methods, our approach ensures a geometric consistency, which is crucial for autostereoscopic displays. However, computation time is the main problem of our methods. That is the reason why our objective is to find a means to minimize energy faster. One idea is the GPU implementation of the graph cuts. Some work has already been done in this domain [19, 20] but we are not using it for the moment since it induces a lot of constraints on the graph structure and must be adapted to our energy. Another possiblity that we are working on is to reduce the number of nodes used in our graph, in order to simplify the maximum flow problem.
The work presented in this paper was supported in part by the “Agence Nationale de la Recherche” as part of the CamRelief project. This project is a collaboration between the University of Reims Champagne-Ardenne and 3DTV Solutions. The authors wish to thank Didier Debons, Michel Frichet, and Florence Debons for their contribution to the project.
A. Klaus, M. Sormann, and K. Karner, “Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure,” in Proceedings of the 18th International Conference on Pattern Recognition (ICPR '06), vol. 3, pp. 15–18, Hong Kong, August 2006.View at: Publisher Site | Google Scholar
Q. Yang, L. Wang, R. Yang, H. Stewénius, and D. Nistér, “Stereo matching with color-weighted correlation, hierarchical belief propagation, and occlusion handling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 3, pp. 492–504, 2009.View at: Publisher Site | Google Scholar
M. Proesmans, L. van Gool, E. Pauwels, and A. Oosterlinck, “Determination of optical flow and its discontinuities using non-linear diffusion,” in Proceedings of the 3rd European Conference on Computer Vision (ECCV '94), vol. 2, pp. 295–304, Springer, Secaucus, NJ, USA, 1994.View at: Google Scholar
L. Alvarez, R. Deriche, T. Papadopoulo, and J. Sánchez, “Symmetrical dense optical flow estimation with occlusions detection,” in Proceedings of the 7th European Conference on Computer Vision-Part I (ECCV '02), pp. 721–735, Springer, London, UK, 2002.View at: Google Scholar
C. Strecha and L. Van Gool, “PDE-based multi-view depth estimation,” in Proceedings of the 1st International Symposium on 3D Data Processing Visualization and Transmission (3DPVT '02), pp. 416–425, 2002.View at: Google Scholar
P.-M. Jodoin, C. Rosenberger, and M. Mignotte, “Detecting half-occlusion with a fast region-based fusion procedure,” in Proceedings of the British Machine Vision Conference, pp. 417–426, 2006.View at: Google Scholar
J. Woetzel and R. Koch, “Real-time multi-stereo depth estimation on GPU with approximative discontinuity handling,” in Proceedings of the 1st European Conference on Visual Media Production (CVMP '04), pp. 245–254, London, UK, March 2004.View at: Google Scholar