Abstract

Based upon the framework of the structural support vector machines, this paper proposes two approaches to the depth restoration towards different scenes, that is, margin rescaling and the slack rescaling. The results show that both approaches achieve high convergence, while the slack approach yields better performance in prediction accuracy. However, due to its nondecomposability nature, the application of the slack approach is limited. This paper therefore introduces a novel approximation slack method to solve this problem, in which we propose a modified way of defining the loss functions to ensure the decomposability of the object function. During the training process, a bundle method is used to improve the computing efficiency. The results on Middlebury datasets show that proposed depth inference method solves the nondecomposability of slack scaling method and achieves relative acceptable accuracy. Our approximation approach can be an alternative for the slack scaling method to ensure efficient computation.

1. Introduction

Learning for stereo vision has been a challenging subject for a long time. Owing to the increment of ground truth datasets, considerable progress has been achieved, that is, using the scene structure of input images to learn a probability distribution model for matching [14] and adopting an expectation maximization algorithm to estimate disparity and then relearn the model parameters based on the estimation [5]. Although these methods have shown exciting results, the shortcoming is obvious, that is, the parameters must be preset or initialized manually on the basis of their prior knowledge. In [6], a new supervised machine learning method was proposed to handle such problem based on conditional random fields (CRFs), and the results had shown a promising future.

As mentioned above, supervised image labeling has been a long-lasting problem in computer vision. In recent years, CRFs have become a popular alternative to address this problem [7, 8], where the spatial correlations among neighboring pixels are incorporated by defining proper unary and pairwise potential functions on the related pixels. In addition, support vector machines have been widely used in image labeling [9], but they are less successful as noisy label results occurred for the absence of consideration of the spatial correlations.

Recently, structured prediction has caused widespread attention, and many new approaches have been proposed. Structured learning approaches solve the above-mentioned problems. In its computation process, both inputs and outputs are well structured, and strong internal correlations are revealed. It is formulated as the learning of complex functional dependencies between multivariate input and output representations. Structured learning has significant impact on addressing important computer vision tasks including image denoising [10], stereo [11], segmentation [12, 13], object localization [14, 15], and human pose estimation [16, 17]. A common way is to generalize the max-margin binary/multiclass classification to incorporate with structured information [14, 1820]. It has been utilized in many respects, such as sequence labeling, image segmentation, grammar parsing, dependency parsing, bipartite matching, and text segmentation [21]. Furthermore, with the development of SVMs, structured information is introduced which generated two new support vector machines named max-margin-based and slack-based SSVMs, respectively.

Max-margin method, with its decomposability of the error function, is possible to find the most violating constraint using the maximum a posteriori (MAP) inference algorithm for prediction [21]. But the shortcomings of the max-margin method are also obvious: it requires the error function being linearly comparable with the features, and it is sensitive to the most violating label. A label with large error would greatly decrease the separability of any other labels. An alternative choice is the slack scaling method. It has a fixed margin of 1 and reduces the violations in proportion to their errors which provide excellent accuracy. However, due to the nondecomposability of its error function, the slack method is not used widely. Therefore, we proposed an approximation method which modifies the slack method while reserving its normal properties. Depending on different given tasks, the proposed approximation method is effective to design most suitable loss functions and generate the corresponding solver.

This paper is organized as follows. In Section 2, we briefly discuss the principles of the SSVM. Our approach is proposed in Section 3 including steps to conduct the structural support vector machine, the typical max-margin method, and the expression of the improved slack method. Section 4 elaborates an approximation of the slack method. Section 5 provides the feature vectors which are utilized in our algorithm. As for Section 6, relative conditions and strategies for the training will be discussed and improved to make the training more efficient. Finally, we apply both methods for the depth restoration and make a detailed comparison between them.

2. Structural Support Vector Machine

Derived from statistical machine learning, the discriminative models focus on the posterior probability and have been viewed as the most successful techniques for structural prediction. Here x is the input sample in the input space and y is the associated output in the output space . Given a feasible training set, for the training sample and their associated truth output , firstly a model for will be learnt that the correct labels have a higher probability than the wrong labels , that is, ,and secondly, it can perform prediction by MAP estimation for a new sample : Under the framework of CRFs, is modeled by a log linear model, which is often assumed as follows: where is a certain relationship between the input and its output; the second term, , is the normalization factor to make a valid probability distribution.

By adopting the framework of max-margin method, the structural support vector machine tries to learn the weight vector, denoting the -parameterized model, to predict the correct output labels. And then, the optimization problem that results from the learning can be written as subject to here, from 1 to denotes different samples, is the label that is not equal to the true label and denotes the loss between the two labels, is the slack variables. Thus, the most violated constrains can be found by solving where is the discriminative function. Therefore, is reformulated as the minimization problem of energy, that is, .

3. Our Approach

3.1. Problem Formulation

In stereo matching tasks, stereo images are two (or more) images of the same object taken from different views, named the left image (reference image) and right image, respectively. Assume that the right view image is just a horizontal shift of the left view, and the two images are the same size . Denoting is the pixel on the cross of th row and th column in reference image, and the pixel on the same position in right image. The matching is aimed at finding the pixel-wise disparity which minimizes the energy where denotes the local disparity and is the smooth term which usually takes the form of Pot’s Model where and are the index of neighboring pixels, and represent the neighboring disparity label, and is a constant for penalty.

Normally the features of and represent certain categories of visual information, for example, color, texture, or gradient. However, each category suits different situations. Texture features work well in boundary regions which usually are rich-textured but not applicative in weakly textured regions. Gradient-based features have opposite characters in comparison with texture features. In addition, different categories of features are not easy to be combined for learning. Simply expanding the dimension of feature vectors to involve more features from different categories is dangerous due to sampling effect and scale. The highly weighted features will greatly influence the final results, also suppressing other features. Therefore, the data term should be constructed in the form of , where is the unary weight parameter which can balance the components in the combination feature vector against the sampling effect and different scales. These parameters can be learnt from training examples.

By expanding the squared difference in data term, we will get three terms; that is, , , and . We use as the constraints in training phase, where is the ground truth, the term would be canceled out by the subtraction because of its independency of label . We use to take the place of . Parameters working on these terms can balance the difference between and , which is caused by sampling effect and camera settings. Overall, the data term is built as = [.

3.2. Max-Margin Formulation for Stereo Learning

Assuming a learnt pairwise weight , then the parameter can be denoted as , and the energy is written as . Here is the vector including data term and also the smooth term. The energy on ground truth should be minimized, that is, for all possible we have . By adopting the margin scaling and adding the slack variables to account for violations, the optimization problem reads, for ,

3.3. Slack Scaling Formulation

The margin rescaling method requires the label loss to be linearly comparable with the feature values . However, this is normally hard to be satisfied in structured learning, since counts the loss over each pixel in the image, and thus the aggregate value is much larger than feature values. Especially in stereo matching tasks, the pixel-wise loss may reach up to hundreds, which makes the overall loss even larger. Thus, we would like to adopt slack scaling, as it is invariant to the label loss scale. Nevertheless, the slack rescaling formulation is difficult to be solved, because no efficient approximation algorithm for exists. We follow the method introduced in [21] to solve this problem.

The slack rescaling optimization formulation is as follow:

4. The Approximation for Slack Scaling

For the slack scaling optimization formulation, the inference engine problem is to find where is the set of the most violating label, is the slack variable, and .

As it is seen in the formulation, because must be considered entirely, the second part of the formula cannot be decomposed easily. Thus, an approximation is used to take the place of and make it possible to be decomposed into the local parts.

It should be noted that is concave, and it has been proved approximated in the form of a linear function with respect to [22]. The linearization and to be approximation procedure will be shown in the following parts.

4.1. Linearization and Approximation

According to [22], a concave function can be expressed in a linear form. Therefore, (10) is expressed as The aim of the inference problem is to find the optimal label which minimizes the left side of (11). Therefore, we have

Here, let , thus which leads to the simplified formulation as

For a fixed , firstly the most optimal label can be computed through minimization

Then, can be substituted into the formula . We can find a that enables to catch its maximum, because is a function which is convex with respect to . can be seen as the max of a set of convex functions; therefore, is convex as well.

With the help of linear search algorithm such as Golden Search, the maximum of can be acquired in an efficient way. During the search procedure, it will encounter many different . By evaluating the for each , we can get different labels. The goal is to find the optimal label to get a minimum of , which is denoted as .

4.2. The Determination of Interval for

Since a simple constrain has been given out, , it is obvious that can be the lower bound of as . However, if , it will be hard to distinguish the between different labels in the early iterations, due to the neglect for the different loss . Let , where is the possible maximal label loss and is the tolerance of the difference between two continuous iterations for this algorithm. In this way, a proper correct is obtained.

Then we come to determine the upper bound of . It is sufficient to find an upper bound as such that it returns for any . And it also satisfies which leads to the following formula

Here, let and be the minimal difference between and , such as for Hamming loss. Then the right side of the function becomes . That requires .

Since , so can be set as .

5. Construction of Feature Vector

Image features are the terms used to describe images, as well as the clues for distinguishing the differences of images. Some image features may be the basic visual features, while others are defined for specific applications. Three types of features are used in this paper, that is, color, texture, and edge features.

5.1. Color Features

Color features are the basic visual description of images. Generally, color features are based on the characteristics of pixels, and each pixel in the image or the image region makes its own contribution to the color features. However, as a global feature, it is not sensitive to the changes of the size of the image or image region and also the directions in image. In other words, color features cannot capture the local characteristics of the image. And due to its nonuniqueness, pixels in different objects may share the same color features. Two basic color descriptions are RGB color space and YCbCr color space. While RGB concentrates on the gray levels of the pixels, the YCbCr pays close attention to the intensity, chromaticity, and the color difference. In YCbCr color space, the channel Y represents the intensity of the color, while channels Cb and Cr denote the chromaticity for blue and red, respectively. YCbCr color space can be easily obtained just by a linear transformation from RGB color space. Both the RGB and YCbCr color features are shown in Figure 1. In this paper, we use both RGB and YCbCr as the color features in the training process.

5.2. Texture Features

Similar to color features, texture features are also global features. The major difference is that texture features describe the statistical characteristics of the pixels in the image region. And the texture features have the properties of rotational invariance and noise immunity, but they are sensitive to the revolution of images, if the revolution changes, different features may be generated. On top of that, the light and the reflection on the surface of the objects may make it hard for computing the texture features.

In [23], Laws developed a method for computing texture features. According to this method, different convolution kernels, which were named Laws’ masks, will be applied to our images. And the results will give some characteristics of the images. Here, the 2D Laws’ masks can be generated from the following small kernels both with the length 3 and 5: .

Here, denotes the average gray levels, denotes the edge features, stands for extracting the spots in the image, stands for extracting the wave feature, and stands for extracting the ripples in the image.

In order to generate the 2-D Laws’ masks, we adopted matrix multiplication by a vertical 1D kernel and a horizontal 1-D kernel, such as . Take the masks scaled 3 × 3, for example, all the possible masks were listed in Table 1. After the convolve operation with these masks on an image sized M × N, the gray-scale texture feature image sized (N-masks_size + 1) × (M-masks_size + 1) will be generated. Figure 2 demonstrates the texture feature results generated by the 3 × 3 Laws’ masks.

5.3. Edge Features

The object edge is the visual features of the discontinuity in the local image region which has a significant change in intensity. Generally, in images, the pixels along the edge have a smooth change in gray levels; however, on the direction which is vertical to the edge, the intensity of pixels change sharply.

The former denoted features are the local visual features. From the description, they are the surface features of the objects. On the other hand, the edge features are the measurement of the local compatibility. In this paper, 4 different Prewitt edge detectors which were directed in 0°, 45°, 90°, and 135° were adopted in order to extract the edge features. The detectors in different directions and corresponding results are shown in Figure 3. By applying the 4 detectors, almost all the edges in the images can be captured.

6. Parameter Learning and Inference Problem

6.1. Bundle Method for Parameter Learning

For parameter learning, this paper utilizes the bundle method. Due to the formulation, such as In order to obtain the optimal parameter, the constraints can be rearranged in the following form:

This formula means that it is lower bounded by . Then it generates the objective function to find the most violated constraints

Thus, this forms an inference problem. And the bundle method can guarantee the optimal solution in a small number of iterations, so the problem can be solved efficiently. Algorithms 1 and 2 provide the parameter learning algorithm for both margin and slack method.

Input: data , label , size , tolerance
Initialize parameter → 0, constraint set
Repeat
  for t = 1 to T
    
  end for
  increase constraint set
   solve the QP using all the existing
Until

Input: data , label , size T, tolerance
Initialize parameter → 0, constraint set
Repeat
  for t = 1 to T
    
  end for
  increase constraint set
   solve the QP using all the existing
Until

Both the margin and slack method refer to the optimal inference problems, so the best solution for them can be obtained via a standard graph-cuts algorithm (see reference [8] for detail). The frameworks seem to be the same, but in Algorithm 2, the inference engine is not similar to that in Algorithm 1. In this case, it needs to be approximated into a linear form, so that it searches for the best in the interval by the golden search algorithm.

6.2. Golden Searching

In this paper, we adopted the golden searching algorithm during searching for the best approximation of the optimal label.

Firstly, suppose that there exists a continuous concave function over the interval , meanwhile it has only one minimum or maximum in the interval. Taking the minimum case for example, the binary searching algorithm is not the optimal algorithm for minimum searching, shown as follows:

Take the middle point as then two different points and are determined by such that . If , the interval will be updated by , otherwise will be the new interval. Obviously, each iteration step should call the binary searching for two times, which is not optimal.

In order to optimize the iteration process, there should be a factor which is capable of reducing the interval, named . For and in the interval , there are two different cases.

If , then the interval becomes , and the interval size is compressed by c as follows: as a result,

If , similarly the interval is compressed by and the new interval is , then is obtained by

Obviously, if the factor is determined, it is easy to locate the points and in the interval. There are two rules for Cases and , respectively, while Algorithm 3 shows the algorithm for golden searching.

Input: interval , reduction factor , tolerance
Initialize ,
       ,
Repeat
  If
       ,
      
      
  Else
       ,
      
  
  end If
Until abs

Rule 1. If , set , then compute another new .

Rule 2. If , set , then compute another new .

7. Experiments and Results

We test the proposed methods on the Middlebury stereo datasets. The dataset contains many different scenes, that is, art, books, dolls, laundry, moebius, and reindeer, and each scene is consisted of 2 ground-truth images, related to view 1 and view 5 in each scene, and several different images which were caught from different views. The ground-truth images are used as the label images of each scene, and its labels were compressed from 0–255 to 0–22 for the computing efficiency, and two neighbor view images are adopted to extract the different features.

Two groups of features are introduced in our experiments. The first group is local visual features, such as colors and textures, including the 3 dimensions of RGB color channels, the 3 dimensions YCbCr color channels, the 9 dimensions texture features, the outputs of Laws’ masks scaled 3 × 3, and the 4 dimensions edge features, the outputs of the different Prewitt edge detectors. The second group is the graph edge features, which are the absolute difference between labels of neighboring pixels and one-dimensional bias constant. Practically, the method for conducting features may construct a large amount of dimensions, which can supply a rich set for choosing the suitable features to learn the parameters of the wanted model. By adopting the features and the Max-margin method, it may be easy for us to get the reasonable depth for different scenes, as shown in Figure 4.

7.1. Comparison on Inference Accuracy with Different Feature Combination

Suppose that the ground truth is denoted as and the output results as . Defining as the number of the matched pixels in and and as the number of different pixels in from , the inference accuracy can be denoted as which stands for the ratio of the correct output.

In order to study the effects of different features, we have tested different combinations of image features. For the convenience of the expression, 1 denotes the state of the feature which was chosen, and 0 otherwise. Figure 5 shows the inference accuracy of different feature combinations for the 2nd scene book. Note, the order of the features arranged from left to right is RGB, YCbCr, laws’ masks scaled 3 × 3, and the edge features. For example, 1000 denoted that only the RGB feature was chosen.

The combination of features does not always boost the accuracy of the results. In a word, some features have a negative effect on the results while others have a positive effect. In order to test it, a comparison between the set with a certain feature and another without it has been carried out. The results show that an offset effect does exist between features, such as between color and edge feature, and also some features do boost the result, such as the textures in most situations (see Figures 6(a), 6(b), 6(c), and 6(d)).

7.2. Comparison between Margin and Slack Methods

To overcome the above-mentioned shortcomings of the Max-margin method, this paper adopts the slack scaling method to improve the results. In order to solve the nondecomposability problem, we introduce an approximated algorithm as described in Section 4 to make the slack method feasible. Both methods are tested on the Middlebury database, see Figure 7. As in Figure 8, the comparison results of inference accuracy for scene art show that the slack method performs better than the margin method.

7.3. Comparison on the Convergent Properties

To take a step further, the convergent property between margin and slack methods is compared. In the training procedure, the convergence of both margin and slack methods requires the use of the bundle method and one-slack trick. Take the margin method for example, the bundle method is used by rearranging the terms, then the constraints will be

This means that the constraints are up bounded by . Given the current parameter, the objective function can be optimized using the bundle method, where the most violation constraint is

While the bundle method has the ability to achieve the optimal solution, the one-slack trick makes the procedure convergent in a small number of iterations. The computing process of the margin and the slack methods is examined to observe the convergence speed of the iteration. The error between two continuous iterations in the objective function is denoted as itaeps. Figure 9 shows the convergent property, indicating that both methods could converge in several iterations, while the slack method produces better accuracy without too much loss in convergence.

8. Conclusion

This paper presented two methods for the depth restoration of different scenes using structural vector machine. The proposed methods, including both margin and slack, have their own advantages and disadvantages, respectively. While the form of margin rescaling method can be decomposed into local parts easily, it is hard for the slack rescaling method to perform such operation. In contrast, the slack one outperforms the margin rescaling method in accuracy outstandingly. Besides the advantageous promotion in accuracy, there is no need for the slack rescaling method to abandon too many convergences while computing the parameters. The proposed approximation aiming at the slack rescaling approach manages to solve the decomposability problem successfully and make it computable in an efficient way. The pity is that the approximation method requires the formulation being concave which may be an over strong constraint. Our future works focus on these optimization algorithms, including improving the computing speed and enhancing the accuracy of the results.