Abstract

Current semiautomatic 2D-to-3D methods assume that user input is perfectly accurate. However, it is difficult to get 100% accurate user scribbles and even small errors in the input will degrade the conversion quality. This paper addresses the issue with scribble confidence that considers color differences between labeled pixels and their neighbors. First, it counts the number of neighbors which have similar and different color values for each labeled pixels, respectively. The ratio between these two numbers at each labeled pixel is regarded as its scribble confidence. Second, the sparse-to-dense depth conversion is formulated as a confident optimization problem by introducing a confident weighting data cost term and the local and k-nearest depth consistent regularization terms. Finally, the dense depth-map is obtained by solving sparse linear equations. The proposed approach is compared with existing methods on several representative images. The experimental results demonstrate that the proposed method can tolerate some errors from use input and can reduce depth-map artifacts caused by inaccurate user input.

1. Introduction

3D videos have attracted more and more attention, providing an immersive realism visual experience by exploiting depth information [1]. With rapid advances in 3D display technologies, 3D content shortage has become one of the bottlenecks which restrict the development of entire 3D industry [2]. To remedy this issue, many 2D-to-3D conversion methods have been developed to convert existing 2D images/videos into 3D format by creating depth-maps [3]. Semiautomatic 2D-to-3D conversion can produce high-quality depth-maps from sparse user scribbles by using sparse-to-dense depth propagation. However, current methods assume that user scribbles are perfectly accurate [46], and depth quality degrades significantly when inaccurate scribbles are present. Figure 1 shows an experimental result where scribbles are partly inaccurate. As shown in Figures 1(d) and 1(e), existing methods generate visual artifacts around inaccurate labeled regions. To handle the inaccurate input, a confident sparse-to-dense propagation algorithm is introduced in this paper that obtains accurate depth-maps even from erroneous user scribbles, as in Figure 1(f).

The proposed method is based on the observation that inaccurate input often occurs at or near object boundaries, and the number of correct scribbles is much larger than the number of incorrect ones. The rest of this paper is organized as follows. In Section 2, the related works about sparse-to-dense depth propagation for 2D-to-3D conversion are reviewed. The proposed method is described in Section 3. Experimental results are provided in Section 4. Finally, conclusion and future work are given in Section 5.

2D-to-3D conversion algorithms can be categorized into manual methods, automatic methods, and semiautomatic methods. Manual methods can offer the highest quality conversion results but need per-pixel depth assignment precisely which is most time consuming and costly. Automatic methods infer depth information in image/video by exploiting different depth perception cues such as motion, occlusion, vanishing points, defocus, and so on. Recently, with the popularity of deep learning, many neural networks have been proposed for automatic depth estimation [79]. However, existing automatic methods can generally provide a limited 3D effect due to ambiguities between depth and perception cues [2]. Semiautomatic methods are the most widely used schemes for 3D content creation, since they can balance conversion quality and production cost. The core step of semiautomatic methods is sparse-to-dense depth propagation on key frames, in which dense depth-maps are estimated from user-assigned sparse depth values. The conversion quality largely depends on the accuracy of depth-maps at key frames. Thus, this paper mainly focuses on sparse-to-dense depth propagation for semiautomatic 2D-to-3D conversion.

Phan and Androutsos [10] combine random walks (RW) with graph cuts (GC) for sparse-to-dense depth estimation, but incorrectly segmented object boundaries provided by GC may degrade depth quality. Rzeszutek and Androutsos [11] use the domain transform filter to propagate sparse labels throughout an image, but it may smooth out depth edges. Iizuka et al. [12] utilize superpixel-based geodesic distance weighting interpolation and optimization-based edge-preserving smooth to compute dense depth from user scribbles. Similarly, Wu et al. [13] apply superpixel-based optimization method to obtain dense depth-maps from sparse input. However, these superpixel-based methods are affected by the performance of superpixel segmentation. Yuan et al. [14] propose a nonlocal RW algorithm to produce dense depth-map from user scribbles on single 2D image. Liang and Shen [15] further extend this scheme with the ability to process videos. However, RW-based methods cannot modify user-assigned labels, and erroneous input will degrade depth accuracy seriously. Lopez et al. [16] incorporate perspective and equality/inequality constraints into an optimization framework for dense depth estimation, but may add additional burden to user operations. Vosters and Haan [17] propose a line scanning-based sparse-to-dense propagation method with low computation cost, but accuracy may be lost. Revaud et al. [18] use an edge-aware geodesic distance for sparse-to-dense optical flow interpolation, but the result is vulnerable to inaccurate input.

All of the above methods, however, do not account for the possibility of inaccurate scribbles. Thus, these methods can only give reliable results for accurate input. To address this issue, confidence of scribbles is calculated based on local color variation. There has been some recent works addressed on error-tolerant interactive image segmentation [1921]. However, these methods are not well suited to 2D-to-3D conversion, since they mainly focus on foreground and background separation.

3. Method

As shown in Figure 2, the proposed method works as follows. Firstly, user draws sparse scribbles on 2D image/key frames, where brighter red marked regions are closer from the camera. Secondly, depth values at labeled pixels are extracted according to intensities of the scribbles. Thirdly, confidence of scribbles is calculated based on the color variation at labeled regions. Fourthly, an energy function is built where scribble confidence is incorporated into the data cost. Finally, the energy function is minimized by solving a sparse linear system to obtain the dense depth-map.

3.1. Scribble Confidence

It can be found that pixels in accurate labeled regions often have similar color values, while erroneous input mainly appears at object boundaries with strong variations in color. Based on this observation, the scribble confidence is calculated using the following formula:where denotes the Lab color values at pixel i. The reason for using Lab color space is that it takes into account human perception [22]. is the set of 8 neighbors of pixel i. is the Dirac delta function. and are L0 norm and L2 norm, respectively. is a small positive constant to prevent division by zero and set to 10−5. is the set of labeled pixels.

It can be seen from formula (1) that the confidence of a labeled pixel is lower when its color difference between neighboring pixels is larger. Since inaccurate input is mainly located at or near object boundaries around which the color changes significantly, the proposed method can penalize inaccurate scribbles. One may question whether or not the confidence of correct scribbles is high. The confidence of labeled pixels at texture regions is indeed low, and correct labels at these regions will be mistaken as incorrect ones. However, the number of correct scribbles is much larger than the number of incorrect ones. Therefore, the impact on the accurate scribbles can be tolerated.

Figure 3 gives an example on how scribble confidence works. The confidence of inaccurate scribbles which move across the object boundaries is low, as shown in Figure 3(d). Current optimization method [6] generates visual artifacts around inaccurate labeled regions, as can be seen from regions within the red circles in Figure 3(e). These artifacts can be removed if scribble confidence is incorporated into the optimization method, as shown in Figure 3(f).

As shown as the blue square in Figure 3(a), when user scribbles are inside objects, the color variation at labeled pixel between its neighbors is small, in which case the scribble confidence of the pixel at the center of the blue square is 0.6. When user scribbles approach object boundaries, the color variation at labeled pixel between its neighbors becomes larger, as shown as the yellow and pink squares in Figure 3(a). The scribble confidence of the pixel at the center of the yellow square is 0.3, while confidence of the center pixel of the pink square is 0.0. Since erroneous scribbles mainly appears at object boundaries, the proposed method can remove erroneous input by using color differences between labeled pixels and their neighbors.

3.2. Energy Function

Let n be the total number of pixels and and h the image width and height in pixels, that is, n =  × h. For each pixel i, , , and denote, respectively, the estimated depth value, initial depth value, and spatial coordinate. Here, if ; otherwise, is user-assigned depth value. The n × 1 lexicographically ordered column vectors and are the vector representation of the estimated dense depth-map and initial sparse depth-map, respectively. The objective of 2D-to-3D conversion is to recover from . This is an ill-posed inverse problem. Local and k-nearest smoothness are introduced to regularize the problem and propose the following energy function:where the first term is data cost, the second term is local smoothness, and the last term is k-nearest smoothness. is the scribble confidence at pixel i obtained from formula (1). is the set of k-nearest neighbors of pixel i. Feature vector is used to find k-nearest neighbors. Here, is a parameter and set as 30 in all experiments. and in formula (2) are parameters to weigh the local smoothness term and k-nearest smoothness term, respectively. is the Gaussian weight to measure color similarity between pixels i and j and is defined as follows:where is the bandwidth parameter and fixed as 0.03 in all experiments.

In formula (2), the data cost is used to measure the consistency between the estimation and user-assigned depth values. Since scribble confidence is incorporated, the proposed method is robust to inaccurate use input. The local smoothness term makes the neighboring pixels with similar colors have similar depth values. To reduce the impact on correct scribbles at texture regions, the k-nearest smoothness term is introduced to make distant pixels with similar features also have similar depth values.

3.3. Solver

The energy function in formula (2) is minimized to obtain the dense depth-map from the sparse depth-map . To facilitate computer implementation, formula (2) is rewritten in matrix form as follows:where is an n × n diagonal matrix whose i-th entry on the diagonal is . is the n × n Laplacian matrix for local neighbors and defined as , where is the n×n affinity matrix for local neighbors and is an n×n diagonal matrix whose i-th entry on the diagonal is . is the n × n Laplacian matrix for k-nearest neighbors and defined as , where is the n×n affinity matrix for k-nearest neighbors and is an n×n diagonal matrix whose i-th entry on the diagonal is .

The energy function in formula (4) to be minimized is convex and thus takes its derivative with respect to and sets it equal to zero leading to the following system of linear equations:

The equation in formula (5) is sparse and positive definite which means the solution can be obtained using the conjugate gradient method.

4. Experimental Results

4.1. Experimental Setup

Four representative test images, RGBZ_01, RGBZ_03, RGBZ_05, and RGBZ_07, from the RGBZ dataset [23] are used to evaluate the performance. The proposed method is compared with several state-of-the-art methods, including RW [4], optimization (OPT) [6], hybrid GC and RW (HGR) [10], superpixel-based optimization (SOPT) [13], and nonlocal RW (NRW) [14]. In the proposed method, the regularization weight parameters and are fixed to 1 and 10−5, respectively. The local neighbors of formula (1) and (2)are empirically set to 3 × 3 size square windows centered at each pixel. The parameter k of k-nearest neighbors in (2) is set to 9. Structural similarity (SSIM) and PSNR are used as the quantitative measure for comparison, in which parameters of SSIM are set to default values as suggested by Wang et al. [24].

4.2. Experiments in the Absence of Inaccurate Scribbles

In this section, user input is assumed to be perfectly accurate and show the performance of the proposed method in this case. The SSIM comparison is listed in Table 1. The PSNR comparison is shown in Table 2. As shown in Tables 1 and 2, the proposed method is comparable to current optimization method [6] when user input is accurate. Figures 47 show qualitative comparisons for different methods on the four test images. It can be seen that the proposed method is superior in reducing depth bleeding artifacts compared with the previous optimization method [6]. The reason is that the k-nearest smoothness term in formula (2) is effective in preserving sharp depth boundaries [14]. In summary, the proposed method can be safely used in the case of accurate input.

4.3. Experiments in the Presence of Inaccurate Scribbles

In this section, some inaccurate scribbles are added on the abovementioned experiments by roughly drawing labels across some randomly selected object boundaries (see regions within white squares of Figures 8(b)11(b)) or wrongly drawing inside randomly selected regions (see regions within white circles of Figures 8(b)11(b)).

The SSIM comparison in this case is listed in Table 3, and PSNR comparison is shown in Table 4. It can be seen that the proposed method is superior to other approaches when inaccurate input is present, and obtains the highest SSIM and PSNR values in average. This shows that scribble confidence can help resist inaccurate scribbles. The performance of current methods degrades significantly in the case of inaccurate input, since they assume all scribbles are perfectly accurate.

The qualitative comparisons for different methods are shown in Figures 811. Current methods generate undesirable visual artifacts around inaccurate labeled regions, as shown in Figures 811. Thanks to the scribble confidence, the proposed method successfully reduces these artifacts caused by inaccurate input.

4.4. Experiments for Sparse Labeling

In this section, user input is assumed to be very sparse. In the first row of Figure 12, results are shown for a very sparse input which only contains seven strokes. In the second row, two strokes are added to the input image. It can be seen that the performance of all methods is improved as increase in the number of accurate scribbles. NRW and the proposed method are superior to others in preserving depth discontinuities since they both use nonlocal regularization. As analyzed in Section 3.1, although the proposed method may mistake correct labels for incorrect ones when they are located in texture regions, the proposed method can obtain acceptable results even with very sparse scribbles thanks to the k-nearest smoothness in formula (2), as shown in Figures 12(g) and 12(n).

5. Conclusion

Semiautomatic 2D-to-3D conversion has proven to be an effective solution for alleviating 3D content shortage. The key is sparse-to-dense depth conversion from user scribbles. Existing methods assume user input is entirely accurate, and even small errors may degrade the depth quality dramatically. To alleviate this problem, color difference between labeled pixels and neighbors is used to compute scribble confidence, and a confident optimization method is proposed for sparse-to-dense depth conversion. Furthermore, k-nearest smoothness is introduced to make the proposed method perform well even with very sparse input. The experiments demonstrate that the proposed method is superior to existing methods when inaccurate input is present, while at the same time competitive results are obtained when all scribbles are accurate.

Currently, the proposed method mainly focuses on 2D-to-3D conversion for images. In future, the proposed method will extend to videos.

Conflicts of Interest

The author declares that there are no conflicts of interest in the publication of this article.

Acknowledgments

This research was supported by the Zhejiang Provincial Natural Science Foundation of China (LY16F010014 and LY18F020025), the Ningbo City Natural Science Foundation of China (2017A610109), and the Educational Commission of Zhejiang Province of China (Y201533511).