Abstract

This paper proposes a two-stage method for hand depth image denoising and superresolution, using bilateral filters and learned dictionaries via noise-aware orthogonal matching pursuit (NAOMP) based K-SVD. The bilateral filtering phase recovers singular points and removes artifacts on silhouettes by averaging depth data using neighborhood pixels on which both depth difference and RGB similarity restrictions are imposed. The dictionary learning phase uses NAOMP for training dictionaries which separates faithful depth from noisy data. Compared with traditional OMP, NAOMP adds a residual reduction step which effectively weakens the noise term within the residual during the residual decomposition in terms of atoms. Experimental results demonstrate that the bilateral phase and the NAOMP-based learning dictionaries phase corporately denoise both virtual and real depth images effectively.

1. Introduction

With the development of 3D range imaging devices such as laser scanner, Kinect sensor, and Time-of-Flight (ToF) cameras, depth images are widely used in various research fields including computer vision, computer graphics, virtual reality, and human computer interaction. While laser scanners provide 3D measurements with precise accuracy, Kinect sensor and ToF cameras provide a convenient way to accomplish 3D range imaging in faithful time, which facilitate many applications with high requirement of efficiency and convenience.

However, depth images provided by Kinect sensor or ToF cameras, either in structured light principle or in ToF principle, suffer from lower quality and resolutions because of the deficiency of received light speckles and the noise incurred from ranging environment. Typically, depth images produce holes, missing regions, or unstable boundaries and nonzero-mean Gaussian noise (see Figure 1).

The research work devoted to enhancement of depth images, including superresolution of depth images and denoising depth images, can be roughly divided into three categories: filtering methods, probabilistic methods, and sparse representation methods. In general, filtering methods [16] perform depth enhancement by using filters, based on the assumption that faithful depth data and noise be separable in frequency domains; probabilistic methods [712] formulate the depth enhancement as the uncertainty problem of depth measurement and use probabilistic graphical models to resolve; sparse representation methods [1318] model the depth enhancement problem as a sparse optimization by assuming that faithful depth data have an underlying sparse or low-rank structure.

Hand depth image denoising and superresolution are important for hand-based human-machine interaction. Although many research works have been proposed for RGB/depth image denoising and superresolution, conventional approaches do not work well for hand depth images. This is because the resolutions of depth images captured from Kinect sensor are , where the hand takes only a very small subregion (typically in our experiments). Thus traditional approaches usually confuse depth with noise and are unsuitable for such small scale depth data.

This paper proposes hand depth image denoising and superresolution using bilateral filters and NAOMP-based dictionaries. The bilateral filtering phase recovers singular points and removes artifacts on silhouettes by averaging depth data with restrictions of both depth difference and RGB similarity. The dictionary learning phase uses the K-SVD method with the NAOMP for training dictionaries to separate faithful depth from noisy data. While traditional orthogonal matching pursuit (OMP) works well for training dictionaries to denoise RGB images, the performance of denoising depth images deteriorates badly as depth images involve nonzero-mean Gaussian noise. As a result, traditional dictionary learning algorithm (e.g., OMP-based K-SVD) cannot prevent noisy data from penetrating dictionaries which results in an unsatisfied denoising effect. To improve traditional OMP-based K-SVD for fitting nonzero-mean Gaussian noise which frequently appears in depth data, we propose NAOMP for replacing traditional OMP in K-SVD, where the noise term within residuals is weakened in each atom updating step. Such a modification gives dictionaries capable of representing faithful depth data in a more precise fashion. Experimental results show that the proposed bilateral filters and NAOMP-based dictionaries corporately give promising results for denoising both virtual and real depth data, compared with traditional bilateral filters and traditional OMP-based dictionaries.

This paper is organized as follows. Section 2 reviews previous work on enhancing depth images. Section 3 proposes the bilateral filtering phase using RGBD data, which first recovers singular points and then removes incorrect points over silhouettes and finally removes nonsingular points over nonsilhouette regions. Sections 4 and 5 propose depth image denoising and superresolution, respectively, both using learning dictionaries via NAOMP-based K-SVD. Section 6 shows experimental results with both virtual and real hand depth images.

2. Previous Work

Previous work on enhancing depth images is reviewed in this section, including superresolution of depth images and denoising depth images in the following three categories.

2.1. Filtering Methods

Yang et al. [1] construct a 3D volume of depth probability, impose a bilateral filter over the volume iteratively, and obtain high-resolution depth images by taking the winner-takes-all approach on the weighted volume and a subpixel refinement afterward. Huhle et al. [2] present a two-stage depth enhancement method, which first removes outliers from depth data and then performs smoothing via a nonlocal means filter which uses the similarity of both color and intrapatch of depth. Wasza et al. [3] propose a GPU-based depth image preprocessing, including a normalized convolution for restoring depth images, a bilateral temporal averaging for dynamic scenes, and a guided filter for edge-preserving denoising. Min et al. [4] propose a joint histogram of depth maps for measuring color similarity between reference and neighboring pixels and find a global mode solution via -norm minimization for depth video enhancement. Fu et al. [5] propose a spatial-temporal denoising algorithm which exploits both the intraframe spatial correlation and the interframe temporal correlation to fill the depth hole and suppress the depth noise. Camplani and Salgado [6] propose a joint-bilateral filtering framework for denoising depth images, by evaluating missing depth values with a filter to neighboring pixels involving both spatial and temporal information, with the filter weights selected as a function of a photometric similarity measure of the neighbor pixels.

2.2. Probabilistic Methods

Mac Aodha et al. [7] explore the height field of patches of low-resolution depth images and select high-resolution candidate depth patches by solving a Markov Random Field (MRF) labeling problem. Shen and Cheung [8] use depth layers to account for the differences between foreground objects and background scene, the missing depth value phenomenon, and the correlation between color and depth channels and consider the depth layer labeling as a maximum a posteriori estimation problem. Wang et al. [9] evaluate the confidence of depth map for adaptive weighting of MRF energy terms and introduce a guided depth recovering method in the framework of MRF optimization for handling large holes across multiple image regions. Li et al. [10] segment the input low-resolution depth image into several regions with different labels which correspond to a high-resolution counterpart on training images and formulate the depth superresolution as an MRF-based patchwork assembly problem. Hui and Ngan [11] propose a variational-based depth map enhancement by fusing the depth maps from the active sensor of a moving RGBD system and the depth cues from an induced optical flow. Yang et al. [12] propose a regression method for enhancing depth images using RGB-D data, which first fits the regression model for depth images and then designs pixel-wise regression predictors using the similarity of depth images and the accompanied color images.

2.3. Sparse Representation Methods

Schuon et al. [13] combine several low-resolution noisy depth images of a static scene from slightly displaced viewpoints and merge them into a high-resolution depth image with ToF calibration data. The depth superresolution problem is formulated as an optimization of a data reconstruction term plus a sparsity term of spatial gradient for separating noise from features. Li et al. [14] propose a novel joint example-based depth map superresolution method which reconstructs high-resolution depth images by learning a mapping function from a set of training samples of an image database. Kiechle et al. [15] propose a joint intensity and depth cosparse model for depth map superresolution, by assuming that the cosupports of corresponding intensity and depth image structures be aligned. Zheng et al. [16] propose constructing multiple dictionaries with different structures and different number of atoms for sparse representing each low-resolution patch of depth images. Xie et al. [17] learn a coupled dictionary with local coordinate constraints and incorporate an adaptively regularized shock filter to sharpen the edges and implement both depth superresolution and depth denoising. Lu et al. [18] assemble similar RGBD patches into a low-rank matrix in order to prevent the noise or weak correlation between color and depth.

3. RGBD-Based Bilateral Filters

This section proposes the bilateral filtering phase, which preprocesses depth images by using bilateral filters with both depth and RGB restrictions. The phase includes the following three steps. The depth of each singular point of depth images is corrected by averaging depths over the neighborhood according to a rule of both RGB comparison and depth histogram difference; the depth of each point over silhouettes is corrected by averaging depths over the neighborhood according to a rule of both RGB comparison and depth difference; the depth of other points is corrected using traditional bilateral filters, that is, by averaging depths over spatial neighborhood.

Let be the intensity values/depth value at the pixel of an intensity/depth image. Traditional filters at the pixel with respect to its neighborhood are given bywhere denotes an spatial neighborhood of with , representing the row index, the column index of within images, respectively, denotes the normalization term, is the 2D Gaussian smoothing kernel (known as the domain term) which measures the closeness of the pixels, and is the 1D Gaussian smoothing kernel (known as the range term) which measures the similarity of RGB/depth values of the pixels in RGB/depth images.

3.1. Recovering Singular Points

The singular points of a depth image are referred to as the points whose depth is undetected. For each singular point of depth images, we first set an initial depth value at as the average of depths which are no smaller than in the neighborhood of (choosing such an initial value because experimental results indicate that the depths over the neighborhood of a singular point belong to two regions generally: an interval determined by the depths of the target hand and an interval determined by the depths of the background. Experimental results show that separates two intervals well and hence gives an initial approximation of depth at ), where is the maximum depth over , and then choose a suitable subset of whose histogram of depth is much greater than the histogram of depth at (choosing such a subregion of the neighborhood of overcomes directly choosing a spatial neighborhood of , because the RGB comparison improves the confidence of the pixels and because the histogram of the depth avoids producing shadows within the silhouette. We illustrate this improvement in Figure 2). Finally we select the filtered value of the depth at the pixel by averaging the depth data with respect to both the domain term and the range term provided that the confidence pixels are enough; otherwise the singular points remain as the same value (such untreated singular points are much less than before and are dispersed within the depth image and can hence easily be treated in the second phase). That is,where , denote the RGB, depth values at the pixel of an image, respectively, denotes the normalization term, and denotes the confidence subset of a square neighborhood of with both a depth difference restriction and an intensity similarity restriction, with denoting the number of pixels within whose depth value is equal to .

3.2. Removing Incorrect Points on Silhouettes

For all nonsingular points located the silhouettes of depth images, we modify its depth value at the pixel by first choosing a suitable subset of whose depth is much greater than the depth of and whose RGB values are similar to ’s and then averaging the depth data within the background domain (corresponding to the restriction of depth difference) with similar RGB values at , provided that such suitable pixels are enough. That is,where , , , , and have the same meanings as in (2) and denotes the confidence subset of a square neighborhood of with both a depth difference restriction and an intensity similarity restriction.

3.3. Removing Other Incorrect Points

For all other nonsingular points of depth images, we treat them with a smoothing filter based on the similarity of both depth value and RGB values. That is,where , , , , and have the same meanings as in (2) and denotes an square neighborhood of (different from subtle filtering (2) and (3) where both of the neighborhoods are determined by a depth difference restriction and an intensity similarity restriction, the filtering (4) simply selects a square neighborhood without additional restrictions. This is because the filtering (2) and (3) treats singular points and incorrect points on silhouettes which require careful treatment, whereas the filtering (4) performs a smoothing effect over the whole square neighborhood of the point which needs smoothing.).

Figures 2 and 3 show some comparison results of depth images using traditional filters [6] and the proposed filters (2)–(4). In a word, the proposed filters recover singular points using neighboring points with a quantity restriction (2) and correct artifacts using neighboring points with an intensity restriction (3), while traditional bilateral filters accomplish such tasks by directly averaging points within spatial neighborhood. Therefore, traditional filters always produce new artifacts while the proposed filters either correct the artifacts over silhouettes or suppress them in a small number without creating new ones, so that they can be easily treated in the next dictionary learning phase.

4. Hand Depth Denoising Using Noise-Aware Dictionaries

This section proposes the dictionary-based denoising phase, followed by the bilateral filtering phase given in Section 3. Recovering the original image from a degraded image with additive white Gaussian noise can be modelled by solving the ill-posed system , where denotes the degraded image, denotes the original image, and denotes the noise term. Sparse representation is an important tool for solving such an ill-posed system. According to sparse representation, natural images can be represented by a linear combination of a series of overcomplete basis (known as a dictionary) with very few nonzero combinational coefficients. Therefore, by denoting to be such a dictionary whose columns are basis vectors (known as atoms) with , the original image can be obtained by imposing an -norm constraint of the coefficients over the system . In particular, we select image patches of size pixels randomly from as training data and order each patch lexicographically as column vectors . Then we obtain a trained dictionary and sparse coefficients simultaneously via the following -minimization of all coefficients of training data:where denotes the training set of indices of patches of we randomly select. The original image is finally reconstructed by assembling all image patches with given byBecause (5) and (6) are both nonconvex systems involving the -norm minimization, developing efficient algorithms for solving them is an important task. Elad and Aharon [19] propose the K-SVD algorithm to solve (5), by constructing a dictionary iteratively from training signals with a sparse coding phase and an atom update phase. Within the sparse coding phase of K-SVD and the optimization problem (6), OMP is a greedy algorithm frequently used for solving the -norm minimization.

While traditional K-SVD denoises traditional images well for zero-mean Gaussian additive noise, it is not well suited for denoising depth images with nonzero-mean noise because traditional OMP can hardly separate noise data from noiseless image when the amplitude of noise has an irregular distribution. To remove such noise from depth images in a more effective fashion, traditional OMP is improved by modifying the amplitude of entries of residual whenever such entries have large amplitude and a small quantity. Figure 4 helps readers understand how this idea works. The -axis represents the value of each component of residuals while the -axis represents the index of each component of residuals. Let be the residual obtained in an iteration of OMP. The left subfigure of Figure 4 shows how traditional OMP represents residuals when the noise is zero-mean Gaussian. While contains four components heavily contaminated by noise (denoted by blue cubes, i.e., the original training data has noise terms in the positions of those four components), the least square fitting, obtained when computing the sparse coefficients, approximates noiseless components well and hence makes most of noiseless components vanish at the next iteration before the decomposition of the noise terms with respect to current atoms, because the stopping criteria of the number of sparse coefficients are satisfied. The middle subfigure of Figure 4 shows how traditional OMP fails to remove noise from depth images when the noise is nonzero-mean. In this case, the least square fitting deviates from most noiseless terms and hence the noisy terms begin decomposition in the next iteration, because neither the criteria of the number of sparse coefficients nor the criteria of the residual amplitude are satisfied. From this step, the new atoms obtained shall be contaminated by noise. The right subfigure of Figure 4 shows how NAOMP improves this issue. When the current residual contains greater-magnitude entries and those entries have a relatively small number, it is reasonably believed that those entries correspond to the noisy components of training data. In this case, each of those components of is modified by reevaluating it as the amplitude of corresponding component within the fitting line. Then the updated residual is redecomposed over atoms. By doing so, the least square fitting approximates noiseless terms of residuals and weakens the effect of noisy terms. We illustrate the idea using numerical examples in Appendix.

The NAOMP algorithm is illustrated in Algorithm 3. After performing traditional OMP (lines ()–()), the number of components which is relatively greater than others is checked. Once the number is smaller, the current residual is reevaluated (line ()) so that the values of greater components approach those of other components. Then the atom and the residual are reupdated as in OMP (lines ()–()). Detailed parameter setting shall be given in Section 6.

5. Hand Depth Image Superresolution Using Noise-Aware Dictionaries

We apply NAOMP to hand depth image superresolution, where we use NAOMP for joint dictionary training. The main idea is similar to the work of [20]; hence we only give the different part of our work (which is the dictionary training algorithm) without giving details. We randomly select patches of virtual hand depth images as a training set, each of which is stacked as a vector, denoted by . Then we obtain its corresponding downsampled version and form pairwise training sets . We denote and to be the joint dictionary, which is the sparse representations for high-resolution and low-resolution image patches, respectively. The joint dictionary is given by the following optimization:By rewriting the first and the second terms in a single term of 2-norm, we obtain the optimization model similar to the last section and solve it using the NAOMP-based K-SVD algorithm. To recover the high-resolution patch of an input image , we find a sparse representation of each patch of with respect to and then obtain the corresponding high-resolution patch by combining the high-resolution atoms with the same sparse coefficients with respect to .

6. Experimental Results

The experimental results are given in this section. The experiments are run on Core 2 Quad Q6600 2.4 GHz machine with 2 GB RAM using Visual Studio 2010. The proposed filters are given by (2)–(4), and the NAOMP is given by Algorithm 3. All the parameters in traditional filters [6], the proposed filters (2)–(4), OMP-based K-SVD [19] (Algorithms 1 and 2), and NAOMP-based K-SVD (Algorithms 1 and 3) are given in Table 1, respectively. One can see from Table 1 that traditional OMP-based K-SVD has to select different values for the residual threshold according to the intensity of noise, while NAOMP-based K-SVD selects a single value. In fact, according to [19], the denoising effect heavily depends on the choice of the residual threshold. Such a threshold is difficult to determine for OMP-based K-SVD when no a priori information of noise is given, while this threshold is fixed and easily determined in NAOMP-based K-SVD.

  input: training data , residual threshold , and iteration number
  output: a dictionary
() Initialize as overcomplete DCT dictionary;
() for    do
()   foreach    do /  sparse coding phase  /
()     Using the OMP (Algorithm 2 or Algorithm 3) to solve
       s.t. ;
()   end
()   for    do   / atom update phase /
()     Find the set of patches which use the th atom
       ;
()     foreach    do
()       ;  /   denotes the th column of   /
()     end
()     Set as the matrix whose columns are ;
()     Apply SVD decomposition ;
()     Update the th column of dictionary by the first column of ;
()     Update the coefficient values to be the first column of multiplied by ;
()   end
() end
  input: vector , dictionary , sparsity
  output: sparse coefficient of with respect to
() Initialize residual , index set , ;
() for    do
()   Denote , to be all columns of ;
()   ;
()   Update the index set ;
()   Update the atom matrix ;
()   ;
()   Update residual ;
() end
  input: vector , dictionary , sparsity , residual threshold , noise index threshold
  output: sparse coefficient of with respect to
() Initialize residual , index set , ;
() for    do
()   Denote , to be all columns of ;
()   ;
()   Update the index set ;
()   Update the atom matrix ;
()   ;
()   Update residual ;
()   ; / collect all indices of great components of residual   /
()   if    then / when has only a small number of great components, we re-update the atom by removing those
     components from /
()     foreach    do  ;
()     ;
()     Re-update the index set ;
()     Re-update the atom matrix ;
()     ;
()     Re-update residual ;
()   end
() end
6.1. Denoising Virtual Hand Depth Images

Artificial Gaussian noise is added on three virtual hand models and the images are denoised using OMP-based K-SVD and NAOMP-based K-SVD, both without filter preprocessing. Qualitative results are given in Figure 5 (with 0.5%, 2%, and 5% of noise) and quantitative results are given in Table 2. We see that traditional OMP fails to recover the wrist part of models while NAOMP treats them well. Moreover, within all cases, the proposed method provides higher PSNR than OMP-based K-SVD except for the first example with 0.5% noise.

6.2. Denoising Hand Depth Images Obtained from Kinect v2

We show qualitative results of denoising and superresolution of six hand depth images obtained from Kinect sensor v2 in Figure 6 and give the average running time of different approaches in Table 3. The comparison results include five approaches: traditional bilateral filters [6], OMP-based K-SVD [19], NAOMP-based K-SVD, the bilateral filters (2)–(4) plus OMP-based K-SVD, and the bilateral filters (2)–(4) plus NAOMP-based K-SVD.

In general, both bilateral filters preprocess depth images well in that singular points are removed (the different effect of two filters is given in Section 3 using Figures 2 and 3). Moreover, one can see from the fifth column and the sixth column that NAOMP-based K-SVD produces clearer silhouettes than OMP-based K-SVD, as the trained dictionaries from NAOMP remove noisy terms well.

6.3. Discussions

The two-stage depth image denoising and superresolution enhance hand depth images well mainly because of the following two reasons. For one thing, the proposed bilateral filter functions choose suitable neighborhood pixels with exquisite depth and RGB restrictions than traditional filters. For another, the NAOMP modifies the noisy terms of the residual in each atom updating step so that they are reevaluated with closed values to noiseless terms. This enables the residual to decompose in a more noiseless fashion and results in dictionaries which are less contaminated by noise.

It should be noted that the proposed method may fail in denoising and superresolution of depth images of objects other than human hands. This is because while the skin of human hands shares RGB data in a small range, other objects do not show such an advantage, which make the proposed bilateral filters fail to select suitable neighborhood for filter functions.

In future work, the proposed image denoising and superresolution framework shall be developed for enhancing depth images of more complex objects or scenes. More accurate RGB/depth restrictions can be designed for filter functions to preprocess depth images, so that the singular points and artifacts remain in a more dispersing fashion. Furthermore, the modification of residuals within NAOMP can be improved so that the updated atom can represent the noiseless data more precisely.

Appendix

A. OMP versus NAOMP: The First Example

Letbe an input signal, a faithful signal, respectively, where denotes the noise term, , and let the atom number be one. Without loss of generality we assume that both traditional K-SVD and NAOMP-based K-SVD return the same DCT dictionary (the expression of the trained dictionary depends on the training set and residual threshold we choose. Here we make this assumption so that the dictionary can exactly represent the faithful signal): After the dictionary is obtained, we recover the signal using OMP and NAOMP, respectively. OMP obtains sparse coefficients and the reconstruction signal as follows: Nevertheless, NAOMP obtains the same sparse coefficients and finds that the first component of the residual is relatively greater than the second and the third ones. As a result, the residual is modified as (noticing that holds in the first step). Thus the sparse coefficients are recomputed by where is the modified residual term, and the reconstruction signal is given by which approximates the faithful signal compared with .

B. OMP versus NAOMP: The Second Example

Let be an input signal, a faithful signal, respectively, where denotes the noise term and where , are atoms of the following dictionary: According to the expression of , the best choice of atoms is . Suppose that , . Let us recover the signal using OMP and NAOMP, respectively, both with two iteration steps. For OMP, in the first step, the atom which takes the maximum inner product with is , and the residual is accordingly given by where . In the second step, using simple calculation and the inequalities , , the atom which takes the maximum inner product with is . Finally we recover the signal by with the coefficients given by Thus the reconstruction error is given by For NAOMP, the first step finds the same atom and the same residual as in OMP. Using we can see that the first component of is greater than the other three components and hence we modified the residual by , which gives We reselect the atom taking the maximum inner product with which is still and the residual is given by In the second step, using simple calculation and the relationship , the atom taking the maximum inner product with is . Then we update the residual by Because no single component has a relatively great magnitude, we do not implement the modification step and hence return the reconstruction signal by with the coefficients given by The reconstruction error is given by which is smaller than the error of OMP. Moreover, NAOMP selects the correct atoms while OMP selects the atoms .

Competing Interests

The authors declare that there are no competing interests regarding the publication of this paper.

Acknowledgments

This work was supported by the Natural Science Foundation of China (Grant nos. 61227004, 61402024, and 61300065), the Beijing Natural Science Foundation (Grant nos. 4162009, 4152009, and 4142010), the Beijing Municipal Commission of Education (Grant nos. km201410005013 and km201610005033), the Funding Project for Academic Human Resources Development in Institutions of Higher Learning under the Jurisdiction of Beijing Municipality (Grant no. IDHT20150504), and the Jing-Hua Talents Project of the Beijing University of Technology.