Although the nonlocal means (NLM) algorithm takes a significant step forward in image filtering field, it suffers from a high computational complexity. To deal with this drawback, this paper proposes an acceleration strategy based on a correlation operation. Instead of per-pixel processing, this approach performs a simultaneous calculation of all the image pixels with the help of correlation operators. Complexity analysis and experimental results are reported and show the advantage of the proposed algorithm in terms of computation and time cost.

1. Introduction

Digital image denoising has been a fundamental and challenging issue for several decades [1, 2]. Many contributions have been devoted to recover the image degraded by Gaussian noise. Much attention has been paid to the Partial Differential Equation (PDE) approaches among which the total variation (TV) and the Perona and Malik (PM) model are well known [35]. Other methods rely on the image transform from the spatial domain to another domain (such as the Fourier domain, the wavelet domain, and the DCT domain). After adjusting the transform coefficients, the image is restituted by applying the inverse transform [6, 7].

The algorithm of nonlocal means (NLM) filtering was proposed by Buades et al. [8]. They suggested that a denoised pixel is equivalent to the weighted average of its neighboring pixels, with the weights calculated by the normalized Gaussian weighted Euclidean distance between the blocks centred at those pixels. The NLM algorithm has demonstrated better performance than other main-stream filter methods, such as bilateral filter and TV model in both visual performance and objective measure [9]. Unfortunately, the computational cost of computing the weights is too expensive in many applications.

Some solutions have been proposed to alleviate this high computation burden for weights’ calculation. Liu et al. proposed an approximation to the similarity of neighborhood windows by employing an efficient Summed Square Image (SSI) further combined with Fast Fourier Transform (FFT) [10, 11]. In [12, 13], the NLM algorithm is accelerated by eliminating some computation of weights through a preclassification step based on a hard threshold of local block measures (average intensities, gradients, and first- and second-order moments). Also, some Probabilistic Early Termination (PET) schemes, such as cluster tree, Singular Value Decomposition (SVD), or dictionaries for image blocks and image edges, were also employed to speed up the weights’ calculation [1417].

In this paper, compared with the traditional NLM algorithm, in which the calculations of weights are by the pixel by pixel way, our proposed fast strategy was performed on the whole image. Specifically, correlation operators are applied to compute the differential image and lead to a straightforward shortcut to achieve all the weights. By doing this, a lot of redundancy computation can be successfully avoided. Thereby, the computation speed of NLM is improved.

This paper is organized as follows. In Section 2, we briefly review the NLM algorithm. The repetitive computation causing the original per-pixel NLM slowness is analyzed and our proposed fast NLM algorithm is described in Section 3. In Section 4, some experiments are conducted and they show that a significant improvement over competitive approaches is brought by our method. Section 5 concludes this paper and some relative discussions are given.

2. The Non-Local Means Algorithm

2.1. The Principle of NLM

The image denoising model for an image , where is the real value domain, can be formulated bywhere denotes the noisy image value, represents the original noise-free image value, and is the noise value at the pixel . The idea behind the NLM filter is to consider that the denoised pixel value of is equivalent to the weighted mean of all pixels’ values of the noisy image (indexed in image ) [1, 8]. However, considering the high computational cost, it was suggested in [1, 8] that we can estimate the denoised pixel value using the pixels within a larger search neighborhood centred at the pixel i, where denotes the radius of , whose pixel values are scanned column by column and then concatenated to a vector:where is the number of pixels within . The superscript “” denotes the transpose and , , is the value of the th pixel in the search neighborhood . In this situation, the NLM algorithm can be expressed bywhere is the filtered image value of the pixel and represents the weight between and . The weight is acquired bywhere is a normalization factor:and denotes the exponential Gaussian weighted Euclidean distance:In (6), acts as a filtering parameter and is the Gaussian Euclidean distance between blocks and centred at the pixels and , respectively, and is given by In (7),obtained by scanning the pixel values of the block column by column and then concatenating them and is the number of pixels within the square block whose radius is set as . That is to say, . Similarly,is achieved by scanning the pixel values of the block column by column and then concatenating them. Using (8) and (9), (7) can be deduced step by step as follows:where is the th component of the discrete Gaussian kernel , whose standard deviation is . Apparently it has the same size as the block .

2.2. Computation Cost Using Per-Pixel Algorithm

From (10)–(12), we can clearly see the cause of the NLM computational complexity. For every pixel , the NLM algorithm must compute the dissimilarity between and all in the search neighborhood . For instance, to filter a noisy image of size , it needs multiplications and ( additions. By cutting down the search neighborhood into 15 × 15, with the block set as 3 × 3, for a general 512 × 512 size image , it needs 2 × 92  × 152  × 5122 multiplications and 17 × 152  × 5122 additions. Hence, a fast NLM algorithm is required. This motivated us to propose a correlation based accelerated nonlocal denoising algorithm.

3. The Fast Nonlocal Means Algorithm

3.1. Repetitive Computation of the Per-Pixel Algorithm

From the above observations, we can see that, for one specific central pixel , the Gaussian weighted Euclidean distance between the block centred at and block centred at in the neighboring needs to be computed. However, for any pixel adjoining the pixel , the Gaussian weighted Euclidean distance between the block centred at and block centred at in the neighboring also needs to be computed. Because the two pixels are adjacent, and have lots of overlap and are the same as and . Therefore, by the per-pixel way to implement the NLM algorithm, the blocks based Gaussian weighted Euclidean calculation is highly repetitive.

More precisely, for a block with size , the repetitive calculated square term is . The repetitive rate is . To get an intuitive observation, a diagram is depicted in Figure 1 in which we set the size of block as =3 × 3. It can be observed that there are 2/3 repetitive operators of intensities differences when calculating the weights for the adjacent centered values of pixels and .

3.2. Our Proposed Fast Algorithm

To avoid repetitive computations of the pixel difference , in (12) by the traditional pixel by pixel way, the proposed correlation based accelerated fast NLM algorithm deals with the image as a whole. In this case, it is necessary to deal with (1)–(9) in matrix form. Mathematically, it is trivial to extend them to the nonscalar matrix case. Then, the variables in (1)–(9) are restricted to all the pixels in the noisy image rather than some specific pixel .

To deal with the operation with the whole image , our algorithm can be accordingly divided into 8 steps being consistent with the opposite sequence of (1)–(9).

Step 1. Compute the “differential image” , which is matrix case consistent with in (12).

This step can be efficiently implemented via a correlation operator:where “” denotes the correlation operation and denotes the th differential image of the original image . is a 2D correlation kernel, which contains only two nonzero elements; that is, the center pixel value is and the th pixel value is . Note that, for every different values, there is a different correlation kernel since determines the position of the pixel value . Figure 2 gives an illustration of the = 5th differential image of the whole image within an 11 × 11 search neighborhood.

By means of (13), the achieved differential image contains all the available information in (12). In fact, (13) can be straightforwardly implemented by a translation operation:where represents the image shifted by a coordinate vector , . Here, the th vector points out the coordinate (, ) from the origin () as shown in Figure 2. Note that the correlation kernel and , , are one-to-one mapping; that is, when where () is the coordinate of “” in shown in Figure 2, then, its corresponding is given by

Step 2. Compute the “square image” , which is matrix case consistent with in (11).

This step can be easily conducted by an element-wise square operation as follows:where “” acts as an element-wise product operator.

Step 3. Compute the “Gaussian weighted Euclidean distance image” , which is matrix case consistent with in (10).

This step can also be implemented by a correlation as follows: is the Gaussian weighted Euclidian distance corresponding to the th pixel in the search neighborhood for the whole image . Equation (18) can be efficiently implemented by FFT as follows:

Step 4. Compute the exponential Gaussian weighted Euclidian distance image , which is matrix case consistent with in (6).

Step 5. Then, repeat Steps and compute all the , .

Step 6. Compute the normalization factor image , which is matrix case consistent with in (5).

Step 7. Compute the weights , which is matrix case consistent with in (4). This step can be implemented by an element-wise division operation.

Step 8. Compute the processed in (3).

3.3. Symmetry Property

Note that the computational cost of Step can be further reduced by exploiting the symmetry property of the ; that is, where denotes the th translation vector. The symmetry property is shown in Figure 3.

Thus, when we calculate the th weights for the image, the th weights of the image are also meanwhile obtained as a secondary product, which saves the computational cost by one half.

Specifically, when the kernel is employed to yield the th differential image according to (14), the th differential image is achieved by means of (20) as follows:

Next, when is calculated by (17), is also meanwhile achieved by

In Steps , only once same operation is implemented, and are simultaneously obtained.

Intuitively, this can be observed in Figure 3. By means of , for each pixel within the image , the 13th differential image () is available which is corresponding to difference between the red and green blue rectangle. At the same time, the negative () is also available which is corresponding to the difference between the red and blue rectangle. Following Steps , only once operation is implemented, and are simultaneously obtained.

Therefore, the necessary translation operations in terms of can be reduced from () to according to the symmetry property (see (20)). Thus, the computational cost will be reduced by one half.

3.4. Computational Complexity

Note also that the differences between the proposed fast algorithm and the traditional per-pixel NLM algorithm lie in Steps . Steps to in the proposed algorithm include the same operations as those in the original NLM algorithm. So the computational complexity in these steps is not taken into account in the following subsection of complexity analysis.

The computational complexity of the proposed algorithm is analyzed in this section. We assume that FFT and Inverse Fourier transform (IFFT) need the same computational complexity. In general, the computational complexity of two-dimensional FFT for an image needs approximately multiplications and additions [18].

Step 1. Computing in (14) for all requires shifts; therefore, ( additions are necessary.

Step 2. Computing in (17) for all needs multiplications.

Step 3. Computing in (19) for all involves FFTs, IFFTs, and multiplications for element-wise multiplications. Therefore, it needs multiplications and additions.
The computational cost of using the proposed algorithm, the original algorithm [8], and the FFT based algorithm [11] are, respectively, given in Table 1. Both the proposed algorithm and the algorithm of [11] are superior to the original NLM algorithm in terms of the computational complexity. According to Table 1, if (23) holds, the complexity relations in (24) and (25) will also hold:Here, to satisfy (23), the minimum pixel number in the search neighborhood will be and that can be used to deduce the minimal radius of search region. Table 2 lists the and values according to different image sizes. These requirements on minimal values can be easily met in most practical cases; thus, the proposed method is superior to that of [11].

4. Experiments

In this section, we assess the performance of the proposed algorithm by conducting experiments with the “Lena” image with size 512 × 512 (Figure 4). The degradation is simulated by adding zero-mean Gaussian noise with standard deviations 10, 20, and 30, respectively. For Gaussian noise with different standard deviations, we use different size settings for search neighborhood and block to obtain the optimal denoising effect. For the standard deviations 10, 20, and 30, the sizes of block are set as 3 × 3, 5 × 5, and 7 × 7 and the search neighborhoods are set as 15 × 15, 17 × 17, and 21 × 21, respectively. For a fair comparison with the methods in [8, 11], we use the simplified weight calculation strategy based on the Euclidian distance rather than Gaussian weighted one. All the experiments are performed in the Matlab R2015a environment using a PC computer with 16 G ROM and Intel i7 CPU. Our experiments are compared with Buades’s classical per-pixel algorithm [8] and the Wang’s algorithm [11].

We conduct quality assessment in terms of peak signal to noise ratio (PSNR) and structural similarity index measuring (SSIM) [19]. These quantitative metrics are defined by (26):where and are the means of the 8 × 8 square window and and are the image size. and are the corresponding standard deviations and is the corresponding covariance. and are constant and set according to [19].

The experiments results are reported in Table 3. We can note that the proposed algorithm leads to a very significant decrease in computation cost. Because all the methods are based on the same Euclidean distance, the three different methods yield the same values of PSNR and SSIM in this table.

5. Conclusion and Discussion

In this paper, we proposed a correlation based strategy to accelerate NLM algorithms. This method filters the whole image simultaneously instead of filtering pixel by pixel as the original NLM one. This new approach shows that both the computational complexity and the running time are greatly reduced when compared to the original NLM method [8] and the method reported in [11]. The acceleration method in [11] has to be applied using the simplified Euclidean distance but the proposed approach can overcome this limit by effectively calculating the Gaussian kernel convolution in (18). Our future work will consider its parallelization using GPU techniques and will address applications in medical image processing such as low-dose CT scanner [2022] and also color image processing [2325].

Recently, the collaborative filtering algorithms, such as the BM3D algorithm (block matching and 3D filtering algorithm), which are extension of the general concepts of grouping and block matching from the NLM, have shown the superior denoising performance. These are described in [26] and applied in [27] to complex data sets with good results. However, the method also encounters the problem of the large computational complexity costed to group. The application of our proposed correlation based strategy can simply deal with the problem. Actually, the grouping process is consistent with the process of weights’ calculation within NLM algorithm. Thereby, our method can be employed to group. In addition, because of the existence of noise, the tradition grouping strategy is easily disturbed [26, 27]. Our method can overcome the issue to a large extent to enhance grouping in terms of the adjustment of the Gaussian filtering kernel employed in the NLM algorithm (see (18)).

Eventually, referring to Table 3, it can be observed that the computational time increases versus the augment of noise standard deviation. This is because, for larger noise corrupted image, a larger search neighborhood and block are necessary, and vice versa. Fortunately, the concept of random resampling mask has been developed in [28, 29], which can lower the noise standard deviation. Thus, the random mask prefiltering helps to reduce the computational time of these methods, which will be further studied in our future work.

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this paper. They also solidly confirm that the mentioned received grants in Acknowledgments did not lead to any conflict of interests regarding the publication of this paper.


This research was supported by National Basic Research Program of China under Grant 2010CB732503, by National Natural Science Foundation under Grants 81370040, 81530060, 31100713, and 61201344, and by the Qing Lan Project in Jiangsu Province.