Abstract

Two common problems are often encountered in analysis dictionary learning (ADL) algorithms. The first one is that the original clean signals for learning the dictionary are assumed to be known, which otherwise need to be estimated from noisy measurements. This, however, renders a computationally slow optimization process and potentially unreliable estimation (if the noise level is high), as represented by the Analysis K-SVD (AK-SVD) algorithm. The other problem is the trivial solution to the dictionary, for example, the null dictionary matrix that may be given by a dictionary learning algorithm, as discussed in the learning overcomplete sparsifying transform (LOST) algorithm. Here we propose a novel optimization model and an iterative algorithm to learn the analysis dictionary, where we directly employ the observed data to compute the approximate analysis sparse representation of the original signals (leading to a fast optimization procedure) and enforce an orthogonality constraint on the optimization criterion to avoid the trivial solutions. Experiments demonstrate the competitive performance of the proposed algorithm as compared with three baselines, namely, the AK-SVD, LOST, and NAAOLA algorithms.

1. Introduction

Sparse signal representation has been the focus of much recent research in signal processing fields such as image denoising, compression, and source separation. Sparse representation is often established on the synthesis model [13]. Considering a signal , the synthesis sparse representation of can be described as where represents the pseudonorm, defined as the number of nonzero elements of a vector. is a possibly overcomplete dictionary (), and , containing the coding coefficients, is assumed to be sparse with .

Recently, an alternative form of sparse representation model called analysis model was proposed in [414]. In this model, an overcomplete analysis dictionary or operator () is sought to transform to a high dimensional space; that is, where is called the analysis representation of and assumed to be sparse and is cosparsity of the analysis model, which is the number of zeros in the vector .

Both models can be used to reconstruct an unknown signal from the corrupted measurement ; that is, where is a Gaussian noise vector. This is an ill-posed linear inverse problem, which has been studied extensively [15, 16]. Using the synthesis model, the original signal can be estimated by solving and then calculated as , where denotes the noise floor. It is now well known [6] that the original signal can also be recovered by solving the following analysis model based optimization problem: For the squared and invertible dictionary, the synthesis and the analysis models are identical with [9]. However, the two models depart if we concentrate on the redundant case ( and ). In the analysis model, the signal is described by the zero elements of , that is, zero entries in that define a subspace which the signal belongs to, as opposed to the few nonzero entries of in the synthesis model.

The performance of both the synthesis and analysis models hinges on the representation of the signals with an appropriately chosen dictionary. In the past decade, a great deal of effort has been dedicated to learning the dictionary for the synthesis model; however, the ADL problem has received less attention with only a few algorithms proposed recently [5, 6, 10, 1719]. In these works, the dictionary is often learned from the observed signals measured in the presence of additive noise; that is, where contains the original signals, is a Gaussian noise matrix, and is the number of signals.

Exploiting the fact that a row of the dictionary is orthogonal to a subset of training signals , a sequential minimal eigenvalue based ADL algorithm is proposed in [5]. Once the subset is found, the corresponding row of the dictionary can be updated with the eigenvector associated with the smallest eigenvalue of the autocorrelation matrix of these signals. However, as increases, so does the computational cost of the method. In [68], an -norm penalty function is applied to , and a projected subgradient algorithm (called NAAOLA) is proposed for analysis operator learning. These works employ a uniformly normalized tight frame as a constraint on the dictionary to avoid the trivial solution (e.g., ), which however limits the range of possible to be learned. In [10], the Analysis K-SVD (AK-SVD) algorithm is proposed for ADL. By keeping fixed, the optimal backward greedy algorithm (OBG) is employed to estimate a submatrix of whose rows are then used to determine a submatrix of and the eigenvector associated with the smallest eigenvalue of this submatrix is then used to update . A generalized analysis model, that is, the transform model, is proposed in the learning sparsifying transform (LST) algorithm [11]. Unlike the analysis model (2), the sparse representation of a signal in the LST model is not constrained to lie in the range space of . Such a generalization allows the transform model to accommodate a wider class of signals. A closed-form solution to the LST problem is also developed in [13]. The LST algorithm [11] is further extended to the overcomplete case, leading to the LOST algorithm in [12]. The transform K-SVD algorithm proposed recently in [14] is essentially a combination of the ideas in the LOST and AK-SVD algorithms.

There are two problems that have been observed or already studied in some of the ADL algorithms discussed above. The first one is associated with the computational cost in the optimization process of the ADL algorithms. For example, in the AK-SVD algorithm, needs to be estimated before learning the dictionary, and, as a result, the optimization process becomes computationally demanding. The second one is associated with the trivial solutions to the dictionary, which may lead to spurious sparse representations. To eliminate such trivial solutions, extra constraints are required, such as the full-rank constraint on employed in [1113] and the mutual coherence constraint on the rows of in [12, 14].

In this paper, we propose a new optimization model and algorithm, attempting to provide alternative solutions to the two potential problems mentioned above. More specifically, similar in spirit to our recent work [17], we directly use the observed data to compute the approximate analysis sparse representation of the original signals, without having to preestimate for learning the dictionary. This leads to a computationally very efficient algorithm. Moreover, different from the LOST algorithm, we enforce an orthogonality constraint on , which, as shown later, has led to an improved learning performance.

The paper is organized as follows. In Section 2, we discuss the novel ADL model and algorithm. In Section 3, we show some experimental results, before concluding the paper in Section 4.

2. The Proposed ADL Algorithm

In some algorithms discussed above such as [10], the optimization criterion is based on . In practice, however, is unknown and required to be estimated from . This unfortunately results in a computationally very slow optimization process as shown in Section 3.1. To address this issue, we introduce a new model and algorithm for ADL, where, similar to [17], we use as an approximation to , which is further replaced by and then used as a new constraint for the reconstruction term . This leads to the following model of ADL: where is a regularization parameter. According to the analysis model (2), would be an analysis coefficient matrix only if its columns lie in the range space of the analysis dictionary . However, is not explicitly enforced in (7), and hence it is called the transform coefficient matrix as in [19]. Using (7), the analysis dictionary is learned from without having to preestimate as opposed to the AK-SVD algorithm, so the computational effort for estimating from is exempted.

With the model (7), however, there is a trivial solution; that is, , . To avoid such a solution, additional constraints on are required. In [12], a full column rank constraint has been imposed on through the use of a negative log determinant of . However, the inverse of needs to be computed in the conjugate gradient method in [12], and the inverse of may affect the stability of the numerical computation when the initial of is ill conditioned. To address this problem, in our work, a function defined on is employed as a constraint term which enforces to be a full column rank matrix, based on the fact that the ranks of and its corresponding Gram matrix are equal. This leads to the following new optimization criterion: Using a Lagrangian multiplier , the optimization problem can be reformulated as

It is worth noting that our proposed objective function (9) is essentially different from the one used in [7]. In [7], the objective function is defined as follows: where is the th row of . With (10), the analysis dictionary is learned from the clean signal and the sparse constraint is enforced on . In our proposed ADL algorithm, however, the analysis dictionary is directly learned from the noisy observed signal and thus can tolerate some sparsification errors as in [11]. The results of the experiments demonstrate that our proposed approach is more robust.

Note also that when is sparse, minimizing can be obtained by the minimization of , subject to the sparsity constraint. However, both and are unknown. To solve the problem, we propose an iterative method to alternatively update the estimation of and .

2.1. Estimating

Given the dictionary , we first consider the optimization of only. In this case, the objective function (9) can be modified as The first-order optimality condition of implies that Therefore, we have where and are the indices of the matrix elements. It is well known that the above solution for is called soft thresholding. Indeed, the sparsity constraint is only an approximation to . In order to promote sparsity and to improve the approximation, one could instead use the hard thresholding method [20] as an alternative, that is, setting the smallest components of the vectors to be zeros while retaining the others: As such, the solution obtained using the constraint will be closer to that using .

2.2. Dictionary Learning

For a given , the objective function (9) is nonconvex with respect to , similar to the objective function (10); that is, The local minimum of the objective function (15) can be found by using a simple gradient descent method where is a step size and . If the rows of have different scales of norm, we cannot directly use the hard thresholding method to obtain . The phenomenon is called scaling ambiguity. Moreover, the constraint can enforce to be of full column rank but cannot avoid a subset of rows of to be possibly zeros. Hence, we normalize the rows of to prevent the scaling ambiguity and replace the zero rows, if any, by the normalized random vectors; that is, where is the th row of and is a normalized random vector. There may be other methods to prevent this problem, for example, by adding the normalization constraint to the objective function [7] or by adding the mutual coherence constraint on the rows of to the objective function [12, 14]; however, this is out of the scope of our work. Although, after the row normalization, the columns of the dictionary are no longer close to the unit norm, the rank of will not be changed.

2.3. Convergence

In the step of updating , the algorithm to optimize the objective function (11) is analytical. Thus, the algorithm is guaranteed not to increase (11) and converges to a local minimum of (11) [21]. Furthermore, in the dictionary update step, is updated by minimizing a fourth-order function, and thus a monotonic reduction in the cost function (15) is guaranteed.

Our algorithm (called orthogonality constrained ADL with iterative hard thresholding, OIHT-ADL) to learn analysis dictionary is summarized in Algorithm 1.

Input: Observed data , the initial dictionary , , the number of iterations
Output: Dictionary
Initialization: Set
For     do
 (i) Fixing , compute
 (ii) Update the analysis coefficients using (14)
 (iii) Fixing , update the dictionary using (16)
 (iv) Normalize rows in using (17)
End

3. Computer Simulations

To validate the proposed algorithm, we perform two experiments. In the first experiment, we show the performance of the proposed algorithm for synthetic dictionary recovery problems. In the second experiment, we consider the natural image denoising problems. In these experiments, is the initial dictionary in which each row is orthogonal to a random set of training data and is also normalized [10]. For performance comparison, the AK-SVD [10], NAAOLA [8], and LOST [12] algorithms are used as baselines.

3.1. Experiments on Synthetic Data

Following the work in [10], synthetic data are used to demonstrate the performance of the proposed algorithm in recovering an underlying dictionary . In this experiment, we utilize the methods to recover a dictionary that is used to produce the set of training data. The convergence curves and the dictionary recovery percentage are used to show their performance. If , a row in the true dictionary is regarded as recovered, where is an atom of the trained dictionary. is generated with random Gaussian entries, and the dataset consists of signals each residing in a 4-dimensional subspace with both the noise-free setup and the noise setup (,  dB). The parameters of the proposed algorithm are set empirically by experimental tests, and we choose the parameters as , , and which give the best results in atom recovery rate. We have tested the parameters with different values in the ranges of , , and . For the AK-SVD algorithm, the dimension of the subspace is set to be following the setting in [10]. For the NAAOLA algorithm, we set which is close to the default value in [8]. The parameters in the LOST algorithms are set as those in [11]; that is, , , , and . The convergence curves of the error by the first term of (7) versus the iteration number are shown in Figure 1(a). It can be observed that the compared algorithms take different numbers of iterations to converge (the terms used to measure the rates of convergence are slightly different in these algorithms, because different cost functions have been used), which are approximately 200, 100, 1000, and 300 iterations for the OIHT-ADL, AK-SVD, NAAOLA, and LOST algorithms, respectively. Thus, the recovery percentages of these algorithms are measured at these different iteration numbers (to ensure that each algorithm converges to the stable solutions). It is observed from Figure 2 that , , , and of the rows in the true dictionary are recovered for the noise-free case and , , , and for the noisy case, respectively. Note that the recovery rates for the LOST algorithm are obtained with random initialization. With the DCT and identity matrix initialization, the LOST algorithm can recover and rows of the true dictionary in noise-free and noise cases, respectively, after 300 iterations. Although it may not be necessary to recover the true dictionary in practical applications, the use of atom recovery rate allows us to compare the proposed algorithm with the baseline ADL algorithms as it has been widely used in these works. The recovery rate by the NAAOLA algorithm is relatively low, mainly because it employs a uniformly normalized tight frame as a constraint on the dictionary and this constraint has limited the possible to be learned. The AK-SVD algorithm performs the best in terms of the recovery rate, that is, taking fewer iterations to reach a similar recovery percentage. However, the running time in each iteration of the AK-SVD algorithm is significantly higher than that in our proposed OIHT-ADL algorithm. The total runtime of our algorithm for 200 iterations is about 825 and 832 seconds for the noise-free and noise case, respectively. In contrast, the AK-SVD algorithm takes about 11544 or 10948 seconds, respectively, for only 100 iterations (Computer OS: Windows 7, CPU: Intel Core i5-3210M @ 2.50 GHz, and RAM: 4 GB). This is because our algorithm does not need to estimate in each iteration, as opposed to the AK-SVD algorithm.

3.2. Experiments on Natural Image Denoising

In this experiment, the training set of 20,000 image patches, each of pixels, obtained from three images (Lena, House, and Peppers) that are commonly used in denoising [10], has been used for learning the dictionaries, each of which corresponds to a particular noisy image. Noise with different level , varying from 5 to 20, is added to these image patches. The dictionary of size is learned from the training data which are normalized to have zero mean. For fair comparison, the dictionary size is the same as that in [10]. The dictionary with a bigger size may lead to a sparser representation of the signal but may have a higher computational cost. Similar to Section 3.1, enough iterations are performed for the tested algorithms to converge. The parameters for the OIHT-ADL algorithm are set the same as those in Section 3.1, except that . For the AK-SVD algorithm, the dimension of the subspace is set to be following the setting in [10]. For the NAAOLA algorithm, we set (), (), and (). For the LOST algorithm, the parameter values are set as , , , and . The parameters for the NAAOLA algorithm and the LOST algorithm are chosen empirically. The examples of the learned dictionaries are shown from the top to the bottom row in Figure 3, where each atom is shown as a pixel image patch. The atoms in the dictionaries learned by the AK-SVD algorithm are sorted by the number of training patches which are orthogonal to the corresponding atoms, and therefore the atoms learned by the AK-SVD algorithm appear to be more structured. Sorting is however not used in our proposed algorithm. Then we use the learned dictionary to denoise each overlapping patch extracted from the noisy image. Each patch is reconstructed individually and finally the entire image is formed by averaging over the overlapping regions. We use the OBG algorithm [10] to recover each patch image. For fair comparison, we do not use the specific image denoising method designed for the corresponding ADL algorithm such as the one in [12]. The denoising performance is evaluated by the peak signal to noise ratio (PSNR) defined as (dB), where and are the th pixel value in noisy and denoised images, respectively. The results, averaged over 10 trials, are presented in Table 1, from which we can observe that the performance of the OIHT-ADL algorithm is generally better than that of the NAAOLA algorithm and the LOST algorithm. It is also better than that of the AK-SVD algorithm when the noise level is increased.

4. Conclusion

We have presented a novel optimization model and algorithm for ADL, where we learn the dictionary directly from the observed noisy data. We have also enforced the orthogonality constraint on the optimization criterion to remove the trivial solutions induced by a null dictionary matrix. The proposed method offers computational advantage over the AK-SVD algorithm and also provides an alternative to the LOST algorithm in avoiding the trivial solutions. The experiments performed have shown its competitive performance as compared to the three state-of-the-art algorithms, the AK-SVD, NAAOLA, and LOST algorithms, respectively. The proposed algorithm is easy to implement and computationally efficient.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grants nos. 61162014 and 61210306074) and the Natural Science Foundation of Jiangxi (Grant no. 20122BAB201025) and was supported in part by the Engineering and Physical Sciences Research Council (EPSRC) of the UK (Grant no. EP/K014307/1).