Abstract

This paper deals with the problem of overcomplete transform learning. An alternating minimization based procedure is proposed for solving the formulated sparsifying transform learning problem. A closed-form solution is derived for the minimization involved in transform update stage. Compared with existing ones, our proposed algorithm significantly reduces the computation complexity. Experiments and simulations are carried out with synthetic data and real images to demonstrate the superiority of the proposed approach in terms of the averaged representation and denoising errors, the percentage of successful and meaningful recovery of the analysis dictionary, and, more significantly, the computation efficiency.

1. Introduction

Signal representations play a role of cornerstones in signal processing. They have experienced the evolution from the Fourier transforms, which are traced back to the later part of the 19th century, to the wavelet expansions in the 1980s and then the sparse representations which have been one of the hotspots in signal processing community for the last two decades. Instead of the traditional expansions in terms of bases and frames, sparse redundant representations look for the best approximation of a signal vector with a linear combination of few atoms from an overcomplete set of well-designed vectors [1]. This topic is closely related to the sparsifying dictionary learning [2, 3] and the compressed sensing (CS) [46], an emerging area of research in which the sparsity of the signals to be recovered is a prerequisite.

Let . The -norm of is defined as For convenience, is used to denote the number of nonzero elements in though is not a norm as it does not satisfy the homogeneity.

1.1. Signal Models and Dictionary Learning

Let be the signal under consideration, where is the clean signal and is an additive noise or modelling error. There are basically two models used for the clean signal . The first one is called synthesis model which assumes that the clean signal is given by where is usually called a synthesis dictionary (throughout this paper, a lower-case letter, say , is usually used to denote a column vector of the matrix . Furthermore, and denote the th column vector and th entry of matrix , resp.) and is said overcomplete if , and is the coefficient vector of in . The signal is said -sparse in if .

The procedure for finding with and given is referred to as analysis. Clearly, is a solution to this problem as long as , where is the identity matrix of dimension . It should be pointed out that (2) and (3) are equivalent when is square and nonsingular. When is overcomplete, such a claim does not hold as (2) is not an one-to-one mapping between and . Equation (3) is referred to as analysis model [712] and is called analysis dictionary or analysis operator. When , which is assumed throughout this paper, the dictionary is said redundant or overcomplete.

Let be a set of signal samples belonging to a certain class of signals. For a given sparsity level, say , sparsifying synthesis dictionary learning is to find such that , with . Methods for learning the dictionary from signal samples have been one of the hottest topics in signal processing, among which the method of optimal directions (MOD) [13] and K-singular value decomposition (KSVD) based algorithms [14, 15] have been considered as state of the art in this area. The basic idea behind these methods is based on the alternating minimization strategy [1618], in which the dictionary and the sparse coefficients are updated alternatively using the orthogonal matching pursuit (OMP) techniques [1921].

Compared with the synthesis model-based sparsifying dictionary learning, its analysis model-based counterpart has received less attention. The analysis dictionary learning problem can be stated as where , denoting the number of zero elements in , is the cosparsity given and denotes the th column of . (throughout this paper, we use “tilde” (say ) to denote a (matrix/vector) variable, while is used for the optimal/true one in a problem). As is the clean signal, (4) is also signal denoising.

The first attempt to address such a problem was given in [7], where it is suggested to incrementally learn the dictionary one row at a time. The performance of the proposed algorithm, however, depends heavily on a randomized initialization strategy. A different approach to the dictionary learning was proposed in [810]. As proposed in [10], (4) is investigated with the following formulation: where is a specified set within which the optimal operator is searched and is the parameter corresponding to the noise level. Such a problem is alternatively addressed using a regularized version in [10].

Another signal model, which is very much close to the analysis model, is the transform model raised in [22]. Such a model is characterized with where is called a transform and is sparse with , while is the representation/modeling error in transform domain and is assumed to be very small. It is interesting to note that, for an analysis signal with sparse, one has which is always in the form specified in (6), while if belongs to the transform model-based class in , which means there exists a pair with being sparse such that , may have no solution for , at all given that the number of equations is bigger than that of unknowns due to assumed.

When is available, the signal is transformed to -sparse vector : The solution can be achieved simply by thresholding the product and retaining only the largest coefficients (magnitudewise) of this product. The estimate of the original signal can be found from and by minimizing with respect to . This yields as long as is full rank (under the assumption that ), where denotes the transpose operator.

The sparsifying transform learning problem is formulated as This problem is not well defined and has the problem of so-called scale ambiguity. It is observed that a trivial solution for the above problem is and . One approach to prevent such solutions is to add some penalty functions into the objective. In [22], the penalty was included in the cost function for learning a complete transform with dimension . The term is used to force the full rank of and the term is applied to avoid the scaling issue. See where and are two positive constants, while is the natural logarithmic operator applied to the determinant of a matrix. The optimal transform design was then reformulated as subject to . Such a problem was solved in [22, 23].

1.2. Contributions

In this paper, we will consider solving (9) with given . An efficient algorithm is proposed to solve the optimal sparsifying transform design problem. It should be pointed out that our proposed approach is more general than those in [22, 23] as it deals with overcomplete transform learning and is more efficient than [24] since the transform is updated using a closed-form solution rather than the one obtained with a gradient-based numerical procedure. All of these will be elaborated further in next section. Roughly speaking, (4) has been attacked using alternative optimization, in which with obtained, is updated based on , . This can be considered with As seen, this problem is exactly the same as (9) with replaced with . So, the efficiency of the analysis KSVD proposed in [12] is expected to be improved using our proposed algorithm for analysis dictionary update. We will demonstrate the potential of this proposed algorithm in a series of experiments using both synthetic and real data and compare it with the approach in [12]. As to be seen, our approach yields a performance comparable to that by the latter in terms of denoising and recovery of the reference dictionary and more importantly a much improved computation efficiency.

The outline of this paper is given as follows. In Section 2, after briefing some related work, we formulate the problem to be investigated in the context of transform learning and how it is related to analysis dictionary learning. Section 3 is devoted to solving the optimal transform design. An efficient algorithm is derived for that purpose, in which a closed-form solution is obtained for updating the transform. The performance of this algorithm and its application to signal denoising are discussed in this section. In order to demonstrate the superiority of the proposed algorithm, experiments are carried out in Section 4, which show promising performance of the proposed approach. Some concluding remarks are given in Section 5.

2. Preliminaries and Problem Formulation

2.1. Related Works

The first attempt to address the analysis dictionary learning was given in [7], where it is suggested to incrementally learn the dictionary one row at a time. The performance of the proposed algorithm, however, depends heavily on a randomized initialization strategy. A different approach to the dictionary learning was proposed in [810], where the goal of sparsifying the representations is formulated using an -based penalty function rather than the -based one. To avoid the trivial solutions, the dictionary is constrained to be uniform normalized tight frame; that is, , the identity matrix of dimension . Such a constraint clearly limits the searching space for the optimal dictionary.

Recently, it was argued in [12] that, in the analysis model, one should emphasize the zeros of . Based on this argument, (4) was reformulated in [12] as where denotes the th row vector of matrix , is the cosupport of , defined as the set of indices for the rows of that are orthogonal to , and is the submatrix of that contains only the rows indexed by , while is the dimension of the subspace to which belongs.

An algorithm, called analysis KSVD, was proposed in [12] to address (13). The analysis KSVD, denoted as AKSVD, was developed in a similar way as the one adopted by its counterpart KSVD [14] for the synthesis model and as shown in [12], outperforming the one reported in [7]. It has been observed, however, that the AKSVD is very time consuming and therefore more efficient algorithms are needed.

In order to restrict the solution set of the transform learning problem (9) to an admissible set, was applied in [25], which means forcing the row of the transform to be a tight frame. However, only tight frame constraint is insufficient to prevent undesired solutions of (9) for learning an overcomplete transform (), for which it is highly possible that practical algorithms result in a local minimal solution of the form with being a tight frame. In [24], the overcomplete transform learning (9) was attacked with the following penalty function to be embedded to the cost function: where and are weighting factors and is usually set to and . An iterative procedure using the alternating minimization strategy was proposed to solve the corresponding problem, where a gradient-based algorithm is utilized for updating the transform. Clearly, such a method is time consuming and may suffer from local minimum issue due to the use of gradient-based algorithm. Furthermore, the possibility of having same rows in can be reduced with the second term of the penalty but zero rows may occur if is not (row) normalized.

2.2. Problem Formulation

Based on the previous discussions, the optimal sparsifying transform learning problem can be formulated as subject to

To implement (15), we consider the following cost function: where is a penalty term (i.e., a regularizer) defined as Both and are positive parameters to be discussed later.

The problem to be addressed in this paper is finally formulated as

Comment 1. (i) The additional regularizer in (17) serves as a constraint on the space for searching the true transform . The term is to ensure that is not too big, while − is to guarantee that is not too small and hence the constraint of full rank is met. The combination of the two terms ensures that the optimal dictionary is searched within a well behaved space such that the undesired scaling problem can be avoided.
(ii) As to be seen in the next subsection, the minimum of this regularizer is a set of tight frames. Therefore, if (4) has a solution close to a tight frame, it can be well estimated by (19).
(iii) Comparing (19) with (11), one realizes that (19) can be used for optimal overcomplete sparsifying transform learning. Besides, instead of −, − is adopted in our approach. By doing so, the constraint is no longer needed.

2.3. Analysis Dictionaries and Transforms

Consider the analysis model-based dictionary learning (4). Define and as the representation error and the sparsification error, respectively, where . Then It is assumed in the sequel that the true dictionary is full rank. Such an assumption ensures that the row vectors of can span the entire signal space .

The following lemma specifies a class of true dictionaries for which the equivalence between (15) and (4) holds.

Lemma 1. Equation (15) is equivalent to (4) if the true dictionary is a -tight frame (TF) (); that is, has an SVD of form where both and are orthonormal matrices of proper dimension.

Proof. With (20) and (21), we have Then, Lemma 1 follows immediately from the fact that the two constraints and are equivalent as .

Comment 2. As seen from Lemma 1, the equivalence between the two problems holds if the true dictionary is a -tight frame. This is in general not true. However, it is observed that for given the problem defined by (4) has in general more than one solution . In fact, define and . It can be seen that if is a solution, is : where is any given nonsingular diagonal matrix, while both and are arbitrary except such that . This means that there exist a lot of degrees of freedom in the solution set of (4) and that by forcing finding the solution for which the dictionary is the closest to a TF, (15) is then expected to yield a transform that is a good estimate of a solution for the analysis dictionary learning problem (4).

2.4. Rule of Thumb for Choosing and

Now, we present a result regarding minimization of . A similar result was given in [22].

Lemma 2. Let be defined in (18). Then, with the equality achieved if and only if is a -tight frame.

Proof. Let be the SVD of , where with , . Clearly, .
Note that and . It follows from that the solution for minimizing is given by and hence (24) holds. Clearly, the equality is attained if and only if , .

Comment 3. (i) Based on Lemma 2, the cost function is bounded from below with .
(ii) It follows from Lemmas 1 and 2 that if the reference analysis operator happens to be a tight frame, then the solution of (19) should yield the same operator no matter what is taken.
(iii) As indicated in Comment 2, the solution can be expected to be close to a tight frame. This suggests that should take a value much larger than one since when is very large, minimizing can be approximated with minimizing .
(iv) As to be seen from (19), we need to normalize the dictionary such that , . Under this constraint, the constant in (21) turns out to be . Since holds for overcomplete analysis dictionary, should be ensured.
Before turning to next section, we should point out that though the proposed approach is not directly related to the overcomplete analysis dictionary learning (4) as (13) is in general, it can provide a good estimate of the true dictionary in a very efficient way, which is believed to be useful in improving the analysis KSVD algorithm.

3. The Proposed Algorithm

This section is devoted to attacking the optimal sparsifying transform learning problem by deriving an algorithm to solve (19).

3.1. The Algorithm

Note that the problem defined in (19) is nonconvex. A practical approach to such problems is to adopt the alternating minimization strategy [1618]. The basic idea is to update the dictionary and the (column) sparse matrix alternatively. The proposed algorithm is outlined below.

Initialization. Set the training data matrix , the number of iterations , and the parameters ; initiate with . (there are several ways to initialize . A simple way is to set . Other way is that for the th row () we first randomly choose columns from the training data matrix to form a submatrix , then set , where and denotes the pseudoinverse of . At last we normalize each row to unit norm. In the simulations to be presented, we choose the second approach to initialize in order to have a fair comparison with [12]).

Begin ()

Step 1. Update such that is minimized under the constraint , . This is equivalent to

Step 2. With just obtained, update by solving

Step 3. Normalize such that all the rows of are of unit length. This can be achieved by setting the diagonal with its th entry equal to the inverse of the length for the th row of for .

End. Output , the estimate of the true dictionary .

Now, let us elaborate the three steps involved in the proposed algorithm further.

The first one involves updating with (26), whose solution , as mentioned before, can be achieved directly with the product in the following way.

Begin ()(i)Determine which is equal to the absolute value of the th largest coefficients (magnitudewise) in the th column of .(ii)For , set the th element of to where denotes the th element of matrix .

End. Output . (In case where there are more than nonzero elements in , just keep any set of the first largest ones and set the others to zero.)

Now, let us consider (27) involved in Step 2. Define It can be shown that the cost function can be rewritten as where denotes the trace operator of a matrix and is used to emphasize the fact that is fixed in the minimization to be considered.

Therefore, (27) is just a special case of the following minimization problem:

Theorem 3. Let , , and three matrices independent of matrix with , and an SVD of the (symmetric) positive-definite and . Furthermore, let be an SVD of , where . A solution to (31) is then given by where with

The detailed proof can be found in Appendix.

As seen from (29), no matter what is taken the corresponding matrix in our problem is always (symmetric) positive-definite as long as and . Therefore, (27) in Step 2 of the proposed algorithm can be solved using Theorem 3.

Step 3 is to normalize the transform such that has all its rows of unit length. This is for avoiding zero rows in the transform and the scaling ambiguity [22]. Such scaling has been popularly used in sparsifying dictionary design.

Comment 4. As noted, the proposed algorithm shares the same framework as the one used in [22]. It should be pointed out that besides the fact that our algorithm is more general in the sense that it can deal with analysis dictionaries (transforms) of any dimension rather those square nonsingular ones, the main difference between the two lies in the way of how to solve (27) in Step 2. In [22], a gradient-based algorithm was used to attack (27). Such an algorithm suffers in general from slow convergence and more seriously convergence to a local minimum of the problem. Therefore, our proposed algorithm is more efficient and faster convergence is then expected.

3.2. Convergence

Consider the situation when the row normalization (i.e., Step 3) is not executed in proposed . As the solutions for the minimization involved in both Steps 1 and 2 are exact, the cost function is ensured to be decreasing. Given that the cost function is bounded from below, the proposed without Step 3 can ensure that the cost function converges. As mentioned before, the obtained transform may have zero rows during the iterative procedure. In order to avoid this phenomenon, Step 3 has to be applied when the transform is overcomplete and, in that case, the cost function is found empirically to converge as well.

As argued in [22], the convergence of cost function does not indicate that of the iterates. There have been no theoretical results achieved regarding convergence of the iterates for the proposed algorithm. However, extensive experiments showed that the proposed algorithm does demonstrate good behavior of convergence.

3.3. Computation Efficiency

A thorough comparison between the algorithm proposed in [22], denoted as , and some relevant algorithms in terms of computational cost can be found in [22]. Since our proposed algorithm shares the same framework with and is derived as an alternative of , the one recently reported in [12], we mainly focus on comparing our proposed and . For the sake of completeness, we present the computation issues of our proposed and as follows.

Our algorithm is an iterative method. In each iteration, there are three steps. They are sparse coding, dictionary updating, and the dictionary normalizing. In the sparse coding step, we simply keep the largest entries of each column in the matrix by full sorting. This, as indicated in [26], requires operations. Knowing that computing requires operations, one concludes that the sparse coding step needs roughly flops. In the dictionary update step, what we mainly do is just computing the SVD form of a matrix . Thus, this step costs approximately flops, while in the normalization step, it needs flops. Thus the total cost per iteration of the proposed algorithm is the sum of the three, which is dominated by under the assumption of .

As to , there are two steps involved in each iteration, sparse coding and dictionary updating, and the computation cost is dominated by the sparse coding step. As given in [12], the sparse coding step requires operations if the optimized backward greedy (OBG) algorithm is utilized. Therefore, the overall computation complexity scales as .

Based on the computations above, one can see that our proposed algorithm is much more efficient than . This is confirmed by experiments to be given in the next section.

3.4. Signal Denoising

For a measurement and a (row normalized) dictionary given, the clean signal is estimated in via solving the problem where is the set of row indices and is the dimension of the subspace to which the signal belongs. The OBG is used to attack such a problem and it has to be run times for each iteration in order to obtain the clear signal matrix and the final estimate of the true dictionary . That is why is very time consuming.

In , instead of the (column) sparse matrix is used. As mentioned before, the latter can be updated efficiently using (28). Taking the transform obtained using as the analysis dictionary, we can run the OBG to get the corresponding estimate of for .

Alternatively, following the same approach as in [22] we can consider the following: where the objective function is defined as with the same as defined before and constant determining the importance of the representation error.

This is simultaneous optimization of the dictionary , the clean signals , and the sparse coefficients . It is easy to understand that this problem can be addressed using alternating minimization and a similar algorithm to can be derived. In fact, such an algorithm can be obtained from with adding (i) to the initialization and (ii) a new step between Steps 1 and 2 for updating the clean signals : The above is equivalent to minimizing with respect to , a standard least squares problem.

In this paper, we focus on the OBG based denoising approach as comparisons are given between proposed and .

4. Experimental Results

In this section, we present some experiments to illustrate the performance of our proposed algorithm and compare it with that by given in [12]. For convenience, the dictionaries, obtained using these two algorithms, are denoted with and , respectively. All the experiments are performed on a computer with an Intel Core i7 CPU at 3.40 GHz, 16 GB memory, and 64-bit Windows 7 operating system.

4.1. Synthetic Experiments

In order to have a fair comparison, the setup used here is similar to [12]. Let , , and = 50,000. A matrix is generated with normally distributed entries and then , where is such a diagonal matrix that all the rows of have unit -norm. This serves as the true overcomplete analysis dictionary. Then, a set of vectors of is produced with each living in a -dimensional subspace. Each vector is obtained as follows: we first construct a submatrix by randomly choosing rows from and then compute the orthonormal basis for the null space of , which can be obtained from the SVD of . With , calculate , where each element in is randomly generated by Gaussian distribution of independent and identically distributed (i.i.d.) zero-mean and unit variance. At last, each vector is normalized so that for all . We can see that has at least zero components for all () generated in such a way.

Let denote a hard thresholding operator that keeps the largest (in magnitude) elements of a vector. We define the effective sparsity of a length- vector as Figure 1 represents the histogram of the effective sparsity level of . As can be seen, the effective sparsity level is varying from to and is most concentrated near .

The measurements are generated with where each entry of the additive noise vector is a white Gaussian noise with i.i.d. zero-mean and -variance. Define the signal-to-noise ratio (SNR) as , where denotes the mathematical expectation operator. When , dB.

We then estimate the analysis dictionary using and with (see (12)), respectively, for both noise-free case (i.e., ) and noisy case with noise level dB. In both cases, we fix and , while and for . (As we said before, has at least zero components for all . Thus, we can choose such that . It is illustrated in Figure 1 that the effective sparsity is concentrated near , and therefore we choose .) Both and Alg are initialized with a dictionary in which each row is generated as follows: we first randomly choose columns from the training data matrix to form a submatrix ; then set , where and , as defined before, denotes the pseudoinverse of ; at last we normalize each row to unit norm [12].

Convergence Behavior of . The convergence behavior of our proposed is shown in Figures 2 and 3 in terms of the objective function and the relative change of iterates which is defined as .

Comment 5. (i) The results are self-explanatory. As expected, the objective function decreases smoothly in both noise-free case and noisy case. This is confirmed in Figure 2, from which it is seen that the latter case yields larger than the noise-free case because of the larger sparsification error in the latter case.
(ii) The convergence behavior of the iterates is depicted in Figure 3. One observes that the iterates of the proposed algorithm show a very good trend to convergence in both cases.

Figures 4 and 5 show the distribution of singular values of the two analysis dictionaries and .

As observed, the ratio between the maximal and the minimal singular values of is almost the same as that of , smaller than the one of . This implies that our proposed algorithm, as expected, yields a dictionary closer to a TF than owing to the penalty term .

Performance Evaluation of the Obtained Dictionaries. The averaged representation error is defined as for the dictionary , where is the estimate of the clean signals which are available for and can be obtained using the OBG for .

The performance of a given dictionary for denoising is usually measured with the averaged denoising error defined as , where is the true (clean) signals.

Figures 6 and 7 show comparison of the two algorithms in terms of the averaged representation error and the averaged denoising error, respectively. (In Figure 7, the plots for the noise-free case are not presented as they are the same as those shown in Figure 6 due to the fact that .)

Now, let us look at how close and are to the reference dictionary .

We say a row in the true analysis dictionary is recovered in if Denote as the number of recovered rows in . The percentage of the recovered rows denoted as in is calculated with . This is one of the performance indicators used in [12] for evaluating a learnt dictionary.

With signals generated above and the same setup as the one used in [12] except instead of 100, we run each of the and and obtain a sequence of analysis dictionaries . Figure 8 shows the evolution of the percentage of the recovered rows for the two algorithms.

Comment 6. (i) It is seen that our proposed algorithm surprisingly outperforms in terms of averaged representation error, averaged denoising error, and percentage of recovered rows for both cases.
(ii) According to equation () of [12], the averaged denoising error in the oracle setup (i.e., the true dictionary and all the cosupports are known) is given by , which is equal to for this example. As observed from Figure 7, the averaged denoising error for both algorithms seems to converge to a noise variance close to . This indicates these two algorithms just yield one of the local minimal points of the problems though very close to the reference one.

Figures 911, as another example, display the results for the case of , , and and the vector living in a -dimensional subspace, leading to . Here, we set and for our proposed algorithm. The results are self-explanatory.

What is the most significant improvement of our proposed algorithm over is the implementation efficiency. This can be seen from Figure 12 which yields the relationship between the time (in seconds) used and the number of training samples for the setting: , , and . As an example, one notes that, for , requires about seconds, while our proposed algorithm needs less than seconds.

4.2. Experiments on Image Denoising

Now, let us consider image denoising for five widely used images, “Lenna,” “Barbara,” “Boats,” “House,” and “Peppers.”

For a given noisy image with noise level (which means each pixel is contaminated by a white Gaussian noise with i.i.d. zero-mean and -variance), we generate a training set consisting of = 20,000 overlapping patch examples. Then, we run and with , , , , and to learn the analysis dictionaries of size .

We apply the learnt dictionaries for patch-based image denoising. Each noised patch is denoised by using error-based OBG for sparse coding, which is the same as that in [12]. The denoising results with respect to image PSNR in dB are given in Table 1.

It is observed that denoising performance by is in general better than that by in terms of PSNR.

Figure 13 shows the visual effects of denoising on the image “Barbara” with a noise level of .

5. Conclusions

In this paper, we have investigated the problem of overcomplete transform learning for analysis model. An alternating minimization based iterative procedure has been proposed to solve the formulated optimal transform design problem. A closed-form solution has been derived for updating the transform. The superiority of the proposed approach in terms of the averaged representation and denoising errors, the percentage of the recovered rows, and, more significantly, the computation efficiency has been demonstrated with a series of experiments using both synthetic and real data.

As observed, both the AKSVD and our proposed algorithm demonstrate good convergence to a local minimum. Deeper studies along those lines [27, 28] are needed for theoretical guarantees of convergence to a global minimum and developing high performance algorithms for sparsifying dictionary and transform design. Furthermore, it is expected that the performance of the analysis KSVD algorithms in [12] can be improved with the proposed approach for updating dictionary embedded. Investigations along these lines are ongoing.

Appendix

Proof of Theorem 3

According to Laplace’s formula, where is the th minor of matrix , that is, the determinant of , matrix that results from by removing the th row and the th column, while is known as the th cofactor of . Clearly, and hence as long as .

Therefore, the gradient of with respect to is given by where is the adjugate matrix of , whose th element is given by the th cofactor of and satisfies .

It can then be shown that for with and hence The optimal should satisfy , which leads to . Equivalently, where and with the square root matrix of defined before.

Let be an SVD of and , respectively. The above equation becomes A solution to the above is with constrained by which yields (33) and hence (32) follows. This completes the proof.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This work was supported by NSFC Grants 61273195, 61304124, 61473262, and 61503339.