Mathematical Problems in Engineering

Volume 2016 (2016), Article ID 5737381, 11 pages

http://dx.doi.org/10.1155/2016/5737381

## An Efficient Algorithm for Learning Dictionary under Coherence Constraint

College of Information Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang 310023, China

Received 30 March 2016; Revised 8 June 2016; Accepted 23 June 2016

Academic Editor: Srdjan Stankovic

Copyright © 2016 Huang Bai et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Dictionary learning problem has become an active topic for decades. Most existing learning methods train the dictionary to adapt to a particular class of signals. But as the number of the dictionary atoms is increased to represent the signals much more sparsely, the coherence between the atoms becomes higher. According to the greedy and compressed sensing theories, this goes against the implementation of sparse coding. In this paper, a novel approach is proposed to learn the dictionary that minimizes the sparse representation error according to the training signals with the coherence taken into consideration. The coherence is constrained by making the Gram matrix of the desired dictionary approximate to an identity matrix of proper dimension. The method for handling the proposed model is mainly based on the alternating minimization procedure and, in each step, the closed-form solution is derived. A series of experiments on synthetic data and audio signals is executed to demonstrate the promising performance of the learnt incoherent dictionary and the superiority of the learning method to the existing ones.

#### 1. Introduction

Sparse representation (SR) theory [1, 2] indicates that a signal can be represented by certain linear combination of a few* atoms* of a prespecified* dictionary*. It is an evolving field, with state-of-the-art results in many signal processing tasks, such as coding, denoising, face recognition, deblurring, and compressed sensing [3–7].

A fundamental consideration in employing the above theory is the choice of the dictionary and this leads to the famous dictionary learning (DL) problem. DL has attracted a lot of attention since its introduction at the end of last century [8, 9]. Most of the research has been done to learn a data adaptive dictionary so that a particular class of signals can be sparsely represented in this dictionary with low approximation error.

Under the SR framework, a signal vector can be expressed in the form ofwhere is the dictionary with its columns referred to as atoms (throughout this paper, MATLAB notations are used) and is the corresponding sparse coefficient vector.

Let with being its th element. The -norm of vector is defined as Note that is not a norm in a strict sense for . For convenience, is used to denote the number of nonzero elements in . A vector given by (1) is said to be -sparse in if .

Let be a set of training samples from a class of signals to be considered. The basic problem of DL is to find a dictionary such that, for each , there exists a vector that is sparse. Such a problem has been widely investigated during the last decade or so [10–12] and can be formulated aswhere denotes the* Frobenius* norm, are proper constants, is the sparsity of sparse vector , andSuch a problem is difficult to be solved as it is nonconvex in and , and is nonsmooth and highly unstable. A popularly used approach is based on the alternating minimization strategy. A two-stage procedure is usually carried out for solving the above problem and also for avoiding the selections of [10–12]. The problem in the first stage is referred to as* sparse coding*, aiming at finding the (column) sparse matrix with a given ; that isNote that the equivalent expression to (5) where the constraint is a fixed sparse representation error (SRE) level can also be formulated as with being the error threshold. Such a problem can be solved using the orthogonal matching pursuit (OMP) based methods [13, 14]. Furthermore, it can be shown that the solution of the above problem is the same as the one of the -based minimization below: while the latter can be addressed using algorithms such as basis pursuit (BP) [15] and the -based optimization techniques [16].

Many algorithms for solving (3) are different from each other mainly in the 2nd stage, that is,* dictionary updating*. For the dictionary , in order to code the signals of interest more sparsely, we usually set which means that is overcomplete. However, this redundancy increases the pairwise similarity of dictionary atoms. According to the work in [13], such a similarity has a direct influence on the dictionary’s performance, especially for the accuracy in sparse coding stage. If any two atoms degenerate to the same vector, this will lead to overfitting to the training data. Thus, incoherent dictionary is expected to improve the performance of the SR model.

Yaghoobi et al. proposed a design method for parametric dictionary [17]. The authors attempted to optimize the dictionary to make the corresponding Gram matrix approximate to the Gram of an equiangular tight frame (ETF), which possesses good coherence behavior. However, this method relies extremely on a priori knowledge of appropriate parameters choosing criterion that is related to a given class of signals. A new algorithm was developed in [18] named INK-SVD. In each iteration of K-SVD algorithm [11], the dictionary updating stage is followed by an additional decorrelation step. Each pair of atoms which has coherence above the threshold should have its inner angle increased symmetrically so as to reduce the coherence. But this procedure will implicitly destroy the original SR result from the K-SVD algorithm. To compensate this problem, the authors of [19] improved the work of [18] by incorporating a new decorrelation step (also related to the ETF according to its low coherence) and a dictionary rotation operation to the update stage. In [20], a weighting model was formulated to balance the coherence of the dictionary and the sparse representation ability, and a gradient-based method was carried out for solving the corresponding problem.

The main objective of this paper is to propose a new incoherent dictionary learning (IDL) method that constrains the coherence of the dictionary and minimizes the SRE and the contributions are threefold:(i)A novel model is proposed for learning the incoherent dictionary. The main contribution is also located in the dictionary updating procedure. When minimizing the SRE, that is, , is under the coherence constraint by making the corresponding Gram matrix approximate to an identity matrix of proper dimension.(ii)An iterative algorithm that updates the sparse coefficients and the components of the dictionary alternately is put forward to solve the design problem. In every step of dictionary updating, the solution of each component of dictionary is derived analytically.(iii)A series of experiments on synthetic data and audio signals is carried out to demonstrate the performance of each compared algorithm.

The remainder of this paper is arranged as follows. In Section 2, some preliminaries are provided and the main issue of learning incoherent dictionary is also formulated in this part. The algorithm proposed for addressing the corresponding design problem is investigated in Section 3. Simulations are carried out in Section 4 to examine the performance of the proposed algorithm and to compare with the existing ones. Some concluding remarks are given in Section 5.

#### 2. Preliminaries and Problem Formulation

In this section, some preliminaries will be introduced and two main comparisons of this paper are also reviewed in detail. Based on these, we formulate the problem of incoherent dictionary learning with the purpose of increasing the approximation performance of the dictionary to a particular class of signals under the coherence constraint.

The most fundamental quality associated with a dictionary is the* mutual coherence* (MC) [21]. MC indicates the degree of similarity between different dictionary columns. It equals the maximum absolute inner product between two distinct atoms: where denotes the transpose operator. As shown in [21], a -sparse signal generated according to (1) can be exactly recovered with OMP as long asRoughly speaking, MC measures how two atoms can look alike. Equation (9) is just a worst-case bound and only reflects the most extreme correlations in the dictionary. Nevertheless, MC is easy to be manipulated and it captures well the behaviors of some dictionaries. Generally, a dictionary is called incoherent if the corresponding MC is small [18, 19]. Besides, as pointed out in [19], the coherence of a dictionary is related to the condition number of its subdictionaries. This implies that achieving a low MC value results in well-conditioned subdictionaries.

Define the Gram matrix of asIt is common to study MC in (8) via the Gram matrix. Let be the diagonal matrix whose th element is given by for . The Gram matrix of , denoted as , is then normalized, such that , . Obviously, .

For , it has been shown in [22] that is bounded with with being the Welch bound. If each atomic inner product meets this bound, the dictionary is called an ETF. An ETF has a very nice MC behavior and has been considered to be utilized in optimal dictionary design [17, 19].

##### 2.1. Related Works

It is worth noting that ETFs only exist for those matrices with dimensionality constrained with if the atoms are real. So, one usually replaces the set of ETF Grams with a relaxed version [17, 19] that is defined aswhere is a constant to control the searching space. Clearly, when , contains all the ETF Grams.

Besides the space , the authors of [19] define a spectral constraint set as Here returns the vector of eigenvalues and is the rank operator. The algorithm for learning incoherent dictionary proposed in [19] can be outlined as follows:(i)*Sparse coding* with OMP.(ii)*Dictionary updating* employing K-SVD.(iii)*Atoms decorrelation* through an iterative projection procedure.(iv)*Dictionary rotation* to minimize the approximate error while keeping the MC unchanged.The main contributions of [19] lie in the last two steps. The atoms decorrelation is executed by iteratively projecting the Gram of the output dictionary of K-SVD between the sets and until a stopping criterion is met. With the singular value decomposition (SVD) of the resulting positive semidefinite Gram matrix being expressed as where is orthonormal and is the diagonal singular value matrix with all its elements being nonnegative, the incoherent dictionary can be obtained as with being an* arbitrary* orthonormal matrix. Finally, the authors consider this degree of freedom to further reduce the SRE by solving where is the set of orthonormal matrices. This is the rotation procedure.

*Remark 1. *Compared with the decorrelation operation in [18], the above-mentioned atoms decorrelation can achieve a much smaller MC value. Besides, the additional rotation procedure can slightly redeem SR ability. However, the approximation performance of the dictionary is highly damaged by the iterative projections. Though the dictionary rotation procedure is carried out for compensation, the effect of the sole degree of freedom on the SR ability is quite limited.

In [20], the authors consider another strategy for IDL, where the dictionary’s coherence is minimized along with the SRE. The cost function can be expressed aswith denoting the identity matrix of dimension . It is clear that is the simplest ETF Gram (with ). The Lagrange multiplier controls the trade-off between minimizing the SRE and minimizing the dictionary’s coherence. With the gradient of being calculated as the update of the dictionary is then executed by the steepest descent algorithm [20].

*Remark 2. *(i) The choice of remains open-ended and there is no selection criterion introduced in [20]. From the simulation results of [20], larger introduces better performance.

(ii) It is well known that the gradient-based algorithms may easily fall into a local minimum if the initialization is not properly set [23, 24]. As gradient-based method is carried out for solving (17), the efficiency and accuracy can be further improved.

##### 2.2. Problem Formulation

Let be the dictionary, let be the signal set with , and let be the corresponding sparse coefficient matrix in with as defined previously. For the problem indicated in (3), we update and alternately. For a fixed , can be calculated by the greedy algorithms or the -based convex optimization methods. In the following, we focus our discussion on the dictionary updating stage. For the traditional case, that is,the authors of [10] simply update the dictionary as . But when is not full rank, this method fails to work. The K-SVD algorithm [11] minimizes (19) for each atom separately. When updating the dictionary, the coefficients are also renewed simultaneously. In every iteration between coefficients and dictionary, it needs times SVD operations. It is a time-consuming algorithm and not well suited to enforcing a coherence constraint which is important for the implementation of sparse coding.

As an ETF can achieve small MC value, this motivates us to design a dictionary that is as close as possible to an ETF [17–20]. So the following constrained model is proposed: The closed-form solution set of has been derived in [7, 24] as where and are both* arbitrary* orthonormal matrices of dimensions and , respectively. So (20) can be rewritten as

*Remark 3. *(i) Here we choose the identity matrix as the target Gram for the following reasons: it is easy to handle (avoiding the iterative projection between and as carried out in [19]) and expression (21) contains more degrees of freedom than (15) for further minimizing the SRE.

(ii) As pointed out in [20], a flatter singular value spectrum of the dictionary indicates a less coherent dictionary. Our design strategy is under constraint (21) which means that the nonzero singular values of our designed dictionary are all equal (the same as (17) with ). Hence, better coherence performance can be expected.

#### 3. Coherence Constrained Dictionary Learning

In this section, an alternating minimization algorithm is developed to address the dictionary learning problem (22).

##### 3.1. Algorithm for IDL

To solve the above multivariate problem and also avoid the selections of , the alternating minimization strategy as introduced for addressing (3) seems a natural choice. The pseudocode of the proposed algorithm (named CCDL, standing for Coherence Constrained Dictionary Learning) is summarized as follows:

*Initialization* : initial random dictionary; : training data; : number of iterations between sparse coding and dictionary updating; : number of iterations between updating and .Calculate the SVD of as and set .

For , do the following.

*Step 1. *Set , and update column by column by solving with an OMP based algorithm and get the approximate solution .

Set and and .

*Step 2. *While , do the following:(a)For fixed , update by solving The analytical solution will be given in the next subsection.(b)For fixed , update by solving The solution will be derived in the next subsection.(c)Return to Step 2 with .

*Step 3. *Set and . End for if .

*End*. Output .

##### 3.2. Update the Components of Dictionary

Now, let us focus on solving (24) and (25). For convenience, we omit the superscript in the expressions. As the sparse coefficient matrix is assumed to be fixed in Step 2, we can rewrite the cost function of (22) as where and can be updated alternately. Let have the following SVD (for arbitrary matrix , the general SVD form can be expressed as ):Definewith . We then have two alternative expressions for : Assume that and be given. In what follows, we derive a procedure for updating such that

###### 3.2.1. Update

First of all, considerThis model can be solved by the following theorem [19].

Theorem 4. *For both and belonging to , the solution of can be characterized aswhere and are both orthonormal matrices given by the following SVD:*

LetThe solution to (31) can be derived as

###### 3.2.2. Update

Now, let us considerObviously, (30) is satisfied with and obtained as the solutions of (31) and (37), respectively.

Note that . DefineIt is clear that such a function has the following properties:where is defined in (28).

Denotewhere each with is constructed using It follows from (38) and Theorem 4 that the solution to (41), as understood, is given bywhereIt turns out from (39) and (40) that while (41) indicates thatThis implies that constructing using (42) and hence makes a decreasing sequence and, therefore, the solution to (37) can be estimated with

*Remark 5. *(i) It may be possible that , but is always true. Therefore, can be updated with .

(ii) For the whole CCDL, there actually exist three loops (indexed by , , and , resp.). For the loop indexed by , (44) and (45) indicate that the procedure of updating makes decrease as increases. So the solution (or an approximate one) of (37) can be gotten. Besides, the solution of (31) is derived analytically as (36). All these result in the convergence of the second loop indexed by , that is, dictionary updating. Assuming that OMP performs perfectly in the sparse coding stage, the nonincreasing trend of Step 1 is ensured. To sum up, the cost function (22) decreases in every step and hence the convergence of CCDL is guaranteed.

#### 4. Experiment Results

In this section, we evaluate the performance of the proposed model and algorithm with synthetic data and audio signals.

##### 4.1. Convergence Performance

Firstly, several simulations will be carried out to verify the convergence performance of the proposed CCDL. As the main contributions of the new method lie in the dictionary updating stage, we focus on the performance of designing the orthonormal matrices and , that is, solving (26).

Set , , and number of signals . There exist two loops in dictionary updating as introduced in the second point of Remark 5 indexed by and , respectively. The maximum iteration numbers for and are both fixed to .

###### 4.1.1. For Synthetic Dictionary

is taken as a Gaussian random matrix. Two orthonormal matrices and are generated to form the authentic dictionary :Then is produced as . The performance is evaluated bywith being the learnt dictionary.

Starting from an initial random dictionary, Figures 1 and 2 show the convergence performance of the loops indexed by (with ) and , respectively.