Abstract

Sparse models have a wide range of applications in machine learning and computer vision. Using a learned dictionary instead of an “off-the-shelf” one can dramatically improve performance on a particular dataset. However, learning a new one for each subdataset (subject) with fine granularity may be unwarranted or impractical, due to restricted availability subdataset samples and tremendous numbers of subjects. To remedy this, we consider the dictionary customization problem, that is, specializing an existing global dictionary corresponding to the total dataset, with the aid of auxiliary samples obtained from the target subdataset. Inspired by observation and then deduced from theoretical analysis, a regularizer is employed penalizing the difference between the global and the customized dictionary. By minimizing the sum of reconstruction errors of the above regularizer under sparsity constraints, we exploit the characteristics of the target subdataset contained in the auxiliary samples while maintaining the basic sketches stored in the global dictionary. An efficient algorithm is presented and validated with experiments on real-world data.

1. Introduction

Sparse models are of great interest in machine learning and computer vision, owing to their applications for image denoising [1], face recognition [24], traffic sign recognition [5], visual-tactile fusion [6, 7], and so forth. In sparse coding, samples or signals are represented as sparse linear combinations of the column vectors (called atoms) of a redundant dictionary. This dictionary can be a predefined one, such as the DCT bases and wavelets [8], or a learned one based on a specific task or dataset of interest.

With sufficient samples, learning a specialized dictionary instead of using the “off-the-shelf” one has been shown to dramatically improve the performance. Generally, the dictionary and the coefficients are estimated by minimizing the sum of least squared errors under the sparsity constraint. Batch algorithms such as MOD [9] and K-SVD [10] and nonparametric Bayesian methods [11] have shown state-of-the-art performance. Further, Mairal et al. [12] developed an online approach to handle large amounts of samples.

Recently, theoretical analysis of sparse dictionary learning has attracted much attention. Schnass [13] presented theoretical results of the dictionary identification problem. Sample complexity has been estimated in [14, 15]. Gribonval et al. [16] analyzed the local minima of dictionary learning. Moreover, to extend the capacity, dictionary learning with specific motivations [1719] has also attracted lots of interests. For instance, robust face recognition [3] is dedicated to particular applications, and Hawe et al. [20] require the dictionary to have a separable structure. While the learned dictionary has significant effects on a given dataset, attaining further specialized dictionaries for subdatasets with fine granularity is an interesting and useful concept as well. For instance, with a dictionary corresponding to facial images of all humans, we want to gain a customized dictionary for each particular individual. However, in this case, standard dictionary learning approaches may be unwarranted or impractical: on one hand, samples for a particular individual (subject) are restricted and insufficient in most cases; on the other hand, even with enough data, learning so many dictionaries becomes inefficient for computation and storage. We demonstrate further examples, such as customizing handwritings to different styles, matching images of flower to various species, or matching paper corpora to specific proceedings.

In terms of classification tasks, approaches such as Yang et al. [18] and Ma et al. [2] learn a structured dictionary which consists of subdictionaries on behalf of different subjects. However they are often unfeasible: firstly, as a part of the global dictionary, the coding performance of the subdictionary is always worse than the global one. Secondly, the subdictionaries for subjects must be learned together, which becomes inflexible and exacting for a huge . Thirdly, once the global dictionary is obtained, specialization for a new th subdataset would be impossible.

In this paper, we are looking for an effective, economic, and flexible dictionary customization approach, which are supposed to have the following characteristics:(i)We specialize an existing global dictionary by utilizing auxiliary samples obtained from the target subdataset, valid for finer granularity and a small quantity of examples (hence less computations).(ii)Compared with the global one, the customized dictionary has the same size but smaller reconstruction errors and better representation of the target subdataset.(iii)The customization for each subdataset is independent; thus we can customize an arbitrary number of subdatasets or attain a particular one alone.

As depicted in Figure 1, we first observed that the corresponding dictionary atoms of the global and the particular subjects often look “similar.” This is reasonable, as the dictionary atoms describe the sketches of the object and the basic shapes of all the subjects are consistent. For a more rigorous theoretical analysis, we further considered dictionary identifiability [13] for mixed bounded signal models, that is, signals that are generated from more than one source (reference dictionary). And we proved that if reference dictionaries were close in the sense of the Frobenius norm, the global dictionary learned from mixed signals would be close to each of them. In fact, the global dictionary grabs the common basic shapes of all the subdatasets, regarding characteristics of the subjects as noise and discarding them.

Thus, formulating the dictionary customization problem, we introduced a regularizer penalizing the difference between the global and the customized dictionary. By minimizing the sum of the reconstruction error and above regularizer under sparsity constraints, we exploit the characteristic of the target subdataset contained in the auxiliary samples while maintaining the basic shapes stored in the global dictionary. As a result, a better dictionary, closer to the global one, is obtained. The solution is an asymptotic unbiased estimation of the underlying dictionary and can be seen as a trade-off between learning a new one from data and using an existing one.

To minimize the object function, we considered a general strategy the same as dictionary learning, that is, coding the samples and updating the atoms alternately in each iteration. Further, we present an algorithm that shares the idea with K-SCVD [10], which we call C-Ksvd. The flow chart of our methods is demonstrated in Figure 2. Experiments on tasks such as denoising and superresolution illustrate that our approach can handle the customization problem effectively and efficiently, outperforming both the global one and the normal dictionary learning approach. In addition, our model is also promising for more tasks such as enhancing an insufficient learned dictionary.

2. Notations

Throughout this paper, we write matrices as uppercase letters and vectors as lowercase letters. Given , the -norm of the vector is defined as . In particular, the - counts the nonzero entries of . Let denote the vector such that its th entry is equal to zero if and to one (resp., minus one) if (resp., ).

The Frobenius norm of the matrix is denoted as and matrix  -norm as . Define the operator norm where denotes the th column vector of .

3. Dictionary Learning with Mixed Signals

Dictionary identifiability [13], that is, recovering a reference dictionary that is assumed to generate the observed signals, is important for the interpretation of the learned atoms. In particular, Gribonval et al. [16] proved that the loss function of dictionary learning admits a local minimum in the neighborhood of the dictionary generating the signals.

In this section, we consider that there are multiple reference dictionaries and that the signals generated from them are mixed. Further, we prove that if reference dictionaries are close to each other in the sense of the Frobenius norm, dictionary learning with mixed signals admits a local minimum near both reference dictionaries simultaneously.

Without loss of generality, we analyze the case of two signal sources , . In particular, for the signal source ( or 2), assume its signals are generated by model where is the reference dictionary of , is the coefficient, and is the noise.

Particularly, the coefficient is drawn on index set such that is a zero vector and is a random vector. Assume and satisfy the following assumptions similar to [15], where we denote .

Assumption 1 (basic and bounded signal assumption). There exist random variables , , values , , and , such that

Remark 2. Almost all sparse signal models such as -sparse Gaussians and Laplacians satisfy the first five formulas, which can be seen as a kind of abstract and generalizing of the basic sparse signal model.

Further, the additional assumptions that the signal is upper-bounded and lower-bounded are standard and mainly used to make the analysis simple and clear [15]. In practice, as digital data is gathered with sensors with limited dynamics and stored in float format with limited precision, the boundedness assumption seems to be reasonably relevant.

The index set is called the support of and the sparsity is defined as the number of elements in . Thus the signal model is parameterized by the sparsity , the expected coefficient energy , the minimum coefficient magnitude , maximum norm , and the flatness

Note these assumptions can be generalized to multiple sources case easily, and thus we have the following definition.

Definition 3 (mixed bounded signal source). A mixed signal source is defined as the union set of several signal sources ; that is, where each source generates the signals by the way described in (2). Further, if satisfy the basic and bounded signal assumptions (3) simultaneously, we say that is a mixed bounded signal source or satisfies a mixed bounded signal model.

Further, for the two signal sources’ case, assume and are close in the sense of Frobenius distance; that is, there is a small , s.t. . (As discussed in [15], a dictionary is invariant by sign flips and permutations of the atoms, and we simply assume the atoms have been tuned to attain the minimum distance.) Denoting the th column of , the cumulative coherence of a dictionary is defined as

The term gives a measure of the level of correlation between columns of . Moreover, the lower restricted isometry constant of a dictionary , , is the smallest number, for any , satisfying

Recall that, for a set , the loss function of dictionary learning is where is a penalty function promoting sparsity. Now consider a set of mixed signals , where and ; the dictionary learning can be formulated as where denotes the set of dictionaries with unit norm atoms. Further, we have the following asymptotic result.

Theorem 4. is a mixed bounded signal source described above which consists of two signal sources and . Without loss of generality, let . And assume the cumulative coherence and the sparsity level satisfy

Further, we define And assume . Define . Moreover, let with a regularization parameter and denote , , . Then there exists a radius which satisfies , , and such that the expectation of the function admits a local minimum that ,

Let us consider in more detail the assumptions in Theorem 4.(i) and assume upper bounds of the correlation level between columns of and the sparsity . This is common in the analysis of sparse learning [21].(ii)The condition would be satisfied with small . The smaller is, the larger would be.(iii) impose an upper limit on admissible regularization parameters. Note that limits on regularization parameters are also frequent [22].(iv) requires the level of noises. In particular, noiseless situation, that is, , is a special case. Besides, would be particularly small; for example, if the noise level is 30 dB, then .(v)Consider and and we can rewrite them as so the conditions would be satisfied for small , in line with that and are close.

To conclude, the assumptions will hold for small cumulative coherence , sparsity , noise level , dictionary distance , and chosen regularization parameter .

Remark 5. For the radius , it is lower-bounded by , , and . While is fixed, if the sparsity is particularly small, will be very small as well and the lower bound of will be close to . While is fixed, and tends to zero, that is, the mixed signal model degenerated into one single source case, then will be held forever and Theorem 4 degenerated into the case in [15], implying that the discussion in [15] can be seen as a special case of ours.

Moreover, the upper bound of is implied to be less than 0.15, which can be concluded by a discussion similar to [15].

Remark 6. Theorem 4 can be generated to sources case easily by considering a loss and the proof is similar.

Proof. Define the closed ball of a dictionary with radius as

Now consider and , as and , and the two balls and have intersection with contained. Denote as the boundary of .

Further, for a set of samples and two dictionaries , define

Note that ; then we have

When , the function is Lipschitz continuous with respect to Frobenius metric on the compact constraint set [16]. Thus by choosing a radius such that , the compactness of the closed set will then imply the existence of a local minimum of such that , . Now let us bound each item of (18).

First note that assuming , , , and , then, by the proof of theorem  1 in [15], for any radius and any dictionary that , we have

Further, is monotonically increasing for and for , we have . Thus For the second item , we have the same lower bound similarly.

Moreover, for the dictionary and any coefficient with sparsity , we have

Then by the theorem 2 and lemma 6 in [16], when is norm, we have where . Thus

Assume is a sample in and its sparse coefficient and noise coefficient are and . As , we have taking expectation on each side of it, by assumptions in (2), as and , then thus . Taking expectation on each side of (23), we have

By (18), (20), and (26), as long as we have which means that admits a local minimum in ; that is, , .

The result is reasonable, as when those reference dictionaries are similar, the dictionary learned from the mixed signals should be similar to each of them, in order to get less reconstruction errors for each subdataset and hence a lower total loss.

4. The Regularizer and Dictionary Customization Problem

Now we turn back to the dictionary customization problem. In particular, the dataset consists of several separable subdatasets ; that is, . Further, () is an existing global dictionary corresponding to . This is common, as the dictionary for facial images would always be well trained, and the corresponding dataset can be divided by different individuals. Then we would like to customize with some auxiliary samples , requiring that the customized dictionary has the same size but behaves better on .

Obviously, should have sparse representations and small reconstruction errors on , which corresponds to minimizing under sparsity constraint . Further, noting that can be regarded as several signal sources and, hence, as a mixed bounded model. Moreover, accounting for the fine granularity, the differences between those subdatasets are small and the basic sketches of them are consistent, implying that the underlying dictionaries for all subdatasets are similar. Thus, should, according to Theorem 4, be close to our customized dictionary as well, which is also in accordance with the practical observation. Considering the distance induced by the Frobenius norm, this leads directly to a regularizer .

Denote ; then the customization model can be formulated as a sum of the reconstruction errors and the above regularizer; that is, where represents the th column vector of , is the sparsity number, and is the parameter balancing the prior knowledge of and the information in .

It is worth noting that problem (29) is connected with the matrix version of total least squares (TLS) problems [23], which generalized the least squares by assuming noises in both dependent and independent variables. This is interpretable: as mentioned above, the atoms of the global dictionary only grab the main sketches. They regard the characteristics belonging to different subdatasets as noise and discard them. As a result, when considering a particular subdataset, the characteristic information is absent and thus the corresponding atoms of can be seen as noisy. Different from TLS, the tuning parameter is necessary, as noises in and are different and should be balanced. We further depict model (29) with the following properties.

Theorem 7. Consider customization problem (29), where is the auxiliary data, is the global dictionary, is the true one corresponding to the target subdataset, and is the customized one attained from (29); then (1)denote ; for any , (2)for a fixed , when tends to infinity, will converge to ; in other words, the minimizer of (8) is an asymptotically unbiased estimator of ;(3)the tuning parameter reflects the confidence in ; in particular, if , ; if , (29) will degrade into a common dictionary learning problem.

Proof. For 1, as is the optimal solution of problem (29), then, for any and , we have Let ; then we have and the equality holds only when .
For 2, reshape the loss function as When tends to infinity, the penalty will tend to zero and thus the loss function will degenerate into the common dictionary learning form.
For 3, it is easy to see that reflects the weight of the penalty in the loss function and the conclusion is reasonable.

According to the third property of Theorem 8, customization can be seen as a trade-off between learning a dictionary and using an existing one, which fills the void between them and implies a more flexible dictionary selection strategy. In particular, for datasets with coarse granularity, select dictionary learning with large amounts of samples. For subjects with fine granularity, customize the existing one with some auxiliary samples, and use a predefined dictionary if no sample is available.

We also emphasize that our model (29) will be valid as long as the assumption is satisfied (i.e., ). As demonstrated in the experiments, there are more applications, such as improving an insufficient learned dictionary or correcting a contaminated one. In addition, for the regularizer, more matrix norms can be selected as well. For example, consider the distance induced by matrix  -norm; then will be sparse and the consumptions in storage and transmission will be reduced greatly.

5. Optimization

In this section, we first introduce a general optimization strategy and then devise a more straightforward dictionary updating strategy similar to K-SVD [10].

5.1. A General Strategy

A general optimization strategy, not necessarily leading to a global optimum, can be found by splitting the problem into two parts which are alternately solved within an iterative loop. The two parts are as follows.

5.1.1. Sparse Coding

Keeping fixed, find by This can be solved by pursuit algorithms such as OMP [24], FOCUSS [25], or relaxed to Lasso [26].

5.1.2. Dictionary Updating

Keeping fixed, find by This is a quadratic programming problem with a closed-form solution .

5.2. C-Ksvd Algorithm

We now turn to a more involved dictionary updating strategy: rather than freezing the coefficient matrix , we update together with the nonzero coefficients (i.e., only the support is fixed).

In particular, assume that both and are fixed except for one column in the correction matrix and the coefficients that correspond to it, the th row in , denoted as . Then the loss function can be rewritten as where is the th column of , is a constant, and represents the error when the th dictionary atom is removed.

Now we shrink the loss function to the support of row vector . Define as the group of indices pointing to samples that use the atom ; that is, Further, define as ones on the th entries and zeros elsewhere. Then problem (29) is transformed to where and . For this subproblem, we have the following results.

Theorem 8. Suppose the largest singular value and the corresponding singular vectors of matrix are , , and . is the first element of . Then, the unique solution for problem (37) is where .

Proof. Denote ; then

As is the product of two vectors, its rank is one. Then problem (37) can be rewritten as

And thus is the best rank-one approximation of . By Eckart-Young-Mirsky Theorem [27], we have ; thus , and . Taking back into the original problem (37), it becomes a least squares problem and we have

Thus, problem (37) has a closed-form solution and the main computation is top SVD of . In the dictionary updating stage, we can suggest minimization with respect to each column (for simplicity, omitting , we directly use in updating) and corresponding in sequence, forcing the support of the coefficients fixed. The complete algorithm, named “C-Ksvd”, is described as Algorithm 9. Noting that while K-SVD computes the top SVD of matrix for the th column, C-Ksvd computes that of .

Assuming that the sparse coding stage is performed perfectly, a local minimum is guaranteed, as the loss function is guaranteed to be nonincreasing at the update step for and a series of such steps ensures a monotonic reduction. Compared with the general strategy, the updating for is more straightforward as it allows tuning of the values of the corresponding coefficients. In addition, each atom can have a unique parameter as well, on behalf of the confidence level of the th atom .

Algorithm 9 (C-Ksvd algorithm).
Initialization: a global dictionary , samples .
Repeat:(i)Sparse coding stage: use any sparse recovery algorithm to compute the coefficients for each sample by approximating the solution of (ii)Dictionary updating stage: for each column in , update it by(a)Compute by (b)Define the group of samples that use this atom as . Restrict and by choosing the columns corresponding to , obtain and .(c)Apply top SVD decomposition to , obtain . Update Until convergence (stopping rule).
Output: a better dictionary .

6. Experiments

We first showed the effectiveness of our approach on the denoising task, with analysis of the customized dictionary and the tuning parameter . Further, a novel superresolution experiment was illustrated, sharing the idea of transferring knowledge from a related auxiliary data source. In addition, we conducted an experiment that enhances an insufficient learned dictionary by C-Ksvd, illustrating that our model was also valid for more tasks.

6.1. Denoising

We demonstrated the customization results by denoising tasks on facial images drown from PIE Database [28]. The denoising process was similar to [1], which included sparse coding of each patch of the noisy image. As the coding performance relied heavily on the dictionary, we could assess the dictionary by the denoising results, which were evaluated by PSNR (Peak Signal Noise Ratio).

In particular, the noisy images were produced by adding Gaussian noises with mean zero and different standard deviation . The patch size and the redundant factor were set as 16 × 16 and 4. (We chose them for the best visual effect while similar comparisons can be attained for different value.) OMP was used for coding and atoms were accumulated until the average error passed the threshold, chosen empirically to be . Results corresponding to three dictionaries were compared, that is, the global dictionary , the one generated by K-SVD, and the one produced by our customization approach. In K-SVD and customization, was used as initialization and the iteration number was set to 10. Moreover, three kinds of were considered, denoted as “global I”, “global II,” and “DCT,” respectively: (1) a dictionary learned by K-SVD, with 40,000 noiseless patches picked from 100 individuals; (2) similar to (1), but learned with noisy patches (); (3) predefined DCT (discrete cosine transform).

Each experiment was repeated 5 times and results are depicted in Table 1 and Figure 3. It was seen that customization outperformed the global dictionary and K-SVD on both PSNR and visual effects, accounting for the fact that both the common sketches in and characteristics in had been utilized. Particularly, note that denoising by tended to be too smooth, and results by K-SVD were likely to be too rough. Regarding DCT as a suboptimal global dictionary, the results also showed that our customization is valid for a wide range of . Conducted on a i7-3770 CPU and processed with the same dataset , the average running time for K-SVD and customization were 173.34 s and 48.21 s, respectively, showing that our approach is competitive. In particular, for K-SVD, 119.31 s were used for removing identical atoms. We also display the three dictionaries as images in Figure 4, showing that the customized one was similar to the global one, while the one corresponding to K-SVD was not.

In addition, we plotted the relations of the tuning parameter , the average number (AN) of coefficients for patches, and the PSNR after denoising in Figure 5. It was shown that could be chosen as the one attaining the minimum average number of coefficients by a quick one-dimensional search. What is more, experimentally, for a fixed , the best for different individuals was the same, which implies we only need to tune it once while customizing.

6.2. Superresolution

Yang et al. [29] proposed a scale-up algorithm via sparse signal representation, which contains two steps: dictionary learning and patch-pairs construction. To reduce the dimension and speed-up processing, Elad [30] applied PCA on the samples and used K-SVD for training. However, this learned dictionary was still a global dictionary, which means that we can further improve the performance of superresolution by customization.

Consider a global dictionary and patches sampled from related high-resolution images. We can customize this dictionary to a finer granularity. In particular, by substituting K-SVD, the low-resolution dictionary and the coefficient was customized by C-Ksvd, and the corresponding high-resolution dictionary was attained by where and denoted the initial high-resolution and low-resolution dictionaries, respectively.

In this experiment, similar to the settings in [31], we evaluated the proposed approach on the Yale Face Database [32], which contains 11 different 100 × 100 facial images for each of the 15 individuals. A downscaled image was taken as the low-resolution object, and the down-scale factor was set to 3. Further, other images of the same individual were considered as high-resolution auxiliary data. The patch size was set to 3 × 3. The global dictionary was trained by 34,650 patches sampled from the 80% downscaled total dataset. (This is to highlight the relevance of auxiliary data and simulate real conditions, as the total training set is relatively small and clean in our experiment.) Results produced by , K-SVD (i.e., the original version with as initialization and as training data), and customization were compared. For customization, patches were taken from each auxiliary image. For K-SVD, the total number of sampled patches was fixed to 6,000, to gain the best results.

Varying the number of the auxiliary images and repeating the experiments on different individuals, the performance was evaluated by PSNR. Some of the results are summarized in Table 2. “DL” and “Cus” represent K-SVD and customization, respectively, and “3,” “6,” or “9” denote the number of auxiliary images. “Bicubic,” that is, simple Bicubic interpolation, is shown as a baseline method.

It is seen that if the number of auxiliaries is small, results produced by K-SVD are worse than the common dictionary, implying that the learning is meaningless. However, even when the auxiliary data is small (675 patches from 3 images), superresolutions by customization have significant improvements. Further, customization still outperforms or is no worse than the learning approach when new data is added. Remember that the number of patches which K-SVD needs is much larger than that needed for customization, meaning more computations and time are required. Also note that once the dictionary has been customized, it is valid for all the images of the person.

6.3. Enhancing

As was mentioned above, model (29) can be applied to more tasks, as long as the assumption that and the reference dictionary are close. In this subsection, we consider the case of enhancing an existing dictionary by C-Ksvd and evaluate the performance on classification.

In particular, LC-KSVD [33], one of the state-of-the-art methods for image classification, introduced a triple model , where represents the dictionary, stands for parameters of the label consistent term, and denotes the linear classifier. Regarding as a new dictionary, the following object function can be solved by K-SVD:

Sometimes the model learned is likely not good enough, due to the fact that the training data may be insufficient or too noisy. Moreover, over time the past training information often becomes unavailable. In this case, we can further enhance it by our customization model, simply replacing the K-SVD procedure with C-Ksvd. In accordance with [33], we used the “Extended Yale-B” dataset [34] to demonstrate that the performance and the data were divided into three parts: training data to obtain the initial model, auxiliary data for model enhancement, and test data to evaluate it. Parameters and were tuned while training the initial models and then kept fixed. The initial model , LC-KSVD, and C-Ksvd were compared, where LC-KSVD used as initialization and as training data. Results were analyzed in three ways.

(1) Initial models of different levels were obtained by tuning the number of training samples, and we tried to promote the model with 800 auxiliary images. After repeating the experiment 5 times for each level, the averaged recognition accuracies are summarized in Table 3.

It is seen that C-Ksvd is valid in a wide range of initial models and always significantly outperforms LC-KSVD. Besides, influences of the initial models on LC-KSVD are relatively small, in accordance with our previous analysis.

(2) For fixed initial models, varying the number of auxiliary images from 100 to 1100, we plotted the corresponding recognition results in Figure 6 and found that the accuracy had a significant increase, even when the auxiliary number was relatively small. To gain a competitive result for LC-KSVD, large amounts of images were required, which was unaffordable.

(3) In the previous discussion, the auxiliary images were uniformly sampled from all 38 individuals. Then we considered the nonuniform case where only images of several classes (named “enhanced classes”) were available. After setting the number of enhanced classes as 19, and getting 31 images from each class, the results are reported in Table 4.

While C-Ksvd improved the accuracy on enhanced classes, the accuracy on the remainder was slightly reduced, owing to the similarity of the original and new dictionaries. LC-KSVD presented a sharp contrast.

7. Conclusion

In this paper, we considered the dictionary customization problem, which can be seen as a trade-off between learning a new dictionary from data and using an existing one. We investigated our hypothesis with theoretical analysis and formulated a model by raising a specific regularizer. An efficient algorithm was proposed, and experiments on real-world data demonstrate that our approach is promising.

Competing Interests

The authors declare that they have no competing interests.