Abstract

Nonnegative matrix factorization (NMF) is a useful tool in learning a basic representation of image data. However, its performance and applicability in real scenarios are limited because of the lack of image information. In this paper, we propose a constrained matrix decomposition algorithm for image representation which contains parameters associated with the characteristics of image data sets. Particularly, we impose label information as additional hard constraints to the α-divergence-NMF unsupervised learning algorithm. The resulted algorithm is derived by using Karush-Kuhn-Tucker (KKT) conditions as well as the projected gradient and its monotonic local convergence is proved by using auxiliary functions. In addition, we provide a method to select the parameters to our semisupervised matrix decomposition algorithm in the experiment. Compared with the state-of-the-art approaches, our method with the parameters has the best classification accuracy on three image data sets.

1. Introduction

Learning an efficient representation of image information is a key problem in machine learning and computer vision. Efficiency of the representation refers to the ability to capture significant information from a high dimensional image space. Such a high dimensional problem is difficult to manipulate and compute; therefore dimension reduction becomes the crucial method to cope with this problem. Fortunately, matrix factorization is a valid approach to solve the dimension reduction problem, and it has a long and successful history in dealing with image representation [13]. Some methods of matrix factorization can be referred to as principal component analysis (PCA) [4], singular value decomposition (SVD) [5], vector quantization (VQ) [6], and nonnegative matrix factorization (NMF) [7].

Among all techniques for matrix factorization, NMF is distinguished from others by its use of nonnegative constraints in learning a basis representation of image data [8] and has been applied in face recognition [911], medical imaging [12, 13], electroencephalogram (EEG) classification for brain computer interface [14], and many other areas. However, NMF is an unsupervised learning algorithm and inapplicable to learning a basic representation from limited image information. Thus, to make up for this deficiency, extra constraints are implicitly or explicitly incorporated into NMF to derive some semisupervised matrix decomposition algorithms. In [15], the authors impose label information as additional hard constraints to NMF based on the squared Euclidean distance and Kulback-Leibler divergence. Such a representation encodes the data points from the same class using the indicator matrix in a new representation space, where the obtained part-based representation is more discriminating.

However, none of the semisupervised NMF algorithms mentioned above contain parameters associated with the characteristics of image data sets. In this paper, we introduce -divergence-NMF algorithm [16], where is a parameter. We impose the labeled constraints to the -divergence-NMF algorithm to derive a generic constraint matrix decomposition algorithm which includes some existing algorithms as their special cases: one of them is CNMFKL [15] with . Then, we obtain the proposed algorithm using Karush-Kuhn-Tucker (KKT) method as well as the projected gradient method and prove its monotonic local convergence using an auxiliary function. Comparing to the current semisupervised NMF algorithms, we analyze the classification accuracy for two fixed values of () on three image data sets.

The algorithm does not work well for a fixed value of . Since the parameter is associated with the characteristics of a learning machine, the model distribution is more inclusive when goes to and is more exclusive when approaches to . The selection of the optimal value of plays a critical role in determining the discriminative basis vectors. In this paper, we provide a method to select parameters for our semisupervised algorithm. The variation of is associated with the characteristics of image data sets. Compared with the algorithms in [15, 16], our algorithm is more complete and systemic.

The rest of the paper is organized as follows. In Section 2, we make a brief overview on standard NMF algorithm and constraint NMF algorithm. The detailed algorithms with labeled constraints and theoretical proof of the convergence of the algorithms are provided in Sections 3 and 4 separately. Section 5 presents some experimental results to show the advantages of our algorithm. Finally, a conclusion is given in Section 6.

NMF, proposed by Lee and Seung [7], is considered to provide a part-based representation and applied in diverse examples of nonnegative data [1721] including text data mining, subsystem identification, spectral data analysis, audio and sound processing, and document clustering.

Suppose is a set of training images, where is a column vector and consists of nonnegative pixel values of a training image. NMF is to find two nonnegative matrix factors and to approximate the original image matrix where the positive integer is smaller than or .

NMF uses nonnegative constraints to make the representation purely adapted to an unsupervised way. It is inapplicable to learn a basis representation to the limited image information. To make up for this deficiency, extra constraints such as locality [22], sparseness [9], and orthogonality [23] were implicitly or explicitly incorporated into NMF to identify better local features or provide more sparse representation.

In [15], the authors impose label information as additional hard constraints to NMF unsupervised learning algorithm to derive a semisupervised matrix decomposition algorithm, which makes the obtained representation more discriminating. The label information is incorporated as follows.

Suppose is a data set, which consists of n training images. Set that the first images are represented by the label information, and the remaining images are represented by the unlabeled. Assume there exist classes and each image from is designated one class. Then we have an indicator matrix , which can be represented as From the indicator matrix , a label constraint matrix can be defined as where denotes an identity matrix.

Imposing the label information as additional hard constraint by an auxiliary matrix , there is . It verifies that if and have the same labels. With the label constraints, the standard NMF is transformed into factorizing a large-size matrix into the product of three small-size matrices , , and : Such a representation encodes the data points from the same class using the indicator matrix in a new representation space.

3. A Constrained Algorithm Based

The exact form of the error measure of (1) is as crucial as the nonnegative constraints in the success of the NMF algorithm in learning a useful representation of image data. In the researches on NMF, there are quite a large number of investigations for error measure, such as Csiszár’s f-divergences [24], Amari’s -divergence [25], and Bregman divergences [26]. Here, we introduce a genetic multiplicative updating algorithm [16] which iteratively minimizes the -divergence between and . We define the -divergence as where is a positive parameter. We combine the labeled constraints with (4) to derive the following objective function, which is based on the -divergence between and , With the constraints , , and , the minimization of can be formulated as a constrained minimization problem with inequality constraints. In the following, we will present two methods to find a local minimum of (6).

3.1. KKT Method

Let and be the Lagrangian multipliers associated with constraints and , respectively. The Karush-Kuhn-Tucker conditions require that both the optimality conditions and the complementary slackness conditions are satisfied. If and , then either can have any values. At the same time, if and , then and . Hence we need both and . It follows from (9) that We multiply both sides of (7) and (8) by and , respectively, and incorporate with (10), and then we obtain the following updating rules:

3.2. Projected Gradient Method

Considering the gradient descent algorithm [24, 25], the updating rules for the objective function (6) can be also derived by using the projected gradient method [27] and have the form where is a suitably chosen function and and are two parameters to control the step size of gradient descent. Then, we have Setting , to guarantee that the updating rules (11) and (12) hold, we need From (7) and (8), the updating rules become which are the same as the updating rules (11) and (12). We have shown that the algorithm can be derived using Karush-Kuhn-Tucker conditions and presented alternative of the algorithm using the projected gradient. The use of the two methods guarantees the correctness of the algorithm theoretically.

In the following, we will give a theorem to guarantee the convergence of the iterations in updates (11) and (12).

Theorem 1. For the objective function (6), is nonincreasing under the updating rules (11) and (12). The objective function is invariant under these updates if and only if and are at a stationary point.

Multiplicative updates for our constrained algorithm based are given in (11) and (12). These updates find a local minimum of , which is the final solution of (11) and (12). Note that, when , the updates (11) and (12) are the same with CNMFKL algorithm [15], which is a special case included in our generic constraint matrix decomposition algorithm. In the following, we will give the proof of Theorem 1.

4. Convergence Analysis

To prove Theorem 1, we will make use of an auxiliary function that was used in the expectation-maximization algorithm [28, 29].

Definition 2. A function is defined as an auxiliary function for if the following two conditions are both satisfied:

Lemma 3. Assume that the function is an auxiliary function for ; then is nonincreasing under the update

Proof . Consider .

It can be observed that the equality holds only if is a local minimum of . We iterate the update in (18) to obtain a sequence of estimates that converge to a local minimum of the objective function given by In the following, we will show that the objective function (6) is nonincreasing under the updating rules (11) and (12) by defining the appropriate auxiliary functions with respect to and .

Lemma 4. Function where , is an auxiliary function for

Proof. Obviously, . According to the definition of auxiliary function, we only need to prove . To do this, we use the convex function for positive to rewrite the -divergence function as where Note that and from the definition of . Applying Jensen’s inequality [30], it leads to From the above inequality, it follows that which satisfies the condition of auxiliary function.

Reversing the rules of and in Lemma 4, we define an auxiliary function for the update (12).

Lemma 5. Function where , is an auxiliary function for

This can be easily proved in the same way as Lemma 4. From Lemmas 4 and 5, now we can demonstrate the convergence of Theorem 1.

Proof. To guarantee the stability of , from Lemma 3, we just need to obtain the minimum of with respect to . Set the gradient of (20) to zero; there is Then, it follows that which is similar to the form of the updating rule (11). Similarly, to guarantee the updating rule (12) holds, the minimum of , which can be determined by setting the gradient of (26) to zero, must exist.

Since is an auxiliary function, according to Lemma 4, in (21) is nonincreasing under the update (11). Multiplying updates (11) and (12), we can find a local minimum of .

5. Experiments

In this section, the algorithm is systematically compared with the current constrained NMF algorithms on three image data sets, named ORL Database [31], Yale Database [32], and Caltech 101 Database [33]. The details of the above three databases will be described individually later. We introduce the evaluated algorithms firstly.(i)Constrained nonnegative matrix factorization algorithm in [15] aims to minimize the F-norm cost.(ii)Constrained nonnegative matrix factorization algorithm with parameter in this paper aims to minimize the Hellinger divergence cost.(iii)Constrained nonnegative matrix factorization algorithm with parameter in [15], aiming at minimizing the KL-divergence cost, is the best reported algorithm in image representation.(iv)Constrained nonnegative matrix factorization algorithm with parameter in this paper aims to minimize the -divergence cost.(v)Constrained nonnegative matrix factorization algorithm with parameter in this paper aims to minimize the -divergence cost, where the parameters are associated with the characteristics of the image database and designed by the presented method. CNMFKL algorithm is a special case of our algorithm with .

We apply these algorithms to a problem of classification and evaluate their performance on three image data sets which contain a number of different categories of image. For each date set, the evaluations are conducted with different numbers of clusters; here the number of clusters varies from 2 to 10. We randomly choose categories from one image data set and mix the images of these categories as the collection . Then, for the semisupervised algorithms, we randomly pick up 10 percent images from each category in and record their category number as the available label information to obtain the label matrix . For some special data sets, the label process is different and we will describe the details later.

Suppose a data set has categories , and the cardinalities of these labeled images are , respectively. Since the label constraint matrix is composed of the indicator matrix and the identity matrix , the indicator matrix plays a critical role in classification performance for different categories in . To determine the effectiveness of , we define a measure to represent the relationship between the cardinalities of labeled samples and the total samples, where and denote the maximum and minimum labeled cardinalities of categories. For the fixed cluster number in the data set, is different if the number of samples in each category is different. Then, we compute as follows: where The value of computed by (31) is associated with the characteristics of image data sets, since its variation is caused by both the cardinalities of labeled samples in each category and the total samples. We can obtain both the cardinalities of labeled samples and the total samples in our semisupervised algorithms. However, we can not get the cardinalities of labeled images exactly in many real-word applications. Moreover, the value of varies depending on data sets. It is still an open problem how to select the optimal [16].

To evaluate the classification performance, we define classification accuracy as the first measure. Our algorithm described in (11) and (12) provides a classification label of each sample, marked . Suppose is a data set, which consists of n training images. For each sample, let be the true class label provided by the image data set. More specifically, if the image is designated the th class, we evaluate it as a correct label and set . Otherwise, it is counted as a false label and noted . Eventually, we compute the percentage of correct labels obtained by defining the accuracy measure as

To evaluate the classification performance, we carry out computation about the normalized mutual information, which is used to measure how similar two sets of clusters are, as the second measure. Given two data sets of clusters and , their normalized mutual information is defined as which takes values between 0 and 1. Where , denote the probabilities that an image arbitrarily chosen from the data set belongs to the clusters and , respectively, and denotes the joint probability that this arbitrarily selected image belongs to the cluster as well as at the same time.   and are the entropies of and , respectively.

Experimental results on each data set will be presented as classification accuracy and the normalized mutual information is in Tables 1 and 2.

5.1. ORL Database

The Cambridge ORL Face Database has 400 images for 40 different people, 10 images per person. The images of some people are taken at different times, varying lighting slightly, facial expressions (open/closed eyes, smiling/nonsmiling), and facial details (glasses/no glasses). All the images are taken against a dark homogeneous background with the subjects in an upright, frontal position and slight left-right out-of-plane rotation. To locate the faces, the input images are preprocessed. They are resized to pixels with 256 gray levels per pixel and normalized in orientation so that two eyes in the facial areas are aligned at the same position.

There are 10 images for each category in ORL and 10 percent is just one image. For the fixed parameter (), we randomly choose two images from each category to provide the label information. Note that the same label is meaningless for (30). To obtain , we divide the 40 categories into 3 groups: 10 categories, 20 categories, and 10 categories. In the first 10 categories, pick up 1 image from each category to provide the label information; pick up 2 images from each category in the second 20 categories; and pick up 3 from each category in the remaining categories. The dividing process is repeated 10 times and the obtained average classification accuracy is recorded as the final result.

Figure 1 shows the graphical classification accuracy rates and normalized mutual information on the ORL Database. Note that if the samples in the collection come from the same group, we set . Because of the same number of samples in each category, the variation of the is small even though we label different cardinalities of samples. Compared to the constrained nonnegative matrix factorization algorithms with fixed parameters, our algorithm gives the best performance since the selection of is suitable to the collection . Table 1 summarizes the detailed classification accuracy and error bars of , , and . It shows that our algorithm achieves 1.92 percent improvement compared to the best reported CNMFKL algorithm [15] in average classification accuracy. For normalized mutual information, the details and the error bars of our constrained algorithms with , , and are listed in Table 2. Comparing to the best algorithm CNMF, our algorithm achieves 0.54 percent improvement.

5.2. Yale Database

The Yale Database consists of 165 grayscale images for 15 individuals, 11 images per person. One per image is taken from different facial expression or configuration: center-light, w/glasses, happy, left-light, w/no glasses, normal, right-light, sad, sleepy, surprised, and wink. We preprocess all the images of Yale Database in the same way as the ORL Database. Each image is linearly stretched to a 1,024-dimensional vector in image space.

The Yale Database also has the same number of samples in each category. To obtain an appropriate , we do similar label processing to the ORL Database. Divide the 15 individuals into 3 groups averagely, choose 1 image from each category in the first group, choose 2 from each category in the second, and choose 3 from each category in the remaining group. We repeat the process 10 times and record the average classification accuracy as the final result.

Figure 2 shows the classification accuracy and normalized mutual information on the Yale Database. Set when . It indicates that the samples in the collection just come from two groups; that is, choose 15 images from 10 categories in the first and second group. achieves an extraordinary performance for all the cases and follows. This suggests that the constrained nonnegative matrix factorization algorithm has a higher classification accuracy when the value of is close to . Comparing to the best reported CNMFKL algorithm, achieves 2.42 percent improvement in average classification accuracy. For normalized mutual information, achieves 7.2 percent improvement compared to CNMF. The details of classification accuracy and normalized mutual information are provided in Tables 3 and 4, which contain the error bars of , , and .

5.3. Caltech 101 Database

The Caltech 101 Database created by Caltech University has images of 101 different object categories. Each category contains about 31 to 800 images with a total of 9,144 samples of size roughly pixels. This database is particularly challenging for learning a basis representation of image information, because the number of training samples per category is exceedingly small. In our experiment, we select the 10 largest categories (3,044 images in total), except the background category. To represent the input images, we do the preprocessing by using the codewords generated by SIFT features [34]. Then we obtain 555,292 SIFT descriptors and generate 500 codewords. By assigning the descriptors to the closest codewords, each image in Caltech 101 database is represented by a 500-dimensional frequency histogram.

We randomly select categories from Faces-Easy category in Caltech 101 database and convert them to gray-scale of . The label process is repeated 10 times and the obtained values of computed by (31) are listed in Table 7. The variation of the in the same categories is great. That is, selecting an appropriate plays a critical role for the mixture of different categories in one image data set, especially in the case that the number of samples in each category is different. The choice of can fully reflect the effectiveness of the indicator matrix .

Figure 3 shows the effect that derived from the using of proposed algorithm with . The upper part of figure is the original samples which contain 26 images, the middle part is their gray images, and the lower is the combination of the basis vectors learned by .

The classification accuracy results and normalized mutual information for Faces-Easy category in Caltech 101 database are detailed in Tables 5 and 6, which contain the error bars of , , and . The graphical results of classification performance are shown in Figure 4. The best performance in this experiment is achieved when the parameters listed in Table 7 were selected. In general, our method demonstrates much better effectiveness in classification by choosing . Comparing to the best reported algorithm other than our algorithm, achieves 2.4 percent improvement in average classification accuracy, and comparing to the CNMF algorithm [15] other than algorithm, achieves 8.77 percent improvement in average classification accuracy. For normalized mutual information, achieves 2.29 percent improvement and consistently outperforms the other algorithms.

6. Conclusion

In this paper, we present a generic constraint nonnegative matrix factorization algorithm by imposing label information as additional hard constraint to the -divergence-NMF algorithm. The proposed algorithm can be derived using Karush-Kuhn-Tucker conditions and presented alternative of the algorithm using the projected gradient. The use of the two methods guarantees the correctness of the algorithm theoretically. The image representation learned by our algorithm contains a parameter . Since -divergence is a parametric discrepancy measure and the parameter is associated with the characteristics of a learning machine, the model distribution is more inclusive when goes to and is more exclusive when approaches . The selection of the optimal value of plays a critical role in determining the discriminative basis vectors. We provide a method to select the parameters for our semisupervised algorithm. The variation of the is caused by both the cardinalities of labeled samples in each category and the total samples. In the experiments, we apply the fixed parameters and to analyze the classification accuracy on three image databases. The experimental results have demonstrated that the algorithm has best classification accuracy. However, we can not get the cardinalities of labeled images exactly in many real-word applications. Moreover, the value of varies depending on data sets. It is still an open problem how to select the optimal for a specific image data set.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (61375038), National Natural Science Foundation of China (11401060), Zhejiang Provincial Natural Science Foundation of China (LQ13A010023), Key Scientific and Technological Project of Henan Province (142102210010), and Key Research Project in Science and Technology of the Education Department Henan Province (14A520028, 14A520052).