Abstract

The convex nonnegative matrix factorization (CNMF) is a variation of nonnegative matrix factorization (NMF) in which each cluster is expressed by a linear combination of the data points and each data point is represented by a linear combination of the cluster centers. When there exists nonlinearity in the manifold structure, both NMF and CNMF are incapable of characterizing the geometric structure of the data. This paper introduces a neighborhood preserving convex nonnegative matrix factorization (NPCNMF), which imposes an additional constraint on CNMF that each data point can be represented as a linear combination of its neighbors. Thus our method is able to reap the benefits of both nonnegative data factorization and the purpose of manifold structure. An efficient multiplicative updating procedure is produced, and its convergence is guaranteed theoretically. The feasibility and effectiveness of NPCNMF are verified on several standard data sets with promising results.

1. Introduction

This nonnegative matrix factorization (NMF) [1, 2] has been widely used in information retrieval, computer vision, pattern recognition, and DNA gene expressions [3, 4]. NMF decomposes the data matrix as the product of two matrices that possess only nonnegative elements. It has been stated by many researchers that there are a lot of favorable properties for such a decomposition over other similar decompositions, such as PCA. One of the most useful properties of NMF is that it usually leads to parts-based representation because it allows only additive, not subtractive, combinations. Such a representation encodes much of the data making them easy to interpret. NMF can be traced back to 1970s and has been studied extensively by Paatero and Tapper [5]. The work of Lee and Seung [1] brought much attention to NMF in machine learning and data mining fields. Since then, various extensions and variations of NMF have been proposed. Li et al. [4] proposed local nonnegative matrix factorization (LNMF) algorithm which imposes extra constraints to the cost function to get more localized and parts-based image features. Hoyer [6, 7] employed sparsity constraints to improve local data representation, while nonnegative tensor factorization was studied in [8, 9] by Hazan et al. to handle the data encoded as high-order tensors. All the methods mentioned above are unsupervised, Wang et al; [10] and Zafeiriou et al. [11] proposed independently the Fisher-NMF, which was further studied by Kotsia et al. [12], by adding an additional constraint seeking to maximize the between-class scatter and minimize the within-class scatter in the subspace spanned by the bases.

One of the most important drawbacks of NMF and its variants is the fact that these methods have to be performed in the original feature space of the data points, so that it can not be kernelized and the powerful idea of the kernel method cannot be applied to NMF. Ding et al. [13] proposed convex nonnegative factorization (CNMF) that strives to address the problems while inheriting all the strengths of the above NMF method, which models each cluster as a linear combination of the data points and each data point as a linear combination of the cluster centers. The major advantage of CNMF over NMF is that it can be performed on any data representations, either in the original space or RKHS.

Recently, there has been a lot of interest in geometrically motivated approaches to data analysis in high dimensional spaces. When the data lives on or close to a nonlinear low dimensional manifold which is embedded in the high dimensional ambient space [14, 15], Euclidean distance is incapable of charactering the geometric structure of the data and hence traditional methods like NMF and CNMF no longer work well. Both CNMF and NMF do not exploit the geometric structure of the data, which assume that the data points are sampled from a Euclidean space. To address this problem, Cai et al. proposed a graph regularized NMF (GNMF) [16] and locally consistent concept factorization (LCCF) [17], which assume that the nearby data points are likely to be in the same cluster, that is, cluster assumption [18, 19]. The Euclidean and manifold geometry are unified through a regularization framework, which has a better interpretation from manifold perspective.

In this paper, we introduce a novel matrix factorization algorithm, called neighborhood preserving convex nonnegative matrix factorization (NPCNMF) which is based on the assumption that if a data point can be reconstructed from its neighbors in the input space, then it can be reconstructed from its neighbors by the same reconstruction coefficients in the low dimensional subspace, that is, local linear embedding assumption [20]. NPCNMF not only inherits the advantages of CNMF, for example, nonnegativity, but also overcomes the shortcomings of CNMF, that is, Euclidean assumption. We also propose a multiplicative algorithm to efficiently solve the corresponding optimization problem and its convergence is theoretically guaranteed.

The rest of this paper is organized as follows. In Section 2, we briefly review NMF and CNMF. In Section 3, we introduce our algorithm and provide the proof of the convergence of the proposed algorithm. Experiments on three benchmark face recognition data sets are demonstrated in Section 4. Finally, we draw a conclusion and provide suggestions for future work.

2. A Review of NMF and CNMF

Nonnegative matrix factorization (NMF) factorizes the data matrix into one nonnegative basis matrix and one nonnegative coefficient matrix. Given a nonnegative data , each column of is a sample point. NMF aims to find two nonnegative matrices and which minimize the following objective function: where is Frobenius norm.

The objective function is joint optimization problem of basis matrix and coefficient matrix . Although it is not jointly convex to and , it is convex with respect to each of them when the other one is fixed. Therefore, it is unrealistic to expect an algorithm to find the global minimum of . To optimize the objective, Lee and Seung [2] presented an iterative multiplicative updating algorithm as follows:

It is proved that the above updated steps will find a local minimum of the objective function in (1).

In reality, we have and . Thus, NMF essentially tries to find a compressed approximation of the original matrix, . We can view this approximation column as follows: where is the th column vector of . Thus, each data vector is approximated by a linear combination of the columns of , weighted by the components of . One limitation of NMF is that the nonnegative requirement is not applicable to applications where the data involves negative number. The second is that it is not clear how to effectively perform NMF in the transformed data space so that the powerful kernel method can be applied. To overcome the problem, Ding et al. [13] proposed a convex nonnegative matrix factorization (CNMF) algorithm where nonnegative and mixed-sign data matrices are applied. CNMF claims that each base can be characterized by a linear combination of the entire data points while each data point can be approximated by a linear combination of all the bases. Translating the statements into mathematics, we have where is a nonnegative weight in which data point is related to th base and is a nonnegative projection value of . Replacing in (5) with (4), we have We form the data matrix using the feature vector of data point as the th column, the matrix using bases , and projection matrix using the projection values . From (6), we have

Equation (7) can be interpreted as the approximation of the original data set. Minimizing the squared error and its approximation [13] where , , . The matrices and are updated iteratively until convergence using the following rules: where and the matrix and are given by respectively.

3. Neighborhood Preserving Convex Nonnegative Matrix Factorization

In this section, we introduce our neighborhood preserving convex nonnegative matrix factorization method, which takes the local linear embedding constraint as an additional requirement. The method presented in this paper is fundamentally motivated from the neighborhood preserving embedding.

3.1. The Objective Function

Many real world data are actually sampled from a nonlinear low dimensional manifold which is embedded in the high dimensional ambient space. Both NMF and CF perform the factorization in the Euclidean space. They fail to discover the local geometrical structure of the data space, which is essential to the clustering problem. NPE aims at preserving the local manifold structure. Specifically, for each data point, it is represented as a linear combination of the neighboring data points and the combination coefficients are specified in the weight matrix. We can find an optimal embedding such that the combination coefficients can be preserved in the low dimensional subspace.

For each data point, we find its nearest neighbors. And we can characterize the local geometrical structure by linear coefficients that reconstruct each data point from its neighbors. The reconstruction coefficients are computed by the following objective function: and if , where denotes the nearest neighborhood of .

Then , in the dimensionality reduced space can be preserved by minimizing where denotes the trace of a matrix, is an identity matrix, and . By minimizing (12), we essentially try to formalize our intuition that if a data point can be represented from its neighbors in the original space, then it can be represented from its neighbors by the same combination coefficients in the dimensionality reduced space.

With the neighborhood preserving constraint, CNMF incorporates (8) and minimizes the objective function as follows: where is a positive regularization parameter controlling the contribution of the additional constraint. We call (13) neighborhood preserving convex nonnegative matrix factorization (NPCNMF). Let ; (13) degenerates to the original CNMF.

3.2. The Algorithm

We introduce an iterative algorithm to find a local minimum for the optimization problem. By defining and using the matrix properties , , and , we can rewrite the objective function as follows:

This is a typical constrained optimization problem and can be solved using the Lagrange multiplier method. Let and be the Lagrange multiplier for constraint and , respectively, and let and . The Lagrangian function is

The partial derivatives of with respect to and are

Using the Karush-Kuhn-Tucker conditions and , we get the following equations for and :

The corresponding equivalent formulas are as follows:

Introduce where and .

The equations lead to the following updating formulas:

Note that the solution to minimizing the criterion function is not unique. If and are the solution, then, , will also form a solution for any positive diagonal matrix . To make the solution unique, we will further require that , where is the column vector of . The matrix will be adjusted accordingly so that does not change. This can be achieved by

3.3. Convergence Analysis

In this section, we will investigate the convergence of the updating formula in (14). We use the auxiliary function approach [16] to prove the convergence. Here we first introduce the definition of auxiliary function [16].

Definition 1. is an auxiliary function of if the conditions are satisfied.

Lemma 2. If is an auxiliary function for , then is nonincreasing under the update

Proof. Consider

Lemma 3. For any nonnegative matrices , , , and  , and , are symmetric, then the following inequality holds:

The correctness and convergence of the algorithm are addressed in the following.

For given , fixing , considering any element in , we use to denote the part of , which is only relevant to . We get

Theorem 4. One rewrites as follows: where , , , and .
Then the following function is an auxiliary function of : that is, it satisfies the requirements and . Furthermore, it is a convex function of and its global minimum is
From its minima and setting and , one recovers (20), letting , , , , and .

Proof. The function is
We find upper bounds for each of the two positive terms and lower bounds for each of the two negative terms. For the third term in , by applying Lemma 3, we obtain an upper bound
The second term of is bounded by
To obtain lower bounds for the two remaining terms, we use the inequality , which holds for any , and the first term in is bounded by
The last term in is bounded by
Collecting all bounds, we obtain as in (29). Obviously, and .
To find the minimum of , we take
To find the minimum of , we take the Hessian matrix of To be a diagonal matrix with positive entries
Thus, is a convex function of . Therefore, we obtain the global minimum by setting in (36) and solving for . Rearranging, we obtain (30).

Theorem 5. Updating using (20) will monotonically decrease the value of the objective in (13); hence it converges.

Proof. By Lemma 2 and Theorem 4, we can get that , so is monotonically decreasing. Since is obviously bounded below, we prove this theorem.
For given , fixing , considering any element in , we use to denote the part of , which is only relevant to . We get

Theorem 6. One rewrites as follows: where , , and .
Then the following function is an auxiliary function of : that is, it satisfies the requirements and . Furthermore, it is a convex function of and its global minimum is
From its minima and setting and , one recovers (21), letting , , , and .

Proof. The function is
We find upper bounds for each of the three positive terms and lower bounds for each of the three negative terms. For the third term in , by applying Lemma 3 and setting , , we obtain an upper bound
The second term of is bounded by using the inequality , which holds for any .
For the fifth term in , setting , , and , we obtain an upper bound
To obtain lower bounds for the three remaining terms, we use the inequality , which holds for any , and the first term in is bounded by
The fourth term in is bounded by
The last term in is bounded by
Collecting all bounds, we obtain as in (41). Obviously, and .
To find the minimum of , we take
We have
Therefore
The Hessian matrix containing the second derivatives is a diagonal matrix with positive entries
Thus, is a convex function of . Therefore, we obtain the global minimum by setting in (41) and solving for . Rearranging, we obtain  (21).

Theorem 7. Updating using (21) will monotonically decrease the value of the objective in (13); hence it converges.

Proof. By Lemma 2 and Theorem 4, we can get that , so is monotonically decreasing. Since is obviously bounded below, we prove this theorem.

4. Experimental Results

In this section, we show the performance of the proposed method on face recognition and compare our proposed method with the popular subspace learning algorithms: four unsupervised ones which are principal component analysis [21] (PCA), neighborhood preserving embedding (NPE) [20], local nonnegative matrix factorization (LNMF) [4], and convex nonnegative factorization (CNMF) [13] the one supervised algorithm and which is linear discriminant analysis (LDA) [21]. We use the nearest neighbor (NN) classifier as baseline in original space. We apply different algorithms to obtain new representations for each chosen data set, and then the NN method is applied in the new representation spaces.

4.1. Data Preparation

The experiments are used on three data sets. One is Cambridge ORL face database, the other is the Yale database, and the third one is the CMU PIE face database. The important statistics of these data sets are described below.

The Yale database contains 165 gray scale images of 15 individuals. All images demonstrate variations in lighting condition (left-light, center-light, right-light), facial expression (normal, happy, sad, sleepy, surprised, and wink), and with/without glasses.

The ORL database contains ten different images of each of 40 distinct subjects, thus 400 images in total. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement).

The CMU PIE face database contains more than 40 000 facial images of 68 people. The images were acquired over different poses, under variable illumination conditions, and with different facial expressions. In our experiment, we choose the images from the frontal pose (C27) and each subject has around 49 images from varying illuminations and facial expressions.

In all the experiments, images are preprocessed so that faces are located. Original images are first normalized in scale and orientation such that the two eyes are aligned at the same position. Then the facial areas were cropped into the final images for clustering. Each image is of pixels with 256 gray levels per pixel.

4.2. Parameter Settings

For each data set, we randomly divide it into training and testing sets, and evaluate the recognition accuracy on the testing set. In detail, for each individual in the ORL and Yale data sets,we randomly select 2, 3, and 4 images per individual, respectively, for training samples, and the remaining for test samples, while for each individual in the PIE data set, we randomly select 5, 10, and 20 images per individual for training samples. For each partition, we repeated each experiment 20 times and calculated the average recognition accuracy. In general, the recognition rate varies with the dimension of the face subspace. The best result obtained in the optimal subspace and the corresponding dimensionality for each method are shown.

For the face recognition experiments, several parameters need to be decided beforehand. For LDA, we use PCA as a first step dimensionality reduction algorithm to avoid the singularity problem. The dimension of the PCA step is fixed as and then performs LDA. There are two parameters in our NPCNMF and NPE approach: the number of nearest neighbors and the regularization parameter . Throughout our experiments, we empirically set the number of nearest neighbors to 5, the value of the regularization parameter to 100.

Each testing sample is projected into the linear subspace spanned by the column vectors of the basis matrix , namely, , where indicates the pseudoinverse of matrix .

4.3. Classification Results

Tables 1, 2, and 3 show the evaluation results of all the methods on the three data sets, respectively, where the value in each entry represents the average recognition accuracy of 20 independent trials, and the number in brackets is the corresponding projection dimensionality. These experiments reveal a number of interesting points.(1)It is clear that the use of dimensionality reduction is beneficial in face recognition. There is a significant increase in performance from using LDA, NPE, NMF, LNMF, and CNMF. However, PCA fails to gain improvement over the baseline. This is because that PCA does not encode the discriminative information.(2)The performances of nonnegative algorithms NMF, LNMF, and CNMF are much worse than supervised algorithms LDA, which shows that without considering the labeled data, nonnegative algorithms could not guarantee good discriminating power.(3)Our NPCNMF algorithm outperforms all other five methods. The reason lies in the fact that NPCNMF considers the geometrical structure of the data and achieves better performance than the other algorithms. This shows that by leveraging the power of both the parts-based representation and the intrinsic geometrical structure of the data, NPCNMF can learn a better compact representation in the sense of semantic structure.

5. Conclusion and Future Work

In this paper, we have presented a novel matrix factorization method called NPCNMF for dimensionality reduction, which respects the local geometric structure. As a result, NPCNMF can discriminate power more than the ordinary NMF and CNMF approaches which only consider the Euclidean structure of the data. Experimental results on face datasets show that NPCNMF provides better representation in the sense of semantic structure.

Several challenges remain to be investigated in our future work.(1)A suitable value of is important to our algorithm. It remains unknown how to do model selection theoretically and efficiently.(2)NPCNMF is currently limited to the linear projections, and those nonlinear techniques (e.g., kernel tricks) may further boost the algorithmic performance. We will investigate it in our future work.(3)Another further research direction is how to extend the current framework for tensor-based nonnegative data decomposition.(4)NPCNMF algorithm is iterative and sensitive to the initialization of and . It is unclear how to choose optimal initialization parameters in a principled manner.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.