Robust Structure Preserving Nonnegative Matrix Factorization for Dimensionality Reduction
As a linear dimensionality reduction method, nonnegative matrix factorization (NMF) has been widely used in many fields, such as machine learning and data mining. However, there are still two major drawbacks for NMF: (a) NMF can only perform semantic factorization in Euclidean space, and it fails to discover the intrinsic geometrical structure of high-dimensional data distribution. (b) NMF suffers from noisy data, which are commonly encountered in real-world applications. To address these issues, in this paper, we present a new robust structure preserving nonnegative matrix factorization (RSPNMF) framework. In RSPNMF, a local affinity graph and a distant repulsion graph are constructed to encode the geometrical information, and noisy data influence is alleviated by characterizing the data reconstruction term of NMF with -norm instead of -norm. With incorporation of the local and distant structure preservation regularization term into the robust NMF framework, our algorithm can discover a low-dimensional embedding subspace with the nature of structure preservation. RSPNMF is formulated as an optimization problem and solved by an effective iterative multiplicative update algorithm. Experimental results on some facial image datasets clustering show significant performance improvement of RSPNMF in comparison with the state-of-the-art algorithms.
With the development of data acquisition technology, high-dimensional data is ubiquitous in many areas. Directly learning in original data space is prone to disturbance by noise; meanwhile, it is hard to discover the latent semantic structure , so low-dimensional representation learning becomes a fundamental problem in numerous data analysis algorithm [2–10]. As a common used tool for dimensionality reduction, matrix factorization has received more and more attention in recent years [2, 3, 11]. Unlike typical matrix factorization methods, such as QR and SVD, which seek for exact factorization of original matrix, nonnegative matrix factorization (NMF) approximately factorizes the original matrix into two low-dimensional nonnegative submatrixes. It is the nonnegative constraint which assures NMF can learn a sparse and part-based representation of the sample matrix. Psychological and physiological evidences have proved that NMF is consistent with how the brain interprets an object in a part-based, nonsubtractive way [12–14]. Recently, NMF has exhibited great success in a variety of applications, like text analysis , image processing [16, 17], and video representation .
In order to detect the latent geometrical structure hidden in high-dimensional data space, many manifold learning algorithms have been proposed, for instance, isometric feature mapping (ISOMAP) , locally linear embedding (LLE) , Laplacian eigenmaps (LE) , and local tangent space alignment (LTSA) . Motivated by the benefit of NMF and manifold learning, a graph regularized NMF (GNMF) method is proposed in . Although GNMF displays some improvement on clustering performance, it suffer greatly from the following two problems in real-world applications: (a) GNMF only consider the ideal case where the input data is noise-free. When the input data samples is corrupted by noise, GNMF even leads to an inaccurate matrix factorization. (b) GNMF overemphasizes the affinity relationship between neighboring data samples; the repulsion relationship between data samples which is favorable for clustering is even neglected thoroughly.
To address the abovementioned issues of GNMF, in this paper, we propose a new robust structure preserving nonnegative matrix factorization (RSPNMF) method. In RSPNMF, the data reconstruction term is formulated in the form of robust -norm instead of -norm, which will improve the model robustness to noisy data; meanwhile, the rotation invariance property of the model is also preserved. The success of GNMF relies on the assumption that two neighboring data samples share the same semantic label; it characterizes the compactness relationship among neighboring data samples. Inspired by this observation, we propose a new assumption; that is, data samples which are remote are generally semantic difference; and then we constrain that the low-dimensional representation of distant data samples in the original data space should also be far apart in embedding subspace. We consider that the preservation of neighboring data sample’s compactness together with distant data sample’s repulsion is helpful to describe the intrinsic geometrical structure of data distribution.
The contributions of this paper are presented as follows:(1)A novel nonnegative matrix factorization algorithm, named RSPNMF, is proposed, in which the intrinsic manifold structure is approximated with the local affinity and distant repulsion graphs. By incorporating the corresponding regularization term into robust NMF framework, our proposed algorithm can find the latent subspace of the original matrix with nature of manifold structure learning.(2)We formulate our approach as an optimization problem and provide an iterative multiplicative update algorithm to solve it. The convergence proof of the proposed algorithm is also provided.(3)Our experiments on several datasets for clustering validate our proposed algorithm. Compared with several state-of-the-art methods, the results of our RSPNMF show the better performance.
The rest of the paper is organized as follows. In Section 2, we briefly review the background of NMF and related works. In Section 3, our RSPNMF approach and the convergence proof of optimization scheme are presented. Some experimental results and comparison with other approach are presented in Section 4. We end this paper with some conclusions and discussion in Section 5.
2. Related Works
Given a nonnegative matrix , is a data sample. NMF attempts to find two low-dimensional nonnegative submatrixes whose product provides a best approximation to the original matrix ; that is, With the purpose of finding a compact factorization for the original data matrix, the number of basis component usually is set less than and .
One common data reconstruction function used to quantify the approximation of NMF is formulated as follows:where denotes the Frobenius norm.
Due to no-convexity of the objective function in (2) on both and , it is unrealistic to expect an algorithm to find its global minimum. An iterative multiplicative update algorithm which can find the local minima of (2) is proposed in , which are As the update algorithm allows only additive combination among different basis components, it is believed that NMF can learn a sparse and part-based representation .
Inspired by the work exhibited in , many of NMF variants have been proposed over the past years. In order to explicitly impose sparseness constraint, sparseness constrained NMF is proposed in [9, 24–28]; these works expect more sparse representation than the original NMF. To characterize different residual error model, two new distance metrics are introduced in [29, 30]. Similar to SVD presented in , a nonnegative matrix can also be factorized into three nonnegative components. In , a fast nonnegative matrix trifactorization is proposed, and then orthogonality constrained nonnegative matrix trifactorization is studied in . Different from unsupervised NMF variants, a semisupervised nonnegative matrix decomposition algorithm is proposed in , where the label information of data samples was introduced to NMF as an additional constraint. To improve discriminating power of MMF, in [33, 34], Fisher’s criterion as an additional regularization term is incorporated into NMF framework. In spite of the success of the abovementioned NMF variants in various scenarios, all of them perform matrix factorization on Euclidean space.
In [39, 40], a dual regularized seminonnegative matrix trifactorization is proposed, in which data samples and feature samples manifold structure are also considered. Although the graph-based NMF algorithm incorporates the prior knowledge of manifold structure into NMF framework, overemphasizing local structure preservation sometimes will degrade its performance. A structure preserving nonnegative matrix factorization (SPNMF) is proposed in , in which the intrinsic manifold is approximated with local and distant graphs. Experiment results demonstrate SPNMF outperforms GNMF and its variants.
All the abovementioned NMF variants are based on the premise that the input data are noise-free; it is impractical to real-world data samples which are often corrupted by noise. For boosting NMF robustness to outliers, -norm based NMF reconstruction function is applied in [41, 42]. In , -norm and -norm are used to formulate the graph embedding and data reconstruction function, which endows the algorithm with robustness to unreliable graphs and noisy labels.
Among the existing methods, SPNMF is closely relevant to our method and performs better on some specific datasets. There are still some differences between our proposed algorithm and SPNMF. Firstly, the Laplacian matrix of distant repulsion graph from SPNMF needs to be updated in each iteration, which is time consuming. Additionally, to learn a better parts-based representation, the basis vector of SPNMF is required to be as orthogonal as possible. Finally, SPNMF is sensitive to noisy data points.
3. The Proposed Method
Our goal is to perform matrix factorization on the hidden semantic subspace. To this end, we impose the local affinity and distant repulsion structure preservation on the new robust NMF framework. In this section, we will first explain the structure preservation and robust data reconstruction terms of RSPNMF, followed by its optimization algorithm. The convergence of the proposed algorithm is proved finally.
3.1. Problem Formulation
Despite the diverse motivations of various dimensionality reduction algorithms, most of them can be explained within a graph embedding framework. Let be an undirected weighted graph, where vertex set corresponds to a dataset and is an affinity matrix whose elements measure the similarity of each pair of vertices. In graph embedding algorithm, graph characterizes the prior knowledge of geometric structure from data distribution.
If we use the -nearest neighboring graph to characterize the local structure of data distribution, the weight matrix can be defined as follows:where is the bandwidth parameter and denotes the set of nearest neighbors of .
Local invariance assumption which encourages neighboring data pairs in the original space to be still close in the low-dimensional embedding subspace can be formulated aswhere is the corresponding low-dimensional representation of .
By defining the diagonal matrix and Laplacian matrix of graph are as we can rewrite (5) aswhere denotes the trace of a matrix.
Equation (7) essentially holds the smoothness of dimensionality reduction process. It is based on two vital assumptions; firstly, the neighboring data samples in high-dimensional space are semantic similarity, and secondly the preservation of affinity structure plays great role for low-dimensional representation.
Local invariance assumption essentially exploits the favorite relationship among similar data samples under unsupervised condition; however, it ignores unfavorite relationship between divergent data pairs. In this paper, we conjecture that the distant data pairs are always semantic difference. A new distant neighboring graph which is used to describe the repulsion relationship between dissimilar data pairs is also constructed. The corresponding weight matrix is defined as follows:where denotes the remotest data samples of in givensample dataset.
From (8), we can we can find that the more the distance between and , the smaller the value of . If we want the corresponding low-dimensional representation and are also faraway, the value of must be as large as possible. Inspired by this observation, we propose a new distant repulsion assumption, which emphasizes that the low-dimensional counterpart of distant data pairs is also far apart in embedding subspace. This distant repulsion constraint is characterized by the objective function defined in
The distant repulsion assumption in (9) can maintain the repellent relationship between each remote data pair during dimensionality reduction, which can be rewritten aswhere and are the diagonal matrix and Laplacian matrix of graph , respectively, and , for any given .
With the assumption that semantic label of distant data pairs is certainly different, (10) actually emphasizes the unfavorite relationship among dissimilar data samples.
Although (7) and (10) together can better preserve the manifold structure of data distribution, their optimized object is opposite, one for minimum and the other for maximum. To address this issue, we divide the basis matrix into two parts; that is,correspondingly, the coefficient matrix is also partitioned into two parts:Then, the original matrix can be reconstructed with and in an additive manner:where , , , , and . and are the coefficient matrix of the original matrix on and , respectively.
If we restrict the fact that the submatrix retains the property of graph and avoids the property of graph , then submatrix is the low-dimensional representation of matrix with local invariance assumption. With the added constraint that the property of basic matrices and is consistent, (7) can be translated asOn the contrary, we assume that submatrix preserves the property of graph and avoids the property of graph ; meanwhile, we constrain the fact that the property of matrix is complementary to matrix . This will lead to the fact that the calculation of finding maximum value in (11) is transformed into the calculation of finding minimum value showed in
In (2), the error of each data sample entered the objective function of NMF with squared residue, a few outliers with large errors will easily dominate the objective function. In this paper, we employ -norm to define the data reconstruction term of NMF framework. Compared with -norm, -norm can keep the feature rotation invariance together with minimizing outlier impact. The -norm of matrix can be defined ascorrespondingly, the robust loss function of NMF can be formulated as
With the purpose of finding a robust low-dimensional representation with the nature of structure preservation, we explicitly incorporate (14) and (15) into the robust NMF framework. The objective function of the proposed RSPNMF is described as follows:From (18), we can find that RSPNMF divides the ultimate matrix factorization framework into two parts, one for structure preservation and the other for data reconstruction. The constants and are the tradeoff parameters to balance the contribution of different regularization terms. When , RSPNMF degenerates to GNMF . When and , RSPNMF even degrade to the same form of NMF .
Because the objective function showed in (18) is not convex on both and together, in the following, we will manage to achieve its local minima with multiplicative iterative update algorithm.
Let and be the Lagrange multiplier for constraints and , respectively, and and are corresponding Lagrange multiplier matrix and the Lagrangian function of can be expressed as
The partial derivatives of with respect to and , are as follows:where and is a column vector with all elements equal to one.
Using the KKT conditions , , and , we obtain the following multiplicative update rules for RSPNMF:
To avoid the scale transfer problem which was pointed out in , we normalize and as follows:
3.3. Convergence Analysis
Due to the objective function shown in (18) is nonconvex, only local optimal solutions can be obtained. In this section, we use the property of auxiliary function approach to prove the objective function is nonincreasing under the update formulas in (21) and (22).
Convergence Analysis for
Lemma 1. If there exists an auxiliary function for which satisfies and , then is nonincreasing under the update
In order to prove the objective function is nonincreasing under the update formula in (21), we only need to construct a proper auxiliary function which satisfies and , where denotes the part of which is only relevant to and is the th element of .
The Taylor series expansion of can be formulated aswhere
Lemma 2. Functionis an auxiliary function for which satisfies and .
Proof. Obviously, is true; we only need to prove . For this purpose, we compare with the Taylor series expansion of .
Sincewe can find that , and then we can conclude that .
Let ; there must be , which makes attain the minima, so we can obtain that
Convergence Analysis for . Similar to the convergence analysis for , the first and the second derivate of with respect to can be described as
The Taylor series expansion of is as follows:
Now, we construct a new auxiliary function for : and we prove which satisfies and
Proof. Obviously, is true; we only need to prove .
Since we can deduce that
Let ; we can find thatFinally, we have
The convergence analysis for is similar to the convergence analysis for , and it is omitted.
Computational Complexity Analysis. With the update rules in (21) and (22), it is not hard to calculate the extra computational cost of our proposed algorithm in each iteration, and we summarize the result in Table 1. It is important to note that matrix is sparse, because the local graph is constructed with neighbors. The average nonzero element on each row of is . Thus, we only need and flam (a floating point addition and multiplication) to compute .
Besides the multiplicative updates, RSPNMF also needs to construct the local affinity graph and distant repulsion graph . If the multiplicative updates stop after iterations, the overall cost for RSPNMF is . The overall cost for GNMF is and the overall cost for SPNMF is .
4. Experimental Results
In this section, we carry out several experiments to examine the clustering performance of our approach in comparison with some state-of-the-art methods.
4.1. Datasets Description
Three image datasets which are widely employed in dimensionality reduction literatures are used in our experiments.
(1) ORL Dataset. The ORL dataset contains 400 face images of 40 subjects under different lighting, facial expressions (open/closed eyes, smiling/not smiling), and facial details (with glasses or no glasses). All these images were taken against a dark homogeneous background with the subjects in an upright, frontal position. In our experiments, original images are firstly normalized in scale and orientation, such that two eyes are aligned at the same position. Then, the facial areas are cropped into the final images for clustering. Finally, each image is normalized to array of 32 × 32 pixels and reshaped to a long vector with 256 gray levels per pixel.
(2) YALE Dataset. The YALE dataset includes 165 face images collected from 15 volunteers. Each image has different facial expression and facial details. The same preprocessing as for the ORL dataset is implemented on YALE dataset.
(3) FERET Dataset. For the FERET dataset, we use 70 subjects with 6 facial images for each subject. Each image is also represented by a 1024-dimensional vector in image space.
The statistic property of these datasets is listed in Table 2, which includes the number of samples, the dimension of features, and the number of object classes.
As showed in Figure 1, to evaluate the robustness of mentioned algorithms to noisy data, all the images were occluded with local and global noise. The local noise consists of occlusion patch of 12 × 12 pixels with random black and white dots. The global noise is replaced by Gaussian noise with mean 0 and variance 36.
4.2. Compared Methods
To investigate the clustering performance of our proposed approach, we compared it with the following methods:(i)Canonical -means (means in short) clustering method as a baseline.(ii)-means clustering in the principle component subspace (PCA in short).(iii)Nonnegative matrix factorization based clustering (NMF in short). We use the -norm formulation.(iv)Graph regularized nonnegative matrix factorization based clustering with -norm formulation (GNMF in short).(v)Structure preserving nonnegative matrix factorization based clustering (SPNMF in short).
Except the means, for each image, a new low-dimensional representation can be derived from the matrix factorization, and then we perform clustering by using -means in the new embedding subspace.
4.3. Evaluation Metrics
To evaluate the clustering performance of all involved methods, we adopted two popular metrics, the accuracy (AC) and the normalized mutual information (NMI), as used in .
The accuracy indicates one-to-one relationship between obtained cluster label and ground truth label. The accuracy is defined as follows:where is the delta function which judges whether the ground truth is equal to the obtained label and is a permutation mapping function that maps each obtained cluster label to the equivalent label from the dataset.
The second measure metrics is the normalized mutual information (NMI), which is used for determining the quality of clustering.
Let and denote the sets of cluster from the ground truth and the clustering method, respectively. The mutual information metric is defined as follows:where and denote the extent of a sample belonging to the clusters and , respectively. is a joint probability that an arbitrarily selected sample belongs to as well as at the same time. The NMI can be defined as follows:where and are the entropies of and , respectively. It can be seen that ranges from 0 to 1. The larger the is, the better the clustering result will be.
4.4. Clustering Results and Analysis
In order to fair examine the clustering performance of all mentioned methods, 20 test runs are conducted on the optimal parameter setting, and then the means and standard deviations are recorded. In each test, means is applied 20 times with different initialization and the best result is reported. All clustering performances on the three noise contaminated datasets are reported in Tables 3 and 4.
From these results, we can derive the following observations: (a) SPNMF outperforms means, PCA, and NMF in most cases. This demonstrates the importance of structure preservation in discovering the hidden factors. In addition, the performance of SPNMF is superior to GNMF. This reveals that, by leveraging the ratio of local affinity and distant repulsion regularization, SPNMF can learn a better low-dimensional representation. (b) The performance of RSPNMF exceeds SPNMF; this demonstrates the robustness of -norm in handling noise data samples.
4.5. Parameter Sensitiveness
Our RSPNMF has five essential parameters: the number of nearest neighbors , the number of distant neighbors , the regularization parameter , the regularization parameter , and the number of basis vectors . For all mentioned dimensionality reduction algorithms, the number of basis vectors is equal to the number of clusters. The complement space and the original space reduced dimension is set to . For all graph-based methods, the number of nearest neighbors is set to . For SPNMF and RSPNMF, the number of distant points is also set to . For GNMF, the regularization parameter is set by the grid . For SPNMF and RSPNMF, we tune and by searching the grid . For SPNMF, is set to 1.
The result of parameter sensitiveness on different datasets with local and global noise is showed in Figure 2. In order to compare with GNMF, we fix the parameter and tune the parameter . From Figure 2, we can find that the performance of RSPNMF and SPNMF is superior to GNMF; meanwhile, they are all stable with respect to the change of parameter . This demonstrates the importance of the distant repulsion preservation on characterizing the intrinsic distribution of data samples. Besides, RSPNMF still outperforms SPNMF when range within . This is because the robust data reconstruction term of RSPNMF alleviates the noisy data impact on clustering performance.
The relevant samples selection plays an important role in all graph regularized NMF method. For GNMF, which only use -nearest neighboring graph to capture the local geometric structure of data distribution; meanwhile, assume that the neighboring samples are with the same semantic label. For noise corrupted data samples, this assumption may not hold. In particular, as increases, clustering performance of GNMF is more likely to fail as shown in Figure 3. For RSPNMF and SPNMF, the local and distant data samples relationship is also considered. This will definitely reinforce the algorithm robustness to -nearest neighboring samples increase. In addition, from Figure 3 we can find that the introduction of -norm improves RSPNMF clustering results.
4.6. Convergence and Computational Cost
The effectiveness of the proposed RSPNMF has been demonstrated by the above clustering experiments. Here, we evaluate the efficiency of RSPNMF, that is, how fast the update rules in (21) and (22) are converged.
Figure 4 shows the convergence curves of RSPNMF on all the three datasets. For each figure, the -axis is the value of objective function and the -axis denotes the iteration number. From the convergence curves, it can be observed that the convergence rate of RSPNMF is fast, and the runtime cost of RSPNMF is acceptable.
In this paper, we proposed a novel nonnegative matrix factorization approach named RSPNMF. RSPNMF performs the subspace learning with the nature of manifold structure preservation and robust data reconstruction. As a result, RSPNMF is more discriminating than GNMF which only considers the local structure preservation of the clean data. Experimental results on face image clustering show that RSPNMF provides a better representation in the sense of semantic subspace.
The superiority of RSPNMF arises in the following two aspects: (a) the objective function of RSPNMF uses the -norm as the discrepancy measure and therefore alleviates the outlier issue that is commonly existing in image clustering. (b) The regularization term of RSPNMF incorporates the local and distant geometric information, which leads the nonnegative matrix factorization to be performed on the specific manifold. The added geometric information generally improves the clustering results.
The authors declare that they have no competing interests.
This work was supported by the Natural Science Foundation of China under Grants 61333019 and 61303168.
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, Wiley-Interscience, Hoboken, NJ, USA, 2000.
X. He and P. Niyogi, “Locality preserving projections,” Advances in Neural Information Processing Systems, vol. 45, no. 1, pp. 186–197, 2005.View at: Google Scholar
W. Xu, X. Liu, and Y. Gong, “Document clustering based on non-negative matrix factorization,” in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR '03), pp. 267–273, Toronto, Canada, July 2003.View at: Publisher Site | Google Scholar
D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Proceedings of the 14th Annual Neural Information Processing Systems Conference (NIPS '00), Denver, Colo, USA, December 2000.View at: Google Scholar
P. O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” Journal of Machine Learning Research, vol. 5, no. 12, pp. 1457–1469, 2004.View at: Google Scholar
J. Kim and H. Park, Sparse Nonnegative Matrix Factorization for Clustering, Georgia Institute of Technology, 2008.
S. Z. Li, X. W. Hou, H. J. Zhang, and Q. S. Cheng, “Learning spatially localized, parts-based representation,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '01), pp. I-207–I-212, IEEE, Kauai, Hawaii, USA, December 2001.View at: Google Scholar
C. Ding, T. Li, W. Peng, and H. Park, “Orthogonal nonnegative matrix t-factorizations for clustering,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '06), pp. 126–135, Philadelphia, Pa, USA, August 2006.View at: Publisher Site | Google Scholar
M. E. Wall, A. Rechtsteiner, and L. M. Rocha, “Singular value decomposition and principal component analysis,” Physics, vol. 5, pp. 91–109, 2002.View at: Google Scholar
H. Wang, F. Nie, H. Huang, and F. Makedon, “Fast nonnegative matrix tri-factorization for large-scale data co-clustering,” in Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI '11), pp. 1553–1558, Barcelona, Spain, July 2011.View at: Publisher Site | Google Scholar
R. Zhi, M. Flierl, Q. Ruan, and W. B. Kleijn, “Graph-preserving sparse nonnegative matrix factorization with application to facial expression recognition,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 41, no. 1, pp. 38–52, 2011.View at: Publisher Site | Google Scholar
Q. Gu and J. Zhou, “Transductive classification via dual regularization,” in Machine Learning and Knowledge Discovery in Databases, W. Buntine, M. Grobelnik, D. Mladenić, and J. Shawe-Taylor, Eds., vol. 5781 of Lecture Notes in Computer Science, pp. 439–454, 2009.View at: Publisher Site | Google Scholar