Abstract

Nonnegative matrix factorization (NMF) is a popular tool for analyzing the latent structure of nonnegative data. For a positive pairwise similarity matrix, symmetric NMF (SNMF) and weighted NMF (WNMF) can be used to cluster the data. However, both of them are not very efficient for the ill-structured pairwise similarity matrix. In this paper, a novel model, called relationship matrix nonnegative decomposition (RMND), is proposed to discover the latent clustering structure from the pairwise similarity matrix. The RMND model is derived from the nonlinear NMF algorithm. RMND decomposes a pairwise similarity matrix into a product of three low rank nonnegative matrices. The pairwise similarity matrix is represented as a transformation of a positive semidefinite matrix which pops out the latent clustering structure. We develop a learning procedure based on multiplicative update rules and steepest descent method to calculate the nonnegative solution of RMND. Experimental results in four different databases show that the proposed RMND approach achieves higher clustering accuracy.

1. Introduction

Nonnegative matrix factorization (NMF) [1] has been introduced as an effective technique for analyzing the latent structure of nonnegative data such as images and documents. A variety of real-world applications of NMF has been found in many areas such as machine learning, signal processing [24], data clustering [5, 6], and computer vision [7].

Most applications focus on the clustering aspect of NMF [8, 9]. Each sample can be represented as a linear combination of clustering centroids. Recently, a theoretic analysis has shown the equivalence between NMF and -means/spectral clustering [10]. Symmetric NMF (SNMF) [10] is an extension of NMF. It aims at learning clustering structure from the kernel matrix or pairwise similarity matrix which is positive semidefinite. When the similarity matrix is not positive semidefinite, SNMF is not able to capture the clustering structure contained in the subspace associated with negative eigenvalues. In order to overcome the limitation, weighted NMF (SNMF) [10] is developed. In the WNMF model, the indefiniteness of the pairwise similarity matrix is passed onto a specific low-rank matrix. WNMF improves the clustering performance of SNMF. When a portion of data is labeled, it is desirable to incorporate the class labels information into WNMF in order to improve the clustering performance. To this end, a semisupervised NMF (SSNMF) [11] is studied by incorporating the domain knowledge into WNMF to extract more clustering structure information.

In SNMF, WNMF, and SSNMF, the low rank approximation to the pairwise similarity matrix is used. The goal is to learn the latent clustering structure by minimizing the reconstruction error. However, since there is no prior knowledge about the data, the kernel matrix is often obtained based on pairwise Euclidean distance in the high-dimensional space. It is more sensitive to the unexpected noise. Consequently, it may produce undesirable performances in clustering tasks by minimizing the objective function from the viewpoint of reconstruction in the form as SNMF, WNMF, and SSNMF. In this paper, we present a novel model, called relationship matrix nonnegative decomposition (RMND), for data clustering tasks. The RMND model is derived from the nonlinear NMF algorithms which take advantages of kernel functions in the high-dimensional feature space. RMND decomposes a pairwise similarity matrix into a product of a positive semidefinite matrix, a distribution matrix of similarity on latent features, and an encoding matrix. The positive semidefinite matrix pops out the clustering structure and is treated as a more convincing pairwise similarity matrix by an appropriate transformation. RMND learns the correct relationship matrix adaptively. Furthermore, according to the positive semidefiniteness, the SNMF formulation is incorporated in RMND, and then a more tractable representation of pairwise similarity matrix is obtained. We develop a learning procedure for RMND to discover the latent clustering structure. Experimental results show that the proposed RMND leads to significant improvements on clustering performance.

The rest of the paper is organized as follows: in Section 2, we briefly review the SNMF and WNMF. In Section 3, we present the proposed RMND model and its learning procedure. Some experimental results on several datasets are shown in Section 4. Finally, conclusions and final remarks are given in Section 5.

2. Symmetric NMF (SNMF) and Weighted NMF (WNMF)

A pairwise similarity matrix is a nonnegative matrix, since the pairwise similarities between different objects cannot be negative. For the kernel case, is the standard inner-product linear kernel matrix, where is a nonnegative data matrix with size . And it can be extended to any other kernels. NMF technique is powerful to discover the latent structure in . Since is a symmetric matrix, Ding et al. [10] introduced the SNMF model as follows: where is a nonnegative matrix of size whose rows denote the degrees of the samples related to the centroids of clusters.

In (2.1), is a positive semidefinite matrix. When the similarity matrix is indefinite, has negative eigenvalues. will not provide a good approximation, since cannot absorb the subspace associated with negative eigenvalues. For kernel matrices, decomposition in (2.1) is feasible. However, a large number of similarity matrices are nonnegative but not positive semidefinite matrix. Ding et al. [10] introduced another improved factorization model as where is a nonnegative matrix of size which inherits the indefiniteness of . The detailed update rules for and can be found in [10].

3. Relationship Matrix Nonnegative Decomposition (RMND)

3.1. Proposed RMND Model

Both SNMF and WNMF are powerful methods for learning the clustering structure from a pairwise similarity matrix or a kernel matrix. If the data from different latent classes are well separated, would be approximately a block diagonal matrix. It is easy to find the clustering structure by SNMF and WNMF in that case. Thus, and would be approximately block diagonal matrices. SNMF and WNMF learn good approximation for . This is the simplest case in data clustering. In many real applications, data from different latent classes often crossly corrupt. Then, is not a block diagonal matrix although the data is rearranged appropriately. By minimizing the reconstruction error in (2.1) and (2.2), SNMF and WNMF would not find favorable clustering structure. The reason is and would be approximately block diagonal matrices for good clustering. Consequently, it is desirable to build a new model for finding correct clustering structure and approximating well at the same time.

Recently, NMF has been extended to nonlinear nonnegative component analysis algorithms (referred as KNMF) by Zafeiriou and Petrou [12]. KNMF is proposed to model efficiently the nonlinearities that are present in most real-life applications. The idea of KNMF is to perform NMF in the high-dimensional feature space. Specifically, KNMF is to find a set of nonnegative weights and nonnegative basis vectors such that the nonlinearly mapped training vectors can be written as linear combinations of nonlinear mapped nonnegative basis vectors. Let be a mapping that projects data to a Hilbert space of arbitrary dimensionality. KNMF attempts to find a set of vectors and a set of nonnegative weights such that where and . The nonlinear mapping is related to a kernel function with the operation as . The detailed algorithms for directly learning and can be found in [12].

In this paper, we focus on the convex nonlinear nonnegative component analysis algorithm (referred as CKNMF) in [12]. Instead of finding both and simultaneously, Zafeiriou and Petrou followed the similar lines as convex-NMF [9] and assumed that the centroid is in the space spanned by the columns of . Formally, can be written as where and . This means that the centroid can be interpreted as a convex weighted combination of certain data point . Using (3.2), approximation (3.1) is reformulated in the matrix form as where is the kernel matrix with the entry . Equation (3.3) provides a new decomposition of kernel matrix. Each matrix has the explicit interpretation. is the relationship matrix between different objects based on a certain kernel function, each column of denotes the relationship distribution on certain latent feature according to the property of convex combinations, and is the encoding coefficient matrix. In particular, we rewrite (3.3) in an entry form as It can be noted from (3.4) that represents the weighted average relationship measure correlated to object on the th latent feature, and then the relationship measure between object and is a linear combination of the weighted average relationship measures on the latent features.

However, (3.3) or (3.4) is not convincible for clustering tasks, since the kernel matrix cannot represent the relationship between different objects faithfully. It is more desirable to discover the latent relationship adaptively. Consequently, we replace the in the right hand side in (3.3) by a latent relationship matrix where denotes the correct relationship matrix. From (3.5), the correct relationship matrix would be adaptively learned from the kernel matrix . is a linear transformation of . A relationship matrix , which pops out the latent clustering structure, is approximately a block diagonal matrix under suitable rearrangement on samples. It would be a positive semidefinite matrix. SNMF model is reasonable to learn a low rank representation of matrix . Thus, we derive our new model, referred as relationship matrix nonnegative decomposition (RMND), as follows: where is a nonnegative matrix whose rows denote the degrees of the samples related to the centroids of clusters. The corresponding optimization problem of RMND is given by The objective function of RMND in (3.7) is not convex for , , and simultaneously. Therefore, it is unrealistic to expect an algorithm to find the global minimum of . As it is known that , where is a diagonal matrix with , the normalization on can be easily handled after is updated. Therefore, we only consider the nonnegativity constraints on the factors. When is fixed, let and be the Lagrange multiplier for constraints and , respectively. We define matrix and , then the Lagrange multiplier is where denotes the Euclidean norm and is the trace function. The partial derivatives of with respect to and are Using the KKT conditions and , we get the following equations: The above equations lead to the following multiplicative update rules:

For factor matrix , the corresponding partial derivatives of is Our algorithm essentially takes a step in the direction of the negative gradient and, subsequently, projects onto the constraint space, making sure that the taken step is small enough that the objective function is reduced at every step. The learning procedure for RMND can be summarized as Algorithm 1.

Input: Positive matrix , and a positive integer .
Output: Nonnegative factor matrices , and .
Learning Procedure
 (S1) Initialize , and to random positive matrices, and normalize each column of to one.
 (S2) Repeat the iterations until convergence:
  (1) .
  (2) .
  (3) and where is a diagonal matrix with .
  (4) Repeatedly select small positive constant until the objective function is decreased.
   (i) .
   (ii) Project each column of to be nonnegative vector with unit norm.
  (5) .
Above, and denote elementwise multiplication and division, respectively.

3.2. Computational Complexity Analysis

In this subsection, we discuss the extra computational cost of our proposed algorithm in comparison with SNMF and WNMF. We count the arithmetic operations for each algorithm. Based on the updating rules in [10], it is not hard to count the arithmetic operations of each iteration in SNMF and WNMF. For GNMF, the steepest descent method is used to update factor matrix . We use the bipartition to determine the small positive constant . Let be the maximum iteration number in the steepest descent method. We summary the computational operation counts for each iteration in Table 1. Suppose that the algorithms stop after iterations and the overall cost for both SNMF and WNMF is The overall cost for RMND is Then, the overall cost of SNMF, WNMF, and RMND is related to , where is the number of samples. Much time is needed for large-scale data clustering tasks. For RMND, its overall cost is also effected by the maximum iteration number in the steepest descent method. Nevertheless, RMND will be shown that it is capable of improving the clustering performance in Section 4. We will develop algorithms for fast convergence and low computational complexity in the future work.

4. Numerical Experiments

We evaluate the performance of five different methods, RMND, -means clustering, spectral clustering (SpeClus) [13], SNMF, and WNMF, in a task of data clustering. In RMND, once is learned, we denote , where is the modification of with normalized rows. We apply -means clustering on , and the factor matrices learned by SNMF and WNMF on . Finally, the best clustering results of RMND is obtained.

4.1. Datasets

We use five datasets for evaluating the clustering performance of algorithms. The detailed description for the datasets is listed below. (1)JAFFE [14] is a face database often used in the literature of face recognition. JAFFE database contains 213 face images from 10 different persons under varying in facial expression. (2)Coil20 is a dataset consisting of 1440 images from 20 classes under varying in rotations (http://www1.cs.columbia.edu/CAVE/software/softlib/coil-20.php). For simplicity, the first 10 classes are used in our experiments. (3)reuters dataset [15] is from the news articles of the Reuters newswire. Reuters-21578 corpus contains 21,578 documents in 135 categories. We use the data preprocessed by Cai et al. [15]. The documents with multiple category labels are discarded and the dataset with 8067 documents in the largest 30 categories is derived. Then, we randomly select at most 5 categories for efficiency. (4)USPS is a dataset of handwritten digits from 0 to 9 (http://www.zjucadcg.cn/dengcai/Data/MLData.html). These image data have been preprocessed by Cai et al. [15]. The USPS dataset used here consists of mixed-sign data. The first 100 samples from each class are used in our experiments.

4.2. Evaluation Metrics for Clustering and Kernel Function

In all these methods, we set , the dimensionality of feature subspace, to be equal to the number of classes of datasets. Two performance measures (clustering accuracy and normalized mutual information) are used to evaluate the clustering performance of algorithms. If we denote the true label for the th data to be , and the estimated label , the clustering accuracy can be computed by , where for and for . The clustering accuracy achieves maximum value 1 when clustering results are perfect. Let be the set of clusters obtained from the ground truth and obtained from our algorithm. The normalized mutual information measure is defined by where is the mutual information between and , and denote the entropies of and , respectively. The value of varies between 0 and 1. The greater the normalized mutual information, the better the clustering quality.

In our experiments, we use Gaussian kernel function to calculate the kernel matrix. And then, we evaluate the clustering performance of different algorithms. For comparison, the same kernel matrix has been used in our experiments. The Gaussian kernel function used here is as follows: where is a parameter. is given by where is a heat kernel parameter [16] and denotes a set of nearest neighbors of . For simplicity, we present a tunable way to set as a square average distance between different samples [17] where is a scale factor.

4.3. Experimental Results on the JAFFE Dataset

To demonstrate how our method improves the performance of data clustering, we firstly set the number of nearest neighbors , the scale factor , where is the number of samples. Then, the pairwise similarity matrix is the weighted adjacency matrix of the fully connected graph similar to those in spectral clustering. Figure 1 displays the pairwise similarity matrix obtained from the JAFFE dataset. It can be noted that is ill structured. In order to discover the latent clustering structure, we apply RMND, SNMF, and WNMF algorithms to obtain the decomposition form of , respectively. The factor matrices are randomly initialized by the values in the range [0, 1]. Figure 2 shows that objective function value decreases with increasing iteration number. It can be noted that the reconstruction error of RMND is smaller than those of SNMF and WNMF after 500 iterations. Figures 3, 4, and 5 display the estimated pairwise similarity matrix of SNMF, WNMF, and RMND algorithms, respectively. The estimated pairwise similarity matrix learned by RMND is more highly structured. RMND produces better representations of kernel matrix than SNMF and WNMF. SNMF and WNMF have similar representations of when the algorithms are convergent.

4.4. Clustering Performance Comparison

Tables 2, 3, and 4 show the experimental results on the Coil20, Reuters, and USPS datasets with the number of nearest neighbors and the scale factor , respectively. The evaluations are conducted with different numbers of classes, ranging from 4 to 12 for the Coil20 dataset, 2 to 5 for the Reuters dataset, and 2 to 10 for the USPS dataset. The cluster number indicates the class number used for experiments. In our experiments, the first classes in the database are used. For each given class number , 20 independent tests are conducted under different initializations. For the Reuters dataset, different randomly chosen classes are used. And the average performance is calculated over these 20 tests. For each test, -means algorithm is applied 5 times with different start points, and the best result in terms of the objective function of -means is recorded. From these tables, it can be noticed that RMND yields the best average clustering results on the three datasets although the clustering accuracy of RMND is a little smaller than those of other methods for certain cases of class numbers.

4.5. Clustering Performance Evaluation on Various Pairwise Similarity Matrix

In graph embedding methods, the pairwise similarity matrix (also referred as affinity matrix) has been widely used. In this subsection, we test our algorithm under different adjacency graph constructions to show how the different graph structures will affect the clustering performance. The number of nearest neighbors used in this paper defines the locality of graph. In Figures 6 and 7, we show the relationship between the average clustering accuracies and normalized mutual information versus different numbers of nearest neighbors under 20 independent runs and the scale factor , respectively. As can be seen, RMND performs better when the number of nearest neighbors is larger than 60 and the maximum achieved clustering accuracy is 86.62% when 190 nearest neighbors are used in (4.3). The normalized mutual information is better after , and the maximum normalized mutual information is 0.8429 at . For SNMF and WNMF, the best clustering accuracies are 84.55% and 82.82%, respectively, and the best normalized mutual information are 0.8373 and 0.8214, respectively. This implies that RMND is more suitable to discover the clustering structure contained in smoothed pairwise similarity matrix.

Note that the choice of parameters in (4.4) is still an open problem. To this end, we explore the range of possible values of the scale factor to determine the heat kernel parameter. Specifically, is taken from . Figures 8 and 9 show the clustering accuracies and normalized mutual information on the JAFFE dataset under different scale factors. nearest neighbors are used in this experiment. As increases, the performance decreases. The reason might be that the difference between different pairwise similarity is small for larger value of . The pairwise similarity matrix becomes more and more ill structured. Nevertheless, RMND leads to better clustering performance compared with SNMF and WNMF.

5. Conclusions and Future Work

We have presented a novel relationship matrix nonnegative decomposition (RMND) model for data clustering task. The RMND model is formulated by decomposing a pairwise similarity into a product of three low-rank nonnegative matrices which have explicit interpretation. The correct relationship matrix is adaptively learned from the pairwise similarity matrix by RMND. We develop a learning procedure based on multiplicative update rules and steepest descent method to calculate the nonnegative solution of RMND. Extensive numerical experiments confirm that (1) RMND provides a favorable low-rank representation of pairwise similarity matrix. (2) By using an appropriate kernel function, the ability of RMND, SNMF, and WNMF to deal with mixed-signed data makes them useful for many applications in contrast to original NMF. (3) RMND improves the clustering performance of SNMF and WNMF.

Further future work includes the following topics. The first is to develop algorithms for fast convergence and better solution in terms of minimizing the objective function. The second is to investigate the ability of RMND on different kinds of kernel functions.

Acknowledgments

This work was supported by the National Basic Research Program of China (973 Program) (Grant no. 2007CB311002), the National Natural Science Foundation of China (Grant nos. 60675013, 10531030 and 61075006) and Research Fund for the Doctoral Program of Higher Education of China (Grant no. 20100102120048). The authors would like to thank the helpful comments which lead to substantial improvements of the paper.