Abstract

Most sparse or low-rank-based subspace clustering methods divide the processes of getting the affinity matrix and the final clustering result into two independent steps. We propose to integrate the affinity matrix and the data labels into a minimization model. Thus, they can interact and promote each other and finally improve clustering performance. Furthermore, the block diagonal structure of the representation matrix is most preferred for subspace clustering. We define a folded concave penalty (FCP) based norm to approximate rank function and apply it to the combination of label matrix and representation vector. This FCP-based regularization term can enforce the block diagonal structure of the representation matrix effectively. We minimize the difference of l1 norm and l2 norm of the label vector to make it have only one nonzero element since one data only belong to one subspace. The index of that nonzero element is associated with the subspace from which the data come and can be determined by a variant of graph Laplacian regularization. We conduct experiments on several popular datasets. The results show our method has better clustering results than several state-of-the-art methods.

1. Introduction

In machine learning and data mining, clustering is one of the most important topics in unsupervised learning. Given a set of data points, clustering is to partition these points into several groups, with each of which called a cluster, such that data points in the same group have higher similarities than those in different groups. In the past decades, an enormous number of clustering algorithms have been proposed, such as K-means and its variants [1, 2], spectral clustering [3], nonnegative matrix factorization (NMF) [4], and subspace clustering [5]. In this paper, we focus on the methods of subspace clustering.

Contemporary is the era of high-dimensional data explosion. However, there is redundancy information included in those high-dimensional data so that their intrinsic dimension is often much smaller. In many computer vision and machine learning problems, one often assumes that the data are drawn from a union of multiple low-dimensional linear subspaces. Thus, subspace clustering of such data has been studied extensively. Nowadays, many works proposed the assumption of linear subspace may not always be true for many real high-dimensional data. The proposed data may be better modeled by nonlinear manifolds [68]. The common strategy employed in these works is to use the logarithm mapping projecting data onto the tangent space at each data point which is a linear space, so that all the strategies applicable to linear space can be used. These methods finally still need to establish a model in linear space. Therefore, research of modeling methods in linear space is still essential. The existing methods for subspace clustering can be roughly divided into four groups: statistical learning-based methods, factorization-based methods, algebra-based methods, and sparsity-based methods (e.g., Sparse Subspace Clustering (SSC) [9] and Low-Rank Representation (LRR)) [10]. In this paper, we focus on the fourth group.

In recent years, a lot of methods basing on deep learning [11] have been proposed. These methods obtained extremely competitive results in many fields on image and computer vision. A discussion hot point in computational imaging is if it is time to discard the classic methods and fully replace them with deep learning-based methods. On the one hand, a prerequisite for deep learning-based methods is a huge number of samples. However, there exist some situations where no data or only a small number of data can be obtained. In this case, the knowledge-based modeling methods are more suitable. On the other hand, classic methods have a clear structure and theoretical guarantee. They are based on the knowledge of the problem we are trying to solve rather than seeking for best performance by intuitively choosing architectures or trial and error. To the best of our knowledge, most of the existing learning-based clustering methods only learn some autoencoder features [12]. The final clustering is still obtained by applying k-means or SSC. Another challenge for deep learning is that a clustering model trained from some datasets may not be effective for other sets, but classic methods have better generalization ability.

Let be a collection of -dimensional data drawn from the union of linear subspaces . Each includes data of . The task of subspace clustering is to cluster the data in according to those independent subspaces. For each , consider the data as a linear combination of all data in , i.e., , here is called a representation vector, we assume to avoid the trivial solution. Suppose , then only the coefficients associated with the data from are not zero, and the others are all zero. Assuming , we want to find the solution to the following problem:

which means to find the oracle solution [9].

According to the above statement, there must be at least one matrix satisfying

Equation (2) actually has an infinite amount of solutions. Any solution is called a representation matrix. When handling 2D data, with each data being a matrix, [13] shows the strategy of converting all data to vectors severely damages inherent structural information and correlations of the original data. They proposed to learn a 2D projection matrix such that the most expressive structural information is retained in the spanned subspace. In our work, we still consider the vectorized data for simplicity since our method is also suitable for the projected vectorized data, and it is not within the scope of this paper to discuss how to deal with 2D data.

A good representation matrix should have the properties of sparsity between subspaces and density within a subspace, which means each query data is represented by a small number of subspaces, and once one subspace is selected, it is in favor of using more data from the same subspace. The work of [14] introduced a family of new regression models and estimated a representation model while accounting for discriminativeness between clusters. Here we achieve this property of the representation matrix by forcing it to have block diagonal structure since the block diagonal structure of directly induces a segmentation of the data (each block corresponds to a cluster). Reference [15] stated that under the ideal conditions, i.e., the data are noiseless and the subspaces are independent (i.e., none of the subspaces can be represented by other subspaces), as long as the regularizer for satisfies the EBD (Enforced Block Diagonal) conditions, and the optimal representation matrix is block diagonal. However, as contains noise, which is inevitable in any application, may not be a strict block diagonal matrix. Therefore, it is difficult to decide how large the representation coefficient between two data should be to group them into the same subspace. Usually, many previous subspace clustering methods are to find a matrix firstly; then using as an affinity matrix, the spectral clustering such as normalized cuts can be applied to get the subspace clustering result, just as the classic SSC and LRR and many other variants thereafter have done. All these methods divide the solution of the representation matrix and the final clustering result into two independent steps. We propose to integrate the affinity matrix and the data labels into a model to make them interact and promote each other and finally improve the clustering performance.

Furthermore, in terms of the penalty for representation matrix , most methods based on SSC and LRR seek with the most sparse or lowest rank constraint. In fact, for subspace clustering, it is more important for to have block diagonal structure than to be the most sparse or lowest rank matrix. We give a low-rank-based regularization term to enforce the block diagonal structure of directly. This regularization term is applied to a combination of the semisupervised label matrix and representation vector other than . Note that the rank function of a matrix is the l0-norm of the singular value vector and solving such a l0 minimization problem is usually difficult or even NP-hard. The standard approach is to replace the rank function with the convex nuclear norm [16, 17]. It has been proved that under certain incoherence assumptions on the singular values of the matrix, solving the convex nuclear norm regularized problem leads to a near-optimal low-rank solution [18]. On the other hand, it is pointed out that the nuclear norm is not accurate for rank approximation. Recent works develop various more accurate approximations to the rank function, such as the log-determinant rank approximation [19, 20], which significantly improves the learning performance. In addition, the nuclear norm of a matrix is the l1-norm of the singular values vectors. Fan and Li pointed out that l1-norm penalty overpenalizes large entries of vectors; therefore, nuclear norm overpenalizes large singular values. Moreover, they proposed three criteria for good penalty functions [21]: unbiasedness, sparsity, and continuity at the origin. The l1-norm satisfies both sparsity and continuity, but it is biased.

Recently, nonconvex penalties have drawn more and more attention to sparse learning problems because people believe that one of the possible solutions of nonconvex penalization problem could overcome the drawbacks of the unique solution of convex penalization problem. As a common practice, the l1-norm can be replaced by the lq-norm with if a more sparse solution is expected to be obtained [22, 23]. However, no theoretical guarantee with lq-norm is made for reducing the modeling bias of l1-norm. Based on those three properties proposed by Fan and Li, they proposed a new penalty function called the smoothly clipped absolute deviation penalty (SCAD) [21]. Recently, Zhang proved that a so-called min-max concave plus (MCP) penalty [24] also possesses three properties and achieves better performance than SCAD. Both SCAD and MCP are nonconvex and nearly unbiased. Extensive experiments [21, 2430] have demonstrated the superiority of SCAD and MCP over the l1-norm penalty. Furthermore, folded concave penalty (FCP) methods, including SCAD and MCP, have been shown to have strong oracle property for high-dimensional sparse estimation. As described above, we do expect to get the oracle solution of problem (1).

In this paper, a FCP regularized subspace clustering model is presented as follows:

We define a subspace dependent label matrix , where is a label vector for data ; here, is the number of subspaces and is the number of data. Since we do not know , the dimension of only needs to be set larger than and at most equals to . Here, we still use to denote the dimension of . In our work, we mainly discuss the semisupervised case; i.e., some data already have labels and the others do not. Index set for all data and the labeled data are denoted as and separately; then represents index set for the unlabeled data. For every , we assume when is in ith class, while otherwise. The label vectors for need to be solved, and the set of all the unknown labels is defined as . Encouraged by the good properties of the FCP penalty, we give FCP-based norm to approximate rank function. The last term of our model (3) is to obtain rank minimization for matrix multiplication . , , and are parameters used to balance roles of the three regularization terms. It is obvious that the larger the parameter, the more important the corresponding term in the minimization problem.

We give three regularization terms. The first one is minimizing the difference of l1 norm and l2 norm of the label vector to make it have only one nonzero element. The clustering result for each data is induced by the index of that nonzero element. The second one is a variant of the graph Laplace regularization, which captures the nonlinear structures of the data. This term makes sure the data from the same subspace have the same label as much as possible. Therefore, the nonzero element of each label vector can properly correspond to the subspace from which the data is drawn. We apply a low-rank constraint to the combination of the label matrix and representation vector, as indicated in the third regularizer, to enforce to better satisfy block diagonal structure. In our work, the clustering result, namely the label for each data, is directly solved from the model rather than using spectral clustering methods. The labels and the representation matrix are contained in two regularization terms. The first part of the third term of (3) makes the nonzero element of each label vector accurately correspond to the subspace from which each query data comes if the representation matrix has the block diagonal structure. Vice versa, if each label has only one nonzero element which is associated with the accurate subspace, the second part of the third term and the fourth term make better meet the block diagonal structure. Therefore, the labels and the representation matrix interact and promote each other during the whole computing process.

The problem can be solved by using the Alternating Direction Method of Multiplier (ADMM) framework. The resulting nonconvex FCP minimization problem can be solved by the Linear Local Approximation (LLA) method [23], which solves the problem by minimizing a surrogate function that upper bounds the objective functional. The surrogate function is constructed by linearizing the penalty function. LLA guarantees to decrease the objective function value in each iteration. Due to the nonconvex of the FCP problem, there are usually multiple local solutions, and the oracle property is established only for one of the local solutions. Breheny and Huang [25] have shown that with a Lasso-based initialization, LLA can avoid local maxima and saddle points, and with a high probability, an oracle solution can be obtained by using one-step LLA. In addition, once the oracle estimator is obtained, the LLA algorithm converges; namely, it produces the same estimator in the next iteration. We use the result of singular value thresholding as the initialization, which corresponds to the Lasso-based solution when applying to the nuclear norm minimization problem.

Some notations used in this work are defined as follows. For a vector , and are l1 norm and l2 norm, respectively. is a diagonal matrix with diagonal entries being . For a matrix , denotes the Frobenius norm, is the nuclear norm, and here, is the ith singular value of . represents the element-wise absolute value of . denotes the transpose of . describes the diagonal matrix with diagonal components being . is a vector with entries . is the trace of the square matrix . and denote the identity matrix and the zero matrix. means all entries of are nonnegative.

The remainder of this paper is organized as follows. In Section 2.1, we primarily give the three regularization terms then the low-rank-based semisupervised subspace clustering model is proposed. In Section 2.2, we define an MCP- and SCAD-based norm to approximate rank function. Then a FCP-based nonconvex minimization model results for subspace clustering. In Section 3, for solving the proposed model, we present an algorithm that combines several approaches such as ADMM, LLA, weighted singular value thresholding, and so on. In Section 4, we conduct a series of simulations with several datasets to demonstrate the superiority of our method. In Section 5, we conclude this paper with some summation and future plans.

2. The Proposed Model

2.1. The Low-Rank Model Integrating Affinity and Clustering

We set as the submatrix of composed of the label vectors for all the data from subspace . The ideal is the elements in ith row are all one, and those in the other rows are all zero. With the expression and the ideal structure of , we can automatically seek the block diagonal structure of by minimizing the rank of matrix multiplication (refer to [31, 32] for the detailed interpretation). This sparsity between subspaces and density within a subspace implied by the block diagonal structure are preferred to the aim of subspace segmentation. So, we first give a minimization model as below. Here for every , we set , and denotes the number of elements in .

It is rational for the label vector to have only one nonzero element since, in real application; one data only belongs to one subspace. For example, it is impossible for one face image belonging to two persons. Since has only one nonzero element if and only if . Under the constraint of , its unique nonzero element has a value of one. For the simplicity of computing, we relax as enabling the difference of l1 norm and l2 norm of to achieve minimization and use as a regularized constraint. The problem becomes

Now the problem is how to make sure the nonzero element of properly associate with the subspace the query data come. To do this, we use the following term:

We denote elements of as . The quantity of describes the similarity between and . We analyze from the following several aspects:(1)For the case , , or , , that is, one of and already has a label. Taking , for instance, since already has a label, should belong to the same subspace with as is large enough. This can be achieved by the first part of formula (6). In this case, the index of the nonzero elements of the unknown label can be properly correlated to the subspace to which the data belongs.(2)As neither and has a label, the first part of (6) also works for this case. Since both and have only one nonzero element, indexes of their nonzero elements must be the same as is large enough. Therefore, data and can be clustered into the same subspace, but which subspace cannot be decided. Figure 1 shows labels can be properly propagated from labeled data to unlabeled data; then the index of the nonzero element of each label can correspond to the accurate subspace even in this case.(3)If both and have labels, which means and do not need to be solved, we only do not know the representation coefficient. If and are in the same subspace, namely, , and should be large. If and are not in the same subspace, i.e., , and had better be zero. To this end, we use the second part of (6). Here the parameter needs to be taken a large value.

Using all the data as nodes, and have a connection between them as is large enough, and no connection as is small; then we obtain a graph of all data. Since for each data, the most similar data must come from the same subspace with it, each node must have a connection with at least one node in the same subspace. When there is at least one labeled data in each subspace, the label can be propagated from the labeled node to unlabeled nodes through formula (6). Therefore, the connected nodes can share the same label. This can be simply illustrated by Figure 1, in which there are only two classes and two labeled data (one for each subspace).

So far, we obtain the complete low-rank-based semisupervised subspace clustering model

The minimization problem (7) is difficult to solve due to the combined nature of rank function. The standard approach is to replace rank function with the nuclear norm. Considering the fact that matrix nuclear norm is prone to overpenalize large singular values and thus usually leads to a biased estimation and the advantages of FCP over l1 norm described in lots of works, we will utilize the FCP of singular value vector of a matrix to approximate rank function; thus, the model (7) is transformed into our model (3) presented in the introduction section, where matrix norm is defined as follows:

is the kth singular value of matrix with . We choose the function as or defined as following separately. MCP is of the following form:

SCAD is of the following form:

The matrix FCP norm defined in (8) satisfies the following properties.

Proposition 1. For (MCP) or (SCAD), then(1), with equality iff (2), and (3)is unitarily invariant, that is, whenever and are orthonormal(4) is concave w. r. t. matrix , where

Property (1) is obvious. The second property can be easily verified since both SCAD and MCP penalty functions are upper bounded by Lasso penalty function and tend to Lasso penalty function as . The third property is true due to the fact that singular values are not changed from to whenever and are orthonormal. The fourth property holds because of the concavity of the FCP penalty function. It is worth noting that the matrix FCP is not a real norm since it does not satisfy the triangle inequality of a norm.

3. Optimization

To solve problem (3), the original problem is converted to the following equivalent problem:

The solution to problem (11) is difficult to achieve directly. We adopt the Augmented Lagrangian Multiplier (ALM) scheme to derive the following unconstrained optimization problem:where , , are Lagrangian multipliers and is the penalty parameter. Instead of optimizing all arguments simultaneously, as , , , , are separable, we solve them individually and iteratively for .(1)By fixing , , , , we optimize every column of by the following subproblem: here is jth column of and is jth column of . Then the solution can be achieved via solving a linear system as follows:(2)By fixing , , , , we optimize every by the following subproblem:The above problem can be solved using the following equation:Here, and represent jth column of matrix and separately. is the jth element of vector . is then projected onto the set by the algorithm presented by [33].(3)For or , the optimized can be obtained using a weighted soft-thresholding algorithmAs and , since and are known, we get from the following equation:(4)To update , the following subproblem is solved:Problem (18) can be solved by the soft-thresholding operator.(5)To update , the following subproblem is solved:which can be solved through LLA by using the result of singular value thresholding as the initial value for .

The Lagrangian multipliers are updated as follows:

Steps (14)–(23) are repeated until the convergence conditions are attained. In this algorithm, the penalty parameter starts with , then in each iteration step with .

4. Experiments

To evaluate the performance of our proposed method, we conduct experiments on five benchmark datasets: the face image dataset Extended Yale B [34] and ORL [35], the object image dataset COIL20 [36], the handwriting number image dataset MNIST [37], and handwriting number and letter image dataset Alphadigits. We directly cite some results reported in [12] on the dataset ORL and Alphadigits. We compare our proposed method with ten baseline algorithms, including SSC, kernel sparse subspace clustering (KSSC) [38], SSC by orthogonal matching pursuit (SSC-OMP) [39], S3C [40], SSRSC [41], LRR, low-rank subspace clustering (LRSC) [42], efficient dense subspace clustering (EDSC) [43], TLRR [44], and KTRR [12]. Among these methods, SSC, KSSC, SSC-OMP, S3C, and SSRSC are sparse-based and LRR, LRSC, EDSC, TLRR are low-rank-based; KTRR is based on ridge regression.

4.1. Datasets

For the datasets used in our experiments, we briefly describe these datasets as follows: (1) Extended Yale B. The Extended Yale B Database consists of 2432 face images in total from 38 subjects, with 64 frontal face images per subject taken under different illumination conditions. Face images of each subject are a low-dimensional subspace. In our experiments, each image is downsampled from 192 × 168 to 32 × 32 pixels. (2) ORL contains face images of size 32 × 32 pixels from 40 individuals. Each individual has 10 images taken at different times, with varying facial expressions, facial details, and lighting conditions. (3) COIL20 contains 1440 grayscale images of 20 objects. Each image was downsampled to 32 × 32. (4) MNIST contains 70,000 images of handwritten digits 0–9 with a size of 28 × 28 pixels. (5) Alphadigits dataset is a binary dataset, which collects handwritten digits 0–9 and letters A–Z. Totally, there are 36 classes and 39 samples for each class, of which each example has a size of 20 × 16 pixels.

4.2. Evaluation Metrics

We use three evaluation metrics to testify the effectiveness of the proposed method, including clustering accuracy (CA), normalized mutual information (NMI), and purity. All experiments run ten times using random choices of subjects for each time. For each subject, we randomly select images, of which images are prelabeled in our method. Then the average CA, NMI, and purity are reported. We compute the clustering accuracy as follows:

Clustering accuracy measures the accuracy rate of clustering. NMI measures the quality of the clusters. Purity measures the extent to which each cluster contains samples from primarily the same subject. Higher values of these metrics indicate better clustering quality. More details of the last two metrics can be found in [45]. The best results are shown in bold font. Experiments verify that there is no obvious difference between MCP and SCAD when they are applied to our model, so we do not list them separately. In all of our experiments, we use the intensity feature of each image and stretch it to a column of the data matrix .

4.3. Parameter Selection

The parameters for all the methods are adjusted to obtain the best clustering results. Our method involves six parameters , , , , , and . Among these parameters, and are associated with the FCP penalty functions. In our experiments, and can obtain the best results. , , and are used to balance the roles of three regularization terms. The best choice for is , for is , and for is . Parameter needs to be set large enough to penalize the large representation coefficients when two labeled data have different labels. We fix which can achieve the best results for all the following experimental settings.

4.4. Initialization and Stopping Criterion

All the variables , , , and Lagrange multipliers , , are started with all elements being zero. is initialized to where consists of the known labels and initial unknown labels . We use three errors , , all less than as the stopping criterion of iterations.

4.5. Performances

The average CA, NMI, and purity of all the methods on the two face image datasets; i.e., Extended Yale B and ORL, are reported in Tables 1 and 2 separately. For the database Extended Yale B, we use random choices of subjects for each time. For each subject, we randomly select images, where , among which images are prelabeled in our method. For ORL, we randomly choose persons. Each individual has 10 images. With only a small portion of data being prelabeled, our method can achieve comparative results with the most state-of-the-art methods.

The previous experiment targets face clustering. To show the generality of our algorithm, we also evaluate it on the COIL20 object image dataset. The results are reported in Table 3. In contrast with the previous human face datasets, in which faces are well aligned and have similar structures, the object images from COIL20 are more diverse, and even samples from the same object differ from each other due to the change of viewing angle. This makes this dataset challenging for subspace clustering techniques. Experimental results are listed in Table 3. For each of the chosen 10 subjects, we randomly select images, of which ; images already have labels for our method. Similar to the above datasets, the result of our method is the best among all the eleven methods.

To further verify the effectiveness of our proposed method, we conduct experiments on two datasets containing handwritten digits, i.e., MNIST and Alphadigits. The Alphadigits database also contains 26 letters. For the MNIST database, we randomly choose images for each digit 0–9, where , and then apply all the methods to cluster the images. For our methods, images are prelabeled of the images for each digit. For the Alphadigits database, we use random choices of subjects. For each subject, we use all the images, among which images are prelabeled for our method. The CA, NMI, and purity are reported in Tables 4 and 5. Our method gets the best clustering result among all the compared methods.

5. Conclusion

We propose a novel nonconvex formulation for the subspace segmentation problem. In our work, the labels of all data are directly solved from the model rather than using spectral clustering algorithms. We give two regularization constraints about the label vectors. One is to force the label vector only to have one nonzero element by minimizing the difference of l1 norm and l2 norm. Another is to make sure data from the same subspace have the same label as much as possible so that the index of that nonzero element of each label vector can indicate the subspace from which the data come. Due to many advantages of FCP over l1 norm, we present an FCP-based low-rank approximation norm and apply it to the combination of semisupervised label matrix and representation vector. This term can enforce the representation matrix better meeting block diagonal structure. The labels and the representation matrix are contained in two regularization terms. Therefore, they can interact and promote each other during the computing process. We give a solving algorithm based on ADMM, LLA, weighted singular value thresholding, weighted soft thresholding, and so on.

In the future, we will continue conducting research on estimation or learning methods for the parameters in our proposed model because we have to tune those parameters in order to achieve better results. It is possibly better to integrate the classical knowledge-based approaches into the deep learning architecture, making the algorithm enjoy both the flexibility of the deep learning-based methods and the clear structure of the classical approaches.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (no. 61772389), Natural Science Basic Research Program of Shaanxi Province, China (2020 JM-569, 2021 JM-440, and 2020 JQ-817), Science and Technology Plan of Shaanxi Province (2020 GY-066), Principal Fund Program of Xi’an Technological University (XAGDXJJ17027), Science and Technology Plan of Weiyang District, Xi’an, Shaanxi Province (201925), and Key Research and Development Program of Shaanxi (2021 GY-137).