Abstract

Traditional clustering methods neglect the data quality and perform clustering directly on the original data. Therefore, their performance can easily deteriorate since real-world data would usually contain noisy data samples in high-dimensional space. In order to resolve the previously mentioned problem, a new method is proposed, which builds on the approach of low-rank representation. The proposed approach first learns a low-rank coefficient matrix from data by exploiting the data’s self-expressiveness property. Then, a regularization term is introduced to ensure that the representation coefficient of two samples, which are similar in original high-dimensional space, is close to maintaining the samples’ neighborhood structure in the low-dimensional space. As a result, the proposed method obtains a clustering structure directly through the low-rank coefficient matrix to guarantee optimal clustering performance. A wide range of experiments shows that the proposed method is superior to compared state-of-the-art methods.

1. Introduction

Clustering is a powerful technique in unsupervised machine learning, which requires a measure of similarity to group data samples into classes. Therefore, traditional clustering algorithms such as -means [1] perform clustering on original data samples directly by assuming that alike samples reside around a centroid. This assumption is noted in [2] as too restrictive in a real-world setting where two samples can be far away from each other yet belong to the same cluster, and vice versa. Furthermore, as real-world data is often high dimensional, the -means approach is also computationally inefficient. Thus, efforts were made in literature [3] to enhance its efficiency. However, the robustness needed for accurate clustering under the aforementioned real-world circumstances was largely ignored, as they do not consider data quality.

Therefore, considering that corresponding low-dimensional subspaces exist for high-dimensional data, Zheng et al. [4] and Cai et al. [5] proposed graph regularized sparse coding (GraphSC) and graph regularized nonnegative matrix (GNMF), respectively, to tackle the previously mentioned lapses. The object of both methods is such that discriminability and computational efficiency can be enhanced simultaneously to improve clustering performance. Nonetheless, as they do not fully utilize the self-expressiveness property of data, a good representation of data cannot be guaranteed in all cases. As a result, subsequent spectral-based approaches were built on the self-expressiveness property of data where a data sample can be represented by a combination of the bases in a whole dataset [6]. Specifically, low-rank representation (LRR) [7] and sparse subspace clustering (SSC) [8] are the two classical methods, and the newer ones are formed using either of their principles. An example is Laplacian regularized low-rank representation (LapLRR) [9] proposed to improve LRR’s global data structure initiative by further capturing data’s intrinsic nonlinear geometric information. Similarly, Zhang et al. [10] proposed spectral-spatial sparse subspace clustering (S4C) using the SSC’s -norm regularization strategy. Besides, considering the local data structure that LRR previously ignored, low-rank representation with adaptive graph regularization (LRR_AGR) [11] was proposed to learn optimal clustering. However, because real-world data often contain noise, these spectral methods cannot guarantee that two similar samples in the original high-dimensional space will have a close representation coefficient in the low-dimensional space.

To address the previously mentioned concern, a new method that adopts LRR’s approach to first learn data’s low-rank coefficient matrix is proposed. A regularization term is then introduced to ensure that the samples’ neighborhood structure is maintained in the low-dimensional space. Thus, the proposed method learns optimal clustering structure directly from the low-rank coefficient matrix without spectral postprocessing. It is achieved by imposing a constraint on the low-rank coefficient matrix to promote a robust affinity matrix, a rank constraint is further utilized to make the affinity matrix express the clustering structure.(1)We propose a novel method that uses the LRR strategy first to obtain data’s low-rank coefficient matrix, of which a regularization term is also introduced to ensure that samples’ neighborhood structure is maintained in the low-dimensional space. This approach is different from the existing ones, which simply assume such a structure without a robust strategy to tackle the influence of noise.(2)We constrain the low-rank coefficient matrix to guide a robust affinity matrix to obtain optimal clustering directly while avoiding spectral postprocessing of the affinity matrix, unlike most existing methods.(3)Several experiments are performed to evaluate the effectiveness of the proposed method using accuracy (ACC), normalized mutual information (NMI), and purity evaluation metrics. The results demonstrate the superiority of the proposed method over similar state-of-the-art ones.

To resolve the inefficiency of traditional clustering methods, several spectral type methods based on the self-expressiveness property of data have been proposed over the years. Therefore, one can categorize these methods into two groups according to the strategy used to obtain a given data’s low-dimensional coefficient matrix. The first group has methods like those mentioned in [10, 1214], which use the -norm regularization [8] to acquire the coefficient matrix. For example, Zhang et al. [10] proposed a spectral-spatial sparse subspace clustering algorithm for hyperspectral remote sensing images, which obtains a final clustering by applying a spectral clustering algorithm on an adjacent matrix. Also, using the -norm approach, Li et al. [12] proposed structured sparse subspace clustering (S3C), which learns an affinity matrix and data segmentation jointly to improve clustering accuracy. The -norm’s regularization main advantage lies in the fact that it can obtain a sparse representation of data samples. However, it ignores the data’s global structure; hence, it can be vulnerable to noise.

In contrast, the nuclear norm regularization approach was proposed in [7] to capture a global data structure, thus providing some robustness against noise and possible outliers. Thus, the second category contains methods like those in [6, 9, 11, 1524], which apply the aforementioned nuclear norm technique to learn the coefficient matrix. Illustratively, the LapLRR method was proposed in [9] based on the nuclear norm. LapLRR mainly focuses on capturing data’s nonlinear geometric structures to improve clustering performance. The compound rank- projection (CRP) [23] algorithm was proposed for bilinear analysis. Specifically, CRP uses multiple rank- projection models to enhance discriminant ability. Meanwhile, the constrained low-rank representation (CLRR) [15] emphases on increasing discriminating ability by incorporating supervision information as hard constraints. Since the LRR methods are based on affinity graph construction, Luo et al. [24] argue that the affinity measurement in the original feature space would usually suffer from the curse of dimensionality. Thus, a method was proposed, which assumes similarity between instances only if they have a larger probability of being neighbours. However, because a fixed graph often does not guarantee optimal performance, Wen et al. [11] then introduced flexibility into graph learning and proposed the LRR-AGR method. Besides, a technique founded on a finite mixture of exponential power (MoEP) was proposed in [19] to handle complex noise contamination in data. Furthermore, the recent work of [25] introduced an adaptive kernel into LRR to boost the accuracy of clustering. Besides, a coupled low-rank representation (CLR) strategy was presented in [6] to learn accurate clustering from data using the bock diagonal regularizer [26]. Aside from that, Yan et al. [27] proposed a novel self-weighted robust linear discriminant analysis (LDA) for multiclass classification, especially with edge classes.

This study also adopts the nuclear norm regularization approach. However, unlike the previously mentioned techniques, the proposed method further injects a regularization term to guarantee that two similar samples in the original high-dimensional space would have a close representation coefficient in the low-dimensional space to handle noise distortion in data more holistically.

3. The Proposed Method

In this section, the proposed method is formulated first. Then, an optimization method is proposed to solve the model.

3.1. Model Formulation

LRR's nuclear norm regularization has been shown in many studies, not limited to those previously cited, as an effective technique for capturing a robust data coefficient matrix due to its global orientation. Thus, LRR is integrated into our model to capture the data’s low-rank coefficient matrix as follows:where is the low-rank coefficient matrix. is the error matrix, assuming that part of the given data is corrupt. However, because the level of corruption in real-world data is unknown in advance, one cannot guarantee that will capture an accurate similarity of the data samples. To address this concern, a regularization term is introduced into (1) to ensure that when samples and are similar in the original high-dimensional space, and should be similar in low-dimensional space, so that  = 1, otherwise 0. Thus, we havewhere . and . Once is obtained, existing methods directly learn an affinity matrix such that . Most of these methods then apply a spectral clustering algorithm on to obtain the clustering structure. Differently, we utilize a constraint to ensure that the coefficient of and is the same, in which some nonnegative constraints, imposed on ensures that its entries are nonnegative. As a result, a spectral postprocessing of is then dodged by imposing a rank constraint on its Laplacian matrix to allow it to express our clustering structure using Theorem 1. Therefore, the proposed model is formulated aswhere is the Laplacian matrix of and is the diagonal matrix with entry . and are parameters to balance the terms, respectively.

Theorem 1 (see [6]). If is nonnegative, the multiplicity of the zero eigenvalue of the graph Laplacian corresponds to the number of connected components in the graph associated with .

3.2. Optimization

To solve (3), the augmented Lagrange multiplier (ALM) [28, 29] approach is adopted, but an auxiliary term is introduced first to make it easily solvable. Besides, because the rank constraint is not linear, a similar strategy employed in [6] is followed to express the smallest eigenvalue of as (). Thus, given sufficient , (3) is equal to

According to Ky Fan’s theorem [30], is equivalent to minimizing subject to . Hence, we have

Following conventional practice, the augmented Lagrangian function of (5) is obtained as follows:

At this point, the terms not connected in (6) are separated, where , , , and are Lagrangian multipliers. Hence, each term’s optimal value is obtained in the following order.

3.2.1. Problem

When other variables are fixed, can be obtained by minimizing the following formula:

By setting the derivative ∂L ()/∂ = 0, variable is obtained as follows:

3.2.2. Problem

When other variables are fixed, can be obtained by minimizing the following formula:

By setting the derivative ∂L ()/∂ = 0, variable is obtained as follows:

3.2.3. Problem

can be achieved by solving the following problem with other variables fixed

Then, is obtained as follows by using the singular value thresholding (SVT) operator [31]:

3.2.4. Problem

is obtained by solving the following minimization problem:

Denote T = , .

The column of is

3.2.5. Problem

is obtained by solving the following minimization problem:

For simplicity, is equivalent to , where ; then, we rewrite (15) as

Denoting , we get

Considering the previously mentioned constraints, we have the following:

Then, taking the derivative with respect to and setting it to zero, that is,

The j entry of is shown as follows:

According to KKT conditions,

3.2.6. Problem

When other variables are fixed, can be obtained by solving the following minimization problem:where is the Laplacian matrix of . This problem can be simply solved via eigenvalue decomposition, and its solutions are the set of c eigenvectors corresponding to the first c smallest eigenvalues of .

The complete solution is given in Algorithm 1.

Input: Data , clusters size c, parameter ,
Initialize:; ; ;  =  ; ;
While not converged do
(1) Update by equation (8);
(2) Update by equation (10);
(3) Update by equation (12);
(4) Update by equation (14);
(5) Update by equation (21);
(6) Update by equation (22);
   
   
   
(11) Update by
(12) Check the convergence conditions:
     , and
Output:
3.3. Computational Complexity

The inverse of a matrix, eigen decomposition, and SVT's nuclear norm minimization problem are three key aspects that define the computing cost of our proposed algorithm. Specifically, the eigen decomposition costs , where is the number of clusters. The inverse of a matrix and nuclear norm minimization cost is each. As a result, the total computational complexity of each iteration is because the cost of computing multipliers and basic matrix operations like addition, subtraction, and division are negligible compared with the cost of other operations.

4. Experiments

4.1. Experimental Settings

In order to demonstrate the effectiveness of the proposed method, several experiments were performed on COIL20 (https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php), UCI (https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits), ORL (http://cam-orl.co.uk/facedatabase.html), FERET (https://www.nist.gov/itl/products-and-services/color-FERET-database), and BBC (http://mlg.ucd.ie/datasets/segment.html) datasets (See Table 1 for a summary of each dataset and Figure 1 for images of some datasets). Thus, using accuracy (ACC), normalized mutual information (NMI), and purity (PUR) metrics, the performance of the proposed method was evaluated in comparison with Kmeans, LapLRR [9], GraphSC [4], GNMF [5], LRR_AGR [11], and nonnegative self-representation (NSFRC) [32] state-of-the-art methods. For each method, the parameter settings in the corresponding literature were adopted. Therefore, each of these methods are described as follows.LapLRR [9]: it uses LRR’s nuclear norm strategy to capture data’s intrinsic nonlinear geometric informationGraphSC [4]: it considers the local manifold structure of the data to learn a sparse representationGNMF [5]: it constructs an affinity graph to encode the geometrical information of dataLRR-AGR [11]: it learns optimal clustering by considering the local data structure that LRR previously ignoredNSFRC [32]: it uncovers data’s intrinsic structure by joint nonnegative self-representation and adaptive distance regularization

4.2. Experimental Results

This section analyzes the results of various experiments conducted to evaluate the proposed method’s effectiveness. Note that the best results are highlighted in bold fonts in the tables

4.2.1. Clustering Performance on Original Data

Tables 26 display the clustering results concerning ACC, NMI, and PUR of different algorithms on COIL20, UCI, ORL, FERET, and BBC benchmark datasets, respectively. On COIL20 dataset, the proposed method has the best result in all evaluation metrics, followed closely by NSFRC in ACC and LRR_AGR in NMI. This performance is not surprising because LRR_AGR considers the local data structure of the data to improve its performance, while NSFRC, on the other hand, uses an adaptive affinity matrix learning approach to uncover the intrinsic structure of data to boost clustering performance. Besides, it is also not shocking that Kmeans has the worst performance on COIL20 dataset because it performs clustering directly on the original data, promoting noise interference on the clustering structure. Furthermore, one may observe that Kmeans has its best performance on the UCI dataset compared to other datasets. However, its performance is far below that of others methods, especially the proposed method and NSFRC. It can also be observed that NSFRC and LRR_AGR performances are consistently close to the proposed method on ORL, FERET, and BBC datasets. Specifically, on ORL datasets, both NSFRC and LRR_AGR have a performance of 85.45% and 81.45%, respectively, in ACC, which is lesser than that of the proposed method with 1.28% and 5.28%, respectively. This performance is also maintained in the relatively difficult datasets of FERET and BBC. Overall, the proposed method has more excellent performance to demonstrate the effectiveness of the regularization term in our model.

4.2.2. Clustering Performance on Data Corrupted with Pepper Noise and Stripe Occlusion

In this section, several experiments are performed to evaluate the robustness of each algorithm against noise and occlusion. To perform this experiment, two settings were adopted. First, stripe occlusion of various degrees (0, 5%, 10%, and 15%) were randomly applied to ORL and FERET datasets. Second, pepper noise of various degrees (the same as previously mentioned) were randomly applied to the UCI dataset. Therefore, Figures 24 present each algorithm’s performance on corrupted ORL, FERET, and UCI datasets, respectively. Although it can be seen clearly from these figures that all methods have performance degradation as the corruption level increases, our proposed method shows more robustness than other methods on the three datasets. More specifically, the proposed method’s performance maintained a steadier drop on the ORL dataset than a sharp one obtained by the NSFRC method in the ACC evaluation metric. It can also be observed that its performance in NMI is better with a 10% degree of stripe occlusion than in the 5% case, which confirms that the proposed method can guarantee superior robustness to a large extent. Besides, its performance on the UCI dataset under the random pepper noise corruption further validate above with a clear improvement over the other compared methods.

4.3. Parameter Sensitivity

This section uses some experiments to show the sensitivity of parameters concerning clustering ACC of the proposed method. We can find that there are three regularization parameters, that is, , and , needed to be set in advance. These parameters, respectively, balance the importance of low-rank constraint term, error term, and rank constraint. Generally, the larger the parameter value is, the more important or impact the corresponding term is. To demonstrate the effects of these three parameters for data clustering, a candidate parameter range set of is first defined for the three parameters, and then, the proposed method is performed with different combinations of the parameters for data clustering. We first fixed parameters and and then executed the proposed method with different parameter values of to show the influence on the clustering ACC.

From Figure 5, it is obvious to see that the clustering ACC is insensitive to parameter when . This is mainly because if parameter is too large, the corresponding rank constraint term will play the dominant role in the graph learning while ignoring the local and global structure preservation. In this case, although the obtained graph still has corresponding connected components, it cannot reveal the intrinsic structure of data. In the experiments, we can select a small value in the range of for parameter . Figure 6 shows the clustering ACC versus different values of parameters and when parameter is fixed. As shown in this figure, the clustering ACC is sensitive to parameter to some extent, and the best clustering result can be obtained when parameters and are in a feasible range. This is mainly because a very large or very small parameter leads to a small error or large error that cannot compensate well for the sparse data noise. In this case, the model cannot learn the intrinsic similarity graph for data clustering. Thus, in the experiments, we can select the two parameters in the candidate range of according to the degree of noise corruptions of data.

As far as we know, it is still an open problem to adaptively select these optimal parameters for different datasets. In the experiments, we first fix parameter since this parameter is insensitive to the clustering ACC and then perform the method to find the optimal and in a candidate domain where the optimal parameters may exist. Then, by a similar strategy, we fix parameters and to find the optimal value of parameter in a candidate domain. Finally, the optimal combination of these parameters can be obtained in the 3D candidate space, composed of three candidate domains of parameters.

4.4. Convergence Study

The convergence of the ADMM-style algorithm with two blocks has been generally proven in [29]. There are six blocks (including ) in Algorithm 1, and the objective function is not smooth; it would not be easy to prove the convergence in theory. For , , and , their subproblems have clear analytical solutions. With continuous iteration, they can be expected to reach the optimal solution. The subproblems of can be solved by solving the characteristic equations. For the subproblem, the SVT [33, 34] method is used to solve it. For the subproblem, the effectiveness and convergence of the solution have also been confirmed [35]. Therefore, we can expect the algorithm to reach a local optimal solution. In Figure 7, we further verify the algorithm's convergence on the actual dataset.

Furthermore, Table 7 lists each method’s average computation runtime on three benchmark datasets. According to the record, it can be observed that although the computational time of the proposed method is not the most efficient, it is at the same level as most methods.

5. Conclusion

A novel method is proposed in this study that injects a regularization term into LRR to ensure that the neighborhood structure of data samples in the original high-dimensional space is also maintained in the low-dimensional space. With this strategy, the proposed method guarantees robust clustering performance that resolves the limitation of most existing methods. This fact is demonstrated experimentally, with several results showing that the proposed method substantially outperformed similar state-of-the-art methods in ACC, NMI, and PUR evaluation metrics. Thus, our approach will be extended to multiview learning in future work based on several works [36], which shows that it can improve single view learning models.

Data Availability

All datasets used in this paper are open source, meaning they are freely available for research purposes. They can be accessed using the following links. COIL-20: https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php; UCI: https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits; ORL: http://cam-orl.co.uk/facedatabase.html; FERET: https://www.nist.gov/itl/products-and-services/color-FERET-database; and BBC: http://mlg.ucd.ie/datasets/segment.html.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was funded in part by the National Natural Science Foundation of China (no.61572240).