Abstract

Correlation learning is a technique utilized to find a common representation in cross-domain and multiview datasets. However, most existing methods are not robust enough to handle noisy data. As such, the common representation matrix learned could be influenced easily by noisy samples inherent in different instances of the data. In this paper, we propose a novel correlation learning method based on a low-rank representation, which learns a common representation between two instances of data in a latent subspace. Specifically, we begin by learning a low-rank representation matrix and an orthogonal rotation matrix to handle the noisy samples in one instance of the data so that a second instance of the data can linearly reconstruct the low-rank representation. Our method then finds a similarity matrix that approximates the common low-rank representation matrix much better such that a rank constraint on the Laplacian matrix would reveal the clustering structure explicitly without any spectral postprocessing. Extensive experimental results on ORL, Yale, Coil-20, Caltech 101-20, and UCI digits datasets demonstrate that our method has superior performance than other state-of-the-art compared methods in six evaluation metrics.

1. Introduction

Cross-domain and multiview datasets can help to improve performance accuracy in a typical clustering problem, such as face image clustering [15] and object clustering [68]. The reason is that different instances of cross-domain and multiview datasets often share a common property that is useful in the overall clustering task [9, 10]. As such, several correlation learning techniques, including canonical correlation analysis (CCA) [1115], cotraining [1517], nonnegative matrix factorization [18, 19], and subspace clustering [2026] were utilized over the years to find a common representation in cross-domain and multiview datasets. For example, CCA [11] learns the linear relations between two sets of correlated variables with the nonlinear type proposed in [12]. As a result, the study in [13] is built on CCA, in which CCA is used to obtain a low-dimensional subspace spanned by a set of correlated but different datasets. In addition, Qin et al. [14] utilized cluster CCA to align domains in a projected correlated subspace. Besides, cotraining [16] is another classical methodthat maximizes the agreement between two distinct views of data to aid the learning process by adapting the knowledge gained from one view to the second view alternatively. Therefore, considering the successes of cotraining, Kumar and Daume [17] incorporated the idea of cotraining into multiview spectral clustering by learning a common clustering matrix that agrees across different views. Furthermore, Sun and Jin [15] proposed a method named robust cotraining, which combines the principles of cotraining and CCA, where CCA is used to analyze the labels predicted by cotraining on unlabeled samples. While the clustering methods based on CCA [13, 14] and cotraining [17] are validated to be efficient, their clustering performance may degrade with noisy data, which is ubiquitous in practice [27].

Recently, subspace clustering methods [2024] have provided a more robust solution to tackle the abovementioned shortcomings. These methods utilize either low-rank representation (LRR) [28, 29] or a combination of LRR and sparse subspace clustering (SSC) [30] to minimize the discrepancy between cross-domain and multiview datasets in a latent subspace. Specifically, these methods utilize self-expressiveness property to select similar samples in the original dataset to reconstruct each other to find a low-dimensional representation. The above is possible because high dimensional data are assumed to have corresponding multiple low-dimensional subspaces and the concept becomes finding those low-dimensional subspaces together with their cluster members [31]. That’s why, Xia et al. [20] combined LRR and SSC techniques to learn a low-rank transition matrix shared across multiple views, which is used subsequently for clustering. Similarly, Ding and Fu and Zhang et al. [22, 24] utilized the LRR technique to pursue a domain-invariant representation that agrees across domains. Brbic and Kopriva [23] used the combination of LRR and SSC to find a low-dimensional representation matrix for each view. Then, using these low-dimensional subspace structures, Brbic and Kopriva [23] can obtain a joint similarity matrix which balances the agreement between the different views. However, these methods mentioned above find the common representation matrix shared across views by assuming that similar samples in each view reside near each other. While this assumption is possible, it can be problematic in a practicable case where the data samples in each view are distributed randomly with noise. Hence, the chances will increase for two unrelated data samples to be selected to reconstruct one another, and this limitation can cause the learned common representation matrix to be faulty in a manner capable of degrading clustering performance. Therefore, the recent work in [26], which extends [25], attempts to resolve the above issue by allowing a common similarity matrix to approximate different view low embedding matrices.. However, the state-of-the-art method fails to consider the proximity issue mentioned above in constructing the individual view structure. Thus, it can cause the propagation of noise in the different low embedding matrices.

To this end, in this paper, we propose a novel correlation learning method, which finds a common low-rank matrix between two different instances of data in a latent subspace. The core idea here is that we learn this common low-rank matrix using one instance of data in a way that a second instance can linearly reconstruct it. First, we consider a real-world scenario where two instances of data will have different dimensions. Hence, we utilize matrix factorization to align a low-rank matrix learned in one instance to another through an orthogonal constraint. This approach allows the low-rank matrix to become a common one shared between the two domains. Besides, this approach avoids the propagation of noise in different low-rank matrices. Therefore, our method finds an ideal similarity matrix that approximates the common low-rank matrix much better, with a rank constraint on the Laplacian matrix so that it reveals the clustering structure explicitly without spectral clustering postprocessing. Extensive experimental results on ORL, Yale, Coil-20, Caltech 101-20, and UCI digits datasets demonstrate that our method has superior performance than other state-of-the-art compared methods in six evaluation metrics. That is, accuracy (ACC), normalized mutual information (NMI), adjusted rand index (AR), F-score, precision, and recall.

Our major contributions are summarized as follows:(1)We propose a novel method based on LRR for correlation learning between two instances of data. Specifically, our method learns a common low-rank matrix in one instance of data using matrix factorization to align a second domain. This way, our method can avoid the propagation of noise in different low-rank matrices that can cause a faulty common matrix.(2)Furthermore, our method can obtain a clustering structure without performing spectral postprocessing of the low-dimensional embedding matrix. To achieve this, we find an ideal similarity matrix that best approximates the common low-rank matrix such that a rank constraint on the Laplacian matrix will reveal the clustering structure explicitly.(3)Extensive experimental results on five benchmark datasets demonstrate that our method has superior performance than other state-of-the-art compared methods in six evaluation metrics.

The purpose of correlation learning in cross-domain and multiview clustering tasks is to learn a common representation that maximizes the agreement between different instances or distribution of a given data to improve clustering performance [17]. Many studies [27, 3133] provide an extensive review of the existing methods utilized previously to learn the correlation between cross-domain and multiple view datasets. Among them are two classical methods, namely, CCA [11] and cotraining [16], which have inspired several other multiple instance-based methods such as [13, 17]. While [13] is built on CCA principles , [17] is based on cotraining in which a common structure is learned in a low-dimensional subspace to reduce the disagreement between distinct data views. Besides, the more recent subspace clustering methods [22, 24, 34], built on LRR [28, 29], provide some more robust approaches. These methods require constructing a common similarity matrix to align the different views or cross-domain datasets using the low-rank representation of each view where similar data samples are selected to reconstruct themselves linearly [35]. For simplicity, assuming we have a single view data matrix with some noisy samples, LRR can obtain a low-rank representation matrix using nuclear norm regularization [36] as follows:where denotes the self-dictionary, denotes the nuclear norm, and denotes the error matrix. Upon obtaining in a multiview or cross-domain setting, the subspace clustering methods mentioned above can find a common similarity matrix to balance the agreement between the different . Besides, some methods, such as [20, 23], combine both LRR and SSC [30] to learn a common similarity matrix.

Regardless, these subspace clustering methods mentioned above utilize a two-phase learning approach to obtain a clustering structure by applying spectral clustering [37] on the learned similarity matrix. In this case, the clustering performance can degrade when the similarity matrix produced in the first phase is faulty. Considering this drawback, Yang et al. [38] utilized block diagonal constraints to encourage a proper common matrix. Liang et al. [39] explored the diversity between domains to improve performance. Nonetheless, [38, 39] are equally two-phase methods. Moreover, the previous method [26] and the more recent one [40] can obtain a clustering structure directly without applying spectral clustering over the consensus similarity matrix. They both achieved that by imposing a rank constraint on the Laplacian matrix to guarantee c exact number of connected components.

Our proposed method is related to the subspace clustering method based on LRR because we can obtain a similarity matrix through a low-rank representation. However, unlike these methods, our method learns a common low-rank representation matrix on only one instance of the data. Then, following a similar approach in [26, 40], we directly pursue a block diagonal clustering structure without performing any spectral postprocessing.

3. Proposed Method

In this section, we present our proposed low-rank correlation representation and clustering method. First, we formulate the common low-rank matrix and then incorporate clustering directly into our model.

3.1. Model Formulation

Given two multi-instance datasets, and of sizes and , respectively, where and are the dimensions of the feature space and is the number of samples, a naive approach would include learning different low-rank matrices for and using equation (1) so that both matrices can be merged to obtain a common matrix. This approach, however, has a limitation. It will cause a fallible common matrix if any of the low-rank matrices is faulty. Therefore, we suppose that a common matrix can be obtained differently through one instance of the data since multiview data originate from one underlying latent subspace [41]. On the other hand, the above is tricky for two reasons. First, the dimensions of and are different. Second, the redundant specific structure and common structure in and are combined. To tackle this, we utilize matrix factorization [42, 43] to find a low-rank matrix adaptively that captures the common structure in and . At first, we obtain the common structure from as follows:where denotes a low-rank matrix and is a factorized variable to make possible with orthogonal constraint . Then, can be perceived to hold the common structure of as well with the following model:where is similar to , and it is used here to align the common structure in the two domains with orthogonal constraint so that can well capture an accurate manifold structure of and through .

Upon learning , our method finds an ideal similarity matrix by making an approximate of as much as possible. As such, we formulate our model as follows:where ensures that all the entries of are none negative. denotes the normalized Laplacian matrix because the constraint normalizes such that . Then, allows to become the clustering structure with an exact c number of connected components by avoiding spectral postprocessing of the low-dimensional embedding matrix through the following theorem [44, 45].

Theorem 1. If the similarity matrix is nonnegative, the multiplicity of the eigenvalue zero of the Laplacian matrix is equal to the number of connected components in the graph that is associated with .

A detailed proof of Theorem 1 is given in preposition 2 of [45].

3.2. Discussion

Here, we discuss the importance of matrix and , constrained to be column orthogonal using and , respectively, in our model. It can be seen that in equation (2), the low-rank matrix is learned from such that it captures the common structure between and . This is possible because we let differ from through and not by itself. Then, is used in equation (3) to match the common structure inside ( in original dimension to adaptively. For this, we suppose that the relative error of the term should be very small for to capture the common structure in and correctly. Therefore, the proposed method can guarantee more discriminative ability in practice, since is learned adaptively using the nuclear norm, because with this norm, we can remove the effect of noise through soft thresholding [46], which sets small eigenvalues zero to suppress the noisy data. This approach is different from the existing ones where the data samples themselves are used as dictionaries to learn low-dimensional representation for each of and before a corrected space can be found.

3.3. Optimization

We propose an optimization method built on augmented Lagrange multipliers [47] to solve equation (4) by iteratively updating all variables. First, we denote the - smallest eigenvalue of as to make equation (4) easy to solve. Then, referring to Proposition 1 [45], because is a positive semidefinite matrix. Therefore, given a large value of , equation (4) is the same as the following:

Furthermore, when is large enough, , so that the constraint will be satisfied. According to Ky Fan’s theorem [48], we understand that is equivalent to minimizing subject to . This is because an optimal would contain the eigenvectors corresponding to c smallest eigenvalue of . Hence, for a better understanding of Ky Fan’s theorem, Zhan et al. [26] provided a simplified version as follows.

Theorem 2. Eigenvalues of are ordered by and the corresponding eigenvectors are . The inequality holds for any orthogonal vectors , , , .

Thus, we rewrite equation (5) asand also, we introduce an intermediate term to make equation (6) easier to solve. Then, we get

The Lagrangian of equation (7) can be obtained as follows:where is the Lagrange multiplier. Hence, we divide equation (8) into several subproblems and then update all subproblems by fixing the others in the following order.

3.3.1. J Subproblem

We solve equation (9) by denoting the singular value decomposition (SVD) [46] of as to obtain J:where .

3.3.2. P Subproblem

The orthogonal Procrustes problem in equation (11) is difficult to solve because the feasible set satisfying is not convex. However, luckily, SVD [46] provides a method to solve this problem to find a unique solution. Therefore, we solve by as follows:

3.3.3. Q Subproblem

The subproblem can be solved the same way as subproblem.

3.3.4. U Subproblem

Setting the derivatives , we get

3.3.5. S Subproblem

Since in equation (16) contains the eigenvectors corresponding to c smallest eigenvalue of , we rewrite equation (16) as

Then, denoting as , equation (17) can be rewritten as

Similar to [26], the optimal solution ist

3.4. Complexity Analysis

The computational complexity of our proposed method illustrated in Algorithm 1 is determined mainly by nuclear norm calculation in equation (10), matrix inverse and multiplication in equation (13), and Euclidean projection onto the simplex space in equation (19). The cost of equation (8) is O (n3), while the inverse of an n × n matrix in equation (15) consumes O (n3). The time complexity for matrix multiplication is O (n3). However, since there are several multiplications in equation (15), the overall multiplication cost will be (k + 1) O (n3), which is not ignorable when the number of data samples is large. Moreover, the Euclidean projection onto the simplex space in equation (19) takes .

Input: training dataset clusters size c
Initialize: and matrices are based on nearest neighbor graph; is formed by the eigenvectors of corresponding to the smallest eigenvalues;
While not converged do
 Update while fixing others by equation (10)
 Update while fixing others by equation (12)
 Update while fixing others by equation (13)
 Update while fixing others by equation (15)
 Update while fixing others by equation (19)
Update the multipliers.
Update by
End while
Output:

4. Experiments

In this section, we conduct an extensive experiment to demonstrate effectiveness of our proposed method on five benchmark datasets described in Section 4.1. Accordingly, we compare our method with seven state-of-the-art methods, namely, MLRSSC [23], RMSL [22], MCGC [26], DiMSC [21], MVGL [25], GMC [40], and SM2SC [38].

4.1. Datasets

We perform experiments on ORL and Yale, Coil-20 and Caltech 101-20, and UCI digits datasets to demonstrate the superiority of our method on face image clustering, object image clustering, and handwritten digit, respectively. For each dataset, we select two types of features to represent and. Specifically, we extract LBP together with Gabor features for ORL, Yale, Coil-20, and Caltech 101-20, and we select Fourier coefficients of the character shape (FOU) and profile correlation (FAC) features for UCI digits. Therefore, the following is a detailed description of each dataset, while Table 1 and Figure 1 provide a summary and pictures of the datasets, respectively.(1)ORL1: this face dataset contains 400 images of 40 distinct subjects. Each subject represents different images taken under various conditions such as time, light, and facial expressions. Note that the dimensions of LBP and Gabor features extracted from this dataset are 3304 and 6750, respectively.(2)Yale2: this dataset contains 165 gray-scale images of 15 individuals with 11 images per individual. The variations of this dataset include the images taken, with center light, with glasses, under happy mood, with left light, without glasses, etc. Here, the feature dimension of LBP and Gabor is the same as that extracted for ORL above.(3)COIL-203: this dataset is from the Columbia Object Image Library, and there are 1440 images of 20 different classes. Each class contains 72 images. The LBP and Gabor features of this dataset have 944 and 4096 dimensions, respectively.(4)Caltech 101-204: his image dataset has 101 categories of images. We select the widely used twenty classes to obtain 2368 images each for 48-D Gabor and 928-D LBP features.(5)UCI digits5: this dataset denotes handwritten digits of 0 to 9 from the UCI machine learning repository. It is composed of 2000 data samples. The two features extracted are FOU and FAC with 76 and 216 dimensions, respectively.

4.2. Experimental Setting

We fine-tuned the parameters for each method with strict compliance with the experimental settings in the respective literature. For our proposed method, two parameters and need tuning. Therefore, we utilize a grid search to find the best and from . Then, we perform all experiments ten times and report the mean performance to guarantee fairness.

4.3. Evaluation Metrics

To evaluate our method, we utilize six standard evaluation metrics, i.e., accuracy (ACC), normalized mutual information (NMI), adjusted rand index (AR), F-score, precision, and recall. These metrics capture a different aspect of the performance to demonstrate the superiority of our method over state-of-the-art compared methods. For example, ACC measures the percentage of correctly clustered data samples in the learned clustering structure compared with the ground truth labels, whereas NMI is a theoretic validation measure, which relies on the amount of statistical information shared by random variables.

4.4. Experimental Results
4.4.1. Image Clustering

In this section, we present the image clustering performance results on all five benchmark datasets using the six evaluation metrics described earlier.

Table 2 displays the performance results on ORL and Yale datasets. Specifically, on the ORL dataset, our method has better performance results than the seven state-of-the-art compared methods with 81.39%, 91.32%, 71.13%, and 83.10% in ACC, NMI, precision, and recall, respectively. Surprisingly, DiMSC outperforms the more recent methods, such as MCGC, MVGL, and MLRSSC, with 78.87% in ACC and 91.19% in NMI. The above may result from the approach utilized by DiMSC to learn the common representation matrix where it introduced a diversity regularizer to obtain a representation matrix for each view that enhanced the diversity of the different views. Besides, one can observe that GMC and SM2SC, which are more recent methods, have better performances than other compared methods in most evaluation metrics. In particular, SM2SC has the best performance of 78.36% in AR, which is just 1% higher than that obtained by our proposed method. Nonetheless, our proposed method has the best performance on the Yale dataset in all evaluation metrics, with 74.48% in ACC, 71.96% in NMI, 50.11% in AR, 51.84% in F-score, 50.42% in precision, and 59.18% in the recall metrics.

Table 3 illustrates the clustering performance on Coil-20 and Caltech 101-20 object datasets. Furthermore, our method has the best performance in all evaluation metrics, except for the Coil-20 dataset, where SM2SC has the best performance in ACC by a small margin of 0.29%.

Table 4 shows that our proposed method outperforms other state-of-the-art methods in five evaluation metrics such as NMI and AR, where it has 86.76% and 78.42%, respectively.

Overall, our method outperforms GMC, SM2SC, and MCGC only slightly, especially on object datasets, and it is not surprising because, for instance, SM2SC employed a block diagonal regularization to enforce a proper common structure. Similarly, GMC and MCGC utilized a rank constraint to find a unified clustering structure to balance the agreement between the different views by avoiding k-means spectral postprocessing of the low-dimensional embedding matrix as well. Notwithstanding, our proposed method outperforms other state-of-the-art methods by a wide margin in all evaluation metrics. Furthermore, we provide more intuitiveness of our approach by visualizing the clustering structure learned on three datasets in Figure 2. Accordingly, we can see clearly that our proposed method has a block diagonal clustering structure on all three datasets with an exact c number of connected components. Therefore, we conclude that our common low-rank learning approach is very efficient.

4.4.2. Image Recognition with 0%, 10%, and 20% Levels of Corruptions

In this section, we present experimental results obtained for image recognition with respect to accuracy evaluation metric. In this experiment, we study the robustness against noise corruption of each algorithm by gradually injecting 10% and 20% noise into the five benchmark datasets. To perform this experiment, we keep the same parameter settings described in Section 4.2, while the K nearest neighbor (KNN) classifier is applied to evaluate the classification accuracy of each algorithm. Thus, it is easily noticeable through Tables 59 that all algorithms had a degrading performance with an increment in the noise level. Specifically, our proposed method had degraded performance of only about 3–5% on all datasets with a 10% level of noise corruption. However, all algorithms had a significant reduction in performance with a 20% level of noise corruption. Yet, we can observe that our proposed method is generally more robust against noise than the other compared methods.

4.5. Parameter Sensitivity

Figure 3 demonstrates the clustering performance of our method with respect to NMI and ACC by varying and on Yale, COIL-20, and UCI digits datasets. First of all, we should note that does not need turning like the other parameters because we find its best value heuristically like [26] to accelerate the process. Therefore, we initialize and automatically set it as or  =  in each iteration when the number of connected components is smaller or larger than c, respectively. As a result, we can see from Figure 3 that our method is more sensitive to on both NMI and ACC because it controls the learning of the common low-rank matrix . Hence, our proposed method is certain to have a good performance by finding a suitable value for . On the other hand, our method is only slightly sensitive to , which means that is not as important as to our model. Still, these clustering results show that our proposed method is relatively stable regardless of the dataset.

4.6. Convergence Analysis

Although the convergence of the inexact augmented Lagrange multiplier (ALM) method with more than three subproblems is still not easy to theoretically prove [29], we compute the relative error of the term to demonstrate the convergence behavior of our proposed method. To illustrate this better, Figure 4 shows the errors at different iterations. It is easy to see that on all three benchmark datasets, our method converges within 150 iterations.

5. Conclusion

We proposed a novel method based on LRR, which learns a common low-rank representation matrix shared by two multi-instance datasets to improve the clustering performance accuracy. Specifically, our proposed method obtained this common low-rank representation matrix using only one instance of the data so that a second instance of the data can linearly reconstruct it. Thus, utilizing this common low-rank representation, our method obtained an ideal similarity matrix, which explicitly revealed the clustering structure without any spectral postprocessing. This approach is different from existing state-of-the-art methods, where a projection matrix is learned for each instance of the data to obtain a common structure. Extensive experiments on five benchmark datasets demonstrate our method's superiority over seven state-of-the-art compared methods in all six evaluation metrics. In future work, we would extend our ideas to deep multiview learning.

Data Availability

The data used in this study are available in the following websites: (1) http://cam-orl.co.uk/facedatabase.html. (2) http://vision.ucsd.edu/content/yale-face-database. (3) https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php. (4) http://www.vision.caltech.edu/Image_Datasets/Caltech101/. (5) https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was funded in part by the National Key Research and Development Program of China, Grant no. 2020YFC1511800.