Developments in Mobile Multimedia TechnologiesView this Special Issue
Robust Graph Structure Learning for Multimedia Data Analysis
With the rapid development of computer network technology, we can acquire a large amount of multimedia data, and it becomes a very important task to analyze these data. Since graph construction or graph learning is a powerful tool for multimedia data analysis, many graph-based subspace learning and clustering approaches have been proposed. Among the existing graph learning algorithms, the sample reconstruction-based approaches have gone the mainstream. Nevertheless, these approaches not only ignore the local and global structure information but also are sensitive to noise. To address these limitations, this paper proposes a graph learning framework, termed Robust Graph Structure Learning (RGSL). Different from the existing graph learning approaches, our approach adopts the self-expressiveness of samples to capture the global structure, meanwhile utilizing data locality to depict the local structure. Specially, in order to improve the robustness of our approach against noise, we introduce -norm regularization criterion and nonnegative constraint into the graph construction process. Furthermore, an iterative updating optimization algorithm is designed to solve the objective function. A large number of subspace learning and clustering experiments are carried out to verify the effectiveness of the proposed approach.
With the rapid growth of information technology and computer network technology, a large number of multimedia data can be collected from many research fields such as computer vision, image processing, and natural language processing. However, most of the multimedia data are represented by the high dimension and complex structures [1, 2]. Therefore, how to accurately analyze these data becomes a vital problem. Inspired by the pattern recognition and machine learning techniques, many multimedia data analysis approaches based on subspace learning and clustering have been put forward recently [3–6]. However, learning or constructing a valuable graph to describe the pairwise similarity or relationship among the samples is a key issue to multimedia data analysis .
Nowadays, a series of graph learning approaches have been proposed in which the heat-kernel function is the most widely used graph construction manner, such as -nearest-neighborhood graph (-NN graph) or -nearest-neighborhood graph ( graph). The edges of vertexes are computed based on the Euclidean distance among samples and then the weights of the edge between two vertexes are estimated by the heat kernel . However, there are two main limitations in these approaches . First, the choice of parameters in these approaches, such as the neighbor number or radius , is very challenging, which can impact the final performance of the task. Second, the processes of neighbor selection and weight calculation are independent, which are sensitive to noise and often cannot well reveal the real similarities of samples .
To overcome these drawbacks, sparse representation (SR) based graph construction approach has been proposed, which is often called -graph or sparse graph. In -graph , each sample is regarded as the query sample and the rest of samples are considered as the dictionary to represent the query sample. Therefore, the similarities between the query sample and the remainder samples can be measured. Since -graph employs -norm constraint on the regression model for selecting a few important samples, it has a better discriminability and more robustness to deal with noise. In the past decades, a series of excellent learning approaches based on -graph have been designed and successfully applied in different areas . Although -graph can reveal the linear relationship between a single point and other points, there are still some limitations as follows. First, -graph strictly assumes the dictionary of regression should be overcompleted, which is unsatisfied in many real applications especially for the graph learning. Second, -graph pays too much attention on the sparsity while it neglects the correlations between the samples, so it cannot offer a smooth data representation. Therefore, SR is not a good choice for graph construction. To overcome the aforementioned problems suffered by SR, Zhang et al.  introduced a Collaborative Representation (CR) linear regression approach by employing the -norm rather than -norm sparsity regularization. Compared to SR, CR provides more relaxation for regression coefficients and obtains a smoother data representation.
Considering that both SR and CR usually reveal the linear relationship between a single data point and other data points, the global structure of the data is ignored. To address this problem, Liu et al.  suggested Low-Rank Representation (LRR) for subspace clustering. The main purpose of LRR is to find a coefficient matrix by trying to reconstruct each data point as a linear combination of all the other data points, which is called self-representation. Different from the traditional similarity measurement approaches based on distance, i.e., -nearest neighborhood or -nearest neighborhood, the representation-based approaches, such as SR, CR, or LRR, measure the similarity between data by solving an optimization problem. These approaches improve the image structure to achieve better classification and clustering performance overall. However, the objective function of LRR is not differentiable which has a high computation complexity on solving the rank minimization problem. To efficiently solve the limitation of LRR, Lu et al.  proposed Least Squares Regression (LSR) by grouping the highly correlated data together, which is robust to noise. Compared with LRR, LSR is simpler and more efficient.
In recent years, researchers found that the relationships between data in real applications usually show high dimension nonlinear, so the aforementioned linear representation approaches can hardly achieve good performance. Many researchers paid more attention on revealing the nonlinear relationship between data points of interests [16–27]. For example, Wang et al.  explored the criterion of Locally Linear Embedding (LLE) and used it to construct the graph by computing the weights between the pairs of samples. Wei and Peng  adopted a similar criterion to that LLE to construct a neighborhood-preserving graph for semisupervised dimensionality reduction. Furthermore, Yu et al.  have found that the nonzero coefficients of the sparse coding always are assigned to the neighbor samples of the query sample. To encourage the coding to be locality, some local feature-based coding approaches have been proposed, which achieve excellent performance for the classification and clustering tasks . With the usage of the merits of local constraints, Peng et al.  put forward Locality-Constrained Collaborative (LCC) representation, which achieves better classification performance than those nonlocal approaches. Chen and Yi  took the local constraint and LSR into consideration and designed Locality-Constrained LSR (LCLSR) for subspace clustering. LCLSR explores both the global structure of data points and the local linear relationship for data points, forcing the representation to prefer the selection of neighborhood points. Although LCLSR considers the locality structure of data, there are still some limitations on the graph structure. On the one hand, the objective function of LCLSR is based on -norm, which is very sensitive to noise; on the other hand, the process of sample reconstruction ignores the relationships between sample representations. For example, similar original samples should generate similar coding vectors, and this process weakens the effectiveness of graph learning approaches.
To combat these issues, we design a novel graph learning approach, named Robust Graph Structure Learning (RGSL). Specifically, the self-expressiveness of samples and adaptive neighbor selection approach are introduced to preserve both the local and global structures of data. For enhancing the robustness of graph construction, we introduce the -norm constraint and nonnegative constraint on the adjacency graph weight matrix to reduce the influence of noise points in graph construction. Therefore, the proposed approach can estimate the graph from data alone by self-expressiveness of samples and data locality, which is independent of a priori affinity matrix. We assess the benefits of the proposed approach on the subspace learning and clustering tasks. Extensive experiments verify the effectiveness of the proposed approach over other state-of-the-art approaches. The framework of the proposed approach is shown in Figure 1.
The outline of this paper is as follows. Section 2 reviews some related work briefly. Section 3 gives the proposed approach in details. Section 4 shows extensive experiments to prove the effectiveness of the proposed approach. Section 5 presents some conclusions.
2. Related Work
In this section, first, many classic and widely used graph construction approaches are introduced. Then, two kinds of multimedia data analysis techniques including subspace learning and spectral clustering are presented in detail accordingly.
2.1. Graph Construction Approaches
Recently, many graph construction approaches have been proposed for multimedia data analysis. In this subsection, we will review some graph construction approaches related to our work as below.
Liu et al.  proposed a Low-Rank Representation (LRR) graph construction approach, in which each sample can be represented by a linear combination of all samples, and meanwhile, a low-rank constraint of coefficient matrix is imposed. Given a high dimensional database in which is the data dimensionality and is the number of samples. The LRR graph can be obtained by optimizing the following problem: where denotes the nuclear norm of a matrix, i.e., the sum of the singular values of the matrix. denotes the coefficient matrix of data with the lowest rank.
Although LRR graph can obtain the global structure of data, it is very time-consuming to solve the problem of optimal nuclear norm. Hence, Lu et al.  utilized the -graph based on Frobenius norm in place of nuclear norm for fast computing the weight matrix. The LSR graph is defined as where is the Frobenius norm. denotes the diagonal operation of a matrix.
In order to make full use of the advantage of locality constraints, Chen and Yi  combined LSR and the locality constraints into a unified framework and proposed the LCLSR approach for graph construction. The objective function of LCLSR is where and are two balance parameters and the symbol represents the Hadamard product. denotes the distance matrix between samples where the function is a distance metric, such as the Euclidean distance.
2.2. Subspace Learning
Locality Preserving Projection (LPP)  is a well-known subspace learning approach which is used to discover the geometric property of high-dimensional feature space. Suppose that the adjacency graph weight matrix is given, LPP as aims at ensuring that if the original high-dimensional samples and are “close,” then the low-dimensional representations and should be close as well. With the usage of weight matrix as a penalty, LPP is to minimize the following objective function: where is the Laplacian matrix, in which is a diagonal matrix with diagonal elements . is the trace of matrix . is used to measure the local density around and the bigger indicates that is more important. Hence, a nature constraint can be imposed as . Based on the equation , the LPP model can be rewritten as
The projection matrix is constructed by the eigenvectors associated with smallest nonzero eigenvalues, which can be solved by
For a new high-dimensional data , with the usage of the obtained projection matrix , we can obtain a low-dimensional data representation by .
2.3. Spectral Clustering
Spectral clustering is a popular clustering approach that uses eigenvectors of a symmetric matrix derived from the distance between data points [35, 36]. Given a data set consisting of data points , spectral clustering approach is aimed at partitioning into disjoint clusters by exploiting the top eigenvectors of the normalized graph Laplacian . Suppose that the graph matrix is obtained by graph construction approaches and the new representation can be acquired by optimizing the following objective function: where is the Laplacian matrix of , in which is the diagonal matrix with . is the number of selected clusters. Each column in represents the new discriminative representation of the corresponding original sample.
Finally, data clustering can be accomplished by performing -means on the new representation .
3. Proposed Method
In this section, some notations are introduced first. Second, we give some detailed descriptions of the proposed RGSL approach. At last, an iterative update algorithm is designed to solve our RGSL approach.
Let be the given high-dimensional original data matrix, where is the dimensionality of samples and corresponds to the total number of samples. For a matrix , the definitions of Frobenius norm and -norm are as follows: and , in which and are the -th row and the -th column of , respectively.
3.2. Objective Function
First, in order to enhance the robustness of graph learning algorithm to noise and obtain the discriminative graph structure, the -norm measure criterion on the traditional LSR is introduced, which is defined as where and denote -norm and -norm, respectively. is a balance parameter. and are diagonal matrices whose diagonal elements are, respectively, defined as and . is a small nonnegative constant for preventing the value of the denominator from being zero.
Second, the relationship between representation coefficients is ignored in the sample reconstruction, i.e., similar original samples should generate similar coding vectors, weakening the effectiveness of graph learning. To solve the abovementioned issue, a modified manifold constraint based on the -norm is designed, which is defined as where denotes the similarity weight value between sample and sample . The elements in matrix are defined as , and is a diagonal matrix whose diagonal elements are . is the Laplacian of the graph matrix .
At last, the nonnegative constraint is also imposed on the representation coefficients, and the final objective function of the proposed approach is where and are two positive balance parameters.
In this section, we give the optimization procedures for the objective function of the proposed approach in Equation (10). From Equation (10), we can observe that the objective function is related to -norm. Thus, the variable in the objective function is nonconvex and a closed form solution to Equation (10) cannot be given. With regard to this limitation, an iterative update algorithm is designed to optimize the objective function.
3.3.1. Fix , , and , Update
First, we fix matrices , , and . After removing the irrelevant terms, the optimization problem with respect to in Equation (10) can be simplified to
The Lagrangian function of Equation (11) is represented as
By computing the derivative of Equation (12) with respect to and setting it equals to zero
According to the KKT condition , we update the solution for as below:
3.3.2. Fix , Update , , and
When is fixed and all the irrelevant terms are removed, the solution can be formulated as
In conclusion, the proposed optimization algorithm for RGSL can be summarized as below.
In Algorithm 1, the convergence condition is defined as the change of the value of objective function in Equation (10) which is less than a threshold or a predefined maximum iteration number is reached.
4. Experiment and Results
In this section, first, we will introduce the used databases in our experiment. Next, some graph learning comparison approaches are given. At last, subspace learning and clustering tasks are employed for verifying the effectiveness of the proposed approach.
Four commonly used multimedia databases from the Internet including Yale , AR , CMU PIE , and Extended YaleB  are used for verifying the effectiveness of the proposed approach. The detailed statistical information about the four different databases is depicted in Table 1.
Yale database: it contains 165 face images captured from 15 different subjects. Each subject has 11 different images with the varied facial expressions, under different illumination conditions, and wearing glasses or not. Some example images of the Yale database are depicted in Figure 2(a).
(c) CMU PIE
(d) Extended YaleB
AR database: it consists of over 4000 facial images obtained from 70 male and 56 female faces. Images of each person were captured with 26 frontal face images with anger, smiling, and screaming, under varied illumination conditions, and with sunglass and scarf occlusions. Some examples of the AR database are shown in Figure 2(b).
CMU PIE database: there are 41,368 face images of 68 different subjects. Images of each person are captured under 43 different illumination conditions with 13 different poses and 4 different expressions. Here, we employ a subset of CMU PIE which consists of 24 images per subject. A part of example images is illustrated in Figure 2(c).
Extended YaleB database: there are 38 individuals and each individual has 64 images. For each individual, the face images are taken from different illumination conditions with small changes in head pose and facial expression. Example images from this database are shown in Figure 2(d).
Although we can obtain the graph structure from the proposed approach, it is intractable to assess the graph learning approaches using the estimated graph alone. Hence, we will assess the quality of the learned graph by two kinds of multimedia data analysis tasks including subspace learning and spectral clustering. In our experiments, we first vary the graph construction approaches by fixing the graph learning task and then observe the obtained performance associated with subspace learning and spectral clustering tasks.
4.2. Comparison among Several Graph Learning Approaches
To investigate the performance of our approach on subspace learning and clustering, several state-of-the-art graph learning approaches are chosen to compare in our work, which are shown as below: (i)KNN graph : the graph edges connected by two vertexes can be generated by the Euclidean distance-based -nearest neighbor and the heat kernel function is used to measure the weight of an edge(ii)LLE graph : each sample is linearly reconstructed by its neighbors within a local area to preserve the local manifold structure(iii)L1 graph : the locality structure of data by using L1 sparse representation optimization(iv)LRR graph : based on self-expressive property, a low-rank graph is obtained(v)LSR graph : self-expressive property and Frobenius norm are used for fast computing the weight matrix(vi)LCLSR graph : it combines the locality constraint and LSR together to explore both the global structure of data points and the locality linear relationship of data points(vii)SGLS graph : it integrates manifold constraints on the unknown sparse codes as a graph regularizer(viii)Our proposed RGSL graph: our approach takes the global and local structure information into consideration and also introduces the -norm regularization criterion and nonnegative constraint into graph construction process to enhance its robustness
4.3. Subspace Learning Experiment and Analysis
In this section, we employ an unsupervised subspace learning approach represented by Locality Preserving Projections (LPP) to verify the effectiveness of the proposed approach. In our experiments, different graphs are employed as in LPP approach for subspace learning and then the classification accuracy is used for performance comparison. For each database, we randomly select images from each class as the training samples. The remaining images are treated as the test samples. The values of for Yale, AR, CMU PIE, and Extended YaleB databases are set as , , , and , respectively. In order to more effectively and fairly test the performance of the proposed approach, the random sample selection is repeated 20 times and the average classification accuracy and standard deviation are regarded as the final results for comparison. In this work, we employ the nearest neighbor classifier with Euclidean distance for classification due to its simplicity. To compare the performances of different approaches, the classification accuracy rate is chosen as the evaluation criterion, which is defined as where is the number of test samples which are correctly classified using the nearest neighbor classifier. is the total number of the test samples.
All the experiments are conducted using MATLAB 2016b on a 3.60 Hz with 8 G RAM. In order to acquire the optimal parameters of different approaches, we employ the grid-search manner in our experiments. Tables 2–5 depict the average classification accuracy rates and standard deviations of the proposed approach on the Yale, AR, CMU PIE, and Extended YaleB databases, respectively. Note that the brackets in Tables 2–5 mean the data dimensionality when achieving the best classification accuracy rates.
From the results depicted in Tables 2–5, we can clearly observe that most of the graph learning approaches perform better than the KNN graph and LLE graph. It indicates that graph learning based on Euclidean distance is very sensitive to noise points, weakening the classification performance. Besides, compared to L1 graph learning approach, LRR graph, LSR graph, SGLS graph, and LCLSR graph take the locality structure of data into consideration during the process of graph construction to get more excellent performance. At last, the proposed RGSL approach performs best among all of the compared approaches. The main reasons are as follows: first, both the global structure and local structure are essential to the graph learning. Second, -norm regularization criterion and nonnegative constraint are introduced into graph construction process to improve the robustness of our approach against noise. Therefore, our approach can improve the classification performance further.
There are two parameters, i.e., and in the objective function of our proposed approach. Hence, how to appropriately set their values is very important for our approach. In this study, we tune the values of parameters and by searching the grid in an alternate manner. The best results of different parameter values on the four databases are shown in Figure 3.
(c) CMU PIE
(d) Extended YaleB
As we can see from Figure 3, when the values of parameters and are relatively small, the performance of the proposed approach is relatively small. With the increase of parameters and , the performance of the proposed approach will be improved. However, after it achieves its best classification result, the performance of the proposed approach dramatically decreases with the increase of the two parameters. Therefore, the proposed approach can obtain its best classification results when the values of parameters and are set as neither too large nor too small. At last, the convergence curves of our RGSL on the four databases are shown in Figure 4. In this figure, the -axis and the -axis are, respectively, denoted as the iteration number and the value of the objective function. As seen from Figure 4, the value of the objective function declines at each iteration and converges very fast on all of the databases.
(c) CMU PIE
(d) Extended YaleB
4.4. Clustering Experiment and Analysis
In spectral clustering, the initialization has a major impact on the performance of the -means clustering algorithm. Therefore, we carry out the process of clustering 50 times with different random initializations. Then, the average clustering results with standard deviations are used as the final results. In the experiments, three widely employed clustering evaluation indicators including Accuracy (ACC), Normalized Mutual Information (NMI), and Purity are used to evaluate the performance of the proposed approach.
For a given sample , supposing that the obtained clustering result is and true label is , the clustering accuracy is calculated as where if , otherwise. The function maps the clustering result to the corresponding ground truth label. is the number of samples. The Kuhn-Munkres algorithm  is employed to find the best mapping result.
Assuming that and are, respectively, the clustering result and the true label set obtained by different approaches, the Mutual Information (MI) is defined as where and represent the probabilities that a sample is randomly selected from the dataset belonging to and , respectively. represents the joint probability of a sample randomly being selected from the dataset belonging to and .
Let and be the entropies of and , respectively. The Normalized Mutual Information (NMI) is calculated as
Purity is defined as follows: where represents the number of clusters, is the number of elements in the most numerous category in cluster , and is the number of elements in cluster .
Tables 6–9 show the best values of ACC, NMI, and Purity of eight approaches, respectively, on the Yale, AR, CMU PIE, and Extended YaleB databases. According to the results as shown in Tables 6–9, the following conclusions can be obtained. First, since KNN graph and LLE graph are based on Euclidean distance, they are very sensitive to the noise points, outliers, and parameter values. So the clustering performance based on KNN graph and LLE graph is lower than that based on other compared approaches. Second, the performance of LRR graph, LSR graph, SGLS graph, and LCLSR graph is superior to that of L1 graph because of taking the locality structure of data into consideration during the process of graph construction. However, these objective functions are all based on -norm, so it is very sensitive to the noise data. Besides, the relationship between the representation coefficients is ignored in the sample reconstruction, i.e., similar original samples should generate similar coding vectors, weakening the effectiveness of graph learning. To overcome these disadvantages, our RGSL approach combines -norm with manifold constraints on the coding coefficients to learn a locality and smoothness representation. Therefore, the performance of the proposed approach is superior to that of all of the comparison approaches.
Similar to the subspace learning experiment, we also tune the values of parameters and by searing the grid in an alternate manner. From the objective function, we can learn that there are three terms. When the values of parameters and are set as small, the effectiveness of the second and third terms in the objective function will be weakened, and the role of the first term will be overemphasized. On the contrary, the second and third terms in the objective function will play a major role, reducing the effect of the first term. Therefore, the proposed RGSL approach can achieve the best performance when parameters and are set as moderate values, which is similar to the discussions of subspace learning.
This paper proposes a novel graph learning framework, named Robust Graph Structure Learning (RGSL) for effective multimedia data analysis. In order to preserve both local and global structures of data, we employ the data self-representativeness to capture the global structure and adaptive neighbor approach to describe the local structure. Furthermore, we also introduce the -norm regularization criterion and nonnegative constraint into graph learning to improve the robustness of the model against noise. Extensive experimental results associated with subspace learning and clustering tasks show that the proposed approach performs better performance than the state-of-the-art graph learning approaches. Since our proposed approach will be affected by the graph construction when the dimensionality of data is high, in the future, we will take the dimensionality reduction, subspace learning, and graph learning into a united framework to address this issue.
The data are derived from public domain resources.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work is supported in part by grants from the National Natural Science Foundation of China (Nos. 61903262, 62062040, 61772091, 61802035, and 61962006), the China Postdoctoral Science Foundation (No. 2019M661117), the Scientific Research Fund Project of Liaoning Provincial Department of Education (Nos. JYT19040 and JYT19053), the Scientific Research Funds of Shenyang Aerospace University under Grant (18YB01 and 19YB01), the Natural Science Foundation of Liaoning Province Science and Technology Department (No. 2019-ZD-0234), the Sichuan Science and Technology Program under Grant Nos. 2021JDJQ0021, 2020YFG0153, 2020YJ0481, 2020YFS0466, 2020YJ0430, 2020JDR0164, and 2019YFS0067, the CCF-Huawei Database System Innovation Research Plan under Grant No. CCF-HuaweiDBIR2020004A, the Natural Science Foundation of Guangxi under Grant No. 2018GXNSFDA138005, and the Major Project of Digital Key Laboratory of Sichuan Province in Sichuan Conservatory of Music under Grant No. 21DMAKL02.
G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by low-rank representation,” in Proceedings of the 27-Th International Conference on Machine Learning, pp. 663–670, Haifa, Israel, 2010.View at: Google Scholar
C.-Y. Lu, H. Min, Z.-Q. Zhao, L. Zhu, D.-S. Huang, and S. Yan, “Robust and efficient subspace segmentation via least squares regression,” in European Conference on Computer Vision, pp. 347–360, Springer, 2012.View at: Google Scholar
K. Yu, T. Zhang, and Y. Gong, “Nonlinear learning using local coordinate coding,” Advance in Neural Information Processing Systems, NIPS, pp. 2223–2231, 2009.View at: Google Scholar
D. Arpit, G. Srivastava, and Y. Fu, “Locality-constrained low rank coding for face recognition,” in International Conference on Pattern Recognition, pp. 1687–1690, Tsukuba, Japan, 2012.View at: Google Scholar
A. Martinez, The AR Face Database, CVC, Luxembourg City, Luxembourg, 1998, Technical Report 24.