Abstract
Automatic image annotation is for more accurate image retrieval and classification by assigning labels to images. This paper proposes a semisupervised framework based on graph embedding and multiview nonnegative matrix factorization (GENMF) for automatic image annotation with multilabel images. First, we construct a graph embedding term in the multiview NMF based on the association diagrams between labels for semantic constraints. Then, the multiview features are fused and dimensions are reduced based on multiview NMF algorithm. Finally, image annotation is achieved by using the new features through a KNNbased approach. Experiments validate that the proposed algorithm has achieved competitive performance in terms of accuracy and efficiency.
1. Introduction
The advent of Internet age brings the explosive growth of image resources. Although managing and retrieving images by semantic tags is a common and effective way, there are still a large number of untagged or not fully tagged images. However, it is not easy to carry out manual annotation regarding the cost of human resources and the semantic nuances of annotation under the background of various cultures, religions, and languages. Moreover, the cognition bias caused by subjectivity could induce semantic discrepancies as well. Thus, how to design an efficient automatic image annotation algorithm to provide accurate labels for untagged images has been an urgent problem.
Automatic image annotation (AIA) refers to the process that computers automatically provide one or more semantic tags that can reflect the content of a specific image through algorithms. It is a mapping from images to semantic concepts, namely, the process of understanding images. Image annotation is based on image feature representations, and features utilized in different tasks have different representation abilities [1–3]. For example, global color and texture features have been successfully used in retrieving similar images [4], while local structure features perform well in tasks of object classification and matching [5, 6]. In general, features that depict images from different views can provide complementary information. Thus a rational fusion of multiview features contributes to more comprehensive depiction for images, which can be beneficial to image searching, classification, or other related tasks.
Many multiview learning algorithms have been proposed for operating some tasks such as classification, retrieval, and clustering based on multiview features. According to the levels of feature fusion, multiview learning methods can be grouped into two categories [7]: featurelevel fusion such as MKL [8], SVM2K [9], and CCA [10] and classifierlevel fusion such as hierarchical SVM [11]. Some experimental studies show that classifierlevel fusion outperforms simple feature concatenation, whereas sophisticated featurelevel fusion usually performs better than classifierlevel fusion [11, 12].
Recently, many image annotation algorithms use a variety of underlying features to improve annotation performance [8–10]. On one hand, the multiview features improve the accuracy, but on the other hand the strategies decrease the efficiency and applicability of the algorithms because of the increase of feature dimensions. Moreover, many existing multiview learning algorithms are unsupervised; that is, they do not make use of the label information in the training set. Such fused features may not effectively contain the semantic relationship between samples. This paper proposes a semisupervised learning framework based on graph embedding and multiview NMF (GENMF). In GENMF, feature fusion and dimension reduction are firstly performed by the proposed graph embedded multiview NMF algorithm, and then the new obtained features are used to annotate images through KNNbased approach.
2. Related Works
Existing image annotation algorithms can be roughly divided into two categories [13]: modelbased learning methods and databasebased retrieval methods. Modelbased methods explore the relationship between highlevel semantic concepts and lowlevel visual features to discover a mapping function through machine learning or knowledge models for image annotation. Unlike modelbased methods, databasebased methods do not need to set up the mapping function based on the training set but directly provide a sequence of candidate labels according to the already annotated images in the database.
There are three kinds of modelbased learning methods for image annotation: classification based methods, possibility based methods, and topic modelbased methods. Classification based methods [14–16] treat tags as specific class labels and explore the mapping relations between lowlevel visual features and labels through machine learning methods. The essence of this kind of methods is transforming image annotation to image classification. Different classifiers are used to establish mapping functions between lowlevel features (from images or regions) and semantic concepts. Labels with the high confidence from the classifiers are annotated to images. Different from classification based methods, possibility based methods [17, 18] do not use classifiers to build the mapping functions but explore the relationship between the underlying features of the image and the semantic labels based on unsupervised probability and statistics models. They utilize the relations to calculate the joint probability of images and labels or the conditional probability of labels given an image and then estimate the possible labels through statistical inference. Topic modelbased methods [19, 20] use latent topics to associate lowlevel visual features with highlevel semantic concepts to implement image annotation.
The modelbased methods have three difficulties in practical applications. First, the learning models trained on the datasets with finite image types and semantic labels can hardly reflect the characteristics of feature distributions in the real world, which leads to unsatisfactory annotation performance when facing new features and semantic labels. Second, the limited size of training sets may result in overfitting and low generalization ability of the models. Third, lowlevel features may often fail to express highlevel semantic information because they belong to different feature spaces. Thus, it is also hard to establish a mapping model between image features and semantic concepts because of the semantic gap.
The essence of retrieval based method is directly providing a list of candidate labels for the images to be tagged based on the existing datasets with complete and valid label information. Most common retrieval methods are based on KNN [21–23]: they retrieve k images with the highest similarity to the input image from the database, and the labels of the k images are sorted based on the statistical relationship or weighted statistical relationship to generate the candidate labels of the input images. The other category is graphbased methods [24–27] that utilize image feature distance to establish relevant graphs of samples. Based on the assumption that neighboring images in the relevant graph have similar labels (label smoothness), the similarity between nodes and the global structural characteristics of the relevant graph are used to propagate and enrich the node information including labels and classes. This kind of semisupervised learning methods is suitable for not fully tagged datasets existing on the Internet.
Traditional graphbased methods usually label images by aggregating multiple features into one feature and building a relation graph based on this feature. In [25], it is pointed out that traditional methods cannot effectively capture the unique information for each feature and proposes to utilize different features to establish relation subgraphs and then link these subgraphs to form a supergraph. Based on the supergraph, label propagation is achieved through the graphbased method. In [26], different feature graphs are built based on different features of the images and then the relationship between images is constructed through the graphbased method based on different feature graphs. Furthermore, the relationship between images and different features can be also constructed. Finally, the two relationships, namely, the relation between images and the relation between images and different features, can be fused by a designed objective function to obtain good candidates for the labels.
In [27], a graph learning KNN (GLKNN) is proposed by combining KNNbased method and graphbased method. GLKNN first uses graphbased method to propagate the labels of the K nearest neighbors to the new image and obtain one sequence of candidate labels, then GLKNN employs the naiveBayes nearest neighbor algorithm to establish the relationship between labels and image features for obtaining another sequence of candidate labels. Finally, the two candidate label sequences are linearly combined as the final predicted labels. In [28], graph embedding discriminant analysis is applied to classify marine fish marine fish species by constructing intraclass similarity graph and interclass penalty graph. Although the algorithm improves the performance of classification and clustering by utilizing class labels to build graph embedded term, the traditional graph embedding algorithm is not suitable for multilabel problems with multilabel images because there is no intraclass and interclass relationship. In [21, 22], different models based on metric methods are proposed to enhance the representation ability of features and further improve the performance of image annotation. However, the metric based feature processing only linearly embeds the original features and does not reduce the feature dimension. In [13], multiple features are fused by concatenation, which ignores the manifold characters of different features and high feature dimension results in low efficiency of the algorithm.
For reducing the dimensions of each feature for annotation, an extended local sensitive discriminant analysis algorithm is proposed by constructing relevant and irrelevant graphs in [29]. Generally, feature dimension reduction methods based on NMF decomposition are for singleview features. References [30, 31] extend this method to multiview features by simply concatenating multiple vectors into one feature vector before further dimension reduction. However, this concatenation way can cause vector dimension disaster. Besides, multiview features are descriptions from different views for images so that simple connection does not make good sense. Then a multiview NMF model based on shared coefficient matrix is developed for capturing the latent feature patterns in multiview features [32], where different view features have their own basis matrices and share a coefficient matrix. The proposed model is used for solving classification and clustering problems and is not suitable for multilabel problems with multilabel images.
Based on the above reviews, this paper proposes a semisupervised learning model based on multiview NMF and graph embedding. A novel multiview NMF algorithm based on graph embedding is developed to fuse the multiview features and reduce the dimension of the fused features by designing appropriate graph embedded regularization terms. Then, the image annotation is performed by using the new features through a KNNbased algorithm.
3. The Proposed Methods
In this section, we elaborate the proposed semisupervised framework for automatic image annotation. First, the graph embedding terms for multilabel problems are constructed through semantic similarity matrix. Second, an objective function is established by adding graph embedded semantic constraints. Third, the update rules for optimizing are derived in detail. Finally, the overall framework of the algorithm is presented.
3.1. Graph Embedding for Multilabel Problem
The traditional graph embedding model is introduced for classification problems, in which each sample has only one label, so that the Laplacian matrices L and can be given according to whether they belong to the same category or not. However, for multilabel problems, a sample usually contains multiple category labels. Therefore, traditional graph embedding methods cannot be directly applied to multilabel problems. In this paper, we give a relation matrix according to whether samples are related or not. By setting appropriate thresholds, the relevant matrix and the irrelevant matrix can be obtained, and they can be used to calculate Laplacian matrices L and , respectively.
Let denote the ith sample and denote label matrix, where is the number of samples in the training set, is the number of labels, represents the ith row of Y, and represents the ith column. The semantic similarity between sample i and sample j can be formulated as , where C is a priori label relation matrix similar to that in [33]. denotes the sample vector and denotes the L2norm of . Then, the semantic similarity matrix of samples can be obtained by the following formula:Given thresholds and (), samples with similarity greater than are relevant, and samples with similarity less than are irrelevant. Therefore, the relevant matrix W and the irrelevant matrix are constructed as follows:The corresponding Laplacian matrices are formulated as follows:where and .
Having the relevant and irrelevant matrices, the following two constraint items and are incorporated to make feature representations in the new feature space consist with semantic concepts:where denotes the number of samples in the training set and and represent the visual feature vectors of sample i and sample j, respectively.
3.2. An Automatic Image Annotation Model Based on Multiview Feature NMF and Graph Embedding
Let denote the data matrix, where is the feature matrix corresponding to the th view, is the dimension of feature vectors, M is the number of views, and N is the number of samples. The objective function can be formulated aswhere and are nonnegative matrices and K denotes the dimension of the new lowdimensional feature.
Furthermore, graph embedding regularization terms (7) and (8) are combined with the above loss function, thenwhere and and are two equilibrium coefficients. Equation (10) consists of two terms, where the first is the error term, and the second is the constraint term that makes semantic constrains on V by using graph embedding regularization. It implies that the semantic related sample features are closer and vice versa. It is worth noting that the model is semisupervised since that refers to data with labels, and the graph embedding term is used to constrain .
3.3. Update Rules Derivation
The established model is semisupervised, and only part of the data has label information. The objective function can be rewritten in the form of block matrix. The following subsection will give the derivation of update rules.
The update rule of formula (10) is derived as follows:Let and be the Lagrange multipliers of constraint conditions and , respectively, . Then the Lagrange function can be written asThe partial derivative of L with respect to is as follows:where , , and , the symbol means labelled and the symbol means unlabelled. Thus, and refer to the data with labels. Then (12) can be rewritten as Separating the terms associated with and , the above equation can be written asThe partial derivatives of L with respect to and are as follows:Using the KKT conditions and (i.e., and ), consider formulae (13), (18), and (19) and let the derivatives equal 0; the following three equations can be obtained:The following update rules can be obtained through the above three equations:It is mentioned in [34] that in order to ensure the convexity of the loss function, β needs to be taken as an appropriately small value, which is suggested by . Besides, [35] gives a modified strategy to the original update rules to ensure convergence. The same strategy can be applied to the derived update rules.
3.4. Framework of the GENMF
The schematic diagram of the proposed GENMF model can be illustrated as in Figure 1. First, multiview features are extracted from images as the input matrix X in (10). Equations (1)(8) are utilized to build graph embedding regularization terms as the input matrices L and in (10). Then, and are updated iteratively by using updated equations (23) to (25) until the maximum number of iterations is reached or the loss value is within the permissible range. Finally, the new features of the test set and the training set features are input to the KNNbased labelling algorithm to obtain the predicted labels. The flowchart of the algorithm is shown in Figure 2.
Algorithm 1 gives the pseudocode of the GENMF.

4. Experimental Studies
4.1. Dataset and Experiment Design
The main purpose of the proposed algorithm is to improve the performance of automatic image annotation by fusing the multiview features and reducing the feature dimension, which makes it better to represent semantic concepts under semantic constraints in new lowdimensional feature spaces. So this paper selects the dataset Corel5k with 15 different features, and Corel5k consists of 4500 images for training and 499 images for test, which is available on http://lear.inrialpes.fr. The 15 features are all lowlevel image features including Gist, DenseSift, DenseSiftV3H1, HarrisSift, HarrisSiftV3H1, DenseHue, DenseHueV3H1, HarrisHue, HarrisHueV3H1, Rgb, RgbV3H1, Lab, LabV3H1, Hsv, and HsvV3H1. In the experiment, we select a local feature DenseSiftV3H1, a global feature Gist, and a color feature Hsv.
In the experiments, the multiple features except Gist are regularized through L2normalization, and the normalized features are input into the GENMF to obtain lowdimensional representations. Then the lowdimensional feature vectors are input into the 2PKNN annotation algorithm to obtain the predicted labels for the test set. The performance of the algorithm is evaluated in terms of four metrics Pre, Rec, F1, and N+. Table 1 lists the parameters used in the experiments.
4.2. Experimental Results
4.2.1. Convergence Curve of Loss Function
Figure 3 shows the convergence curves of loss function with different parameters. It can be observed that, after about 300 iterations, the trend of the loss curve tends to be stable.
(a) ,
(b) ,
4.2.2. The Influence of Different and
The relation matrix can be established according to formula (2). Observed by experimental methods, the maximum value of is 12.9554 and the minimum value of is 0. The values of and are traversed, where . Figure 4 shows the changes in the performance of the annotation when the different values of parameters are selected. On the whole, when and , the algorithm obtains the highest F1 value. Thus, in the following experiments is taken as 2 and is taken as 1.
4.2.3. The Influence of Different
Figure 5 shows the varying curve of Pre, Rec, F1, and N+ in the case of K = 300 with different values. Figure 51 shows that the annotation accuracy increases first and then decreases with the increase of . When , the accuracy reaches the highest value. Figure 52 shows that the recall rate generally increases first and then decreases. When , the recall rate reaches the highest value. From Figure 53, it can be seen that the F1 value also increases first and then decreases with the increase of , but a concave point appears at . When , the F1 value reaches the highest value. In Figure 54, the N+ value fluctuates in the interval , and its value reaches the highest value at and decreases afterwards.
4.2.4. The Influence of Different Feature Dimensions
Figure 6 shows the annotation performance curves when α is taken as 0, 1000, and 2000, respectively, and the value of K changes from 100 to 800 with an increase of 100 each time. The three curves with different values of parameter α show the consistent trend of change. In Figure 61, the accuracy increases with the increase of dimension because more information can be retained, and the curve becomes stable until α reaches 2000. The worst performance is at α = 0. Figure 62 shows that the recall rate decreases slightly with the increase of dimension because the requirement for retrieval is higher with the increase of dimension. In Figure 63, F1 is reflecting the comprehensive effect of the accuracy and recall rate. It can be observed that the F1 increases in the interval with the increase of dimension and then tends to be stable except for α = 0. Figure 64 shows that N+ value fluctuates but the overall trend is stable. In general, the performance of proposed algorithm on four metrics outperforms using the original features when α = 1000 or α = 2000 with dimension in the range of 200800.
4.2.5. Comparison with Existing Annotation Algorithms
Table 2 presents the comparison results with existing annotation algorithms. RMLF [36] optimizes the final prediction tag score by fusing prediction tag scores of 15 different features. LDMKL [14] and SDMKL [14] use the different classifiers based on the nonlinear kernel of threelayer network to annotate images. 2PKNN [22] uses two steps for annotation: after dealing with data imbalance, images are annotated through a KNNbased method in databalanced dataset. LJNMF [31], merging features [31], and Scoefficients [31] consider different kinds of NMF modeling, extract new features, and annotate images through a KNNbased method. TagProp (ML) [21] and TagProp (σML) [21] acquire discriminative feature fusion on the training set by designing a metric learning model and annotate images using weighted KNN method. JEC [37] is a KNNbased algorithm based on the average distance of multiple features, which is a benchmark algorithm for image annotation. MRFA [38] proposes a new semantic context modeling and learning method based on multimarkov random fields. SML [39] is a discriminative model that treats each label as one class in multiclass classification problems; GS [38] introduces the regularizationbased feature selection algorithm to exploit the sparsity and clustering properties of features.
In Table 2, the note (3f) denotes using the three features selected in this paper, and the note (3f’) indicates using three features that are not the same as in this paper. The results of other algorithms are directly taken from respective literatures and all the 15 features are utilized. Our algorithm uses only three features, and it can be seen in Table 2 that the proposed GENMF achieves the competitive performance.
4.2.6. The Best, Average, and Standard Deviation of the Results
Table 3 shows the best, average, and standard deviation of the results using 10 independent runs. The NMFbased algorithms have a certain randomness, and different initial values may produce different results. Table 3 shows that the influence of different initialization values is limited, but better performance could be expected if a better initialization strategy is chosen. Besides, the average time consumption of the proposed GENMF with the new lowdimensional features is 13.945 seconds to label all 499 test images, whereas utilizing the original features to label takes 34.652 seconds, which is about 2.5 times that of GENMF.
5. Conclusions
In this paper, we propose a semisupervised framework based on graph embedding and multiview nonnegative matrix factorization for automatic image annotation with multilabel images. The main purpose of the proposed algorithm is to improve the performance of automatic image annotation by fusing multiview features and reducing feature dimension, which makes it better to represent semantic concepts under semantic constraints in new lowdimensional feature spaces. For feature fusion and dimension deduction, a novel graph embedding term is constructed based on the relevant graph and the irrelevant graph. Then, the fusion of multiview features and the reduction of dimensionality are realized based on multiview NMF model. Moreover, the updated rules of the model are derived. Finally, images are annotated by using a KNNbased approach. Experimental results validate that the proposed algorithm can achieve competitive performance in terms of accuracy and efficiency.
Data Availability
The code used in this paper is released, which is written in Matlab and available at https://github.com/MenSanYan/imageannotation.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
The authors are grateful to the support of the National Natural Science Foundation of China (61572104, 61103146, 61425002, and 61751203), the Fundamental Research Funds for the Central Universities (DUT17JC04), and the Project of the Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172017K03).