Abstract

It remains a challenging task for automatic image annotation problem due to the semantic gap between visual features and semantic concepts. To reduce the gap, this paper puts forward a kernel-based multiview joint sparse coding (KMVJSC) framework for image annotation. In KMVJSC, different visual features as well as label information are considered as distinct views and are mapped to an implicit kernel space, in which the original nonlinear separable data become linearly separable. Then, all the views are integrated into a multiview joint sparse coding framework aiming to find a set of optimal sparse representations and discriminative dictionaries adaptively, which can effectively employ the complementary information of different views. An optimization algorithm is presented by extending K-singular value decomposition (KSVD) and accelerated proximal gradient (APG) algorithms to the kernel multiview framework. In addition, a label propagation scheme using the sparse reconstruction and weighted greedy label transfer algorithm is also proposed. Comparative experiments on three datasets have demonstrated the competitiveness of proposed approach compared with other related methods.

1. Introduction

With the surge of private and online images, automatic image annotation has become of great research interest in computer vision since it is a vital step for image retrieval and management [1, 2]. Image annotation aims to automatically predict a set of semantic labels, such as “sea,” “beach,” and “sand,” for the unannotated images by learning the relevance among images. However, the visual similar images may not be correlated with each other in semantics, which makes the problem still challenging.

In the past decade, considerable research efforts [312] have been made for automatic annotation. They can be roughly classified into three types of models: discriminative model, generative model, and nearest neighbor model. Discriminative model [35] treats image annotation as a multilabel classification problem, in which each label is a single class. Given a test image, a semantic label will be propagated to it only if a corresponding classifier makes a decision that the image belongs to it. But this type of model neglects the correlation between different labels, which is also very important for image annotation. Generative model [6, 7] attempts to infer the correlations or joint probabilities between images and semantic concepts. By learning the statistical models, the test image is annotated based on probability computation. However, there exist many parameters in these models, which leads to heavy computational cost for parameter estimation. The nearest neighbor (NN) based model [812] predicts labels by taking a weighted combination of the label absence or presence among neighboring images. Because of its simplicity and efficiency, NN-based methods attract more researching attention.

Recently, sparse coding scheme and its variations have been successfully used in image annotation task. It is substantially related to the NN-based model since the test image is often represented by a few of representative samples in a low-dimensional manifold. For example, Wang et al. [13] reconstructed the test image sparsely and transferred labels by the sparse coefficients. Gao et al. [14] considered each (sub)class of images as a (sub)group and employed multilayer group sparsity to classify and annotate single-label images concurrently. Cao et al. [15] utilized the group sparse reconstruction framework to learn the label correlation for dictionary and reconstructed the test image for label prediction under weakly supervised case. Lu et al. [16] presented a more descriptive and robust visual bag-of-words (BOW) representation by semantic sparse recoding method for image annotation and classification. Jing et al. [17] learned a label embedded dictionary as well as a sparse representation for image annotation. In addition, the sparse representation can be further enhanced by exploiting the kernel mapping, since it maps nonlinear separable features into a higher dimensional feature space, in which features with the similar labels are closely grouped together while those without the same labels become linearly separable. Moran and Lavrenko [18] introduced a sparse kernel learning framework for the continuous relevance model, which can adaptively select different kernels for different features, and obtained great improvement.

Despite these efforts, most of existing annotation methods combine the information from different image features by concatenating them into a long feature vector, which treats all the features equally and omits their different contributions for the final decision. To solve this problem, Kalayeh et al. [19] introduced a multiview learning technique in image annotation, in which each type of features as well as label matrix is considered as a view, and all the views are adaptively integrated to exploit the complementary information, but the sparsity prior to the training samples used to reconstruct the test image is not considered. Liu et al. [20] introduced a multiview sparse coding framework for semisupervised image annotation. But they assumed the different views share a common sparse pattern, which ignores the diversity between the views. Yuan et al. [21] adopted a multitask joint sparse representation for image classification, allowing different coefficient representations for different task and enforcing the similarity among different tasks by joint sparsity. But image annotation is a multilabel classification problem, which cannot use the framework directly.

Motivated by the previous research, we formulate image annotation as a kernel-based multiview joint sparse coding (KMVJSC) learning problem. Figure 1 describes the framework of KMVJSC. Particularly, we map the feature views and label views to an implicit high-dimensional space and integrate all the views adaptively by multiview learning to strengthen the power of multiple modalities from an image. We aim to find a set of optimal sparse representations as well as dictionaries for each view simultaneously. Different views in an image should have similar coding coefficients to joint represent the same image on the one hand; on the other hand, these coefficients from different views should have some diversity to reflect the distinctive property of different views. Thus, we adopt different sparse coefficients for different views, allowing each view to be flexibly coded over its associated dictionary; at the same time, we employ the joint sparsity constraint to make the sparse coefficients among different views to be similar. The optimization algorithm and label prediction scheme under the proposed framework are developed. Given a test image, we also map its multiple features into the same kernel space and reconstruct the test image by joint sparse coding using the learned atom representation dictionary. The product of the atom representation dictionary and the corresponding sparse coefficients is considered as scores of the near neighbors for the test image, and a greedy label transfer scheme is used to get the annotation. Experiments on three datasets demonstrate the effectiveness of our proposed method and the competitive performance compared with the related methods.

In summary, the major contributions of this paper are as follows: (1) An effective kernel-based multiview joint sparse coding framework is proposed and successfully applied in image annotation; (2) the optimization algorithm is proposed by extending the accelerated proximal gradient (APG) and K-singular value decomposition (KSVD) algorithms into a kernel-based multiview case; (3) a label prediction algorithm is proposed for kernel sparse representation framework based on the sparse reconstruction and weighted greedy label transfer scheme.

The rest of this paper is organized as follows: Section 2 briefly discusses the related work. Section 3 describes the details of our kernel-based multiview joint sparse representation, optimization algorithm, and label prediction scheme. Experiment results are reported and analyzed in Section 4. Finally, we conclude this paper in Section 5.

In this section, we will review the related work including sparse representation, kernel sparse learning, and multiview learning.

2.1. Sparse Representation Based Image Annotation

Sparse coding aims to represent the observed data as a linear combination of dictionary entries (or training samples), with the constraint that each image feature vector is only represented by a subset of all the available dictionary atoms. Denote the feature matrix formed by original training samples as , where () is the feature vector of the th sample image and is the feature dimension. Then, the conventional sparse representation aims to seek a sparse coefficient matrix and the associated dictionary by solving the following optimization problem [22]:where is the L1-norm sparse constraint regularization and is a trade-off parameter used to balance the sparsity and the reconstruction error.

For label transfer, given a test image, [13] adopts all the training images as dictionary and propagates the labels from the training images to the test image directly by the product of label matrix of training images with the learned sparse vector. Document [14] predicts the test image label based on the reconstruction error in the (sub)groups and assigns labels from the (sub)groups with the minimum reconstruction error to the test image. But the discriminative power of sparse representation in the original space is still limited compared to the kernel space.

2.2. Kernel Sparse Representation

The kernel sparse coding improves (1) by introducing a kernel trick on both training images and dictionary and has obtained great success in image classification. Let denote the matrix whose columns are obtained by embedding the input features into a feature space using the mapping . That is, . Furthermore, let denote the dictionary in this mapped feature space; the kernel sparse representation [23] problem is defined as follows:

Equation (2) can be rewritten as in (3), which depends on a Mercer kernel function , but not the mapping :where is a positive semidefinite matrix with the entries computed from Mercer kernel: , and is a -dimension vector with . A Mercer kernel is a function representing the nonlinear similarity between two vectors and . Some commonly used kernels include polynomial kernels, Gaussian kernels, and histogram intersection kernels (HIK).

For image classification, given a test image, it is assigned to the class with the minimum reconstruction error obtained by the learned dictionary and sparse coefficients for that class in the mapped space. However, image annotation task is a multilabel classification problem and cannot use this judgment decision for label transfer directly. In addition, the label transfer scheme [13] in original space cannot be moved to the mapped space since the mapping function is unknown. Thus, the application of kernel sparse representation in image annotation is limited.

2.3. Multiview Learning

While the previous framework has been proven successful for many tasks, it has only been applied to the single-view case. With multiple types of input modalities, [20] proposed a multiview learning based sparse representation model, which is based on the following general framework:

Here, each image sample is represented by different features of views, and are the feature matrix for all the training images and learned dictionary of the th view, respectively, and is the sparse coefficient matrix shared for all views.

In [20], labels are treated as an additional th view; then, and represent the label view matrix of training samples and the associated label view dictionary, respectively. For a new test sample , the corresponding sparse code can be obtained by solving the convex problem:and the label view of the test sample is then predicted by directly. Although [20] implements the image annotation effectively, it adopts shared sparse coefficients for all views, which is not the case in reality because of the diversity of different views.

3. The Proposed Method

In this section, we introduce in detail our proposed method, optimization procedure, and label propagation algorithm. Throughout this paper, given a matrix , we will use the term to denote its th column vector and to denote its th row vector.

3.1. Kernel-Based Multiview Joint Sparse Representation

Motivated by the previous works of [20, 21], we utilize different sparse coefficient representation for different views, allowing the flexibility of sparse coefficients from distinct views; then, we introduce a joint sparsity constraint to our kernel-based multiview sparse learning framework to keep the correlation among multiple views. Specifically, let be a set of training image samples obtained from different feature views; each view contains the sample feature denoted by , where is the th sample. The label information can be considered as another view , is the number of labels, is the label vector of the th image, and each entry is either 1 or 0 representing the presence or absence of a given label in the image. Let be an overcomplete dictionary (), and denotes the th column of . Let and denote the corresponding sample matrix and dictionary of the th view obtained using the mapping , respectively. Thus, we have and . Then, the object function of kernel-based multiview joint sparse representation can be defined aswhere corresponds to the sparse representation coefficient of over dictionary in the mapped space, , , and is defined as the sum of the -norm of all rows of the matrix , which encourages to be sparse in column direction and dense in row direction. It helps to enforce the sparse coefficients of different views to share a similar pattern. is the tuning parameter to control the regularization term.

The dictionaries from different views must be correlated with each other since they are learned from the same training images. To keep the correlation, we employ , since dictionary atoms lie within the subspace spanned by the input data, where is called the atom representation dictionary and is called base dictionary [24]. Then, (6) can be rewritten as

Compared with the dictionary obtained by using all the training samples directly, in which each dictionary atom corresponds to a training sample, the learned dictionary in (6) has no explicit physical meaning in their structure; that is, atoms located in the same column of different dictionary views may not originate from the same training sample, and also the correlations among different views are lost. By introducing , we can learn a shared implicit dictionary among multiple views in a similar way as (6), while the explicit nature (atom location information) of the dictionary is also maintained in the base dictionary , which is very important for subsequent label transfer in kernel space.

3.2. Optimization

To solve the problem in (7), we adopt an iterative optimization procedure by alternately optimizing technique [25] with respect to and while holding the other one fixed. In the following, we provide a brief description of the alternating optimization for our method. For convenience, Notation lists the important notations used in this paper.

First, keeping fixed, the problem in (7) can be simplified as

In the next step, keeping fixed, the problem in (7) is simplified as which can also be represented by the matrix form:Here, .

Such process can be iterated until the solutions of and converge to some local minimum. The following is a detailed description for optimization algorithm solving problems (8) and (10).

Learning Sparse Codes. Equation (8) can be decoupled into distinct subproblems; the th () subproblem is formulated as follows:

The objective function in (11) is a nonsmooth convex function since is nondifferentiable, which can be solved by accelerated proximal gradient (APG) method [26, 27] with the extension that signals are now in the multiview kernel space with a very high dimension.

Then, the optimization process can be iterated alternately by generalized gradient mapping and aggregation steps [27].

In the generalized gradient mapping step, we denote , , and ; then, the gradient of with respect to is calculated as follows:

Then, we apply the gradient mapping to (11) to get the following equivalent objective function:

It comprises the regularization term and the approximation of by the first-order Taylor expansion at regularized as the Euclidean distance between and , where is a parameter controlling the step penalty.

In the th iteration, we calculate as

Document [26] gives the solution for solving the optimization problem of (14) aswhere , , and , are the th row of matrixes and , respectively.

In the aggregation step, is computed as a linear combination of and :Here, we set conventionally.

The steps of optimizing algorithm for (8) by the extended APG are elaborated as Algorithm 1.

Input:
  Training image samples ,
  Corresponding dictionary ,
  Regularization parameter and step size .
Output:
  Sparse coefficient matrix ,
(1) For  
(2) Initialization: Initialize , to be zero matrix; ; and set .
(3) While not converged Do
    Generalized gradient mapping step:
(4)   Calculate by Eq. (12)
(5)   For
(6)    Calculate th row of by Eq. (15)
(7)   End For
    Aggregation step:
(8)    
(9)     
(10)    
(11) End While
(12) End For

Dictionary Updating. For problem (10), considering the matrices , , and , we obtain

This is equivalent to the dictionary update stage in kernel dictionary learning algorithms. We adopt the kernel KSVD dictionary update strategy in [24] to efficiently solve it. Then, is chosen as the first column of .

3.3. Label Prediction

The proposed method treats labels as an additional view, so we can infer the label information from the sparse coefficients. In particular, given a test image represented by multiple feature views , visual features of training images , the learned dictionary from the features, and labels of the training images, we obtain the sparse coefficient vector () for the test image in terms of learned dictionary by solving the following convex problem, which can be solved similarly as to problem (11).

Since the magnitude of the product term can be considered as the importance of certain training samples in the reconstruction of the test image, the priority of the samples used for label propagation is based on the magnitude of its corresponding product term . Although is unknown, it must share the similar pattern with , since they are used to represent the same image. So, we iterate for each view , choosing the samples in corresponding the top five values in . Then, the tag propagation for the test image is based on a weighted version of greedy label transfer scheme [8]. The proposed label prediction and propagation scheme is summarized in Algorithm 2.

Input:
  The feature view of test image ,
  Feature view of training images , ,
  Learned dictionary,
  Tuning parameter .
Output:
  Predicted labels for the test image.
(1) Calculate by Eq. (18)
(2) For  
(3)  Find the five largest values in  and corresponding indices
(4)  Find sample label column vectors in  corresponding to these indices
(5) End For
(6) Compute and rank the weighted frequency of labels appeared in all the found samples, with weights equaling to
the product coefficients of
(7) Transfer labels according to their calculated frequency.

4. Experiments

To validate the effectiveness of the proposed KMVJSC for automatic image annotation task, we conduct experiments and compare the results with related algorithms on three popular benchmark databases.

4.1. Experimental Settings

The three datasets, we used to evaluate the performance of our approach are as follows.

Corel5K dataset is the most popular dataset for annotation evaluation [810, 13, 1719, 28, 29]. It consists of 5000 images from 50 different topics, such as “beach,” “aircraft,” and “horse,” and each topic includes 100 similar images. The training and testing sets contain 4500 and 500 images, respectively. Each image is manually annotated with 1 to 5 labels from a dictionary of 260 labels, and with 3.5 labels on average.

IAPRTC12 [30] dataset consists 19627 images of natural scenes dealing with sports and actions and photographs of people, animals, cities, landscapes, and so on. In the dataset, 17665 images are selected for training and the remaining 1962 images are chosen for testing. The number of labels in this dataset is 291, and each image has up to 23 labels with an average of 5.7 labels per image. Besides, each label averagely relates with 347 images.

MIRFlickr25K [31] contains 25000 images which are collected from the social photography site https://www.flickr.com and equally split for training and test. It provides 38 labels as the ground-truth annotation, such as “animals”, “baby”, and “”. The words with “” indicate a further strict annotation for labels. Each image is associated with up to 17 labels and 4.7 labels on average, while each label averagely relates with 1560 images. Besides, the dataset provides 1386 tags. Since the Flickr tags are noisy, we kept the tags that appear at least 50 times, resulting in a vocabulary of 457 tags, and use it as another type of feature.

We adopt histogram intersection kernel as the kernel function because it is a parameter-free kernel, and it has obtained excellent performance in evaluating the similarity between two histograms [14].

Following [32], we adopt 15 visual features but limit the size of large feature vectors. Specifically, in order to reduce the computation complexity of both training and testing procedures, we use color histogram with 10 bins per color channel. For features encoded with the spatial layout information, we quantize the color with 8 bins in each channel and reduce the -means cluster centers to 500 for SIFT features. Thus, the 15 distinct features include 3 color features (RGB, LAB, HSV, 1000D), 2 SIFT features (DenseSIFT, HarrisSIFT, 1000D), 2 Hue features (DenseHue, HarrisHue, 100D), 7 above features with layout encoding (RGBV3H1, LABV3H1, HSVV3H1, 1536D; DenseSIFTV3H1, HarrisSIFTV3H1, 1500D; DenseHueV3H1, HarrisHueV3H1, 300D), and a GIST feature (512D). In addition, for MIRFlickr25K dataset, we endow each image with a binary vector indicating the absence or presence of each tag, which leads to a Tag feature of length 457D.

There are two parameters that need to be tuned, that is, dictionary size and regularization parameter . In this paper, is selected from , and is selected from . We perform KMVJSC with fivefold cross validation on training set and conduct sensitivity analysis in Section 4.3. Finally, we set the parameters of our KMVJSC algorithm as , on Corel5K dataset and as , on the other two datasets. Due to the random entries in initialization, we repeat all the experiments 5 times separately and report the average results.

Following most research [10, 13, 17, 19, 32], all the test images are annotated by the top 5 relevant labels. To estimate the performance, we calculate precision (), recall (), and -measure for each label. For a given label , the three metrics are defined by , , and . Here, is the number of images labeled with in the ground truth, is the number of images labeled with by our automatic annotation algorithm, and is the number of correct labeled images with . We present the mean value over all labels for each metric. Besides, the number of labels with nonzero recall () is also used.

4.2. The Compared Methods

We compare our KMVJSC with some sparse coding based baselines including multilabel sparse coding (MSC) [13], kernel sparse coding (KSC) [23], multiview Hessian discriminative sparse coding (MVHDSC) [20], multiview joint sparse coding (MVJSC) in original space, and kernel-based multiview sparse coding (KMVSC) without enforcing the joint sparsity across views. For single-view methods (MSC and KSC), we concatenate the multiple features as a long vector. For KSC, we use the same label transfer scheme as our method except using learned dictionary and reconstruct coefficient directly; for MVJSC, we calculate the average sparse coefficient from multiple feature views and use it as the sparse coefficient of label view and transfer labels with the same way as that of MVHDSC; and for KMVSC, we use the same label transfer method as our KMVJSC. Moreover, we take the same parameter selection strategy for these methods, where all the training, validation, and test sets are exactly the same as those used for our method.

4.3. Experimental Results

Table 1 lists some image examples from three datasets and the predicted labels using our method, together with the ground truths. It contains at least one mismatched label compared with ground-truth labels (perfectly matched annotations are not listed here). The differences in predicted labels are marked in italic font. The results in Table 1 show that, in many cases, some predicted labels missed by the ground-truth annotation can still explain the image well, such as “meadow” for the first image from Corel5K dataset. Besides, some semantically similar words were also treated as errors, such as “cathedral” for the last image from IAPRTC12 dataset, which shows the potential effectiveness of our proposed method for automatic image annotation task.

The annotation performance of our proposed algorithms on three datasets along with other related methods is listed in Table 2, where the methods with “” represent the implementation using our features, and KMVJSC+T refers to the annotation with our method using another Tag feature for MIRFlickr25K besides the 15 feature views and the label view. The maximum values are bold. From the result in Table 2, we can make the following observations.

(1) For single-view methods, KSC is slightly better than MSC; for multiview methods, kernel methods including KMVSC and KMVJSC both outperform MVJSC, which demonstrates the power of kernel mapping. We also can see that MVHDSC is still comparable with KMVSC, which is probably because MVHDSC enforces structured sparsity on dictionary additionally.

(2) All multiview methods dramatically outperform the single-view methods on all datasets. Although MSC takes multilabel information into consideration, its performances are inferior to those of multiview methods including MVJSC, which shows multiview learning can harness the label view and feature views in a more natural way and capture the relation between multiple labels and that between labels and visual features for discrimination.

(3) Compared to KMVSC, which neglects the joint sparsity across multiple views, our KMVJSC clearly outperforms KMVSC, which shows the constraint of joint sparsity across different views is valuable to enforce the robustness in coefficient estimation among different views.

(4) Adding the Flickr tags as features helps to improve the annotation performance further, which illustrates that tags are another type of complementary information to visual features and label view and can be effectively used in multiview leaning.

4.4. Sensitivity Analysis of Parameters

We investigate the sensitivity of the parameters and in our approach using the three datasets. The corresponding results are presented in Figures 2 and 3.

It should be noted that the results in these two figures are obtained by 5-fold cross validation on the training image set of the three datasets, respectively. Figure 2 illustrates the values of the measure versus different size of dictionary with fixing (0.1 for Corel5K dataset and 0.001 for the other two datasets). From the results, we can observe that although the performance in terms of measure varies on changing the value of , it is not very sensitive to its choice and remains fairly stable. Considering that high dimension corresponds to expensive computing cost, we set for Corel5K dataset and for IAPRTC12 and MIRFlickr25K datasets to leverage the performance and the cost. Figure 3 shows the values of the measure versus different with the selected fixing . It is observed that the -norm joint sparsity regularization term is effective while it is not too small or too large. If it is too small, the correlations between different views are lost while it limits the flexibility of all the variety of views with large values. We can see that the best result of our approach is reached when 1 on Corel5K dataset and on the other two datasets.

5. Conclusions

In this paper, we present a kernel-based multiview joint sparse coding framework for image annotation problems. It can learn a set of optimized sparse representations as well as dictionaries in a multiview kernel space. We consider label view (or tag view) as an additional view and adaptively utilize the relationship between label views and visual feature views to find the more discriminative sparse representation by incorporating multiview learning and the -norm joint sparsity. We extend KSVD and AGM to a multiview kernel version to solve our optimization problem. The priority of sparse coefficients in the kernel space is used to propagate labels to the test images by weighted greedy label transfer scheme. Experimental results on three widely used datasets show the effectiveness of our proposed method compared to the related methods for image annotation tasks. Considering that some non-ground-truth annotations can still describe the image well in our experiments, for the future research, we try to further involve human perception [33] to measure the relevancy between the non-ground-truth annotations and the images to show the potential flexibility of our proposed framework. Besides, we are going to apply deep learning features [10, 34] and multiple kernel learning [18] into our framework.

Notation

:Number of training samples with labels
:Training samples data of the th view
:th training sample of the th view
:Dimension of th view
:Kernel mapping
:Dictionary of the th view
:Number of feature views
:Number of dictionary atoms
:Atom representation dictionary
:Sparse coefficient of over in space
:
:
:Parameter of regularization term
:Test image data of the th view
:Sparse coefficient of over in space.

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by Chinese National Natural Science Foundation (61371143), Beijing Natural Science Foundation (4132026), and Scientific Research Project of Beijing Municipal Education Commission (KM201510009006).