Abstract

Zero-shot learning (ZSL) is a powerful and promising learning paradigm for classifying instances that have not been seen in training. Although graph convolutional networks (GCNs) have recently shown great potential for the ZSL tasks, these models cannot adjust the constant connection weights between the nodes in knowledge graph and the neighbor nodes contribute equally to classify the central node. In this study, we apply an attention mechanism to adjust the connection weights adaptively to learn more important information for classifying unseen target nodes. First, we propose an attention graph convolutional network for zero-shot learning (AGCNZ) by integrating the attention mechanism and GCN directly. Then, in order to prevent the dilution of knowledge from distant nodes, we apply the dense graph propagation (DGP) model for the ZSL tasks and propose an attention dense graph propagation model for zero-shot learning (ADGPZ). Finally, we propose a modified loss function with a relaxation factor to further improve the performance of the learned classifier. Experimental results under different pre-training settings verified the effectiveness of the proposed attention-based models for ZSL.

1. Introduction

Image classification can be viewed as the task to correctly classify the given image into its class. There are many supervised models that have achieved significant success in image classification, such as K-nearest neighbors (KNN) [1] and support vector machines (SVM) [2]. Especially in recent years, deep learning techniques have made great progress in image classification. However, most existing recognition models require a large amount of training samples and can only classify instances belonging to the classes covered by the training data. There are about 30,000 classes that humans can recognize [3], where the workload is quite huge to label all classes and the classes may be growing over time. In contrast, humans are very good at recognizing the unseen classes via reasoning. For example, if we have seen cats and spotted dogs, we will look for an animal called a leopard, which is a cat with spots. Hence, it is important for the agents to acquire the ability of recognizing the unseen classes and zero-shot learning (ZSL) is proposed accordingly.

Zero-shot learning [4] is an inevitable trend of target classification, whose general idea is to transfer the knowledge contained in the training instances to the task of testing instance classification. As no labeled instances belonging to the unseen classes are available, some auxiliary information is necessary to be involved. The auxiliary information involved by the existing ZSL methods is usually some semantic information [5]. Semantic attributes and semantic word vector are two typical semantic information, while we have to learn the mapping from semantic space to visual space when using the two semantic information, which will make it difficult for the model to learn semantic vector representation from structured information.

As a non-Euclidean space data structure, knowledge graph cannot be processed well by the traditional convolutional neural network (CNN). In order to solve this problem, the graph convolutional network (GCN) [6] was proposed with local graph operators. In a GCN, the influence of neighbor nodes on the central node is the same, and the GCN was affected by Laplacian oversmoothing, which makes the GCN a shallow network. In order to solve the problem that the central node can accept the distant node, Michael Kampffmeyer proposed the DGP model [7]. However, there is no good explanation for the contribution of neighbor nodes to the central node. Hence, we apply the attention mechanism to the GCN for enhancing the interpretability of the model and the model can well evaluate the contribution of different neighbor nodes to the central node.

Zero-shot learning aims at recognizing unseen classes by training. Therefore, the classes of testing dataset cannot be included in the training dataset. In recent studies, many models have adopted a pretrained model [8], and we consider whether the pretrained model affects the model. It is clear that when the model is being trained, more samples will help the model test to get better results. In zero-shot learning, we only consider the relationship between the training set and testing set, but do not consider the influence of pre-training. In many models, there are small-scale datasets, such as Animals with Attributes 2 (AWA2) [9] used for the zero-shot learning task, and the model will use the pretrained model of the ImageNet dataset. However, the classes of the ImageNet training set are often more than that of the training classes of AWA2 and other datasets. When we only know a small-scale dataset for zero-shot learning, the task should only be carried out in the training classes of the small-scale dataset. Therefore, we divide the zero-shot learning into three settings, that is, small-scale setting, classifier setting, and large-scale setting, according to the pre-training methods, and integrate the results of the three settings to make the evaluation of the task model more accurate for more practical tasks.

In this article, we proposed the attention-based graph convolutional network for zero-shot learning with pre-training to improve the performance of the task for unseen classes and improve the generalization ability of the model. For the unseen classes, we use the relationship of the classes to establish a connection between the seen classes and the unseen classes. We use knowledge graph as a prior knowledge of agents, which allows the agents to learn to reason. Then, we use the GCN to process the knowledge graphs and train the classifier for the unseen classes. The main contributions of this article are threefold:We integrate the attention mechanism and graph convolutional network for zero-shot learning. Specifically, we propose two attention-based models, AGCNZ and ADGPZ, to learn adaptive connection weights of the nodes to achieve more accurate predictions.We present a modified loss function with a relaxation factor, which has a positive effect on the performance.We have a complete discussion of the setting of ZSL and propose three settings to certify the effect of pre-training for zero-shot learning. Extensive experiments show that the proposed attention-based models can effectively improve the performance of zero-shot learning.

The rest of the article is organized as follows. Section 2 introduces the related work of ZSL. In Section 3, the proposed approach is presented with the overall framework followed by specific algorithms. In Section 4, the three pre-training settings are introduced and the experimental results demonstrate the success of the proposed algorithms. Conclusions are given in Section 5.

2. Preliminaries

Zero-shot learning (ZSL) was first proposed in 2009 [10, 11] and has become one of the important fields of machine learning for that ZSL can identify specific unseen classes and meets the future demand for target recognition. In ZSL, seen classes and unseen classes are connected in a high-dimensional semantic representation space, which includes the attribute space, word vector space, and text description space. The attribute space is firstly introduced in ZSL, where the essential idea is to train a classifier with each attribute of the input, use the trained classifier to predict attributes, and pay more attention to the correlation between learning attributes during the training stage. For example, DAP [12] first estimates the posterior value of each attribute in the image and predicts the class label by learning the probabilistic attribute classifier. Later, for the limitations of the DAP model, Akata et al. [13] introduced a function to measure the compatibility between the image and the label embedding, whose parameters are learned from a set of training samples to ensure that the correct classes rank higher than the incorrect classes in a given image. Li et al. [5] also pay attention to attribute ZSL, and an end-to-end network that automatically discovers discriminative regions by a zoom network and learns the discriminative semantics of user-defined and latent attributes in augmented space is represented.

As for the word vector space, Socher et al. [14] can recognize objects in an image using an unsupervised large text corpus without training data. Frome et al. [8] presented a new deep visual semantic embedding model that uses labeled image data and semantic information extracted from unlabeled text to identify visual objects. Inheriting the DeViSE method, Norouzi et al. [15] proposed a simple method to construct an image embedding system from the existing -way image classifier as a result of a semantic word embedding model containing class tags. In the text description [1618], text description is used to classify unseen classes, and Kodirov et al. [19] proposed to solve the drift of the zero-shot field by using a learning semantic autoencoder (SAE). Wang et al. [20] introduced the GCN in their research, using structured information and complex relationships to generate classifiers for unseen classes. Knowledge graph is a semantic network that represents the relationship between entities, and each class is represented as an entity on the knowledge graph, for example, as shown in Figure 1. In the zero-shot learning semantic representation space, attribute descriptions require attribute annotations and text descriptions require sentence descriptions, and a large number of manual annotations are required. Therefore, the cost is relatively high and the advantages shown by the word vector space are considerably attractive.

The graph convolution network (GCN) has become a hot spot of research in recent years. In the GCN, the number of neighbors around the central node is different in non-Euclidean data. Hence, many scholars have begun to study how to deal with graph data structures. A GCN is a kind of the network structure models that can process graph structure data, and the most important part is its convolutional kernel. Like a CNN, the GCN also aims to be able to define convolutions on graphs. Therefore, the essence of the graph convolution is to find a learning convolution kernel suitable for graphs. Bruna et al. [21] first proposed spectral convolutional neural networks. Spectral domain graph convolutional networks implement convolution operations on topological graphs through the theory of graphs, but the method has disadvantages such as computational complexity and nonlocal connection. In addition, Defferrard et al. [22] proposed to fit the convolution kernel using Chebyshev polynomials to reduce computational complexity. Based on the previous works, Kipf and Welling [6] proposed a simple and effective layered propagation method via first-order approximation, which became the pioneering work of the graph convolutional network (GCN). Because of the advantage of the GCN to process the graph data, GCN is gradually applied to a wide range of research fields [2325] and there are also some studies on graphs, such as Deepwalk [26] and Node2vec [27].

3. Problem Statement

In this section, a schematic framework of the proposed approach is shown in Figure 2 ied loss funct with specific methods of introducing the attention mechanism to different GCN models for the zero-shot learning. In addition, a modifion is also proposed between the predicted classifier and the ground-truth classifier. Then, the algorithms are presented in detail as shown in Algorithms 1 and 2.

Input: Adjacency matrix A, Number of nodes N, Input node features X, Pretrained ResNet50 model classifier parameters
Output: Classifier parameter , Predicted categories of Unseen classes .
(1) Initializes: the graph convolutional network parameters.
(2)while not converged do
(3)  Update by equation (4);
(4)  for Attention-layer do
(5)    Update by equation (7);
(6)    Update by equation (5);
(7)  end for
(8)  Loss = LossFunction (, ), LossFunction update by equation (2) or (3);
(9)  Loss.backward;
(10)end while
(11)return
(12) is obtained by using as classifier parameter of classification .
Input: Graph , Number of nodes N, Input node characteristics X, Pretrained ResNet50 model classifier parameters
Output: Classifier parameter , Predicted categories of Unseen classes .
(1) Initializes: the graph convolutional network parameters.
(2)Change the Graph to a dense Graph , get the adjacency matrix A.
(3)while not converged do
(4)  Update by equation (4);
(5)  for Attention-layer do
(6)   Update by equation (10);
(7)   Update by equation (9);
(8)  end for
(9)  Loss = LossFunction (, ), Loss Function update by equation (2) or (3);
(10)  Loss.backward;
(11)end while
(12)return
(13) is obtained by using as classifier parameter of classification .
3.1. Attention-Based Graph Convolution Network for Zero-Shot Learning

Here given a graph , each node on the represents a category. The adjacency matrix is expressed as , which is used to characterize the relation between categories. The propagation formula between GCN layers is defined aswhere is the identify matrix, , degree matrix is expressed as , is the nonlinear activation function, and is the weight matrix with -feature map in the output layer. is the matrix of activations in layer, where is the number of nodes and is the feature dimension [6].

In the above formula, each vertex not only has its own neighbor, but also has a self-connection. Laplace smooths the new feature of the vertex, that is, the weighted average of the vertex itself and its neighbors. Because the vertices of the same cluster tend to be more tightly connected, this makes the classification task easier. In GCNs, although using a convolution is already very effective, two-layer GCNs are much better than one-layer GCNs. Because smoothing on the first level of activation makes vertex characteristics in the same category more similar and classification tasks easier. However, as the number of GCN layers increases, the performance will decrease. The reason is that additional Laplace smoothing will be performed as the number of layers increases. Consequently, we can generally use a 2-layer network in this article.

3.1.1. Attention Mechanism

As an important concept in neural networks, the attention mechanism was first used in machine translation [28]. There are many applications in various fields, such as computer vision [2931] and natural language processing [28, 32, 33]. Attention mechanism, whether in computer vision or natural language processing, can be classified as giving more attention to the target areas that need to be focused on. In this article, when solving ZSL tasks with knowledge graphs, we represent each category as each node on the knowledge graph and then use GCN to process the knowledge graph. Therefore, it is very crucial whether the result of GCN processing knowledge graphs can fully express the real situation of each neighbor node’s influence on the central node. Therefore, we use cosine distance to calculate the attention of the node [34] and to capture the degree of association between node and node , as shown in Figure 3, and then use the improved GCN to process the knowledge graph for ZSL.

3.1.2. Loss Function for Predicted Classifier

A node represents a class in the knowledge graph, and then, we use a word embedding vector for each node. The word embedding vectors of all nodes in the knowledge graph are the input to the graph convolutional network. There are nodes, -dimensional vectors, input , is the ground-truth for seen classes, and the loss function [7] can be represented asThe optimized loss function iswhere represents the output of the graph convolutional network model and is the parameter to adjust the error between the ground-truth and the predicted classifier. We hope to calculate the error of the difference at least, where the optimized loss function utilizes a relaxation factor to enhance the generalization ability of the model. We use the ground-truth to train the predicted classifier that can classify unseen classes, add a relaxation factor to enhance the generalization ability of the model, and do not have to be exactly the same as the ground-truth.

3.1.3. Pre-Training Zero-Shot Learning Setting

We propose three pre-training settings for zero-shot learning to better evaluate the model. The architecture of the proposed three pre-training settings is given in Figure 4. We use the ResNet50 [35] model, which has been pretrained on the large-scale dataset. Based on this, for the classifier parameters of the pretrained model, large-scale setting continues to use the classifier parameters of the large-scale dataset, and classifier setting is that we use the classifier parameters of trained by the training set of the small-scale dataset used to test. Small-scale setting is that the training set of the small-scale dataset is trained with the ResNet50 model to get the pretrained model.

3.2. Attention Graph Convolutional Network for Zero-Shot Learning (AGCNZ)

In zero-shot learning tasks, we consider the relationship between the training set (seen classes) and the testing dataset (unseen classes) in dataset and . The ground-truth is trained on the training set to get the classifier parameters. The knowledge graph is established by using the classes of ImageNet and AWA2, which reflects the relationship between each class.

In the GCN, we introduce the attention mechanism and use cosine distance to calculate the similarity between nodes. The propagation formula [34] of the first layer is given as follows:where . The introduced parameter in the layer is guided by the attention mechanism, and the rule [34] of AGCNZ propagation for the attention layer iswhere is the propagation matrix and denotes the layer index. The output row vector [34] of node is recorded aswhere is the neighborhood of node . In order to ensure that the sum of each row of the propagation matrix is 1, the softmax function is used so that the influence of nodes adjacent to the central node is 1. In summary, the attention [34] between node and node iswhere . It calculates the similarity between node and node , and pays more attention to nodes with more similar central nodes. The AGCNZ algorithm is shown in Algorithm 1. Meanwhile, the architecture of the attention is shown in Figure 5.

3.3. Attention Dense Graph Propagation for Zero-Shot Learning (ADGPZ)

GCN is limited to shallow layer; that is, in the experiment, only two-layer GCN is the best, so the central node cannot receive the information from the remote node. To solve this problem, Kampffmeyer et al. [7] proposed a dense graph propagation (DGP) model to solve this problem and achieved good performances. However, we hope that we can better balance the weight relationship between different neighbors. Because not all edges represent the same degree of association, it is desired to focus on those nodes that are more related to the center node. At this time, the attention mechanism tends to choose those neighbor nodes with the same class as the central node, giving stronger association strength.

Instead of directly processing the knowledge graph with GCN, the DGP model transforms the knowledge graph into a graph in which ancestors and descendants are directly connected with the central node, and then, the dense graph is processed by the GCN. For a given graph, the DGP layer to layer propagation mode [7] iswhere and are used to denote the adjacency matrix directly connected to ancestors and descendants, respectively. The ADGPZ propagation rule of the attention layer iswhere represents the layer index; represents the attention of descendants:where . In the same way, we can get and .

The introduction of attention into the model provides some explanation information. At the same time, the acquired propagation matrix can also reflect the attention of center node to neighbor node in the process of feature aggregation, which represents the influence of node on node in the classification process. The ADGPZ algorithm is shown in Algorithm 2, and the architecture of the attention is shown in Figure 6.

4. Experiment

4.1. Datasets

We carried out several groups of experiments on both of large-scale and small-scale datasets. ImageNet dataset [36] contains 140 million images, which are divided into more than 20 000 classes (synsets), including 1000 training sets. We used 2-hops for the test, with 1549 classes. Animals with Attributes 2 (AWA2) [9] contains 50 kinds of animal species, of which 40 species are training sets and 10 species are test sets. The training set contains 29 409 pictures, and the test set contains 7913 pictures. Attribute Pascal and Yahoo (aPY) [9] are 32 classes, 20 classes from Pascal are used as training, and 12 classes are from Yahoo as test. Experimental settings are guaranteed that the ImageNet dataset training set does not contain unseen classes of the testing set and that the classes of the dataset are in the knowledge graph. Three classes from ImageNet and AWA2 are added as a supplement for the unavailable data in the split aPY testing set.

4.2. Experimental Settings

The knowledge graph of the relationship between classes is established by using the ImageNet dataset and AWA2 dataset class names and WordNet. The GloVe [37] text model trained with the Wikipedia dataset represents that each class represents word embedding vectors. In the experiment, we only use a half of the graph without attributes and reconcile the words by WordNet. Our models are trained and tested in PyTorch using an Adam optimizer [38], the learning rate of and the weight decay of . The nonlinear activation function uses the ReLU function with dropout set to 0.5. We use a two-layer model in the GCN model. In the ADGPZ model and DGP model, we only consider the different effects within 5-hop neighbors on the central node. To better compare and discuss the effect of attention mechanism and pre-training for ZSL in the experiment, the fine-tuning method in this article [7] is not used in the full-text experiment.

4.3. Results on Small-Scale Datasets

The results of all the comparisons are shown in Table 1 and show that our models outperform the baseline and other methods, where the annotation means from [9] and means from [7]. OL and ML stands for original loss and modified loss. These methods use pretrained models that have been trained on the ImageNet datasets. It can be seen from the table that the classification effect is significantly improved in ImageNet setting with the attention mechanism. Our model outperforms the best model DGP by 4.8% on the AWA2 dataset, and shows better performance on the aPY dataset.

To demonstrate the effectiveness of our methods, we compare the results in different settings. In Tables 2 and 3, all is small-scale setting and classifier is classifier setting on the small-scale dataset. We compared the accuracy of the four methods in the classification of unseen classes under the small-scale setting and classifier setting. No matter or , the classification accuracy of 50.7%, 37.0%, 55.6%, and 39.6% under the two settings is better than that of baseline (43.9% and 36.5%) on the AWA2 dataset. Similar performances can be found for the aPY testing set. The classification accuracy of and ( 66.8%, 50.8%, 65.6% and 48.3%) is better than that of the baseline method. Among them, the best model is 6.8% better than baseline method.

4.3.1. Effect of Pre-Training on the Model

We further design comparative experiments to demonstrate the effect of pre-training for ZSL. Compared with Tables 13, we can find that the accuracy of the model can go up to 82.1% on the AWA2 dataset and around 91% on the aPY dataset. Among the three settings, the classification accuracy of the large-scale setting is the best one. The results show that the effect of classifier parameters trained with small-scale datasets is not as good as that of pre-training with the large-scale datasets. The model parameters pretrained with ImageNet training set are actually equivalent to training with 1000 classes. Although the 1000 classes do not contain the same class in the test set, it is clear that the effect on the classification of unseen classes is affected. To more intuitively compare the influence of pre-training on the model, we show it in Figure 7. It is clear that under the three pre-training settings, no matter on the AWA2 dataset or aPY dataset, the classification accuracy of classifier setting is higher than that of small-scale setting, and the classification accuracy of large-scale setting is higher than that of classifier setting. In contrast, different pre-training settings will produce different results for ZSL, which further indicates that pre-training has an impact on the model. In future, it is necessary to consider different pre-training settings in the evaluation of model competence.

4.3.2. Effect of Modified Loss Function on the Model

In Table 1, when the DGP model used the modified loss function, its classification accuracy is improved by nearly 4%. In all the tables, it is clear that the model with the optimized loss function is better than the original loss function. Without the modified loss function, the accuracy of ADGPZ classification was improved by 3% over the baseline method. The attention mechanism introduced in the baseline method is significantly better than the baseline method as exhibited in Tables 2 and 3. It is proved that when calculating the errors of the predicted classifier and the ground-truth, the modified loss function by adding a parameter to adjust the errors between them to obtain the classifier of the unseen classes can better improve the performance of classification of the unseen classes. The accuracy is also shown in Figure 8. It is clear that on the two datasets, the classification accuracy of the modified loss function of each method is higher than that of the original loss function. The experimental results show that the relaxation factor by introducing the loss function can make the model better classify in ZSL.

4.4. Discussion on Large-Scale Datasets

We further test the proposed models on large-scale datasets, and the experimental results of and were not as good as those of GCN and DGP, and the experimental results of were the worst, shown in Table 4. The reason is that in a large-scale datasets, the number of classes of the training set is more than that of the small-scale dataset. For AGDPZ, its model is the most complex, and the number of added parameters will be more than that of the small-scale datasets; thus, the model overfits. In Table 4, using the model of the modified loss function improves the accuracy of unseen classes, where , , and indicate the results from [20, 39, 40].

4.5. Further Analysis

We further analyzed which parameters were more sensitive to changes using the modified loss function model. We implemented experiments with the learning rate and the weight decay, where the implementation details are kept consistent except for the more important parameters. The experimental results show that the model is more sensitive to changes in the learning rate. Meanwhile, the ADGPZ model is more sensitive on parameter variation, which is due to the more complex propagation of the ADGPZ attention layer than AGCNZ.

In the experiments, we found that the effect of unseen classes classifier obtained by using the large-scale dataset pretrained parameters is much better than that obtained by using the small-scale dataset training parameters. Hence, we hold the opinion that using a large number of training samples for the pre-training is more likely to improve the classification of unseen classes.

In real life, we have different application scenarios for large-scale and small-scale datasets. For small-scale datasets, it is enough to identify and classify a specific domain; while for large-scale datasets, it can be applied to a wide range of scenarios. In the experiments, we found that ResNet50 model pretrained with the large-scale dataset has 30% higher classification accuracy for unseen classes than the model trained with the small-scale training set. It is clear that the more classes the agent has seen in the training, the better it can recognize for the unseen classes. More categories stored for the training of the agents may help identify unseen classes for later ZSL tasks in an incremental learning paradigm.

5. Conclusion

In this article, we combine the attention mechanism with GCN, propose two models of AGCNZ and ADGPZ with a modified loss function, and propose three pre-training settings for the zero-shot learning. The experimental results demonstrate the success of the attention mechanism and the proposed models with the modified loss function in three pre-training settings, which is proved to be an influencing factor for evaluating the model in ZSL. Extended experiments also provide more characteristics of the proposed approach with detailed discussion. The emergence of the ZSL task avoids the cost of labeling and training when new categories are added and enables the model to have reasoning ability to recognize unknown categories, which promotes the development of the image recognition. Our future work will consider more ways to improve the loss function, not just by introducing a relaxation factor. We will also focus on more applications of the attention-based GCN aiming at specific fields and algorithm improvement with online adaptation.

Data Availability

The underlying data supporting the results of this study can be found at https://github.com/xf-wu/ZSL.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was partially supported by the National Key Research and Development Program of China (No. 2018AAA0101100) and the National Natural Science Foundation of China (No. 62073160).