Abstract

Zero-shot learning is dedicated to solving the classification problem of unseen categories, while generalized zero-shot learning aims to classify the samples selected from both seen classes and unseen classes, in which “seen” and “unseen” classes indicate whether they can be used in the training process, and if so, they indicate seen classes, and vice versa. Nowadays, with the promotion of deep learning technology, the performance of zero-shot learning has been greatly improved. Generalized zero-shot learning is a challenging topic that has promising prospects in many realistic scenarios. Although the zero-shot learning task has made gratifying progress, there is still a strong deviation between seen classes and unseen classes in the existing methods. Recent methods focus on learning a unified semantic-aligned visual representation to transfer knowledge between two domains, while ignoring the intrinsic characteristics of visual features which are discriminative enough to be classified by itself. To solve the above problems, we propose a novel model that uses the discriminative information of visual features to optimize the generative module, in which the generative module is a dual generation network framework composed of conditional VAE and improved WGAN. Specifically, the model uses the discrimination information of visual features, according to the relevant semantic embedding, synthesizes the visual features of unseen categories by using the learned generator, and then trains the final softmax classifier by using the generated visual features, thus realizing the recognition of unseen categories. In addition, this paper also analyzes the effect of the additional classifiers with different structures on the transmission of discriminative information. We have conducted a lot of experiments on six commonly used benchmark datasets (AWA1, AWA2, APY, FLO, SUN, and CUB). The experimental results show that our model outperforms several state-of-the-art methods for both traditional as well as generalized zero-shot learning.

1. Introduction

In recent years, deep learning [14] has achieved great success in a wide range of computer vision and machine learning tasks [5], including face recognition, emotion classification, and visual question answering. In most cases, these deep learning models are more effective than human beings in many aspects, because they can observe potential information that may be ignored by human eyes in pictures. However, as the inventor of neural network, human beings are better at identifying objects they have never seen before through some prior semantic knowledge about these novel objects. In this respect, the effect of deep learning is not as good as that of humans. It precisely is because deep learning tasks for image recognition rely heavily on fully-supervised training, so they need a very large amount of labeled data. However, some object classes are difficult to obtain, such as the image data of endangered species and newly produced commodities. Moreover, even if they get the labeled data of related classes, they will still face the problem of unbalanced data. It is very difficult to obtain images of these objects, let alone a large number of labeled samples. Therefore, training models with a large number of labeled data are unrealistic. In this background, the concept of zero-shot learning has been put forward, which has attracted wide attention in the field of computer vision and has been greatly developed.

As there are too many classes in the real world, it is impossible to collect enough labeled data for each class. In this case, the task of zero-shot learning is desirable, but it is challenging. In the literature [610], zero-shot learning is usually realized by using the marked samples of seen categories and category-related semantic embedding which is regarded as auxiliary information. The semantic embedding, which encodes the interclass relationships, is usually attribute, word vector, or sentence embedding. Therefore, seen classes and unseen classes are shared in semantic embedding space. In traditional zero-shot learning settings [11, 12], the goal is to train an image classifier on the seen classes and then test the trained classifier on unseen classes, where the seen classes and unseen classes are disjoint. However, the traditional zero-shot learning setting is not realistic, and it is not always applicable in the real world, because in reality, the test images can come from the seen classes. Therefore, there is such a trend that we hope the trained classifier can not only identify unseen classes but also seen classes, which is called generalized zero-shot learning [13, 14]. In the following articles, we uniformly express the traditional zero-shot learning as ZSL and the generalized zero-shot learning as GZSL. The main difference between ZSL and GZSL is whether the label space contains seen classes during the test period. In this work, we have conducted comparative experiments to study both ZSL and GZSL by synthesizing visual features of unseen classes with using the potential and valuable discriminative information.

In this paper, we point out the existing problems of ZSL and GZSL works reported recently, and we analyze the effectiveness of the dual generative network proposed in this paper as well as the discriminative information of visual feature representation. In the early days, as is illustrated in Figure 1(a), most methods [7, 11, 1518] mapped image visual features to the semantic space to solve ZSL tasks based on class attribute embeddings or other side knowledge. However, using semantic space as the mapping space will suffer from the hubness problem pointed out in [1921]. It is because projecting high-dimensional visual features to low-dimensional semantic space will greatly reduce the diversity of features that some points from different classes may become more clustered as a hub, as shown in Figure 2. In order to alleviate the hubness problem, some works [1921] proposed to map semantic features into the visual space as illustrated in Figure 1(b). However, this will lead to another problem called domain shift. For example, the tail of a pig and the tail of a horse are similar in semantic space, but they are quite different in visual space, as shown in Figure 3. Then, the concept of a shared latent space was put forward. People mapped visual features and semantic attributes into a latent space at the same time, as shown in Figure 1(c), and performed nearest neighbor search to calculate the average per-class top-1 accuracy. This shared latent space was considered to alleviate the hubness and shifting problems, but the generalization ability of this method is poor. When using mapping methods for GZSL, the performance will be significantly degraded. Our dual generation model combines the advantages of improved WGAN and conditional VAE, which can alleviate hubness and shifting problems, thus effectively achieving the goal of zero-shot learning and generalized zero-shot learning.

In contrast, most recent ZSL and GZSL approaches [8, 2225] are based on generative adversarial network [26], which aims at directly optimizing the divergence between real and generated data distributions. The work of Xian et al. [8] learns a GAN by using the seen class visual features and the corresponding semantic embedding that are manually annotated attributes or word vector [27] representations. Fake visual features of the unseen categories are synthesized using the trained generator and then used together with the real visual features of seen classes to train ZSL classifiers in a fully-supervised setting. But GANs are often suffering from mode collapse and unstable training issues. Inspired by the idea of generative adversarial networks, our proposed dual generative framework combines the advantages of conditional variational auto encoder network and improved WGAN, with the discriminative information by using an additional classifier trained on the seen classes to increase the diversity and distinguishability of samples that are generated by the generator. Among them, the improved WGAN can overcome the mode collapse problem, and VAE can alleviate the unstable problem of GAN training, so that our model can stably and quickly generate visual features corresponding to categories according to semantic embedding.

As described above, we combine the advantages of improved WGAN and conditional VAE together with intrinsic characteristics of visual feature representation itself by using an additional classifier to propose a new model called dual generative network with discriminative information (DGDI). Compared with the previous generative methods for ZSL whose models suffer from mode collapse problems [28, 29], our model is more stable by using conditional VAE to assist GAN in generating visual features. In this work, our main task is to obtain a robust generator to synthesis visual features of the unlabeled classes. In particular, if the generator learns discriminative visual feature data with sufficient variation, the generated data should be useful for implementing supervised learning. Moreover, we consider our dual generative framework that was composed by improved WGAN and conditional VAE can learn the complementary information of semantic space, so we believe that our model can produce higher quality visual features from semantic embeddings.

Our main contributions are summarized as follows: (1) we propose a novel generative model named DGDI with combining the advantages of improved WGAN and conditional VAE, which can learn complementary information from semantic embeddings. (2) In contrast to previous zero-shot learning works, we add an additional classifier loss to train the generator by using the intrinsic characteristics of visual feature representation, which makes the synthesized visual features more diverse and distinguishable. (3) We conduct extensive experiments that demonstrate the effectiveness of our proposed model and the results maintain high accuracy for both ZSL and GZSL on six widely used benchmark datasets. In addition, in order to make better use of the discriminative information expressed by visual features, we also analyze the effects of classifiers with different structures. (4) We also conduct visual experiments on synthetic visual features from unseen classes by t-SNE [30], which intuitively proves the effective generation ability of our model.

In this section, we will discuss some relevant works on (generalized) zero-shot learning as well as generative models.

We are interested in both ZSL and GZSL tasks, in which the former aims at predicting the labels of unseen classes, while the latter tries to predict labels of both seen and unseen classes. Visual feature representation itself has strong distinguishability, but this is often ignored by previous researchers, so it is not reused. In this paper, a discriminative classifier is added to study the intrinsic distinguishable information of visual features, and it is applied to the dual generation module to synthesize more distinctive feature representations according to the corresponding semantic attributes of categories.

Early works [31, 32] associated seen and unseen classes by directly learning attribute classifiers. However, most recent works either learn a compatibility function between the image feature and class embedding spaces [7, 11, 16, 17, 21] or learn unseen classes, which are the mixture of visible classes [3335]. For example, SYNC [33, 36, 37] try to predict the labels of unseen classes by learning linear classifiers. Wang et al. [38] proposed to combine the knowledge graph with graph convolutional network [39] and semantic embeddings. Rohrbach et al. [40] and Ye and Guo [9] project image features to the semantic embedding space followed by label propagation. Verma and Rai [41] treat unknown labels of unseen class images as latent variables and apply expectation-maximization (EM). All the abovementioned models are nongenerative and suffer from the problems of hubness as well as domain-shifting, but our proposed method uses a dual generative model to transform ZSL or GZSL into traditional supervised learning by generating fake visual features of unseen classes, which is considered to alleviate the problems of embedding methods.

In recent years, generative models have been widely used. Generative adversarial network [26] was originally proposed as an image synthesis method based on a particular image data distribution [42] and has achieved the state-of-the-art results. Generative adversarial network [26, 42, 43] is composed of a generator that synthesizes fake data distribution and a discriminator that distinguishes fake data from real data. However, GANs are suffering from the problems of unstable training and mode collapse [44, 45]. In order to alleviate these problems and improve the quality of synthesized features, many researches have put forward their own methods. Arjovsky et al. [44] proposed WGAN to optimize GAN on an approximate Wasserstein distance by enforcing 1-Lipschitz smoothness. Although WGAN has obtained better theoretical performance than the original GAN, it still has the problems of disappearance and explosion gradient due to weight clipping to enforce the 1-Lipschitz constraint on the discriminator, and then, Gulrajani et al. [45] proposed an improved version of WGAN which is called WGAN-GP enforcing the Lipschitz constraint [3] through gradient penalty. Therefore, our method draws lessons from the idea of the improved WGAN. Different from the existing works that directly generate image itself, our proposed model chooses to generate visuals features instead, which can be directly used to train a discriminative classifier for zero-shot learning.

Further, Zhu et al. [46] proposed an interesting application of GANs named CycleGAN that translates an image from one domain to another domain and then back to the original domain to form a closed loop. Schonfeld et al. [47] proposed an approach where cross and distribution alignment losses are introduced for aligning the visual features and corresponding embeddings in a shared latent space, by using two variational auto encoders [48]. The work of [25] is similar to our model, which introduces a f-VAEGAN framework that combines a VAE and a GAN by sharing the decoder of VAE and generator of GAN for feature synthesis. Xian et al. [8] used a conditional Wasserstein GAN [44] along with a seen category classifier to learn the generator for unseen class feature synthesis. Our proposed model combines the idea of VAEGAN of [25] and the seen classes classifier of [8] to encourage the generator to synthesize more discriminative features, which will improve the performance of zero-shot learning and generalized zero-shot learning to a certain extent.

The abovementioned generative methods of zero-shot learning and generalized zero-shot learning almost ignore the inherent distinguishability of visual feature representations between categories, which is actually very important to classification. Therefore, we apply the key discriminative information of visual feature representations to the proposed dual generation framework, which promotes the synthesized visual feature representations generated by the learned generator to be more easily distinguished from each other. In this paper, we also analyze the role of the additional classifier with different structures in the transmission of discriminative information.

3. Proposed Model

In this section, we first formally define the zero-shot learning generalized zero-shot learning problems, give an overview of our proposed dual generative model with using the discriminative information of visual feature representation by an additional classifier, and then introduce each component of our model in detail.

3.1. ZSL and GZSL Problem Formulation

In this paper, we study both the conventional and generalized zero-shot learning. Specifically, let the source dataset be defined as , where stands for the training data of seen classes, is the image’s visual feature produced by a pretrained neural network which is usually ResNet101 trained on ImageNet1K, is the set of visual features from seen classes, is the label of image visual feature , is the set of labels for seen classes, and is the semantic embedding for the class . Similarly, we can define the test set, i.e., the target dataset as where the represents the set of image features from unseen classes, represents the set of labels for unseen classes, and that . The tasks in ZSL and GZSL are to learn the classifiers and, respectively.

3.2. Model Overview

The overall framework of our proposed framework is illustrated in Figure 4. There are four main components in our model, i.e., an encoder, a generator/decoder, a discriminator, and a pretrained classifier, in which the encoder, the generator/encoder, and the discriminator form a dual generative framework, i.e., VAE-GAN. Our proposed method is based on the recently introduced f-VAE-GAN [25] that combines the advantages of the VAE [48] and GAN [26] which is the same as our proposed method and has achieved impressive results for ZSL classification. Referring to the idea of [25], we add an extra classifier which is the utilization of discriminative information to classify the generated visual features of the seen classes, in which the classifier is pretrained on seen classes. We believe that the additional classifier loss can make the generator learn to synthesize more discriminative visual features which is helpful. The core component of our model is the dual generative framework whose role is to generate various visual features conditioned on certain class semantic embedding. In this paper, we make full use of the inherent discriminative information of visual feature representations and apply this inherent feature to the dual generation module to encourage the generator to synthesize visual feature representations that are easier to be classified based on the corresponding category semantic attributes. In the following, we will introduce the main components dual generative network, the additional classifier, and their loss functions of the proposed model in detail.

3.3. Dual Generative Framework

In this work, we propose a dual generative framework to synthesize visual feature representations of unseen classes stably and efficiently. The dual generative network combines the strengths of improved WGAN and conditional VAE, which can deal with the mode collapse and unstable training problems well.

As we can see from Figure 4, the conditional VAE network is composed of a latent noise encoder and a visual feature representation decoder, and the conditional VAE is proposed as a generative method that maps a random noise vector drawn from to a data point in the data distribution conditioning on the semantic embedding. We train conditional VAE by minimizing the following loss function :where represents the , i.e., the Kullback–Leibler divergence between and, the conditional distribution is modeled as , is equal to , and is treated as a unit Gaussian distribution.

As shown in Figure 4, the improved WGAN is composed of a generator G and a discriminator D. We aim to learn a generator conditioned on semantic embeddings. The generator takes class embedding and random Gaussian noise as inputs and then outputs a fake visual feature of the class. The loss function of our improved WGAN iswhere , , with , and is the penalty coefficient, initialized to 10. Different from the pure GAN, the discriminative network of WGAN is defined as which eliminates the sigmoid layer and outputs a real value. The first two terms of Equation (2) are considered as Wasserstein distance, and the third term is the gradient penalty to enforce the gradient of D to have unit norm along the straight line between real and generated visual feature pairs. We also calculate the value of the gradient penalty term in each epoch of training to adjust the super-parameter .

Once the dual generative model learns to generate visual features of seen classes, conditioned on the seen class semantic embeddings , it can also generate of any unseen category through its class semantic embedding . So, the zero-shot learning and generalized zero-shot learning problems can be transformed into traditional supervised learning.

3.4. Additional Classifier for Discriminative Information

In order to ensure that the visual features generated by improved WGAN are well suited for training a discriminative classifier, we added a classifier C to make use of the discriminative information of visual feature representations, as shown in Figure 4, which is pretrained on the real features of seen classes to encourage the generator to generate distinctive features. For this purpose, module C uses the negative log likelihood to minimize the classification loss over the generated features in the following formulation:where , is the class label of , and denotes the probability of being predicted with its true class label . The conditional probability is computed by a linear softmax classifier parameterized by . The classification loss can be regarded as a regularization that enforces the generator to construct discriminative features. In the next section, we carry out experiments to analyze the performance of different classifiers for zero-shot learning and generalized zero-shot learning.

In summary, our proposed model optimizes the following objective function:

As shown in Figure 5, once the model has been trained, in order to predict the label of unseen classes, we can first generate pseudovisual features for each unseen class using the learned generator. Then, we construct a new dataset by combining these pseudovisual features with the real features of the seen classes for GZSL. After that, we can train any classifier based on this new dataset containing the visual features of the seen classes and unseen classes. Therefore, the GZSL task is transformed into a supervised learning problem. Here, we use a self-learning classifier to fine-tune the accuracy as in [24].

4. Experiments

In this section, we have conducted a lot of experiments on six public benchmark datasets for both ZSL and GZSL. The detailed information of the experimental setup is provided in the respective chapters, and in order to make better use of the discriminative information, we discuss the influences of classifiers with different structures by conducting comparative experiments and comprehensively analyze the corresponding experimental results.

4.1. Datasets and Settings

We compare our proposed model with several baselines on six widely used datasets, i.e., Oxford Flowers (FLO) [49], Animals with Attributes 2 (AWA2) [14], Caltech-UCSD-Birds (CUB) [50], SUN Attribute (SUN) [51], and APascal-a Yahoo (APY). Among these datasets, APY contains 32 categories from both PASCAL VOC 2008 and YahooL that contain 15339 images. AWA2 is a coarse-grained and medium-size dataset which contains 30,475 images, 50 classes, and 85 attributes. CUB, FLO, and SUN are medium scale but fine-grained datasets, in which SUB contains 11788 images from 200 different types of birds annotated with 312 attributes. FLO dataset contains 8189 images from 102 different types of flowers without attribute annotations. However, we use the fine-grained visual descriptions collected by [27]. SUN contains 14340 images from 717 scenes annotated with 102 attributes. Statistics of the datasets are presented in Table 1.

For real visual features, we extract 2048-dim top-layer pooling units of the ResNet101 [56] from the entire image. We do not do any image preprocessing such as cropping or use any other data augmentation techniques. ResNet101 is pretrained on ImageNet1K and not fine-tuned. For pseudovisual features, we generate 2048-dim features using our model. For the class semantic embeddings, we use per-class attributes for AWA (85-dim), CUB (312-dim) and SUN (102-dim), APY (64-dim). Furthermore, for dataset FLO, we extract 1024-dim character-based features from fine-grained visual descriptions by CNN-RNN [57].

At test time, in the ZSL setting, the goal is to correctly classify unseen class label, i.e., , and in the GZSL setting, the search space includes both seen and unseen classes, i.e., . We use the unified evaluation protocol in [58]. In the ZSL setting, we first calculate the average accuracy of each category independently and then sum the average accuracy of all categories and divide by the total number of categories to get average per-class top-1 accuracy (T1). As for the GZSL setting, we compute the average per-class top-1 accuracy on seen classes denoted as s and the average per-class top-1 accuracy on unseen classes denoted as u; after that we calculate their harmonic mean as the final measure, i.e., .

4.2. Implementation Details

In our proposed model, the encoder, the generator, and the discriminator are all implemented as multilayer perceptron (MLP). Through experiments, we find that when the dimensions of semantic embeddings and Gaussian random noise are the same, the performance of zero-shot learning is the best. Therefore, we set the dimension of Gaussian random noise as the dimension of semantic embeddings of each dataset. The latent vector and semantic embeddings are concatenated and feed into the generator. Similarly, the discriminators take input as the concatenation of image features and class embeddings. In which, the discriminator, the encoder, and the generator are all two-layer fully-connected (FC) networks with 4096 hidden units. In addition to the output layer of G, other components use LeakyReLU as a nonlinear activation function. While for G, sigmoid activation is used to apply BCE loss. Through experiments, we prove that when this extra classifier is a single-layer perceptron, it is better to use the discriminative information by visual feature representations. The model is trained using the Adam optimizer with learning rate of 0.0001. Following the suggestion of WGAN paper [44], we update the generator once every 5 discriminator iterations. Hyperparameters and are initialized to 1 and 10, respectively, and then tuned by cross-validation.

4.3. Comparing with State-of-the-Art Methods

We compare our approach with ALE [6], f-WGAN [8], SE-GZSL [52], Sycle-WGAN [22], LisGAN [24], f-VAEGAN [25], TCN [53], DVBE [55], and SAE [45] for both ZSL and GZSL, and two more approaches, CADA-VAE [54] and DVBE [55] are compared for GZSL. The above methods are either representative ones or the state-of-the-art ones published in the past few years. Following previous work [24, 25], we report the average per-class top-1 accuracy. Specifically, for ZSL, we report the top-1 accuracy of unseen samples by only searching the unseen label space. However, for the GZSL, we report the accuracy on both seen classes and unseen classes with the same settings in [58]. Some of the results reported in this paper are also cited from [5].

Table 2 reports the results of ZSL. In these experiments, the categories of test samples are only searched from . It can be seen that the classification accuracies obtained on AWA1, APY, FLO, SUN, and CUB are 71.4%, 44.9%, 73.6%, 65.1%, and 62.6%, respectively. Our proposed framework has improved the state-of-the-art performance on AWA1, APY, FLO, SUN, and CUB datasets by 0.3%, 1.8%, 3.3%, 0.4%, and 1.6%. As for AWA2, we achieve the best of previous works. From Table 2, we can also observe that the generation-based methods, e.g., LisGAN, f-CLSWGAN, and ours, generally have better results than embedding ones, e.g., ALE. The GAN method transforms ZSL into supervision problem by generating visual features of unseen classes, while the embedding methods use indirect way to deal with unseen classes. This also proves the validity of the generative model in ZSL problem. Generally speaking, our method produces one of the best performances compared to the existing methods on five of six datasets.

Table 3 summarizes the results of GZSL. From Table 3, we can observe that our proposed model has better performance than existing methods, which is similar to the conclusion to Table 2. Our method stably predicts seen and unseen classes. Although some previous methods, such as ALE, performed well in identifying unseen samples in ZSL settings, their performance in GZSL decreased significantly. When the number of unseen classes becomes larger, ZSL models always tend to be confused, resulting in performance degradation. This phenomenon is especially obvious when the number of unseen classes is much larger than that of seen classes. Moreover, in real life, the amount of seen classes that can get manual annotations is definitely far less than that of unseen classes. Therefore, the applicability of these ZSL methods in practical application is limited and GZSL is the development trend in line with the reality.

We use harmonic mean which is considered more stable than arithmetic and geometric mean to measure the mean value between the accuracy of seen and unseen classes. From the reported results from Table 3, we can find that our method is more stable than the existing methods. Our proposed method avoids the unbalanced and extreme results between s and u. As far as harmonic mean H is concerned, we achieved up to 0.3%, 0.2%, 3.1%, 0.8%, and 1.1% improvements on AWA2, APY, FLO, SUN, and CUB, respectively. The average is 1.1% over the five. Although our model did not perform the best on AWA1, its performance is almost equal to the previous artistic level. It can be seen from the results that our method reduces the precision difference between known classes and unknown classes to a certain extent, which verified the effective generalization ability of our method.

Considering the fact that both f-WGAN and f-CLSWGAN leverage GANs to synthesize unseen visual features, the performance improvement of our method can be attributed to two aspects. One is that we introduce a classifier trained on seen classes to guarantee that the generated features of each class can be distinguished from each other, which is considered as the usage of the discriminative information. The other is our classifier self-learning mechanism at test time, which is able to leverage the confident results to fine-tune itself. In general, the results verify that it is beneficial to leverage the additional classifier to train VAEGAN. The correct classification of generated unseen visual features guarantee that each synthesized sample features is highly related with its category and is more distinguishable.

4.4. Discussion of the Additional Classifier

Here, we analyze the influence of the additional loss of classifiers with different structures on the performance of zero-short learning and generalized zero-shot learning. The experimental results on datasets SUN and CUB are shown in Table 4.

As we can see from Table 4, the effect of single-layer perceptron is the best among all tested classifiers, except for the accuracy of the ZSL of the SUN. The output layer of all classifiers uses sigmoid as the activation function to calculate the classification loss, thus constraining the dual generation network to synthesize the visual feature representation which is easy to classify. By comparing the experimental results from lines 2 to 4 and lines 3 to 7 in Table 4, we found that using ReLu as an activation function for the hidden layer worked best. At the same time, from the data of the last three rows and the top three rows in Table 4, it can be seen that the hidden layer uses 1024 units better than 512 for both ZSL and GZSL. Through experiments, we found that using single-layer neural network as an additional classifier to understand the discrimination information can not only get the best results, but also reduce the running time time.

4.5. Analysis of Synthetic Image Features

In order to provide an intuitive evaluation on our proposed model, we visualize the visual features of some synthetic image visual features and the corresponding real image visual features of unseen classes. The results are shown in Figure 6. For convenience, we chose 10 unseen categories of AWA2 dataset for visualization. First of all, we get the semantic embeddings and the real image features of the selected categories. Secondly, we input these semantic embeddings and Gaussian random noise into the learned generator to obtain the synthetic image features. Finally, we use t-SNE [30] to reduce the dimension of synthetic and real visual features from 2048 to 2 and plot the obtained feature data into scatter for visualization.

From the visualization of real feature samples in Figure 6(a), it can be seen that some categories overlap to a large extent, such as seals, walruses, blue whales, and dolphins. It is reasonable for them to overlap, because blue whales, dolphins, seals, and walruses are similar in biology and look very similar visually. The visualization of synthetic image features is shown in Figure 6(b). By comparing 6(a) and 6(b), we can clearly find that for most categories, such as seals and dolphins, the synthetic image features are very close to real samples, and some of them even overlap with real samples well, such as horses, sheep, and giraffes. One failure is rat, and we can see that the synthesized features are far from the real features. Another disadvantage is that there is almost no confusion between the categories of synthetic samples, which is contrary to the actual situation. However, the finally trained softmax classifier can well predict the labels of most categories of test images.

5. Conclusion

In this paper, we discuss the generalized zero-shot learning task and propose a model called DGDI, a dual generative framework that combines the advantages of conditional VAE and improved WGAN to obtain a more robust generative model with the using of discriminative information by adding a classification loss. We make full use of the discriminative information of visual feature representation between categories to further improve our dual generative module by adding a softmax classifier pretrained on the seen classes to encourage the generator to learn the discriminative information. The experimental results on six datasets clearly show the effectiveness of our proposed framework; our method has achieved good performance on almost all datasets, which fully proves the importance of the discriminative information between the visual feature representations of categories. It is a meaningful problem to improve the precision and generalization ability of zero-shot learning, and we will further study it.

Data Availability

The datasets used in this study can be downloaded from http://datasets.d2.mpi-inf.mpg.de/xian/xlsa17.zip.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Key R&D Program of China under Grant no. 2019YFA0706200, the National Major Research Program of China under Grant no. 2018AAA0102002, and the National Natural Science Foundation of China (NSFC) under Grant nos. 61976076 and 61632007.