Research Article  Open Access
Jun Yue, Zelang Miao, Yueguang He, Nianchun Du, "Loss Architecture Search for FewShot Object Recognition", Complexity, vol. 2020, Article ID 1041962, 10 pages, 2020. https://doi.org/10.1155/2020/1041962
Loss Architecture Search for FewShot Object Recognition
Abstract
Fewshot object recognition, which exploits a set of welllabeled data to build a classifier for new classes that have only several samples per class, has received extensive attention from the machine learning community. In this paper, we investigate the problem of designing an optimal loss function for fewshot object recognition and propose a novel fewshot object recognition system that includes the following three steps: (1) generate a loss function architecture using a recurrent neural network (generator); (2) train a base embedding network with the generated loss function on a training set; (3) finetune the base embedding network using the fewshot instances from a validation set to obtain the accuracy and use it as a reward signal to update the generator. This procedure is repeated and implemented in the reinforcement learning framework for finding the best loss architecture such that the embedding network yields the highest validation accuracy. Our key insight is to create a search space of the loss function architectures and evaluate the quality of a particular loss function on the dataset of interest. We conduct experiments on three popular datasets for fewshot learning. The results show that the proposed approach achieves better performance than stateoftheart methods.
1. Introduction
The object recognition problem under conventional supervised learning has been thoroughly studied with many successful models available, especially deep convolutional neural networks (DCNNs), including VGG [1], GoogLeNet [2], ResNet [3], and NASNet [4]. DCNNs can achieve impressive performance on large datasets like ImageNet [5] and Open Images [6]. However, in order to train a DCNN model, large amounts of labeled training data and training time are typically needed. Some works use pretrained DCNNs as offtheshelf deep feature extractors [7]. While this procedure can reduce training time, it still requires lots of labeled images to finetune the models. This is in contrast with how a human vision recognition system works: a person does not need to see thousands of instances of an object, but only a small number of them, to remember and generalize for the recognition of the object later [8].
Motivated by the high capability of human vision recognition system, one/fewshot learning has attracted considerable attention from the computer vision community, including those on oneshot or fewshot learning for object recognition [9–14], object detection [15], image segmentation [16], image captioning [17], face identification [18], and person reidentification [19]. Different from the standard supervised learning, fewshot learning aims to solve the problem where only a few labeled samples are available for training, while its extreme scenario, called oneshot learning, handles the more challenging situation where only one instance per class is available. As training a classifier using only one or a few instances per class is difficult to achieve a desired result, studies on one/fewshot object recognition have focused on training it with a set of welllabeled data and then generalizing it to new classes [10, 20].
For one/fewshot object recognition, due to the scarcity of the labeled examples from each new class, a natural metric learning scheme is adopted in [21], which aims to learn object representation and categorize instances based on the absolute similarities between pairs of instances. This absolute similaritybased method, however, does not pay enough attention to the interclass and intraclass correlation, which may not result in a satisfactory accuracy.
Instead of absolute similarity, the scheme of relative similaritybased classification can have higher capability in decreasing the intraclass variations and increasing the interclass variations [22]. While it is difficult to separate or group objects only based on an absolute similarity measure, relative similarities can conveniently deal with these variations by requiring the intraclass similarity to be larger than the interclass similarity. Figure 1 shows an example of the advantage of relative similarity. There have been many metric learning methods based on relative similarity [23–26]. However, different metric learning losses can lead to different performances and there has been no rule of thumb to design the best loss for a dataset.
(a)
(b)
(c)
Hence, in this paper, we develop a novel object representationbased metric learning method that induces suitable relative similarities with a particular loss generated by loss architecture search (LAS), inspired by the neural architecture search (NAS) [4, 27]. LAS, as shown in Figure 2, is a gradientbased method for finding good loss architectures. We use a recurrent neural network as the loss function architecture generator to generate a suitable loss for training the embedding network. The generator is updated with a reward signal, which measures how good the metric learning loss is based on the relative similarity. To obtain the reward signal, the following two steps are carried out. First, we train the embedding network with the generated loss using the training set. Second, we use the fewshot instances from the validation set to finetune the embedding network to obtain the accuracy . With this accuracy as the reward signal, we can compute the policy gradient (i.e., the gradient of ) by proximal policy optimization (PPO) [28] to update the loss generator. As a result, in the next iteration, the generator gives higher probabilities to those losses that drive the embedding network to obtain higher accuracies. In other words, the generator learns to improve its search over time. After obtaining the final loss function with the highest validation accuracy, the fewshot object recognition system is completed by updating the embedding network with the training set and the testing set (see the right part of Figure 2).
The proposed approach trains the object embedding network on the welllabeled training set under a loss function generated by LAS, such that the instances from new classes can be categorized based on their similarities with the fewshot instances in the learned embedding space. Our main contributions are threefold:(1)We create a search space for automatically determining the hyperparameters of metric learning losses, including their magnitudes, margins, and angle(2)We present a fewshot recognition system where a proximal policy optimization algorithm is used to find the best loss architecture sampled from the search space such that the embedding network yields the highest validation accuracy(3)Our fewshot object recognition system with LAS by reinforcement learning obtains the best results compared with recent stateoftheart methods
2. Related Work
2.1. FewShot Learning
Recent works for fewshot learning have adopted different strategies in different domains [8, 9, 11, 18, 21, 29–33], among which the most relevant ones are the metric learningbased methods [8, 21]. The Siamese network developed in [21] aims to learn absolute similarity between pairs of objects. It enlarges or lessens the similarity between a pair of samples according to whether they are from the same class. The authors of [8] propose a matching network which can also be considered as a metric learningbased method. This approach minimizes the cosine distance basedSiamese loss between the embeddings of instances with a bidirectional Long ShortTerm Memory (biLSTM). Another line of research uses metalearning for fewshot learning, which trains a metalearner from many relevant tasks. In [29], model agnostic metalearning is proposed to train a model that can quickly adapt to a new task with limited training data. The authors of [9] propose an LSTMbased metalearner model to train another learner in the fewshot regime. Another metalearningbased method is proposed in [30] which learns to fast parameterize an underlying neural network by employing metainformation. The memoryaugmented neural network is adopted in [31] for fewshot learning. It enables each instance to be encoded and retrieved efficiently by a memory module. This module consists of keyvalue pairs where keys and values are the embeddings and labels of examples, respectively. Another work from [32] also uses a memory module for training, with a biLSTM to predict the parameters of the embedding model. The authors of [33] propose a task agnostic approach to improve the generalizability of fewshot metalearning. In addition to the research mentioned above, a few other works generate synthetic data for learning [11, 18]. Different from these previous methods, we design our fewshot object recognition system in the reinforcement learning framework, which automatically determines the optimal loss function for training the embedding network.
2.2. Metric Learning
Metric learning aims to learn a similarity function from examples, which measures the similarity between object pairs. Recently, quite a few methods have been proposed for metric learning. Some works [34–36] adopt pairwise constraints to train the models. As discussed in Section 1, these models’ capacity is limited due to the use of absolute similarity. Most recent works [23, 26, 37] use DCNNs as embedding functions and use tripletbased constraints instead of pairwise constraints to capture similarities between objects. Their results show that the combination of a deep embedding model with a tripletbased constraint is effective in learning similarity functions. Nevertheless, the problems solved in these works still fall into the conventional supervised learning setting where a set of welllabeled data are needed. In our work, instead of manually designing a metric learning loss and training with a large amount of labeled data, we use LAS for metric learning, which can automatically determine the hyperparameters of relative similaritybased losses such as margin, magnitude, and angle.
2.3. AutoML
The success of deep learning in computer vision tasks is largely due to its automation of the feature extraction process; hierarchical feature extractors are learned in an endtoend fashion from data rather than being manually designed. To obtain better performance and handle more difficult problems, researchers have manually developed more and more complex networks. To automate the design of network architectures, neural architecture search [4, 27] is thus a logical next step in machine learning. NAS can be seen as a subfield of AutoML. The work form [38] proposed a method to automatically determine the parameters of softmaxbased loss architecture by using AutoML. In our work, we propose LAS for finding the best architecture of the metric learningbased loss function for fewshot learning, which can be seen as another subfield of AutoML. We believe that combining NAS and LAS will make AutoML go further and this is the future work.
3. Loss Architecture Search
We formulate the problem of finding the best loss architecture as a discrete search problem. In our search space, a loss architecture consists of five subloss architectures, each of which has two hyperparameters: (1) the magnitude of the operation and (2) the margin or angle of the subloss.
3.1. Losses
We choose five promising losses to construct the search space: Hard Mining Triplet Loss (HMTL) [23], Margin Sample Mining Loss (MSML) [24], Angular Loss (AL) [25], Triplet Center Loss (TCL) [26], and Quadruplet Center Loss (QCL). The first four are from the metric learning community, which achieve stateoftheart results in fewshot learning, face recognition, vehicle ReID, person ReID, and so forth. The last one, QCL, is proposed in this paper and is a complement to the first four losses.
3.1.1. Hard Mining Triplet Loss
The triplet loss is based on a relative distance comparison among triplet instances. Specifically, for any triplet where and are two distinct samples of the same class and comes from another class as a negative example, their embedding vectors are generated by an embedding function such that , , and = . The relative distance constraint of each triplet is defined aswhere is a predefined constant parameter representing the minimum margin between the matched and mismatched pairs, is a metric function that measures distance between vectors and , and denotes the th triplet. In order to ensure fast convergence, it is crucial to select triplets that violate the triplet constraint in equation (1). This means that, given , we want to select an (hard positive sample) such that and to select (hard negative sample) such that . For each sample (as an anchor) from a batch of size , we select a hardpositive sample and a hardnegative sample (also from the batch) to construct a triplet. Thus, the HTML function can be defined as
3.1.2. Margin Sample Mining Loss
The quadruplet loss (QL) [22] extends the triplet loss by adding a different negative pair^{1}. A quadruplet contains four different instances , where and are samples of the same class, while and are samples of two other classes. The quadruplet loss is formulated aswhere and are two margins for the two terms, respectively, and stands for the number of quadruplets. The first term is the same as the triplet loss. The second term tries to enforce intraclass distances to be smaller than interclass distances [22]. Based on QL, a strategy named margin sample mining is proposed [24]. It picks the most dissimilar positive pair and the most similar negative pair in the whole batch and formulates its loss as
As shown in equation (4), the constraints of MSML are extremely sparse. Only two pairs in a batch are used to compute the gradient of the embedding network in the training phase, and it seems that it wastes a lot of training data. In fact, the two chosen pairs are determined by all the data in one batch. During training, more and more pairs will be selected [24].
3.1.3. Angular Loss
Let a triplet form a triangle whose sides are denoted as , , and . The original triplet constraint equation (1) enforces that is longer than . Because the anchor and positive samples are of the same class, a symmetrical triplet constraint can be derived, which enforces . According to the cosine formula, it can be proved that the angle surrounded by the longer edges and has to be the smallest one; that is, . Furthermore, because , has to be less than , which can be used to constrain the upper bound for each triplet triangle [25]:where is a predefined parameter. However, a straightforward implementation of equation (5) becomes unstable in some special cases [25]. In order to solve this problem, the angular loss function is defined in the following equation:where is the number of triplets, which is also equal to the batch size in our implementation.
3.1.4. Triplet Center Loss
The goal of TCL is to leverage the advantages of the triplet loss and the center loss [39], that is, to effectively decrease the intraclass distances and increase the interclass distances simultaneously. Let a given training batch consist of samples with the associated labels . In TCL, it is assumed that the features of each class share one center. Thus, the TCL is defined aswhere is the center of class and is the negative center nearest to .
3.1.5. Quadruplet Center Loss
In equation (7), TCL aims to make the distance between the anchor and its corresponding center smaller than the distance between the anchor and its nearest negative center . For example, in Figure 3(a), the anchor can be pushed to its center. However, in some cases such as the one in Figure 3(b), it cannot, even though it is still far away from its center. In order to solve this problem, we propose a quadruplet center loss:where is a pair of centers. Thus, the proposed loss function guarantees that an anchor that is not close enough to its center will be moved closer, while an anchor that is already close enough to its center will be neglected.
(a)
(b)
3.2. Search Space
Our key insight of constructing a search space and finding an optimal loss from it automatically lies in that (1) designing a loss architecture requires a lot of expert knowledge; (2) the available losses are all task and datasetdependent; (3) there is no rule of thumb to choose the hyperparameters for each loss.
Therefore, we use the five loss architectures to construct the search space. Each loss comes with a magnitude. We discretize the range of the magnitudes into 11 values (uniform spacing) from 0.0 to 1.0 with a step of 0.1 so that we can use a discrete search algorithm to find them. Similarly, we also discretize the margin of each loss having a margin into 51 values (uniform spacing) from 0.0 to 5.0 with a step of 0.1. For the angular loss, we discretize the angle into 51 values (uniform spacing) from 10. to 60. with a step of 1.. Finding an optimal loss architecture now becomes a search problem in a space of possibilities. Notice that there is no explicit discard action in our search space; this action is implicit and can be achieved by making the magnitude of a subloss architecture be 0.
3.3. Search Algorithm
The search algorithm we use to find the optimal loss architecture is a reinforcement learning algorithm (specifically, the proximal policy optimization (PPO)), inspired by [4, 27]. The loss generator is a onelayer LSTM with 100 hidden units. As shown in Figure 4, at each step, the generator takes an action according to the maximum probability produced by a softmax; the output is then fed into the next step. In total, the generator has 10 softmax layers in order to predict 5 subloss architectures, each with 2 actions (selecting a magnitude and a margin/angle).
As shown in Figure 2, the generator is trained with a reward signal, which measures how good the loss architecture is at improving the accuracy of the object embedding network. After generating a loss architecture with probability , the embedding network is trained on the training set and finetuned using the fewshot instances. It is then evaluated on the validation set to measure the validation accuracy (the reward signal). At the end of the search, we obtain the best loss architecture with the highest accuracy on the validation set, which is then used to optimize the embedding network finally.
4. Experiments and Results
To evaluate the effectiveness of the proposed approach, we conduct experiments on three popular datasets for fewshot learning, CIFAR100 [40], Omniglot [41], and miniImageNet [9]. Omniglot dataset is the most popular fewshot object recognition benchmark with handwritten characters, and miniImageNet dataset is a subset of ImageNet [5] released recently.
4.1. Datasets
4.1.1. CIFAR100 Dataset
The CIFAR100 dataset [40] contains 60,000 images with 100 classes. We use 60, 20, and 20 classes for training, validation, and testing, respectively. The validation set is used to generate validation accuracies as the reward signals for LAS.
4.1.2. Omniglot Dataset
This dataset contains 32,460 images of handwritten characters, with 1,623 different characters within 50 alphabets. We follow the most common split in [8], splitting the dataset into a background set of 1200 classes and a testing set of 423 classes. We further split the background set into a training set of 800 classes and a validation set of 400 classes.
4.1.3. miniImageNet Dataset
The miniImageNet dataset is proposed by [9] as a benchmark with images of much higher resolutions and complexity. This dataset contains 100 classes randomly sampled from the ImageNet dataset, and each class contains 600 images. It is further split into a training set of 64 classes, a validation set of 16 classes, and a testing set of 20 classes [9].
4.2. Experimental Settings
4.2.1. Evaluation Setting
Typically, a fewshot learning task with N new classes and k instances per class is referred to as an Nway kshot task. In this paper, all experiments are for Nway kshot object recognition, and the evaluation setting is similar to those in other compared methods. For each iteration in LAS (see the left part of Figure 2), we train the embedding network with the generated loss using the training set. For evaluating , we randomly select the shot instances as a support set consisting of classes with labeled images per class from the validation set for finetuning the embedding network. Then, we randomly select 15 images (disjoint with the support set) within the selected classes from the validation set to evaluate the finetuned embedding network by 1Nearest Neighbor (1NN). We repeat such evaluation procedures 5 times for each support set and use the mean accuracy as the final validation accuracy (i.e., the reward signal) corresponding to .
After obtaining the final loss architecture for one dataset, we use the testing procedure to obtain the final accuracy (see the right part of Figure 2). We first train the embedding network with using the same training set. In the testing stage, we finetune the embedding network using a randomly selected support set from the testing set. After that, we randomly select 15 images (disjoint with the support set) within the selected classes and then measure the classification accuracy by 1NN. To make the accuracy more convincing, we repeat such a testing procedure 600 times and report the final mean accuracy as well as the 95% Confidence Intervals (CIs).
4.2.2. Network Architecture and Parameter Setting
Our embedding network follows the same architecture as that used in [8, 9], which has four convolutional layers (Conv4). Each convolutional layer is designed with convolution with 64 filters followed by batch normalization, ReLU nonlinearity, and max pooling.
On each dataset, the generator samples 5,000 loss architectures. We follow the training procedure and hyperparameters from [4] for training the generator. For training the embedding network, we use Adam [42] to perform stochastic optimization over the learning objective. The hyperparameters of Adam (i.e., , , and ) are set to 0.9, 0.999, and , respectively.
4.3. Ablation Study
Since LAS integrates the five sublosses into the reinforcement learning framework, we conduct an ablation study on CIFAR100 to understand the contribution of each subloss. In each case, we remove one subloss from the search space and perform the whole process of loss architecture search as shown in Figure 2. The results are presented in Table 1.
 
In each of the first five cases, one subloss is removed from the search space. In the last case, all the sublosses are used. 
From Table 1, we can see that removing any one subloss results in an obvious performance drop compared with using all of them. This implies that the five losses are not redundant. In addition, removing QCL, the loss proposed in this paper, causes the most significant performance reduction on this dataset, which shows that QCL may contribute the most among the five losses.
4.4. Compared Methods
To verify the performance of our LAS for fewshot learning, we compare it with the following 19 stateoftheart methods: (1) Siamese Network (SN) [21], which presents a strategy for performing fewshot classification by learning a deep convolutional embedding network with a pairwise Siamese loss, (2) Matching Network (MN) [8], which performs fewshot learning by embedding a small labeled support set and unlabeled examples, (3) Matching Network with Full Contextual Embeddings (MNFCE) [8], which is an upgraded version of MN by utilizing a biLSTM to contextually embed samples, (4) Model Agnostic Metalearning (MAML) [29], which learns hyperparameters through gradient descent in a metalearning fashion, (5) Metalearner LSTM (MLLSTM) [9], which uses a LSTM as a metalearner to learn the parameter update rule for optimizing the network, (6) Siamese with Memory (SM) [31], which proposes a largescale lifelong memory module to remember past training samples and makes predictions based on stored previous samples, (7) Metanetwork (MetaN) [30], which learns metalevel knowledge across tasks for rapid generalization, (8) MetaStochastic Gradient Descent (MetaSGD) [43], which learns the initialization, update direction, and learning rates by metalearning, (9) Prototypical Networks (PN) [10], which learn a prototype representation for each class, (10) Relation Network (RN) [20], which computes relation scores between fewshot instances and testing instances, (11) Memory Matching Network (MMNet) [32], which augments CNNs with memory and learns the network parameters for unlabeled images, (12) MM [32], which is a variant of MMNet by using a mixed training strategy, (13) Graph Neural Network (GNN) [44], which casts fewshot learning as a message passing task, (14) Task Agnostic Metalearning (TAML) [33], which learns an unbiased initial model with the largest uncertainty, (15) taskdependent adaptive metric (TADAM) [45], which proposes a practical endtoend optimization procedure to learn a taskdependent metric space, (16) Simple Neural AttentIve Learner (SNAIL) [46], which proposes a generic metalearner architecture for fewshot learning, (17) Deep Nearest Neighbor Neural Network (DN4) [14], which replaces an imagelevel featurebased measure with a local featurebased imagetoclass measure, and (18) deep learning with knowledge transfer architecture (KTN) [47], which jointly incorporates classifier learning, knowledge inferring, and visual feature learning into one framework.
Note that, for fair comparison, the backbones of LAS and all these compared methods are of the same architecture as that used in [8, 9], which is a shallow network with only 4 convolutional layers.
4.5. Results on CIFAR100
Table 2 compares LAS with 4 stateoftheart methods (MN, SM, MAML, and MetaN) and the 5 sublosses described in the last section. The codes of these 4 methods are available publicly and thus can be used to run this experiment. The results across the 5way 1shot task and the 5way 5shot task show that our LAS method obtains the best performances against these methods. In particular, the 1shot and 5shot accuracy of our method can achieve 55.85% and 71.91%, respectively, on 5way learning, making the absolute improvement over MAML by 6.56% and 5.87%.

Table 2 also shows the performances of 5 metric learning approaches for fewshot object recognition (HMTL [23], MSML [24], AL [25], TCL [26], and QCL). The results indicate that no single subloss dominates the loss searched by LAS.
4.6. Results on Omniglot
Table 3 compares LAS with 8 stateoftheart methods on the Omniglot dataset. The results of these methods are from [14, 32, 33]. The results across the 20way 1shot task and the 20way 5shot task indicate that our LAS method again achieves the best performances against other stateoftheart techniques including embedding models (SN, MN, SM, PN, and MMNet) and metalearning approaches (MetaN, MAML, and TAML). In particular, the 1shot and 5shot accuracy of LAS can reach 97.69% and 99.21% on 20way learning, respectively. It makes the absolute improvement over MAML by 1.89% and 0.31%, which is significant on this dataset.
4.7. Results on miniImageNet
The performance comparison with 14 stateoftheart methods on miniImageNet is summarized in Table 4. The results of these methods are from [14, 20, 32, 33]. Our LAS method also performs the best. In particular, the 1shot and 5shot accuracy can achieve 54.97% and 71.92% on 5way learning, respectively, making the absolute improvement over MMNet by 1.60% and 4.95%. Compared with the most recent work DN4, LAS improves the accuracy of the two tasks by 3.73% and 0.90%, respectively. Our LAS gains 4.53% and 6.60% improvements over RN. RN computes relation scores between fewshot instances and the testing instances, while LAS finds the best metric learningbased loss to generate the best embedding functions.
 
To evaluate the proposed method on deeper networks, several experiments on miniImageNet are conducted by using ResNet12 as the backbone. ResNet12 is a 12layer residual network. The performance comparison with ResNet12 as the backbone on miniImageNet is summarized in Table 5. The results across the 5way 1shot task and the 5way 5shot task show that the proposed method achieves the best performances against other stateoftheart techniques. In particular, the 1shot and 5shot accuracy of LAS with ResNet12 as the backbone can reach 59.44% and 77.82% on 5way learning, respectively. It makes the absolute improvement over MAML by 9.08% and 10.48%, which is significant on this dataset.
 
Results from [48]. 
4.8. Training and Inference Time
Similar to other reinforcement learning algorithms, the training of LAS is timeconsuming. On Nvidia Tesla P100 GPU, LAS takes about two days to find the best loss architecture on the miniImageNet dataset for the 5way 5shot task. However, after training, its inference is fast enough; for example, it only takes 2 ms to perform one inference for the same task.
5. Conclusion
In this paper, we have proposed an automatic approach to best loss architecture search for fewshot object recognition. Five metric learningbased sublosses (with one developed by us) are used to construct the search space. The loss generator is trained by a reinforcement learning algorithm. Our experiments show that the proposed fewshot object recognition method outperforms other stateoftheart methods on three popular benchmarks. The future work includes combining network architecture search and loss architecture search for better AutoML.
Data Availability
The datasets and source codes used to support the findings of this study are available from the author upon request via email (jyue@pku.edu.cn).
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant 41701500, in part by the Natural Science Foundation of Hunan Province under Grants 2018JJ3641 and 2019JJ60001, and in part by InnovationDriven Project of Central South University under Grant 2020CX036.
References
 K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in Proceedings of the ICLR, San Diego, CA, USA, May 2015. View at: Google Scholar
 C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolutions,” in Proceedings of the ICLR, San Diego, CA, USA, May 2015. View at: Google Scholar
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the ICLR, San Juan, Puerto Rico, May 2016. View at: Google Scholar
 B. Zoph, V. Vasudevan, J. Shlens, V. Quoc, and Le, “Learning transferable architectures for scalable image recognition,” in Proceedings of the ICLR, Vancouver, BC, Canada, April 2018. View at: Google Scholar
 O. Russakovsky, D. Jia, H. Su et al., “Imagenet large scale visual recognition challenge,” IJCV, vol. 115, no. 3, 2015. View at: Publisher Site  Google Scholar
 A. Kuznetsova, R. Hassan, A. Neil et al., “The open images dataset v4,” International Journal of Computer Vision, vol. 128, no. 7, pp. 1956–1981, 2020. View at: Google Scholar
 R. Ali Sharif, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features offtheshelf: an astounding baseline for recognition,” in Proceedings of the CVPR Workshops, Boston, MA, USA, April 2014. View at: Google Scholar
 Oriol Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching networks for one shot learning,” in Proceedings of the NeurIPS, Barcelona, Spain, December 2016. View at: Google Scholar
 S. Ravi and H. Larochelle, “Optimization as a model for fewshot learning,” in Proceedings of the ICLR, San Juan, Puerto Rico, May 2016. View at: Google Scholar
 Jake Snell, S. Kevin, and R. Zemel, “Prototypical networks for fewshot learning,” in Proceedings of the NeurIPS, Barcelona, Spain, December 2017. View at: Google Scholar
 B. Hariharan and R. Girshick, “Lowshot visual recognition by shrinking and hallucinating features,” in Proceedings of the ICCV, Venice, Italy, October 2017. View at: Google Scholar
 E. Schwartz, L. Karlinsky, S. Joseph et al., “Deltaencoder: an effective sample synthesis method for fewshot object recognition,” in Proceedings of the NeurIPS, Montreal, Canada, December 2018. View at: Google Scholar
 S. Krishnagopal, Y. Aloimonos, and M. Girvan, “Similarity learning and generalization with limited data: a reservoir computing approach,” Complexity, vol. 2018, Article ID 6953836, pp. 1–15, 2018. View at: Publisher Site  Google Scholar
 W. Li, L. Wang, J. Xu, J. Huo, Y. Gao, and J. Luo, “Revisiting local descriptor based imagetoclass measure for fewshot learning,” in Proceedings of the CVPR, Long Beach,MA, USA, June 2019. View at: Google Scholar
 X. Dong, L. Zheng, Ma Fan, Yi Yang, and D. Meng, “Fewexample object detection with model communication,” in Proceedings of the IEEE TPAMI, Piscataway, NJ, USA, June 2018. View at: Google Scholar
 S. Caelles, K.K. Maninis, J. PontTuset, L. LealTaixé, D. Cremers, and L. Van Gool, “Oneshot video object segmentation,” in Proceedings of the CVPR, Long Beach, MA, USA, June 2017. View at: Google Scholar
 X. Dong, L. Zhu, De Zhang, Yi Yang, and F. Wu, “Fast parameter adaptation for fewshot image captioning and visual question answering,” in Proceedings of the ACMMM, Seoul, Republic of Korea, May 2018. View at: Google Scholar
 J. Choe, S. Park, K. Kim, J. H Park, D. Kim, and H. Shim, “Face generation for lowshot learning using generative adversarial networks,” ICCV, , 2017. View at: Google Scholar
 Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Ouyang, and Yi Yang, “Exploit the unknown gradually: oneshot videobased person reidentification by stepwise learning,” in Proceedings of the CVPR, Long Beach, MA, USA, June 2018. View at: Google Scholar
 F. Sung, Y. Yang, Li Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales, “Learning to compare: relation network for fewshot learning,” in Proceedings of the CVPR, Long Beach, MA, USA, June 2018. View at: Google Scholar
 G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for oneshot image recognition,” in Proceedings of the ICML Workshops, Lille, France, July 2015. View at: Google Scholar
 W. Chen, X. Chen, J. Zhang, and K. Huang, “Beyond triplet loss: a deep quadruplet network for person reidentification,” in Proceedings of the CVPR, Long Beach, MA, USA, June 2017. View at: Google Scholar
 F. Schroff, D. Kalenichenko, and P. James, “Facenet: a unified embedding for face recognition and clustering,” in Proceedings of the CVPR, Columbus, OH, USA, June 2015. View at: Google Scholar
 Q. Xiao, H. Luo, and C. Zhang, “Margin sample mining loss: a deep learning based method for person reidentification,” 2017, http://arxiv.org/abs/1710.00478. View at: Google Scholar
 J. Wang, F. Zhou, S. Wen, X. Liu, and Y. Lin, “Deep metric learning with angular loss,” in Proceedings of the ICCV, Long Beach, CA, USA, April 2017. View at: Google Scholar
 X. He, Y. Zhou, Z. Zhou, B. Song, and B. Xiang, “Tripletcenter loss for multiview 3d object retrieval,” in Proceedings of the CVPR, Long Beach, MA, USA, June 2018. View at: Google Scholar
 B. Zoph and V. Quoc, “Neural architecture search with reinforcement learning,” in Proceedings of the ICLR, Toulon, France, April 2017. View at: Google Scholar
 John Schulman, W. Filip, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017, http://arxiv.org/abs/1707.06347. View at: Google Scholar
 C. Finn, P. Abbeel, and S. Levine, “Modelagnostic metalearning for fast adaptation of deep networks,” in Proceedings of the ICML, Sydney, Australia, August 2017. View at: Google Scholar
 T. Munkhdalai and H. Yu, “Meta networks,” Proceedings of Machine Learning Research, vol. 70, pp. 2554–2563, 2017. View at: Google Scholar
 Ł. Kaiser, O. Nachum, A. Roy, and S. Bengio, “Learning to remember rare events,” in Proceedings of the ICLR, Toulon, France, April 2017. View at: Google Scholar
 C. Qi, Y. Pan, T. Yao, C. Yan, and T. Mei, “Memory matching networks for oneshot image recognition,” in Proceedings of the CVPR, Long Beach, MA, USA, June 2018. View at: Google Scholar
 M. Abdullah Jamal and G.J. Qi, “Task agnostic metalearning for fewshot learning,” in Proceedings of the CVPR, Long Beach, MA, USA, June 2017. View at: Google Scholar
 S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Proceedings of the CVPR, New York, NY, USA, June 2006. View at: Google Scholar
 L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr, “Fullyconvolutional siamese networks for object tracking,” in Proceedings of the ECCV, Amsterdam, The Netherlands, October 2016. View at: Google Scholar
 R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in Proceedings of the CVPR, New York, NY, USA, June 2006. View at: Google Scholar
 J. Wang, Y. Song, T. Leung et al., “Learning finegrained image similarity with deep ranking,” in Proceedings of the CVPR, Columbus, OH, USA, June 2014. View at: Google Scholar
 A. Li, T. Luo, T. Xiang, W. Huang, and L. Wang, “Fewshot learning with global class representations,” in Proceedings of the ICCV, Long Beach, CA, USA, April 2019. View at: Google Scholar
 Y. Wen, K. Zhang, Z. Li, and Yu Qiao, “A discriminative feature learning approach for deep face recognition,” in Proceedings of the ECCV, Amsterdam, The Netherlands, October 2016. View at: Google Scholar
 A. Krizhevsky and G. Hinton, Learning Multiple Layers of Features from Tiny Images, Citeseer, New York, NY, USA, 2009, Technical report.
 B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum, “One shot learning of simple visual concepts,” in Proceedings of the Annual Meeting of the Cognitive Science Society, Boston, MA, USA, June 2011. View at: Google Scholar
 D. P. Kingma and Ba Jimmy, “Adam: a method for stochastic optimization,” in Proceedings of the ICLR, San Diego, CA, USA, May 2015. View at: Google Scholar
 Z. Li, F. Zhou, F. Chen, and H. Li, Metasgd: learning to learn quickly for fewshot learning, 2017, http://arxiv.org/abs/1707.09835.
 V. Garcia Satorras and E. Joan Bruna, “Fewshot learning with graph neural networks,” in Proceedings of the ICLR, Vancouver, BC, Canada, April 2018. View at: Google Scholar
 B. Oreshkin, L. Pau Rodríguez, and A. Lacoste, “Tadam: task dependent adaptive metric for improved fewshot learning,” in Proceedings of the NIPS, Montreal, Canada, December 2018. View at: Google Scholar
 N. Mishra, M. Rohaninejad, Xi Chen, and P. Abbeel, “A simple neural attentive metalearner,” in Proceedings of the CVPR, Long Beach, CA, USA, June 2018. View at: Google Scholar
 Z. Peng, Z. Li, J. Zhang, Y. Li, G.J. Qi, and J. Tang, “Fewshot image recognition with knowledge transfer,” in Proceedings of the ICCV, Long Beach, CA, USA, April 2019. View at: Google Scholar
 S. Gidaris and N. Komodakis, “Generating classification weights with gnn denoising autoencoders for fewshot learning,” in Proceedings of the CVPR, Long Beach, CA, USA, June 2019. View at: Google Scholar
Copyright
Copyright © 2020 Jun Yue et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.