Abstract

Few-shot object recognition, which exploits a set of well-labeled data to build a classifier for new classes that have only several samples per class, has received extensive attention from the machine learning community. In this paper, we investigate the problem of designing an optimal loss function for few-shot object recognition and propose a novel few-shot object recognition system that includes the following three steps: (1) generate a loss function architecture using a recurrent neural network (generator); (2) train a base embedding network with the generated loss function on a training set; (3) fine-tune the base embedding network using the few-shot instances from a validation set to obtain the accuracy and use it as a reward signal to update the generator. This procedure is repeated and implemented in the reinforcement learning framework for finding the best loss architecture such that the embedding network yields the highest validation accuracy. Our key insight is to create a search space of the loss function architectures and evaluate the quality of a particular loss function on the dataset of interest. We conduct experiments on three popular datasets for few-shot learning. The results show that the proposed approach achieves better performance than state-of-the-art methods.

1. Introduction

The object recognition problem under conventional supervised learning has been thoroughly studied with many successful models available, especially deep convolutional neural networks (DCNNs), including VGG [1], GoogLeNet [2], ResNet [3], and NASNet [4]. DCNNs can achieve impressive performance on large datasets like ImageNet [5] and Open Images [6]. However, in order to train a DCNN model, large amounts of labeled training data and training time are typically needed. Some works use pretrained DCNNs as off-the-shelf deep feature extractors [7]. While this procedure can reduce training time, it still requires lots of labeled images to fine-tune the models. This is in contrast with how a human vision recognition system works: a person does not need to see thousands of instances of an object, but only a small number of them, to remember and generalize for the recognition of the object later [8].

Motivated by the high capability of human vision recognition system, one-/few-shot learning has attracted considerable attention from the computer vision community, including those on one-shot or few-shot learning for object recognition [914], object detection [15], image segmentation [16], image captioning [17], face identification [18], and person reidentification [19]. Different from the standard supervised learning, few-shot learning aims to solve the problem where only a few labeled samples are available for training, while its extreme scenario, called one-shot learning, handles the more challenging situation where only one instance per class is available. As training a classifier using only one or a few instances per class is difficult to achieve a desired result, studies on one-/few-shot object recognition have focused on training it with a set of well-labeled data and then generalizing it to new classes [10, 20].

For one-/few-shot object recognition, due to the scarcity of the labeled examples from each new class, a natural metric learning scheme is adopted in [21], which aims to learn object representation and categorize instances based on the absolute similarities between pairs of instances. This absolute similarity-based method, however, does not pay enough attention to the interclass and intraclass correlation, which may not result in a satisfactory accuracy.

Instead of absolute similarity, the scheme of relative similarity-based classification can have higher capability in decreasing the intraclass variations and increasing the interclass variations [22]. While it is difficult to separate or group objects only based on an absolute similarity measure, relative similarities can conveniently deal with these variations by requiring the intraclass similarity to be larger than the interclass similarity. Figure 1 shows an example of the advantage of relative similarity. There have been many metric learning methods based on relative similarity [2326]. However, different metric learning losses can lead to different performances and there has been no rule of thumb to design the best loss for a dataset.

Hence, in this paper, we develop a novel object representation-based metric learning method that induces suitable relative similarities with a particular loss generated by loss architecture search (LAS), inspired by the neural architecture search (NAS) [4, 27]. LAS, as shown in Figure 2, is a gradient-based method for finding good loss architectures. We use a recurrent neural network as the loss function architecture generator to generate a suitable loss for training the embedding network. The generator is updated with a reward signal, which measures how good the metric learning loss is based on the relative similarity. To obtain the reward signal, the following two steps are carried out. First, we train the embedding network with the generated loss using the training set. Second, we use the few-shot instances from the validation set to fine-tune the embedding network to obtain the accuracy . With this accuracy as the reward signal, we can compute the policy gradient (i.e., the gradient of ) by proximal policy optimization (PPO) [28] to update the loss generator. As a result, in the next iteration, the generator gives higher probabilities to those losses that drive the embedding network to obtain higher accuracies. In other words, the generator learns to improve its search over time. After obtaining the final loss function with the highest validation accuracy, the few-shot object recognition system is completed by updating the embedding network with the training set and the testing set (see the right part of Figure 2).

The proposed approach trains the object embedding network on the well-labeled training set under a loss function generated by LAS, such that the instances from new classes can be categorized based on their similarities with the few-shot instances in the learned embedding space. Our main contributions are threefold:(1)We create a search space for automatically determining the hyperparameters of metric learning losses, including their magnitudes, margins, and angle(2)We present a few-shot recognition system where a proximal policy optimization algorithm is used to find the best loss architecture sampled from the search space such that the embedding network yields the highest validation accuracy(3)Our few-shot object recognition system with LAS by reinforcement learning obtains the best results compared with recent state-of-the-art methods

2.1. Few-Shot Learning

Recent works for few-shot learning have adopted different strategies in different domains [8, 9, 11, 18, 21, 2933], among which the most relevant ones are the metric learning-based methods [8, 21]. The Siamese network developed in [21] aims to learn absolute similarity between pairs of objects. It enlarges or lessens the similarity between a pair of samples according to whether they are from the same class. The authors of [8] propose a matching network which can also be considered as a metric learning-based method. This approach minimizes the cosine distance based-Siamese loss between the embeddings of instances with a bidirectional Long Short-Term Memory (bi-LSTM). Another line of research uses metalearning for few-shot learning, which trains a metalearner from many relevant tasks. In [29], model agnostic metalearning is proposed to train a model that can quickly adapt to a new task with limited training data. The authors of [9] propose an LSTM-based metalearner model to train another learner in the few-shot regime. Another metalearning-based method is proposed in [30] which learns to fast parameterize an underlying neural network by employing metainformation. The memory-augmented neural network is adopted in [31] for few-shot learning. It enables each instance to be encoded and retrieved efficiently by a memory module. This module consists of key-value pairs where keys and values are the embeddings and labels of examples, respectively. Another work from [32] also uses a memory module for training, with a bi-LSTM to predict the parameters of the embedding model. The authors of [33] propose a task agnostic approach to improve the generalizability of few-shot metalearning. In addition to the research mentioned above, a few other works generate synthetic data for learning [11, 18]. Different from these previous methods, we design our few-shot object recognition system in the reinforcement learning framework, which automatically determines the optimal loss function for training the embedding network.

2.2. Metric Learning

Metric learning aims to learn a similarity function from examples, which measures the similarity between object pairs. Recently, quite a few methods have been proposed for metric learning. Some works [3436] adopt pairwise constraints to train the models. As discussed in Section 1, these models’ capacity is limited due to the use of absolute similarity. Most recent works [23, 26, 37] use DCNNs as embedding functions and use triplet-based constraints instead of pairwise constraints to capture similarities between objects. Their results show that the combination of a deep embedding model with a triplet-based constraint is effective in learning similarity functions. Nevertheless, the problems solved in these works still fall into the conventional supervised learning setting where a set of well-labeled data are needed. In our work, instead of manually designing a metric learning loss and training with a large amount of labeled data, we use LAS for metric learning, which can automatically determine the hyperparameters of relative similarity-based losses such as margin, magnitude, and angle.

2.3. AutoML

The success of deep learning in computer vision tasks is largely due to its automation of the feature extraction process; hierarchical feature extractors are learned in an end-to-end fashion from data rather than being manually designed. To obtain better performance and handle more difficult problems, researchers have manually developed more and more complex networks. To automate the design of network architectures, neural architecture search [4, 27] is thus a logical next step in machine learning. NAS can be seen as a subfield of AutoML. The work form [38] proposed a method to automatically determine the parameters of softmax-based loss architecture by using AutoML. In our work, we propose LAS for finding the best architecture of the metric learning-based loss function for few-shot learning, which can be seen as another subfield of AutoML. We believe that combining NAS and LAS will make AutoML go further and this is the future work.

We formulate the problem of finding the best loss architecture as a discrete search problem. In our search space, a loss architecture consists of five subloss architectures, each of which has two hyperparameters: (1) the magnitude of the operation and (2) the margin or angle of the subloss.

3.1. Losses

We choose five promising losses to construct the search space: Hard Mining Triplet Loss (HMTL) [23], Margin Sample Mining Loss (MSML) [24], Angular Loss (AL) [25], Triplet Center Loss (TCL) [26], and Quadruplet Center Loss (QCL). The first four are from the metric learning community, which achieve state-of-the-art results in few-shot learning, face recognition, vehicle Re-ID, person Re-ID, and so forth. The last one, QCL, is proposed in this paper and is a complement to the first four losses.

3.1.1. Hard Mining Triplet Loss

The triplet loss is based on a relative distance comparison among triplet instances. Specifically, for any triplet where and are two distinct samples of the same class and comes from another class as a negative example, their embedding vectors are generated by an embedding function such that , , and  = . The relative distance constraint of each triplet is defined aswhere is a predefined constant parameter representing the minimum margin between the matched and mismatched pairs, is a metric function that measures distance between vectors and , and denotes the -th triplet. In order to ensure fast convergence, it is crucial to select triplets that violate the triplet constraint in equation (1). This means that, given , we want to select an (hard positive sample) such that and to select (hard negative sample) such that . For each sample (as an anchor) from a batch of size , we select a hard-positive sample and a hard-negative sample (also from the batch) to construct a triplet. Thus, the HTML function can be defined as

3.1.2. Margin Sample Mining Loss

The quadruplet loss (QL) [22] extends the triplet loss by adding a different negative pair1. A quadruplet contains four different instances , where and are samples of the same class, while and are samples of two other classes. The quadruplet loss is formulated aswhere and are two margins for the two terms, respectively, and stands for the number of quadruplets. The first term is the same as the triplet loss. The second term tries to enforce intraclass distances to be smaller than interclass distances [22]. Based on QL, a strategy named margin sample mining is proposed [24]. It picks the most dissimilar positive pair and the most similar negative pair in the whole batch and formulates its loss as

As shown in equation (4), the constraints of MSML are extremely sparse. Only two pairs in a batch are used to compute the gradient of the embedding network in the training phase, and it seems that it wastes a lot of training data. In fact, the two chosen pairs are determined by all the data in one batch. During training, more and more pairs will be selected [24].

3.1.3. Angular Loss

Let a triplet form a triangle whose sides are denoted as , , and . The original triplet constraint equation (1) enforces that is longer than . Because the anchor and positive samples are of the same class, a symmetrical triplet constraint can be derived, which enforces . According to the cosine formula, it can be proved that the angle surrounded by the longer edges and has to be the smallest one; that is, . Furthermore, because , has to be less than , which can be used to constrain the upper bound for each triplet triangle [25]:where is a predefined parameter. However, a straightforward implementation of equation (5) becomes unstable in some special cases [25]. In order to solve this problem, the angular loss function is defined in the following equation:where is the number of triplets, which is also equal to the batch size in our implementation.

3.1.4. Triplet Center Loss

The goal of TCL is to leverage the advantages of the triplet loss and the center loss [39], that is, to effectively decrease the intraclass distances and increase the interclass distances simultaneously. Let a given training batch consist of samples with the associated labels . In TCL, it is assumed that the features of each class share one center. Thus, the TCL is defined aswhere is the center of class and is the negative center nearest to .

3.1.5. Quadruplet Center Loss

In equation (7), TCL aims to make the distance between the anchor and its corresponding center smaller than the distance between the anchor and its nearest negative center . For example, in Figure 3(a), the anchor can be pushed to its center. However, in some cases such as the one in Figure 3(b), it cannot, even though it is still far away from its center. In order to solve this problem, we propose a quadruplet center loss:where is a pair of centers. Thus, the proposed loss function guarantees that an anchor that is not close enough to its center will be moved closer, while an anchor that is already close enough to its center will be neglected.

3.2. Search Space

Our key insight of constructing a search space and finding an optimal loss from it automatically lies in that (1) designing a loss architecture requires a lot of expert knowledge; (2) the available losses are all task- and dataset-dependent; (3) there is no rule of thumb to choose the hyperparameters for each loss.

Therefore, we use the five loss architectures to construct the search space. Each loss comes with a magnitude. We discretize the range of the magnitudes into 11 values (uniform spacing) from 0.0 to 1.0 with a step of 0.1 so that we can use a discrete search algorithm to find them. Similarly, we also discretize the margin of each loss having a margin into 51 values (uniform spacing) from 0.0 to 5.0 with a step of 0.1. For the angular loss, we discretize the angle into 51 values (uniform spacing) from 10. to 60. with a step of 1.. Finding an optimal loss architecture now becomes a search problem in a space of possibilities. Notice that there is no explicit discard action in our search space; this action is implicit and can be achieved by making the magnitude of a subloss architecture be 0.

3.3. Search Algorithm

The search algorithm we use to find the optimal loss architecture is a reinforcement learning algorithm (specifically, the proximal policy optimization (PPO)), inspired by [4, 27]. The loss generator is a one-layer LSTM with 100 hidden units. As shown in Figure 4, at each step, the generator takes an action according to the maximum probability produced by a softmax; the output is then fed into the next step. In total, the generator has 10 softmax layers in order to predict 5 subloss architectures, each with 2 actions (selecting a magnitude and a margin/angle).

As shown in Figure 2, the generator is trained with a reward signal, which measures how good the loss architecture is at improving the accuracy of the object embedding network. After generating a loss architecture with probability , the embedding network is trained on the training set and fine-tuned using the few-shot instances. It is then evaluated on the validation set to measure the validation accuracy (the reward signal). At the end of the search, we obtain the best loss architecture with the highest accuracy on the validation set, which is then used to optimize the embedding network finally.

4. Experiments and Results

To evaluate the effectiveness of the proposed approach, we conduct experiments on three popular datasets for few-shot learning, CIFAR-100 [40], Omniglot [41], and mini-ImageNet [9]. Omniglot dataset is the most popular few-shot object recognition benchmark with handwritten characters, and mini-ImageNet dataset is a subset of ImageNet [5] released recently.

4.1. Datasets
4.1.1. CIFAR-100 Dataset

The CIFAR-100 dataset [40] contains 60,000 images with 100 classes. We use 60, 20, and 20 classes for training, validation, and testing, respectively. The validation set is used to generate validation accuracies as the reward signals for LAS.

4.1.2. Omniglot Dataset

This dataset contains 32,460 images of handwritten characters, with 1,623 different characters within 50 alphabets. We follow the most common split in [8], splitting the dataset into a background set of 1200 classes and a testing set of 423 classes. We further split the background set into a training set of 800 classes and a validation set of 400 classes.

4.1.3. mini-ImageNet Dataset

The mini-ImageNet dataset is proposed by [9] as a benchmark with images of much higher resolutions and complexity. This dataset contains 100 classes randomly sampled from the ImageNet dataset, and each class contains 600 images. It is further split into a training set of 64 classes, a validation set of 16 classes, and a testing set of 20 classes [9].

4.2. Experimental Settings
4.2.1. Evaluation Setting

Typically, a few-shot learning task with N new classes and k instances per class is referred to as an N-way k-shot task. In this paper, all experiments are for N-way k-shot object recognition, and the evaluation setting is similar to those in other compared methods. For each iteration in LAS (see the left part of Figure 2), we train the embedding network with the generated loss using the training set. For evaluating , we randomly select the -shot instances as a support set consisting of classes with -labeled images per class from the validation set for fine-tuning the embedding network. Then, we randomly select 15 images (disjoint with the support set) within the selected classes from the validation set to evaluate the fine-tuned embedding network by 1-Nearest Neighbor (1NN). We repeat such evaluation procedures 5 times for each support set and use the mean accuracy as the final validation accuracy (i.e., the reward signal) corresponding to .

After obtaining the final loss architecture for one dataset, we use the testing procedure to obtain the final accuracy (see the right part of Figure 2). We first train the embedding network with using the same training set. In the testing stage, we fine-tune the embedding network using a randomly selected support set from the testing set. After that, we randomly select 15 images (disjoint with the support set) within the selected classes and then measure the classification accuracy by 1NN. To make the accuracy more convincing, we repeat such a testing procedure 600 times and report the final mean accuracy as well as the 95% Confidence Intervals (CIs).

4.2.2. Network Architecture and Parameter Setting

Our embedding network follows the same architecture as that used in [8, 9], which has four convolutional layers (Conv-4). Each convolutional layer is designed with convolution with 64 filters followed by batch normalization, ReLU nonlinearity, and max pooling.

On each dataset, the generator samples 5,000 loss architectures. We follow the training procedure and hyperparameters from [4] for training the generator. For training the embedding network, we use Adam [42] to perform stochastic optimization over the learning objective. The hyperparameters of Adam (i.e., , , and ) are set to 0.9, 0.999, and , respectively.

4.3. Ablation Study

Since LAS integrates the five sublosses into the reinforcement learning framework, we conduct an ablation study on CIFAR-100 to understand the contribution of each subloss. In each case, we remove one subloss from the search space and perform the whole process of loss architecture search as shown in Figure 2. The results are presented in Table 1.

From Table 1, we can see that removing any one subloss results in an obvious performance drop compared with using all of them. This implies that the five losses are not redundant. In addition, removing QCL, the loss proposed in this paper, causes the most significant performance reduction on this dataset, which shows that QCL may contribute the most among the five losses.

4.4. Compared Methods

To verify the performance of our LAS for few-shot learning, we compare it with the following 19 state-of-the-art methods: (1) Siamese Network (SN) [21], which presents a strategy for performing few-shot classification by learning a deep convolutional embedding network with a pairwise Siamese loss, (2) Matching Network (MN) [8], which performs few-shot learning by embedding a small labeled support set and unlabeled examples, (3) Matching Network with Full Contextual Embeddings (MN-FCE) [8], which is an upgraded version of MN by utilizing a bi-LSTM to contextually embed samples, (4) Model Agnostic Metalearning (MAML) [29], which learns hyperparameters through gradient descent in a metalearning fashion, (5) Metalearner LSTM (ML-LSTM) [9], which uses a LSTM as a metalearner to learn the parameter update rule for optimizing the network, (6) Siamese with Memory (SM) [31], which proposes a large-scale life-long memory module to remember past training samples and makes predictions based on stored previous samples, (7) Metanetwork (Meta-N) [30], which learns metalevel knowledge across tasks for rapid generalization, (8) Meta-Stochastic Gradient Descent (Meta-SGD) [43], which learns the initialization, update direction, and learning rates by metalearning, (9) Prototypical Networks (PN) [10], which learn a prototype representation for each class, (10) Relation Network (RN) [20], which computes relation scores between few-shot instances and testing instances, (11) Memory Matching Network (MM-Net) [32], which augments CNNs with memory and learns the network parameters for unlabeled images, (12) MM- [32], which is a variant of MM-Net by using a mixed training strategy, (13) Graph Neural Network (GNN) [44], which casts few-shot learning as a message passing task, (14) Task Agnostic Metalearning (TAML) [33], which learns an unbiased initial model with the largest uncertainty, (15) task-dependent adaptive metric (TADAM) [45], which proposes a practical end-to-end optimization procedure to learn a task-dependent metric space, (16) Simple Neural AttentIve Learner (SNAIL) [46], which proposes a generic metalearner architecture for few-shot learning, (17) Deep Nearest Neighbor Neural Network (DN4) [14], which replaces an image-level feature-based measure with a local feature-based image-to-class measure, and (18) deep learning with knowledge transfer architecture (KTN) [47], which jointly incorporates classifier learning, knowledge inferring, and visual feature learning into one framework.

Note that, for fair comparison, the backbones of LAS and all these compared methods are of the same architecture as that used in [8, 9], which is a shallow network with only 4 convolutional layers.

4.5. Results on CIFAR-100

Table 2 compares LAS with 4 state-of-the-art methods (MN, SM, MAML, and Meta-N) and the 5 sublosses described in the last section. The codes of these 4 methods are available publicly and thus can be used to run this experiment. The results across the 5-way 1-shot task and the 5-way 5-shot task show that our LAS method obtains the best performances against these methods. In particular, the 1-shot and 5-shot accuracy of our method can achieve 55.85% and 71.91%, respectively, on 5-way learning, making the absolute improvement over MAML by 6.56% and 5.87%.

Table 2 also shows the performances of 5 metric learning approaches for few-shot object recognition (HMTL [23], MSML [24], AL [25], TCL [26], and QCL). The results indicate that no single subloss dominates the loss searched by LAS.

4.6. Results on Omniglot

Table 3 compares LAS with 8 state-of-the-art methods on the Omniglot dataset. The results of these methods are from [14, 32, 33]. The results across the 20-way 1-shot task and the 20-way 5-shot task indicate that our LAS method again achieves the best performances against other state-of-the-art techniques including embedding models (SN, MN, SM, PN, and MM-Net) and metalearning approaches (Meta-N, MAML, and TAML). In particular, the 1-shot and 5-shot accuracy of LAS can reach 97.69% and 99.21% on 20-way learning, respectively. It makes the absolute improvement over MAML by 1.89% and 0.31%, which is significant on this dataset.

4.7. Results on mini-ImageNet

The performance comparison with 14 state-of-the-art methods on mini-ImageNet is summarized in Table 4. The results of these methods are from [14, 20, 32, 33]. Our LAS method also performs the best. In particular, the 1-shot and 5-shot accuracy can achieve 54.97% and 71.92% on 5-way learning, respectively, making the absolute improvement over MM-Net by 1.60% and 4.95%. Compared with the most recent work DN4, LAS improves the accuracy of the two tasks by 3.73% and 0.90%, respectively. Our LAS gains 4.53% and 6.60% improvements over RN. RN computes relation scores between few-shot instances and the testing instances, while LAS finds the best metric learning-based loss to generate the best embedding functions.

To evaluate the proposed method on deeper networks, several experiments on mini-ImageNet are conducted by using ResNet-12 as the backbone. ResNet-12 is a 12-layer residual network. The performance comparison with ResNet-12 as the backbone on mini-ImageNet is summarized in Table 5. The results across the 5-way 1-shot task and the 5-way 5-shot task show that the proposed method achieves the best performances against other state-of-the-art techniques. In particular, the 1-shot and 5-shot accuracy of LAS with ResNet-12 as the backbone can reach 59.44% and 77.82% on 5-way learning, respectively. It makes the absolute improvement over MAML by 9.08% and 10.48%, which is significant on this dataset.

4.8. Training and Inference Time

Similar to other reinforcement learning algorithms, the training of LAS is time-consuming. On Nvidia Tesla P100 GPU, LAS takes about two days to find the best loss architecture on the mini-ImageNet dataset for the 5-way 5-shot task. However, after training, its inference is fast enough; for example, it only takes 2 ms to perform one inference for the same task.

5. Conclusion

In this paper, we have proposed an automatic approach to best loss architecture search for few-shot object recognition. Five metric learning-based sublosses (with one developed by us) are used to construct the search space. The loss generator is trained by a reinforcement learning algorithm. Our experiments show that the proposed few-shot object recognition method outperforms other state-of-the-art methods on three popular benchmarks. The future work includes combining network architecture search and loss architecture search for better AutoML.

Data Availability

The datasets and source codes used to support the findings of this study are available from the author upon request via e-mail ([email protected]).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 41701500, in part by the Natural Science Foundation of Hunan Province under Grants 2018JJ3641 and 2019JJ60001, and in part by Innovation-Driven Project of Central South University under Grant 2020CX036.