Abstract

Object detection plays an important role in many computer vision applications. Innovative object detection methods based on deep learning such as Faster R-CNN, YOLO, and SSD have achieved state-of the-art results in terms of detection accuracy. There have been few studies to date on object detection with the addition of new classes, however, though this problem is often encountered in the industry. Therefore, this issue has important research significance and practical value. On the premise that the old class samples are available, a method of reserving nodes in advance in the output layer (RNOL) was established in this study. Experiments show that RNOL can achieve high detection accuracy in both new and old classes over a short training time while outperforming the traditional fine-tuning method.

1. Introduction

Object detection involves the two distinct tasks of object recognition and location. It is not only necessary to identify the class of the object in the image but also able to locate the object within a rectangular area. In [1], only the object is recognized, but the object is not located in the rectangular area. Object detection is a common component of artificial intelligence and information technology systems including robot vision, unmanned aerial vehicle surveillance, automatic driving, intelligent video surveillance, and medical image analysis.

Many scholars have studied object detection. Most of the traditional methods are based on background subtraction [24]. Recently, many scholars have developed numerous object detection methods based on deep learning, such as Faster R-CNN [5], YOLO v3 [6], and SSD [7] and achieved state-of the-art results in regard to detection accuracy. When adding new classes, however, it is very time-consuming to train an object detection model from scratch on the premise that the old classes are available. How can the model training time be improved without sacrificing high detection accuracy in both new and old classes? This problem is often encountered in the industry; this issue has important research significance and practical value.

Fine-tuning [8] is the method most commonly used to solve the new-class addition problem at present. The fine-tuning method uses the weights of the old model except for the last output layer. Although this method can train the model in a short time, its detection accuracy is relatively low.

In this study, we developed the reserving nodes in advance in the output layer (RNOL) method to solve the object detection problem when adding new classes based on the Faster R-CNN and fine-tuning method. We conducted a series of experiments on the PASCAL VOC 2007 to validate the proposed method. The results show that, on the premise that the old classes are available, RNOL can train the model well and quickly when new classes are added. RNOL also demonstrated higher detection accuracy on both new and old classes than fine-tuning, discussed in detail as follows.

Object detection is mainly based on a geometric principle first developed in the 1960s. With the emergence of neural network and support vector machine techniques, object detection methodology has transformed from geometric to statistical. In recent years, advancements in computing and deep learning technology have brought about object detection frameworks based on deep learning such as R-CNN [8], Fast R-CNN [9], Faster R-CNN [5], YOLO [10], YOLO v2 [11], YOLO v3 [6], and SSD [7].

The new-class addition problem has a long history in the machine learning and artificial intelligence field [1215]. The problem may be approached when old classes are not available [16, 17] or when old classes are available; there has been considerably less research centered on the latter scenario. Rebuffi et al. [18] researched the problem using a small number of old classes. Extant methods are not ideal as far as the detection accuracy of new or old classes, so it is difficult to meet industrial needs at present. In this paper, we discuss only scenarios wherein old classes are available.

3. Reserving Nodes in Advance in the Output Layer

For object detection problems considering the addition of new classes, the RNOL method primarily works by reserving the number of nodes in the output layer appropriately; the number of nodes in the output layer in this case exceeds the number of old classes. To operate RNOL, we first use a Faster R-CNN to build a model, then reserve the correct number of nodes in the output layer, and train the model on the old classes before saving the model and weight. Next, when new classes are added, the saved models and weights are loaded and the models are trained on both the new and old classes. Finally, we use the fully trained model to detect the test samples.

A diagram of the RNOL method is shown in Figure 1. Hollow dots in the output layer represent reserved nodes. The number of nodes in the output layer is larger than the number of the old classes, as mentioned above. The number of reserved nodes can be set artificially. This method is effective as long as the number of new classes is not greater than the number of reserved nodes. The person in Figure 1 belongs to the old class and the horse belongs to the new class. The proposed method resolves the problem of the coexistence of new and old classes. Compared to fine-tuning, the advantage of RNOL is that it can effectively utilize more weight information of the old class model, including the weight of the output layer.

The architectures of RNOL are consistent with those of Faster R-CNN, except the number of nodes in the output layer. The activation function of neurons in the output layer is softmax.

The old classes are marked as . The model trained in the old classes is marked as , and the new classes are marked as . The model trained in new and old classes is marked as .

4. Experiments

4.1. Datasets and Evaluation

We evaluated our method on the PASCAL VOC 2007 dataset, as mentioned above. VOC 2007 consists of 5K images in the trainval split and 5K images in the test split for 20 object classes. We used the standard mean average precision (mAP) at 0.5 IoU threshold as the evaluation metric; evaluation of the VOC 2007 experiments was conducted on the test split.

4.2. Implementation Details

We randomly initialized all new layers by drawing weights from a zero-mean Gaussian distribution with a standard deviation of 0.01. We used the stochastic gradient descent (SGD) with Nesterov momentum [19] to train the network in all experiments. We set the learning rate to 0.001, decay to 0.0001 after 50K iterations, and momentum to 0.9. In the second stage of training, i.e., learning the extended network with new classes, we used a learning rate of 0.001 and decay to 0.0001 after 10K iterations. The network was trained for 70K iterations on PASCAL VOC 2007. The network was trained for 20K iterations when only one class was added and 30K iterations when 10 classes were added simultaneously. For the Faster R-CNN, we took batches of two images each. All other layers (i.e., the shared convolutional layers) of the network were initialized by pretraining a model for ImageNet classification [20]. We implemented this in Tensorflow [21].

4.3. Effects of RNOL

We sought to determine whether reserving nodes in advance in the output layer increases the computing time or affects the object detection accuracy compared to the traditional method.

We took 10 classes in alphabetical order from the VOC2007 dataset and ran two respective types of experiment. Experiment 1 involved reserving 10 nodes in the output layer (that is, the number of neurons in the output layer was 20). There was no reserved position in the output layer, in Experiment 2; that is, the number of neurons in the output layer was 10.

The detection accuracy of the two experiments is shown in Table 1, and the training times are shown in Table 2. We observed no significant difference in test results and training time between the two experiments, which suggests that RNOL does not increase the training time nor affect the detection accuracy.

4.4. Addition of One Class

In this experiment, we took 19 classes in alphabetical order from the VOC 2007 dataset as and the remaining one as the only new class . We then trained the network on and the network on the VOC trainval containing the 20 classes. A summary of the evaluation of these networks on the VOC test set is shown in Table 3, and the full results are listed in Table 4.

As shown in Table 3, when applying the RNOL method on the basis of the old network , the new network has 69.0% mAP after 20K iterations. When using the fine-tuning method on the basis of the old network without RNOL, the new network only has 68.1% mAP after 20K iterations. When applying the training from scratch (TFS) method, the network only has 59.2% mAP after 20K iterations. The TFS method needs 70K training iterations to achieve the ideal accuracy. These results suggest that the RNOL method outperforms both the fine-tuning method and TFS method. When adding one class, the RNOL method yields higher accuracy in a shorter training time than fine-tuning or TFS.

We next compared the RNOL and fine-tuning methods in the new network when adding one class. Each was trained 30K times, and the weights were saved every 5K iteration; each saved weight was loaded on the detection set. The test results are shown in Figure 2, where the RNOL method achieves its highest detection accuracy when training 20K iterations and then begins to decline. The fine-tuning method accuracy increases slowly over the experiment but does not readily exceed that of the RNOL method.

4.5. Addition of Multiple Classes

In this experiment, we took 10 classes in alphabetical order from the VOC 2007 dataset as and the remaining 10 classes as . We then trained the network on and the network on the VOC trainval containing the 20 classes. A summary of the evaluation of these networks on the VOC test set is shown in Table 5, and the full results are listed in Table 6.

As shown in Table 5, on the basis of the old network with RNOL, the new network has 68.2% mAP after 30K iterations. On the basis of the old network with fine-tuning alone and no RNOL, the new network only has 67.3% mAP after 30K iterations. When using the TFS method, the network only has 62.5% mAP after 30K iterations. The TFS method needs training 70K iterations to achieve the ideal accuracy. Once again, the RNOL method outperforms both fine-tuning and TFS methods. When adding 10 classes, the RNOL method achieves higher accuracy in a shorter training time than fine-tuning or TFS.

For adding 10 classes, we have compared RNOL and fine-tuning methods in the new network . Each was trained 30K times, the weights were saved every 5K iteration, and each saved weight was loaded on the detection set. The results are shown in Figure 3. The detection accuracy of RNOL is higher than that of the fine-tuning method when training 30K times. More iterations are needed to achieve higher detection accuracy as the number of added classes increases beyond one.

5. Conclusion

For object detection considering the addition of new classes when the old classes are available, we improved the Faster R-CNN model in this study by reserving nodes in advance in the output layer. Our experimental results show that RNOL can achieve high detection accuracy in both new and old classes in a short training time. Although the proposed method outperforms the fine-tuning method, its detection accuracy still has room for further improvement. One possible way to do this is to increase the number of training iterations, but it will increase the cost of training time.

Data Availability

We evaluated our method on the PASCAL VOC 2007 dataset.

Conflicts of Interest

The authors declare that they have no conflicts of interest.