Abstract

In the field of object detection, recently, tremendous success is achieved, but still it is a very challenging task to detect and identify objects accurately with fast speed. Human beings can detect and recognize multiple objects in images or videos with ease regardless of the object’s appearance, but for computers it is challenging to identify and distinguish between things. In this paper, a modified YOLOv1 based neural network is proposed for object detection. The new neural network model has been improved in the following ways. Firstly, modification is made to the loss function of the YOLOv1 network. The improved model replaces the margin style with proportion style. Compared to the old loss function, the new is more flexible and more reasonable in optimizing the network error. Secondly, a spatial pyramid pooling layer is added; thirdly, an inception model with a convolution kernel of 1  1 is added, which reduced the number of weight parameters of the layers. Extensive experiments on Pascal VOC datasets 2007/2012 showed that the proposed method achieved better performance.

1. Introduction

Human beings can easily detect and identify objects in their surroundings, without consideration of their circumstances, no matter what position they are in and whether they are upside down, different in color or texture, partly occluded, etc. Therefore, humans make object detection look trivial. The same object detection and recognition with a computer require a lot of processing to extract some information on the shapes and objects in a picture.

In computer vision, object detection refers to finding and identifying an object in an image or video. The main steps involved in object detection include feature extraction [1], feature processing [24], and object classification [5]. Object detection achieved excellent performance with many traditional methods that can be described from the following four aspects: bottom feature extraction, feature coding, feature aggregation, and classification. The feature extraction plays an essential role in the object detection and recognition process [6]. There will be more redundant information which can be modeled to achieve better performance than previous point-of-interest detection. Previously used scale-invariant feature transformations (SIFT) [7] and histogram of oriented gradients (HOG) [8] belong to this category.

The object detection is critical in different applications, such as surveillance, cancer detection, vehicle detection, and underwater object detection. Various techniques have been used to detect the object accurately and efficiently for different applications. However, these proposed methods still have problems with a lack of accuracy and efficiency. To tackle these problems of the object detection, machine learning and deep neural network methods are more effective in correcting object detection.

Thus, in this study, a modified new network is proposed based on the YOLOv1 [9] network model. The performance of the modified YOLOv1 is improved through the following points:(i)The loss function of the YOLOv1 network is optimized.(ii)The inception model structure is added.(iii)A spatial pyramid pooling layer is used.(iv)The proposed model effectively extracts features from images, performing much better in object detection.

The remaining of this paper is organized as follows. Section 2 describes related work. Section 3 presents the methodology, which describes network architecture in detail. Section 4 presents the analysis of the improved network from various aspects. In Section 5, the experiment setup, results, and comparison with other networks are discussed. The paper conclusion and future work are given in Section 6.

Detecting and identifying multiple objects in an image is hard for machines to recognize and classify. However, a noteworthy effort has been carried out in the past years in the detection of objects using convolutional neural networks (CNNs). In the object detection and recognition field, neural networks are in use for a decade but became prominent due to the improvement of hardware new techniques for training these networks on large datasets [10, 11]. In object detection and recognition, researchers have used deep learning for learning features directly from the image pixels, which are more effective than the manual features [4, 12]. Recently deep learning-based algorithms remove the manual features extraction methods and directly use features extracting methods [13] from the original images. This methodology has been successfully proven in feature pyramid network (FPN) [14], single shot detector (SSD) [15], and deconvolutional single shot detector (DSSD) [16]. Deep learning is a prevailing direction in the field of machine learning [17]. In [18, 19], researchers showed that the CNNs inherit the advantages of deep learning, which makes their results in the field of object detection and recognition greatly improved compared with the traditional methods. Researchers had made many efforts to use stochastic gradient descent and backpropagation to train deep networks for object detection [20]. Those networks were able to learn but were too slow in practice to be useful in real-time applications; the technique in [12] showed that stochastic gradient descent by backpropagation was effective in training CNNs. CNNs became in use but fell out of fashion due to the support vector machine as in [21] and other simpler methods like linear classifiers as in [22]. New techniques that have been developed recently [23, 24] show higher image classification accuracy in ImageNet large scale visual recognition [25]. These techniques have brought much more easiness to train large and deeper networks and shown enhanced performance. Newly, approaches have been established to identify vehicles and other objects from videos or static images using deep convolutional neural networks (DCNN) [2630]. For example, faster R-CNN [19] proposes candidate regions and uses CNN to confirm candidates as valid objects. YOLO uses end-to-end unified, fully convolutional network structure that predicts the objectless assurance and the bounding boxes concurrently over the whole image. SSD [31] outperforms YOLO by discretizing the production space of bounding boxes into a set of avoidance boxes over different feature ratios and scales per feature map location. YOLO-2 [32] achieves state-of-the-art performance in object detection by improving various aspects of its earlier version. A fully convolutional network is utilized for object detection from three-dimensional (3D) range scan data with LIDAR. A 2D-DBN design is proposed, which uses second-order planes instead of first-order vectors as inputs and uses bilinear projection for retaining discriminative information to develop the recognition rate [33]. Although DCNN based approaches accomplish the state-of-the-art accuracy of detection or classification, these approaches often require intensive calculation and a considerable amount of labeled training data. Through the past few years, to use deep neural networks economically in real-time applications, a substantial amount of work has been done to report these two problems [34, 35]. In this study, a different modified architecture for object detection is addressed, which is capable of providing high accuracy and speed.

3. Methodology

In this section, the proposed model is described in detail. Firstly, the improvement based on loss function is presented. Secondly, the improvement based on inception structure model is described. And lastly, the improvement based on the spatial pyramid pooling layer is portrayed. The symbolic representations are described in Table 1.

3.1. Improvement in Network Design

The following improvements to the YOLO network model are made while maintaining the original model dominant idea.

3.1.1. Improvement Based on Loss Function

The loss function of the original YOLOv1 network takes the same error for the large and small objects, which makes the model’s prediction for neighboring objects unsatisfactory. If two objects appear in the same grid, only one object can be detected, and there will be a problem in detecting small objects. Compared with the old loss function, the new loss function is more flexible and optimized. In the new loss function, the original difference is replaced by the proportionality. Equation (1) shows the original loss function of YOLOv1; YOLOv1 uses one single loss function for both bounding boxes and the classification of the object. Loss function can be described in five parts: the first and second are focusing on the loss of the bounding box coordinates, while the third and fourth are responsible for the difference in the confidence of having an object in the grid, and part five is responsible for the difference in class probability. The and are scalars to weight each loss function, is set to 5, and is set to 0.5 by the original author of YOLOv1.

In convolutional neural networks, variance function is often used as the loss function [36] of the network. For example, for a variety of problems, the total number of categories is C and training samples is N. The algorithm which is used for multiclassification first needs to find those weights and biases that make the output of the neural network close to (which is labeled category) for all training inputs ; to quantify how close the output of all training inputs is to , the loss function is defined as

Here, represents the label of the input object, and represents the actual output value of the input object to the network. The function of choosing the variance form is the loss function to facilitate subsequent optimization. On the other hand, the current training level can be predicted by observing the severity of the fluctuation of the loss value in practice.

In the YOLOv1 network loss function design, the variance function is used as part of the entire loss function, the normalization idea of contrast is used to improve it, and the improved model replaces margin style with proportion style, so here the size of the object in the picture is considered. The specific modified loss function is shown in

Here, indicates that the target object is assumed to be present in the position of the area. and represent the current position of the image; and represent the width and height of the image. is the total number of objects to be identified, and is the probability that the object belongs to a specific class c. Here, it should be noted that the loss function guides the optimization of the class to which the object belongs and optimizes the position of the boundary box for detecting the object.

3.1.2. Improvement of Inception Structure Model

The third and fourth layers of the original network are replaced with new inception models. The inception model itself has the ability to deepen and widen the network and enhance the network; a 64 × 1 × 1 layer is added between the first and second layers of the original network, which reduces the network parameters. Figure 1 shows the structure part of the YOLOv1 network after adding the inception model. Inception architecture is used to find out how an optimal local sparse structure in a convolutional neural network can be approximated and covered by readily available dense components.

The inception model can deepen and widen the network, and the convolutional kernel of different scales is connected in parallel. Thus, the multiscale feature can be more effective, and the hidden information in the image can be used more efficiently.

3.1.3. Improvement of SPP Structure Model

Figure 2 shows the addition of spatial pyramid pooling (SPP) layer, and below are the advantages of using it.(i)It can output a fixed-size image for any size input or any ratio of the input image.(ii)It can extract pool features at varying scales.

A classifier (SVM/Softmax), as well as fully connected layers, requires a fixed-length vector, which can be generated through Bag-of-Words (BoW) [35, 37, 38], the spatial pyramid downsampling boosts the BoW because it preserves spatial information by pooling the spatial bins. These spatial bins have sizes proportional to the image size, so the number of bins is fixed regardless of the image size, which makes the SPP [39, 40] not only improve network performance but also dramatically reduce the required calculation time by avoiding repeatedly computing the convolutional features.

By using the SPP layer, more feature-rich image information is obtained, and also great improvements in the network’s time efficiency are observed. Hence, this technique shows remarkable detection accuracy.

4. Analysis of the Network

Following is the comprehensive analysis of our proposed network and improved YOLO model based on the results of the experimental tests.(i)By the analysis of the confusion matrix, we observed what kind of sample detection performance is better for the new network, what kind of sample detection performance is not good, and how to distinguish the easily confused categories and understand the advantages and disadvantages of the network.(ii)We examined the network architecture of the new network model, such as the comparison of the number of network parameters, and assessed its performance.

4.1. Confusion Matrix

Through the confusion matrix, the test results are analyzed. A confusion matrix is a list of data classes; in each class, the actual data is classified so that we can observe which categories of samples are easily confused in the modified network. In the confusion matrix, the rows represent the true categories of the test images. The columns show the classes of the test images divided by the network in the actual test.

In the original Pascal VOC dataset, there are 20 categories of objects; here some representative categories, which easily cause misidentification, are selected.

Table 2 is the confusion matrix of the modified network model on the Pascal VOC 2007 dataset. It can be noticed from Table 2 that the airplane is mistakenly recognized as a bird, and the original samples belonging to birds are identified as airplanes. The reason is that the overall shape is too similar: the airplane has two wings, and so does the bird; the airplane’s body shape is very similar to that of a bird; therefore, the results show that 22% of the airplanes are mistakenly identified as birds, and 36% of the birds were incorrectly identified as airplanes. In addition, the chair and sofa are also relatively easy to cause misidentification, because in real life it is very easy to differentiate between chairs and sofas, but in picture chairs and sofas are very easy to appear the same, which can cause miss identification very easily. And the same applies for sheep, horses, dogs, and cats.

From Table 2, it can be seen that the overall average misrecognition rate is not too high, indicating that the overall ability of the network to extract features and detect target objects in the image is relatively reliable.

4.2. Network Architecture

Here, the proposed network architecture is described. Before going into detail, please note that the first and second layers are the same: both are convolutional layers plus the downsampling layer structure; the third and fourth layers are the same: both are inception + pool structures; the fifth and sixth layers are the same: both are convolutional cascade structures; the seventh layer is spatial pyramid pooling layer; and the eighth and ninth layers are the fully connected layers.

For the first layer, it is assumed that the input is an image, r is the number of rows of the image, is the number of columns of the image to a network of the first layer input, and the sliding step is s1; the computational cost of obtaining a feature map is shown in the equation:

Computing area is the size of the convolution kernel area, so the result of (4) is obtained, and then we assume that the first layer has feature maps, so the calculation of the first layer is

and the size of the feature map after convolution will become

Next is the maximum downsampling layer; since the downsampling layer does not change the number of feature maps, the number of the feature maps is equal to the number of the previous feature maps. Assuming the size of the downsampling window, the size of the feature map obtained after downsampling is

The calculation of the total number of feature maps will become

The following is the convolution second layer, assuming that the number of features

The calculation with the upper layer of the feature map for convolution operation will be as follows.

Assuming that the output of the maximum downsampling layer in the second layer is characterized by the size of the downsampling window and with the step size , calculation of the total amount of the layer can be obtained by the same way.

From the above, it can be seen that the output feature size of MaxPool2 is . In the inception structure, the step size is 1, and the calculation is from left to right. The third layer’s inception structure model is shown in Figure 3 and mathematically shown in

Thus, the whole calculation of inception four layers can be done in the above way. Next is the fifth layer of the convolution, and the total calculation is

Since the sixth layer and the fifth layer have the same structure, the calculation is the same as (13).

The seventh layer is the pyramid layer, denoted by L, where n = 1, 2, …, L. The calculation amount of the pyramid layer is

The eighth layer is fully connected. Assume that the number of input features is , and the number of output features is. Because the input of the layer is the former layer, it will be processed after all the features of the map are gathered as a vector, so is

Because the full-connection layer is derived from the original neural network, the calculation method is the same as that of the neural network, so the computational cost of the layer is

From the above description of network architecture analysis, it is observed that the network’s overall calculation, input layer image size, convolution kernel size, and the number of convolutional layers, shows that network depth and width are having big impact.

5. Experiment

Pascal VOC is divided into two datasets: Pascal VOC 2007 and Pascal VOC 2012 dataset. The newly designed network was tested on both datasets [41]. The Pascal VOC dataset consist of 20 categories: person, bird, cat, cow, horse, sheep, airplane, bike, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, and TV monitor. Figures 4 and 5 show the sample images.

The whole experiment process is conducted on NVIDIA GeForce GTX 1060 GPU using the Ubuntu operating system. The number of iterations was 40000.

5.1. Results and Discussion

The results are discussed and the network performance is checked using t-SNE visualization tool, showing the extent to which the new network is able to extract rich features from images.

Next, the visualization of a large number of sample features in 2D is observed by using the t-SNE visualization tool, which maps high-dimensional to low-dimensional data [42].

Figure 6 shows ten selected categories from the Pascal VOC dataset (bird, chair, sofa, bike, airplane, horse, sheep, dog, cat, cow) using the t-SNE visualization tool; in the figure, different colors represent different types; if the two types are fused, this means that these types are easily getting confused with one another.

There are about seven categories which are not compatible with each other, indicating that the characteristics of these seven types of differences are relatively large and relatively easy to identify; in addition to several types of partial integration, the characteristics of several types have a certain degree of similarity, which is easy to cause misidentification. However, overall, the use of the new network to extract the characteristics is very effective and robust, but it is also inadequate and needs to be further improved. The improved network was tested on Pascal VOC 2007 and Pascal VOC 2012, respectively. The results are shown in Tables 3 and 4.

The data in Tables 3 and 4 is expressed in percentage. In the above results, to make the comparison results more consistent, the training dataset used in the above algorithm is the train/val dataset of Pascal VOC 2007 and Pascal VOC 2012. The data presented in Tables 3 and 4 are test results for each class of 20 objects. Our modified network average detection rate is 65.6% and 58.7% on the Pascal VOC 2007 and 2012 dataset. To check the performance, we compared the results of our modified network with those of R-CNN and YOLOv1, as depicted in Tables 5 and 6 for Pascal VOC 2007 and 2012, respectively. Table 5 shows the Pascal VOC 2007 comparison test results, and in Table 6 Pascal VOC 2012 comparative test results are presented.

It can be seen from the tables that our modified model has improved recognition over the YOLOv1 and R-CNN model in almost every type. Table 7 depicts the processing time of an image of three different networks, R-CNN, YOLOv1, and our improved YOLO, for testing the same image. The time taken by the R-CNN network is 6.9 seconds, the YOLO network takes 0.14 seconds, and our model takes 0.11 seconds. Figures 7 and 8 show the testing results on Pascal VOC 2007 and Pascal VOC 2012 dataset images [41].

From the testing results, the robustness of the improved network is noticed; it classifies each class accurately and detects the desired class.

6. Conclusion

In this paper, we proposed YOLOv1 neural network based object detection by modifying loss function and adding spatial pyramid pooling layer and inception module with convolution kernels of 1  1. The new network is trained on an end-to-end method, and the extensive experiment on a challenging Pascal VOC dataset, 2007/2012, shows the effectiveness of the improved new network, with the detection results being 65.6% and 58.7%, respectively. The results of the proposed network have been compared with those of R-CNN and YOLOv1, from which the effectiveness of the proposed method is demonstrated.

In the future, we expect to extend our work further to make our own benchmark dataset and a hybrid detector for small object detection.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding this paper.

Acknowledgments

This work was supported in part by the National Key R&D Program of China under Grant 2018YFC0831404 and the State Grid Corp of China Science and Technology Project “Research on Key Technologies of Knowledge Discovery Based ICT System Fault Analysis and Assisted Decision”.