Abstract

Vehicle detection is expected to be robust and efficient in various scenes. We propose a multivehicle detection method, which consists of YOLO under the Darknet framework. We also improve the YOLO-voc structure according to the change of the target scene and traffic flow. The classification training model is obtained based on ImageNet and the parameters are fine-tuned according to the training results and the vehicle characteristics. Finally, we obtain an effective YOLO-vocRV network for road vehicles detection. In order to verify the performance of our method, the experiment is carried out on different vehicle flow states and compared with the classical YOLO-voc, YOLO 9000, and YOLO v3. The experimental results show that our method achieves the detection rate of 98.6% in free flow state, 97.8% in synchronous flow state, and 96.3% in blocking flow state, respectively. In addition, our proposed method has less false detection rate than previous works and shows good robustness.

1. Introduction

Intelligent Transportation Systems [1] (ITS) is an effective way to solve the problem of urban traffic congestion in the future. Vehicle detection is an important link for the data acquisition of ITS system [2] and is able to provide valid data for various traffic intelligent control applications such as traffic jam and traffic illegal evidence extraction. In the last decade, multitarget detection [3] for traffic data based on machine learning [4] has been a research hotspot.

Traditional machine learning methods first extract target features, such as histogram of oriented gradient (HOG) [5], scale invariant feature transform (SIFT) [6], or local binary pattern (LBP) [7]. Then the extracted features are fed to a pretrained classifier, such as support vector machine SVM [8] or iterator of Ada Boost (AB) [9]. For a particular identification task, in which the data size is always limited, the extracted features may encounter trouble in generalization ability, so the result is difficulty to achieve precise identification among the practical problems.

Machine learning [10] is the research of supervised or unsupervised methods to extract and convert features. As a subarea of machine learning, deep learning [11] is an algorithm that models the complex relationships between data through multilevel representations. It transforms the original data into higher-level and more abstract expressions through simple and nonlinear models, showing good results in target detection. For example, the mAP (mean average Precision) reaches 30% in [12] on VOC2007 dataset. RCNN [13] combines traditional machine learning with deep learning and increases the mAP with the VOC2007 dataset to 48%. Despite its advantages on detection, RCNN requires training with several SVM classifiers, resulting in high computational complexity. By modifying the network structure, it increased to 66% in 2014. Then by optimizing the network, SPP-Net [14], Fast RCNN [15], Faster RCNN [16], and YOLO [17] appeared.

YOLO is a new real-time target detection method which uses deep convolutional neural network (CNN). It further increases the mAP to 78.6% on the VOC2007 dataset. At the same time, unlike traditional feature extraction algorithms, deep convolutional neural networks have a certain degree of invariance to geometric transformation, deformation, and illumination and effectively overcome the difficulties caused by the changing appearance of vehicles. In addition, the feature description can be adaptively constructed under the training data, showing higher flexibility and generalization ability.

In summary, in this paper a deep learning method is purposed in the multitarget detection of vehicles under several challenging conditions such as color changing and complex road scenes. Firstly, the vehicle dataset is built based on VOC2007 dataset. Then, the parameters of an improved network YOLO-vocRV are fine-tuned through our retraining to obtain the multiobject detection. Finally, the detection methods in this paper are tested under different road traffic conditions by comparing with YOLO-voc, YOLO 9000, and YOLO v3 model.

2. YOLO v2 Algorithm

YOLO v2 can distinguish region between the target and the background. In YOLO v2, both the probability and the target location of multitarget can be predicted in real time, which is a property essential for multiobjective vehicle detection.

2.1. Feature Acquisition Based on Convolutional Neural Network

The CNN in target detection avoids the complicated preprocessing of the image, which has the characteristics of displacement invariance, scaling invariance, and other forms of distortion invariance. At the same time, due to the same neuron weights on the same feature mapping plane, the network learns in parallel and shares weights to reduce its complexity [18]. In this paper, the CNN in YOLO v2 is composed of convolution layers and pooling layers, as shown in Figure 1.

The 1x1 convolution kernel has a size of only 1x1, and it does not need to consider the relationship between pixels and surrounding pixels. It is mainly used to adjust the number of channels, combine pixels on different channels linearly, and then perform nonlinear operations. It can complete the functions of ascending and dimension reduction. The 3x3 is the smallest size capable of capturing eight fields of pixels. The convolution kernels are arranged by 3x3 and 1x1 alternately, which can reduce the network parameters without reducing network performance. It is more conducive to the extraction of vehicle features based on the arranged convolution kernels. The pooling layers are used to downsample the extracted features, remove the less important parts, and output the final feature map.

2.2. Vehicle Target Detection Algorithm Based on YOLO v2

The design concept of YOLO v2 follows end-to-end training and real-time detection: the input image is divided into grid cell for learning feature. If the center of the vehicle falls within a cell, the corresponding cell is responsible for detecting the vehicles and direct prediction for each target location in the output of the feature map. The bounding box regression is used to fine-tune the window and we use VOC dataset in this paper to perform clustering statistics by ground truth (K-means algorithm). Our results indicate that from K=1 to K=5 in our experimental models the IOU (Intersection Over Union) curve rises faster, which is shown in Figure 2 and that means the corresponding matching degree is high. When the effect and complexity are traded off, we use 5 bounding boxes which can make positioning more accurate. Each grid cell is responsible for outputting B=5 bounding boxes, and each bounding box has four-position information , , and a confidence score, as shown in Figure 3.

is the offset coordinate of each bounding box’s center leaving its network boundary, and it is the true width and height of the target relative to the ratio of the entire image. Then the real position calculation of the bounding boxes is shown in formula (1). is the margins of the upper left corner of the grid cell distance image, and is the length and width of the corresponding box dimension.

The confidence score reflects the possibility of the existence target in the bounding box and the bounding box predicts the accuracy of the target position. If there is no target in the bounding box, then .

In addition, each cell in the YOLO v2 algorithm will also produce only one probability with C conditional probabilities that will be used to determine the best target location. The specific test procedures are summarized as follows (a single class vehicle as an example).

Step 1. The input image is divided into grid cells (S=13).

Step 2. For one class vehicle (C=1), five bounding boxes (B=5) are predicted for each grid cell, obtaining a vector of predicted length S×S×(B×5+C)=13×13×26.

Step 3. Obtain the optimal target detection position and suppress nonmaximal values of vectors S×S×(B×5+C) and use (3) to judge whether the target identification box is reserved or not.

The whole process of the algorithm is shown in Figure 4:

3. Improved Network Based on YOLO v2

Different network structures may affect the effect of model training. Different parameter sets will be selected by different learning rates, activation functions, and all kinds of different layers. Therefore, we perform different parameters selection and optimization experiments and analysis for vehicle feature targets in order to obtain an improved network model for vehicle targets detection.

3.1. Improved Network for Vehicle Multitarget Detection

YOLO v2 removed the last full connection layer by using GoogleNet ideology, where both the deviations and the confidence score of the target box were predicted by convolution layer (as shown in Section 2.1), for obtaining the probability and location of a single vehicle target. And YOLO v2 can train and learn the images with different resolutions.

To obtain more adaptable network framework for vehicle multitargets detecting, we set different structural parameters based on network structure of YOLO-voc, and a variety of network frameworks are chosen. Results are reported in Table 1 and Figure 5, respectively.

In Table 1, YOLO-voc-v1.0 adopts 23 convolutional layers and 5 maximum pooling layers. The convolutional layers from 1 to 22 have BN layer and all use Leaky Activation function. The last convolutional layers use linear activation function. The initial learning rate is set to 0.001. YOLO-voc-v1.1 adopts 20 convolutional layers and 5 maximum pooling layers and all the convolution layers have joined with the BN layer. The initial learning rate of YOLO-voc-v1.2 is set to 0.0001 based on v1.0, and the initial learning rate of v1.3 was changed to 0.01. For YOLO-voc-v1.5; there is a maximum pooling layer more than v1.1. The last activation function of YOLO-voc-v1.5 is changed as rule to get YOLO-voc-v1.4.

The test results in Figure 5 show that the change of convolutional layers, activation function, the additional average pool layers, and more BN layers affect the detection effect. We can also find that if BN joins the last layer, the initial learning rate is too small to have a duplicate detection or too large to cause no vehicle characteristics to be learned. Through the above experiments, we use the YOLO-voc-v1.1 network structure to further fine-tune the parameters to obtain an improved model; we called it YOLO-vocRV (YOLO-voc for Road Vehicle). According to the network structure of YOLO-voc, the convolutional layer of the last 3-5 lines is removed to form a network structure as shown in Figure 6.

3.2. YOLO-vocRV Network
3.2.1. Training Stage

First, we use ImageNet dataset [19] to carry on high-resolution pretraining. On the basis of the classification, the fine-tuning technique is used to train the vehicle dataset of our definition with the pretraining convolution neural network.

To verify the validity and robustness of the proposed method, in the training phase, we enhance the data by randomly scaling, saturation, and exposure. These operations are used to improve the diversity of our samples. Then a whole picture is input into a neural network one-time. The neural network divides the image into different regions, gives each region the predictions on borders and probabilities, and assigns weights to all boxes based on probabilities. Finally we only obtain the test results whose confidence score exceeds a certain threshold. The threshold is set to 0.25 in our experiments. The entire training process is illustrated in Figure 7.

3.2.2. Selection of Loss Function

Because the mesh proportion of the target in an image sample is sometimes small, we use the confidence gradient in determining a target. In this paper, the balance weight of confidence coefficient is introduced in the model’s training, and the different proportion is set for the bounding box as below.

In the training stage, the mean square error is a common choice for the loss function. However, the candidate box is easy to be led to excessive if we use mean square error directly. Therefore, we employ the square root, which is able to weaken the weight of boxes. A target has at least one waveform regression of background to correspond. We set the size, scale, and type of target as the features for each box. The loss function is formulated as below.

Loss[Coordinate prediction error + (Box confidence prediction error with target + Box confidence prediction error without target) + Classification error]. The loss function penalizes classification error if there is a target in that grid cell. It also penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box. In (5) is the cell containing a target, and the bounding box in the cell is responsible for predicting the target. The first and the second terms items indicate the localization errors. is used to determine , and is used to determine , which correspond to and described in Section 2.2. The use of the square root can reduce the influence of different vehicle sizes. The third and the fourth items are used as IOU error, which corresponds to in Section 2.2, and the box confidence prediction error of comprising the target or no target. The fifth item is the classification error that is used to determine whether the target is a vehicle.

The loss function is designed to keep a good balance between the coordinates , confidence, and the classification error. In the training stage, we just want to ensure a single-to-single correspondence between a bounding box and a target, so both IOU of bounding box and ground truth are calculated. The better result is supposed to be the final bounding box while others are seen as no detected target. Based on this processing, the bounding box and its category are operated only when there exist the detected targets in the grid cell.

4. Experimental Results and Analysis

4.1. Experiment Preparation
4.1.1. Experimental Equipment and Testing System

In order to verify the effectiveness of the method proposed, we collected road traffic flows in an actual traffic system with the camera recognition rates of 300 225 and 550 448. In the training stage, we use a workstation equipped with an Intel i7 5930 CPU, four NVIDIA GeForce Titan X 12GB GPU and 8GB memory. In the testing stage, we use an ordinary computer.

4.1.2. Experimental Construction

The experimental data in this paper is collected on the ceiling of a building in Xi’an Engineering University on the 19th, Jinhua South Road, Beilin District, Xi'an, China. The location and angle of the camera are consistent with the traffic cameras used by the existing traffic control department. In the training stage, if the sample set is not representative, it is difficult to select good features. In order to ensure the diversity of the dataset, we collect images of the same target section in different time periods to ensure the diversity of vehicle types and the consistency of illuminations. At the same time, according to the differences of light and traffic density [20], we collect three datasets under different traffic densities, namely, free flow: 6: 00-7: 00 in the morning (number of vehicles <300 vehicles/hour), synchronous flow in 9: 00-11: 00 (number of vehicles between 300 and 900 vehicles/hour), and blocking flow from 7:30 to 8:30 (number of vehicles between 900 and 1300 vehicles/hour). Some samples are shown in Figure 8.

In the experiment, the original data samples are marked and processed. We separate samples by 80% and 20% randomly for training and test (see Section 4.2.1). In addition, we also conduct detailed comparisons and analysis for the traffic flows with different densities (see Sections 4.2.2 and 4.2.3).

4.2. Analysis of Experimental Results
4.2.1. Efficiency Analysis of Improved Net

In this paper, the improved YOLO-vocRV model is compared with YOLO-voc, YOLO 9000, and YOLO v3. The parameters of the above network models are initialized by ImageNet. During training, the weights are updated once per iteration. We set the initial learning rate to 0.001 and change the learning rate to 0.1 times the previous learning rate when iterating 2000, 8000, and 11200 times, respectively. Finally, we compare the multitarget detection effects of different models under different traffic density conditions.

Figure 9 illustrates the loss curves of YOLO-voc and YOLO 9000 under synchronous flow and YOLO v3 under the blocking stream. At the beginning of the training, the three curves are mostly in divergent state and the loss value of YOLO-voc reached 400 and YOLO 9000 are over 1000. With more iterations, the three curves tend to converge at about 300. Since the number of targets in the three datasets is very different, YOLO v3 can not train the datasets in the free stream and synchronous stream states, all of which are “Nan”. Only blocking stream can be used to train. The three models eventually reached convergence. We have found that the convergence rate has very limited effects on the detection performance, and the improved YOLO-voc can achieve better detection result than the YOLO 9000 model and YOLO v3 (see Section 4.2.2).

The test results of the validation set are shown in Figures 10 and 11, respectively. The IOU of the result means using intersection value of box predication and actual label to divide their union, , and the IOU represents the predicted results. Recall refers to the Recall rate that represents the ratio of the number of vehicles detected to the number of vehicles in the test set; i.e., . (proposal is the number of bboxes that are greater than the threshold in all predicted bboxes).

The Recall curves of the three models keep 100% with the increase of the number of targets. It indicates that the correct rate can be guaranteed under free flow condition. But the Precision curves of YOLO 9000 model appear to have larger fluctuation. For the Precision curves of YOLO-voc model, there is a jump when the ID number increases. The proposed YOLO-vocRV shows its stability of 100% and keeps good accuracy. Comparing the performance of three models in terms of their Recall curves at the same time, it can be seen that the IOU value of YOLO 9000 has values between 75% and 85%, indicating a low stability for testing. YOLO-voc model performs better than YOLO 9000 in that its IOU values maintain between 80% and 85%. The IOU value of YOLO-vocRV model is only 73% in the initial stage but increases to 85% quickly.

Figure 11 displays the test results of four models under blocking flow. When ID number increases, the IOU value of YOLO 9000 can achieve 80%, a relatively higher value than other compared methods. The IOU value of YOLO-voc is between 75% and 80% and 75% for YOLO-vocRV model, and the IOU value of YOLO v3 is about 80%. However, in this paper, we are concerned more about the differences of vehicle detection values and the actual values. Therefore, the Precision and Recall curves can demonstrate that YOLO-vocRV model has more advantages in multitarget detection than the compared methods. For example, YOLO-vocRV model Precision value remains around 95% in most tests, indicating better Precision than YOLO 9000 and YOLO-voc of 90% and YOLO v3 of 85%.

Experimental results in Figures 10 and 11 tell us that the same model will produce different changes in the IOU, Recall, and Precision values with the increase of vehicle flow density. For the training samples under the same conditions, the Precision of the three different network models remains above 90% in addition to YOLO v3. Under the condition of different training sets and test sets, YOLO-vocRV model can improve the Recall value and achieve less Precision value losing, which ensures our model with better classification performance, as shown in Figure 11. In order to evaluate the universality of the YOLO-vocRV network model for multivehicle target detection, we conducted a cross experiment of verification sets and training sets.

4.2.2. Analysis of Experimental Results of Detection Cross-Data

To further verify the effectiveness of the improved model, we perform the second experiment under different sample weights and different flow densities in terms of detection result on real traffic video. The different models are trained by three different kinds of VOC dataset (examples are shown in Figure 8), according to the distribution of specific traffic in time.

(1) Testing Analysis under Free Flow Training Samples. With the free flow training, in Figure 12(a), the test results of block flow using YOLO 9000 of iteration 10000 is slightly better than 20000 times. However, there is an obvious misdetection in the middle of the image (a1 and a2 in Figure 12(a)) and an obvious mismatch between the prediction and the marker box. The detection result of YOLO-voc model is better than YOLO 9000, and it detected the upper left target b1 when iterating 10,000 times and the rear target b2 when iterating 20,000 times. Compared with 10,000 times, the reinspection phenomenon at the top left of the image (b3 in Figure 12(b)) was improved when iterating 20,000 times. Based on our experimental results, we found that such misdetections may be due to the insufficient number of vehicles passed at the same time in free flow, leading to a fatal lack of training samples. Although YOLO-vocRV can accurately detect the lower left corner vehicle c1 of blocking flow after 20000 iterations, there are still a large misdetection of detection results in the entire sequence.

(2) Testing Analysis under the Condition of Synchronous Flow Training. For the synchronous flow training samples, the test results on of three models can obtain a better detection effect on free flow. For the overall test set detection results of YOLO 9000 model and YOLO-voc model, there are still large error rate (see Section 4.2.3). YOLO-voc model has more reinspection in the distance (e1 in Figure 13(e)). YOLO-vocRV model has better detection results compared with the previous two models. When the blocking flows datasets are used as the testing targets, as shown in Figure 13(f), the white vehicle group in the central can be detected correctly, and the tiny target (f1 in Figure 13(f)) is detected.

(3) Testing Analysis under Blocking Flow Training Conditions. When the blocking flow samples are used as training, YOLO 9000, YOLO-voc, and YOLO-vocRV models can achieve good performance, so only the experimental results of synchronous and blocking flow are analyzed in this subsection. From Figure 14(g), it is shown that when the YOLO 9000 model is used to iterate 10,000 times, the occurrence of false detection appears in the upper right (g1 in Figure 14(g)). And when the YOLO 9000 model is iterated 20,000 times, the occurrence of such problems can be solved and the small target in the upper right corner is detected (g2 in Figure 14(g)). Using YOLO-voc with 10000 times iteration, as shown in Figure 14(h), the small targets in the upper right corner (h1, h2 in Figure 14(h)) are missing, and the problem can be solved with 20000 iterations, but the vehicle in the upper right corner (h1) is detected duplication. Under the same situation, YOLO-vocRV with 10000 iterations has the same leak detection. But, by 20,000 iterations, the small targets are detected correctly in the upper right corner. In the added YOLO v3 experiment, it is obvious that the detection of small targets in the distance is well, but the near targets are not be detected well (j1, j2, j3, j4, and j5 in Figure 14(j)).

According to detection results in Figures 12, 13, and 14, we found free flow under the condition of the training sample, the training sample being relatively single, and the three models that can not learn the features well with small targets in the distant, the moving vehicles with occlusion, and unclear target contours. Therefore, in the case of high traffic flow density, our test shows more misdetection (a1, a2 in Figure 12(a)). Trained by synchronized flow, the corresponding performance has been improved, but there are still a small amount of missing inspections and heavy reinspection phenomenon (e1 in Figure 13(e)) at the same time. The reason is that the diversity of learning samples and traffic environment provided is not sufficient. Further increasing the complexity of the sample and using the blocking flow as a training sample, the detection is obviously less missed (g2, h1, and h2 in Figure 14). The experimental results show that the detection effect of 20,000 iterations is improved compared with 10,000 times, and YOLO-vocRV has better detection performance compared with the other three models.

4.2.3. Accuracy Analysis of Experimental Results

Because YOLO is greatly influenced by the training samples, it requires the diverse and representative sample training dataset. After applying the sample training model collected in this paper, we use two different training methods to verify the accuracy of the testing structures under different conditions and then use separately trained models to detect the images of different traffic density. The test results are tabulated in Tables 2 and 3. Since test results often show misdetection, missed inspection, and reinspection, we separately counted the actual real detection number , the number of true detection , and the rate of false detection of vehicles tested with different traffic density. is to remove missed inspection or reinspection of the situation after the exact number of vehicles detected, , where is misdetection + missed inspection + reinspection. Here, we define as follows: . In this way, is used to avoid the number of misdetection and missed inspection and the number of reexamination of the situation offset and reduce the false positive rate.

Experimental results show that YOLO 9000 combined error detection rate is 11.6% while YOLO-voc error detection rate is 11.1% and YOLO-vocRV miss rate is 9.0% for synchronous flow. And YOLO v3 cannot train the model. The improved model YOLO-vocRV is more accurate than YOLO-voc. In blocking flow samples with YOLO-vocRV in detecting free flow, the detection error rate is as low as 1.4%, the synchronized flow detection error rate is 2.2%, and the blocking flow detection error rate is 3.7%. Therefore, for multivehicle target detection, the proposed YOLO-vocRV model shows better detection effect.

Due to the possible camera jitter, distortions, such as the road edge, green belts, and lane road, may introduce considerable noises to vehicle appearance and further harm the learning process. YOLO-vocRV is able to enhance training data to make use of overall information data (random scale, change in the saturation, visibility, etc.), obtain better adaptive learning to a variety of appearance, color, and movement of the vehicle, and ultimately improve the detection model in terms of robustness and universality.

5. Conclusion

This article focuses on multiobjective visual detection problem in the lanes. We study YOLO v2 target detection algorithm and propose a new network, which is named YOLO-vocRV. To obtain more accurate test results, the proposed model converts the detection problem into a binary classification problem. Through a large number of trials, the training set we used is built by blocking flow and the YOLO-vocRV network is learned more than 20000 iterations. The average accuracy of our improved YOLO-vocRV model for different traffic densities is larger than 90%. Compared with the traditional machine learning, our proposed method improve in both accuracy and efficiency. Comparing the YOLO9000, YOLO-voc, and YOLO v3 models, it can be seen that the YOLO-vocRV loses significantly less Recall value when it gets better Precision value. The final detection result shows that improved method is more suitable for the multiple target detection of different traffic densities. Our error rate of free flow is 1.4%, the false detection rate is only 3.7%, and the accuracy rate of blocking flow can reach 96.3%. In our collection of testing data, we build our dataset in good visible light condition. Therefore, in the future the network model for the night light or low light conditions is our further research content.

Data Availability

The traffic image data used to support the findings of this study have been deposited in the “Baidu cloud” repository (https://pan.baidu.com/s/1lbgyQrriLy1263FQGZHHJg). The Python Code used to support the findings of this study has been deposited in the “Baidu cloud” repository (https://pan.baidu.com/s/1lbgyQrriLy1263FQGZHHJg). All the information of this paper is free; it can be found in https://fairsharing.org with “databases > biodbcore-001087”. If more detailed information was needed, everyone can send email to the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by Dr. Startup funds of Xi’an Polytechnic University (no. BS1507) and the Natural Science Basic Research Plan (Surface Project) in Shaanxi Province of China under Grant no. 2018JM6089.