Abstract

The performance of convolutional neural network- (CNN-) based object detection has achieved incredible success. Howbeit, existing CNN-based algorithms suffer from a problem that small-scale objects are difficult to detect because it may have lost its response when the feature map has reached a certain depth, and it is common that the scale of objects (such as cars, buses, and pedestrians) contained in traffic images and videos varies greatly. In this paper, we present a 32-layer multibranch convolutional neural network named MBNet for fast detecting objects in traffic scenes. Our model utilizes three detection branches, in which feature maps with a size of 16 × 16, 32 × 32, and 64 × 64 are used, respectively, to optimize the detection for large-, medium-, and small-scale objects. By means of a multitask loss function, our model can be trained end-to-end. The experimental results show that our model achieves state-of-the-art performance in terms of precision and recall rate, and the detection speed (up to 33 fps) is fast, which can meet the real-time requirements of industry.

1. Introduction

Automatically detecting various objects (such as vehicles and pedestrians) in images or videos from traffic scenes is a basic premise for many intelligent transportation systems. Reasonable traffic management and control based on the movement of vehicles and pedestrians can reduce the occurrence of traffic accidents, road congestion, etc. In this regard, considerable efforts have been made over the past decade. Some challenging benchmarks such as KITTI [1] and LSVH [2] have also been proposed to evaluate and compare the performance of various detection algorithms. Because the generalization of the feature extracted by convolutional neural network is much higher than that of traditional artificial feature, the CNN-based object detection methods have achieved remarkable success on vehicle detection, pedestrian detection, and many other kinds of object detection tasks [310].

One of the most popular object detection methods is using sliding windows to generate candidate regions, then features can be extracted from these regions and pretrained classifiers are applied to determine if these regions have certain objects or not. However, it leads to the huge computational cost. Hence, researchers have begun to exploit ways for efficient computation in object detection. Two strategies may be employed: region proposal-based methods and regression-based methods. The former firstly uses region generation algorithms such as selective search (SS) [11] and edge boxes [12] to generate candidate regions (namely, region proposals) and then processes them by convolution neural network, and these methods have high accuracy but cannot meet the requirements of real-time performance. Representative algorithms include RCNN [4], fast RCNN [7], faster RCNN [9], and mask RCNN [13], and they are typical two-stage methods (which generate the proposals using a region generation method and then classify and regress the proposals). The other is the object detection algorithm based on the regression method, which deals with the detection problem as a regression problem and directly predicts the location and classification of the objects. These kinds of methods are typical one-stage methods, and they are fast, but the accuracy is relatively lower than the two-stage methods. Representative algorithms are YOLO [14], SSD [15], YOLOv2 [16], YOLOv3 [17], etc.

Despite the powerful performance of CNN, when applying to object detection for traffic scenes, one of the main conundrums is that the traditional CNN-based methods are scale sensitive while it is quite common that the scale of various objects ranges greatly in traffic images or surveillance videos. For example, as shown in Figure 1, the bus has the largest scale and contains far more effective pixels than other objects. Accurately localizing these multiscale instances is quite challenging due to the full connection layer in CNN requires fixed-size input and that the traditional ROI pooling simply replicates some parts of the region proposals to fill the extra space to get feature maps of specified size but after which the original structures of the small objects may have been destroyed nevertheless. In the network training phase, filling in duplicate values will not only lead to inaccurate forward propagation calculation but also cumulate errors in the backward propagation process to impede parameter updating. These two aspects mislead the training of the network and make the network not able to detect small-scale objects accurately. At the same time, small objects may have lost their response when the feature map has reached a certain depth, which undoubtedly makes it more difficult for these methods to detect small objects accurately.

Existing CNN-based studies address the scale-variance problem mainly from two aspects: through the training of different resolution images [1820] or fusing feature maps with different scales of CNN [5, 8, 10, 21, 22]. Thus, the adaptability of the network in detection tasks with various scales is improved. However, due to the variance of scales, it is difficult to detect objects at all scales because of irrationality on the design of the detection branches or cannot meet the real-time requirement which is essential for unmanned supermarket, autonomous driving, face recognition, parts detection, and many other real-time application scenarios because of the expensive computational overhead caused by too large number of parameters.

As suggested by the above discussion, the network architecture for these tasks should consist of multibranches that take in large-, medium-, and small-scale objects, respectively. Recent CNN architectures exploit the property that higher-level features are obtained by composing lower-level ones. Motivated by the idea, we present a multibranch convolutional neural network, named MBNet, to detect multiscale objects in traffic scenes accurately and efficiently. The schematic illustrations of the proposed MBNet and related methods are shown in Figure 2. The MBNet is a regression-based end-to-end network consisting of convolution layer, max pooling layer, upsample layer, route layer, and YOLO detection layer, and specific explanations will be provided in the following sections. Specially, it assigns dynamic weights for the subbranch with respect to the scale of objects and combines multilevel features to detect objects with different scales. Therefore, MBNet can achieve outperforming detection performance in a wide range of input scales and is efficient in terms of computation.

In summary, the main contributions of this paper include:(1)A novel multibranch scale-aware network is proposed for object detection in traffic scenes, incorporating three subnetworks into a unified architecture, which is specialized for the current input scales and boosts the final detection performance with fewer parameters.(2)A scale-aware mechanism is proposed to adjusting the weights accordingly, performing detection accurately for large-, medium-, and small-scale objects from various traffic scenes, which achieves better performance compared with other methods in terms of precision and recall rate and also is able to meet the real-time requirements of application.(3)We construct an urban traffic dataset with large-scale variance, which provides a practical application platform for comparing the performance of various detection algorithms in dealing with different scale objects.

As in other fields, object detection based on traffic scenes has also experienced a period of development, and related tasks include vehicle detection, pedestrian detection, license plate location, and so on. In this paper, we consider a detection method for detecting 7 kinds of objects including pedestrian, car plate, and various vehicles. Early works detect various vehicles using relative motion clues between foreground and background, such as Gaussian mixture model (GMM) [23, 24] and sigma-delta model [25]. They accomplish the task by modeling the distribution of the background as it appears more frequently than the foreground which occupies a small portion of the image. Then, some handcrafted feature-based statistical learning methods which directly detect different objects from images (video frames) have been applied to object detection in traffic scenes. These methods use commonly used features such as HOG, SURF [26], Gabor [27], and Haar-like [28, 29] to describe the image regions, and then pretrained classifiers like SVM, artificial neural network [27], and Adaboost [28] are used to classify the image regions into different categories, such as object area and nonobject area. Aiming at the problem that the existing pedestrian detection algorithms miss detecting in the case of complex scenes or the scale of an object is too small, Chen et al. [30] proposed to cascade simple aggregated channel features (ACF) and rich deep convolutional neural network (DCNN) features for efficient and effective pedestrian detection in complex scenes. In reference [31], a robust license plate location method based on wavelet, transform, and empirical mode decomposition (EMD) analysis is proposed to deal with some challenging problems in practice such as illumination changes and complex background. Some studies combine optical flow with hardware implementation [32] and dense correspondence fields [33] to detect objects. However, these kinds of approaches are unable to distinguish detailed categories of moving objects, such as bicycles, cars, buses, vans, or pedestrians. In addition, these methods also need a slew of complex postprocessing algorithms, such as occlusion recognition and shadow detection, to optimize the detection results.

It is known that traditional CNN-based methods are sensitive to scales, and a lot of subsequent studies have been devoted to addressing this scale-sensitive issue. In reference [2], Hu et al. proposed a new context-aware RoI pooling method to replace the traditional RoI pooling which may destroy the original structure of small objects and further presented a multibranch decision network to conduct the task of box regression and classification. Li et al. [34] proposed to use generative adversarial networks (GANs) to detect small-scale objects and achieved good results. Most of the existing solutions were inspired by two kinds of pyramid representations. One of which applies the concept of image pyramid (Figure 2(a)), which uses input images of multiple sizes to make the network fit for input of all sizes [6,18,19,35]. However, the main drawback of this scheme is that it is computationally heavy, which limits its application in real-time detection. The other conducted by means of feature pyramid, which exploits the information of multiple feature maps extracted from different layers to detect objects with various scales (as shown in Figure 2(b)). The idea of which is to detect small-scale objects with high-resolution shallow features and large-scale objects with low-resolution deep features. This strategy has been adopted in SSD [15], MS-CNN [10], FCN [36], and SDP [5]. However, since the shallow feature maps are the absence of semantic information and the small objects may have lost its response when the feature map has reached a certain depth, the detection effect of these methods on small objects is poor.

In order to make full use of deep layer information to deal with the scale change of the objects, some researchers present to combine feature maps of different layers to train a network (Figure 2(c)), such as HyperNet [37] and MultiPath [22]. However, small objects are still difficult to detect owing to the use of downsampling operations so that small objects cannot maintain ample spatial information when the feature map reaches a certain depth. To take full advantage of the detailed information of shallow features and the semantic information of deep features, another solution is to use high-resolution shallow feature maps and upsampled deep feature maps together to predict small-scale objects, such as [21, 38]. This scheme can better maintain the information of small objects in the deep feature maps, and this is exactly the idea adopted in this paper (Figure 2(d)).

In a word, through the reasonable design and adjustment of the three detection branches, our approach is to achieve performance balance in time, cost, and detection accuracy, which can better detect objects with various scales while meeting the real-time requirements of application. We explore a simple and effective framework that consists of three subnetworks to generate the corresponding detection results on each branch, and then filtering algorithms (such as NMS) are used to refine these results to get the final results.

3. MBNet

Motivated by the concept of feature pyramid, we propose a new algorithm, i.e., the MBNet. The MBNet is an ensemble of three subnetworks in which scale-specific feature map is employed to detect the objects in traffic scenes of large-, medium-, and small-scale sizes, respectively, as is shown in Figure 3. By fusing the features of different layers, the feature maps used for detection in our model have both rich semantic information of high-level features and detailed information of low-level features, which effectively improves the detection effect of small objects. The design of our model enables MBNet to accurately capture the characteristics of different scale objects on different branches and then classify and locate them. Finally, a series of filtering algorithms are used to screen out the detection results to obtain the final results. The details of the MBNet framework are given in Figure 3.

3.1. Predictions across Scales

Drawing on the idea of faster RCNN, we use k-means to cluster the anchor boxes with the help of a series of marked ground truth boxes, which can automatically determine the sizes and number of anchors. Then, 9 clusters have been selected and further extract features through several convolutional layers to produce feature maps specialized for range of object scales. Specially, in the experiments with urban traffic dataset (Section 4.1.1), we predict 3 bounding boxes on the feature map of each detection branch so the tensor is N × N × (3 × (4 + 1 + 7)) for the 4 bounding box offsets, 1 objectness prediction, and 7 class predictions. As for the three branches designed in the MBNet, N stands for 16, 32, and 64, respectively. As the anchor is sensitive not only to detection efficiency but also to localization quality, the method of k-means clustering is used to find the proper k value by adjusting the objective function to the minimum as done in YOLOv2 [16], in the function variable, box represents the information of the bounding box and centroid represents the information of the cluster center, and the appropriate value of k after the clustering is 9. The resolution of each image in our handcrafted dataset is 512 × 512, and the 9 clusters on the dataset are (11 × 12), (15 × 30), (43 × 32), (37 × 74), (62 × 87), (69 × 139), (173 × 145), (255 × 278), and (453 × 432).

With semantic information and traffic scene details insufficiently encoded, feature extractor only describes the appearance contents at a coarse level. In order to capture the complementary information, we integrate deep semantic segmentation feature maps into the original object detection framework. In detail, with a series of convolution and max pooling operations, a feature map of a specified size can be learned automatically as the first detection branch. Next, we take the feature map from several layers previously and improve the resolution by a factor of 2, and then we combine it with another feature map as one of the detection branches. We also fetch a lower-level feature map and merge it with another upsampled feature map using concatenation, and several convolution operations are then performed on this combined feature map before it serving as the last detection branch. The combination of low-level descriptors and high-level features could potentially lead to better performances in distinguishing fine-grained categories of objects. In short, MBNet has carefully designed three different detection branches to cover large-, medium-, and small-scale objects as much as possible in traffic scenes.

3.2. Network Training Process

Table 1 illustrates the architecture of MBNet in detail. The network consists of 32 layers, including 17 convolution layers for feature extraction, 6 max pooling layers for simplifying feature maps, 2 upsample layers for obtaining high-dimensional feature maps (upsample a layer by improve the resolution by a factor of 2 and then concatenate it to another layer), and 3 yolo layers for receiving output feature maps, which are also serving as three different detection branches in this network. Besides, 4 route layers are used to take a feature map at a certain layer or fuse feature maps from different layers. In the convolution layer, we use regularization to suppress over-fitting and increase the specific gravity of some important parameters in the convolution kernel to extract more accurate feature maps. The batch normalization layer is added after each convolution layer to normalize the data output, which greatly improves the training speed and avoids the occurrence of gradient vanishing. In the network, we use the Leaky ReLU function as the activation function.

The network treats the whole detection task as a regression task, dividing the input images into 16 × 16, 32 × 32, and 64 × 64 small regions (grid cells), respectively. Then, each small region (grid cell) predicts three bounding boxes that might contain objects as well as the probability values of each category in this region. Then, we compare these boxes with ground truth and get the error. The whole training process is shown in Figure 4: we treat the trained network as a function containing several parameters, which is abbreviated as , where x represents input of some dimensions and y stands for its output. Firstly, the network is initialized randomly, and then the images in the training set are used as input to get the corresponding output, that is, the bounding box coordinates predictions, objectness prediction, and 7 category predictions, as is shown in Figure 4. As the input of a module can be computed by working backwards from the gradient with respect to the output of that module, the BP algorithm is used to update the parameters in the network to adjust the coefficient values in our function for the next round of training. Then, we iterate in this way until our loss function reaches a certain range or when the number of iterations reaches a certain number of times we terminate the iteration. Next, we choose the loss function value and the most representative network weight value as the final parameters of our network to do the prediction. In the test phase, for each input image, the network produces various scales of output in different detection branches. Next, we combine them together and then use the filtering algorithms such as nonmaximum suppression (NMS) to refine the results.

We have trained the network for 100000 times and obtained the relationship between the average IOU and loss function with the number of training times, as is shown in Figures 5 and 6, respectively. From these figures, we can draw a conclusion that the loss of the network is converging in the iterative process while the value of average IOU is increasing as good as 1.

3.3. Bounding Box Prediction and Class Prediction

Following faster RCNN and some other works, our system predicts bounding boxes using dimension clusters such as k-means as anchor boxes prior. When an input image is divided into a S × S grid, each grid cell predicts B (9 clusters divided up evenly across 3 branches, so here B is 3) bounding boxes under the prerequisite that MBNet predicts three types of anchor boxes in three detection branches (16 × 16, 32 × 32, and 64 × 64). One object is predicted by the grid cell in which the center of the object falls, and the network predicts 4 coordinates for each bounding box, , , , and , where (, ) is the offset of the center of ground truth box from the top left corner of the grid cell responsible for the prediction and (, ) is the scale by which the size of a bounding box is zoomed to a size similar to a ground truth box. They are calculated corresponding to

If the cell is offset from the top left corner of the image by (, ) (as is shown in equations (1) and (2)) and the anchor box prior has width and height , then the prediction of the coordinates of the predicted bounding box can be obtained by the following equations:where , , , and refer to the center coordinates as well as the width and height of ground truth, respectively. and denote the width and height of the anchor box, respectively. From equations (1)–(8), the 4 predictive output coordinates of the bounding box are obtained. Using to compress and into [0, 1] region can effectively ensure that the object center is in the grid cell that carries out the prediction and prevent excessive deviation.

During training, we use sum of squared error loss, and the total loss function of our network is shown in equation (9), which is the same as used in YOLOv2 [16]. The design goal of the loss function is to achieve a balance between the coordinates, the confidence of the bounding boxes, and the classes. Our gradient is the ground truth values (calculated from the ground truth box) minus our prediction values, as shown in the fourth and fifth items of the following equation

In the loss function, is the real category, is the prediction category, (, , , ) is the information of the ground truth, (, , , ) is the information of the prediction bounding box, and , , , and are the weight parameters. The MBNet predicts a confidence for each bounding box using logistic regression. The value should be 1 when the anchor box overlaps a ground truth object more than any other anchor box. Its calculation process is shown in equation (10). Unlike YOLOv3 [17], we choose several bounding boxes with the relatively high confidence and average the coordinates of them but do not select only one bounding box with the maximum confidence from highly overlapping detection boxes, as done in [2]. In this way, the localization accuracy for occluded objects is improved, and the recall rate increased by 6.8%. If an anchor box is not responsible for predicting a ground truth object, that is, it does not meet the preset threshold with ground truth box’s IOU, it incurs no loss for class predictions, only for confidence predictions, or has a very small weight in the coordinate predictions. The complete bounding box regression process is shown in Figure 7.

Besides the 4 coordinates of , , , and and confidence, each bounding box also predicts 7 class scores, corresponding to 7 classes of our handcrafted dataset. By multiplying the confidence value with 7 class scores, respectively, the specific score of the bounding box based on a particular category is then gained, as shown in equation (11). Thus, these confidence scores can be compared to the preset threshold to determine which category should be retained or not. Each bounding box uses multilabel classification to predict the categories a bounding box might contain. For good performance, we do not use softmax but simply use independent logical classifiers: binary cross-entropy loss for the class predictions during the training process. Using a softmax imposes the assumption that each box has only one class, and that is not usually the case; for example, an apple may have labels such as apple, fruit, and food at the same time. The multilabel method can better model the data in the dataset.

4. Experiments

4.1. Dataset and Evaluation Metrics

A series of comparative experiments are carried out in this paper, we have used a handcrafted urban traffic dataset and the public KITTI dataset to evaluate the performance and effectiveness of the proposed algorithm.

4.1.1. Urban Traffic Dataset

Traffic scenes commonly contain objects (such as all kinds of vehicles and pedestrians) with large-scale variations, as the surveillance cameras usually cover a large and long view of the road. Although publicly available benchmarks have contributed to progress in this area of object detection, existing traffic object datasets often contain a limited range of contents (only cars or pedestrians) and scales, making it difficult to assess real-world performance. In order to demonstrate the proposed method in more practical scenes, we construct a new dataset named urban traffic dataset to provide a better benchmark and focus research effort on these difficult cases.

The urban traffic dataset contains objects with a vast variance of scales under traffic scenes, including 10500 well-labeled images under different roads, time, weathers, and traffic states, as shown in Figure 8. The dataset has been divided into three subsets, in which the training set: the testing set: the verification set is 5 : 3 : 2. In detail, it consists 5125 images for training and 3188 images for testing, and verification set is of 2197 images. The dataset consists seven categories, namely, car, car plate, pedestrian, bus, bicycle, motorcycle, and tricycle, which is also the object we need to detect from input images, and it is worth to point out that we treat the car plate as a class for training and testing.

To better fit into the network presented in this paper, we have resized all the images to 512 × 512 resolution. The data distribution of our handcrafted dataset is shown in Table 2. As illustrated in Table 2, the objects are classified into 7 categories under three different scenes (sparse, crowded, and nighttime). We consider a scene as a crowded scene if it contains more than 15 objects per image; otherwise, it is considered as a sparse scene.

4.1.2. KITTI Dataset

KITTI [1] is a widely used benchmark for vehicle detection, which contains objects with different scales in different scenes. The dataset includes 7481 images for network training (including 2494 images for use as a verification set) and 7518 images for testing the model. KITTI dataset provides a 3D border annotation for a moving object captured by using cameras, and the categories of objects include cars, trucks, pedestrians, and bicycles. According to the difference of object size, occlusion, and truncation criteria, the dataset organizer divides the dataset into three levels: easy, moderate, and hard, which can be used to judge the comprehensive performance of various object detection algorithms.

4.1.3. Evaluation Metrics

We employ the universally recognized recall rate, average precision (AP), and intersection over union (IOU) metrics [39] to evaluate the performance of MBNet on our handcrafted dataset, and they have widely been used to evaluate various object detection algorithms [1, 39]. We evaluate the performance of our model for car, pedestrian, bus, bicycle, and so on under the scenes in all cases, such as crowded or sparse and daytime or nighttime. In the experiments, the threshold is set in the range of 0.1 to 0.65, which means that only the overlap between the predicted bounding box and the ground truth greater than or equal to the value will the current detection be considered as a correct detection. Besides, we use the P-R curves and average precision (AP) to present the detection performance of MBNet for cars, cyclists, and pedestrians under scenes with different complexity degree (easy, moderate, and hard) on the KITTI dataset. All the experimental results can be seen in Section 4.4.

4.2. Experimental Configuration

Our experiments are implemented on a computer equipped with an Ubuntu 16.04 system and supported by NVIDIA 1060 GPU and Intel(R) Core i7-6700K @ 4.0 GHz∼4.2 GHz CPU. Besides a GPU development package CUDA 8.0 and a deep learning acceleration library cuDNN 6.0 are installed. Then our MBNet is trained under the Python 2.7 environment. The specific parameters of our network are as follows: initial learning rate is 0.001; policy is steps; batch is 64; steps, respectively, take 100, 25000, and 50000; maxbatch is 100000; scales are 10, 0.1, and 0.1; momentum is 0.9, and decay is 0.0005. As shown in Figure 5, the horizontal ordinate represents the number of iterations, ranging from 0 to 100000. After more than 60000 iterations, the parameter has basically been stabilized. During the training process, the change of region average IOU and loss are important parameters to measure the quality of model training, as can be seen from Figures 5 and 6, and the loss is falling and approaching a small constant, while the average IOU is approximately equal to 1, which basically meets the requirements of training.

4.3. Explanation of Various Scales

We propose MBNet to effectively detect large, medium, and small objects in the traffic scene, so as to reduce the rate of missing detection. The experiments are carried out on our handcrafted dataset and KITTI, which contains objects with different scales. Through the statistical analysis of the bounding boxes in the dataset, the objects are divided into three categories: small, medium, and large. Specifically, objects with a height or width greater than 10 pixels and smaller than 47 pixels belong to a “small” category; objects with a height or width between 47 pixels and 99 pixels are in a “medium” category. Other objects with a height or width greater than 99 pixels are in the “large” category. The three detection branches of our MBNet can effectively detect these objects from different scenes such as sparse or crowded. The experimental results show that the reasonable design of the detection branch of our model greatly improves the recall rate and the detection precision, and because the MBNet is a 32-layer lightweight network, the speed of processing each image is up to 30 ms (33 fps), which can basically meet the real-time requirements of industry.

4.4. Comparison with the State-of-the-Arts
4.4.1. Urban Traffic Dataset

Based on the configuration above, we conduct our experiments drawing support from our premarked dataset, and experimental comparison is made with RCNN, faster RCNN, SSD, mask RCNN, SINet, and YOLOv3, respectively. We make comparative analyses on the recall rate, the average precision, the average IOU, and the time consumption. It is important to note that the YOLOv3 network divides the input images into 13 × 13, 26 × 26, and 52 × 52 small regions (grid cells), and in this paper, we divide the size of the original images into 16 × 16, 32 × 32, and 64 × 64 small regions, respectively. To verify the effectiveness of our method, we compare it with other methods under different thresholds, as shown in Tables 39.

Because the recall rate, the detection precision, and the IOU value of each model will change under different thresholds, we compare these metrics under different thresholds (0.1∼0.65). As shown in Table 3, we compare the average precision of the 7 frameworks on the testing set under different thresholds. As seen in the table, the average precision of each method increases as the threshold increases, and this is because a smaller threshold may count in some of the incorrect predictions. As shown in Table 3, our model can obtain the highest average precision in most cases under different thresholds. When the threshold is 0.1 (minimum), the average precision of our method can reach 58.25%, which is 2.84% higher than the SINet and 10.51% higher than that of the SSD network; when the threshold is 0.65 (maximum), the precision of our method reaches 83.68%, which is 1.25% higher than the SINet and 9.29% higher than the SSD network. The average precision of our method can reach near to 60% when the threshold is 0.1, which shows that the network structure proposed in this paper is suitable for the prediction of various objects. Table 4 shows the comparison of recall rate for various methods, and our model basically has the highest recall rate at different thresholds. This shows that our method has a lower miss detection rate and is more suitable for detecting objects with different scales.

Table 57 makes the statistics of the detection results for seven categories tested at the threshold of 0.5, and from these tables, we can draw the conclusion that our method has the best detection results compared with other methods under different scenes.

IOU (intersection over union) is mainly used for measuring the overlap degree between the predicted bounding box and the ground truth: the higher the value, the more accurate the prediction. The threshold value set in the experiments is actually the calculated IOU value. As shown in Table 8, we compare the average IOU of seven categories for all methods. At the threshold of 0.1, the average IOU of our model reaches 89.25%, which is 3.38% higher than the YOLOv3 network and 16.8% higher than the RCNN. When the threshold is 0.65, the average IOU of our method reaches 61.58%, and this is also the highest IOU value of all methods. Under other thresholds, our model shows a good advantage over other frameworks. Table 9 analyzes the time complexity (time consumption) of each framework. Because the RCNN is not an end-to-end network, its time consumption is very high, reaching 3.13 s per image. In addition, our time consumption is lower than that of SSD and Mask RCNNs. Finally, compared with the YOLOv3 network, because our network has only 32 layers, although we take a more detailed partition on the original images, the overall time consumption is lower than the YOLOv3 network.

In order to show that the size of 16 × 16, 32 × 32, and 64 × 64 is more adaptable for our model, we have selected five different sizes for comparison, and the results are shown in Figure 9. For the sake of tidiness to demonstrate, each size represents the smallest scale of its group (for example, 16 × 16 stands for 16 × 16, 32 × 32, and 64 × 64). Because each grid cell predicts 3 boxes, the network time consumption will increase with the increase of feature map scale. As seen from Figure 9, when an image is divided into a maximum of 20 × 20 grid, the effect is not as good as 16 × 16 adopted in this paper. In addition, when the input is divided into 8 × 8 grid, the accuracy will decrease rapidly with the increase of threshold, and this is very inappropriate. So the size of 16 × 16 in this paper can be seen as a kind of compromise choice, and its time consumption is not too much, but its accuracy is the best, coupled with our network itself has not many parameters, so in aggregate, the total time consumption is not very high, basically meeting the requirement for real-time performance.

In this section, we compare the recall rate, the average precision, the average IOU, and the time consumption with different methods, and the accuracy under different partition patterns is also discussed. To sum up, our network shows a good advantage over most existing models in the above aspects and can also meet the industry requirements for real-time performance.

In Figure 10, we show some detection results by MBNet on our handcrafted dataset. The results show that the algorithm is effective in detecting objects with different scales, especially for some small-scale objects (such as car plate) in traffic scenes under different conditions such as crowded, sparse, and insufficient illumination. This proves that the proposed MBNet has a good application prospect and has the potential to become an important part of intelligent transportation systems.

4.4.2. KITTI Dataset

For further analyzing the effectiveness of the proposed method, we train our model using the KITTI training set and evaluate the model on the testing set of the KITTI benchmark. Specially, we compare the detection performance of RCNN, faster RCNN, SSD, YOLOv3, mask RCNN, SINet, and our method for different objects (cars, cyclists, and pedestrians). The experimental results are shown in Figure 11.

As can be seen from Figure 11, the area under the P-R curves of different objects detected by our method is larger than other methods, that is, the average accuracy of our method is higher, which means the detection performance of our method is better than that of other methods. In addition, we have calculated the average accuracy (AP) for scenes with different complexity degree (easy, moderate, and hard) on KITTI dataset, and the results are shown in Table 10.

As shown in Table 10, our model can better detect different objects in scenes with different complexity degree, which is due to the reasonable structure design of our model. In Figure 12, we show some of the detection results by MBNet on KITTI dataset. It can be seen from Figure 12 that the network proposed in this paper has a good effect on the detection of vehicles with different scales, which proves the superiority of this algorithm in detecting various objects by using feature maps with different scales.

5. Conclusions

To summarize, we propose a 32-layer multibranch network, denoted as MBNet, for fast detection of objects with a large variance of scales in traffic scenes. By designing of three detection branches, it can accurately detect large-, medium-, and small-scale objects from various traffic scenes, such as sparse, crowded, daytime, or nighttime. Besides, we construct a novel labeled dataset, and it contains objects with large-scale variance in traffic scenes, which provides a practical platform for the evaluation of different detection algorithms. The MBNet achieves state-of-the-art performance on both precision and recall rate, and the detection speed is fast enough for real-time detection. The further investigation is to apply MBNet to more challenging datasets as well as have a shot at changing the overall structure of the network for a better performance. What is more, in view of the poor detection effect of most detection algorithms to the dark scene, our follow-up work will also focus on improving the detection effect of the algorithm to scenes with insufficient light.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was supported by the National Nature Science Foundation of China (41571401).