Abstract

In recent years, vehicle type detection has had an important role in traffic management. A lightweight detection network based on multiscale ghost convolution called G-YOLOX is designed in this paper. It is suitable for practical applications for an embedded device. Specifically, convolutions and and ghost convolutions are combined to fully utilize different feature information. A series of linear transformations was designed to generate ghost feature maps to ensure that the network is lightweight. Moreover, a dataset of images showing different vehicles in a city environment was established. Altogether, 20,000 road scene images were collected, and seven categories of vehicles were identified. Extensive experiments with the benchmark datasets VOC2007 and VOC2012 and with our dataset demonstrate the superiority of the proposed G-YOLOX over the original YOLOX. The proposed G-YOLOX can achieve a nearly invariable mean average precision of 0.5, while the size of the weight file decreased by 40% and the number of parameters decreased by 67% compared to the original YOLOX network.

1. Introduction

Deep convolutional neural network (CNN) has achieved good research results in recent years, especially in various computer vision tasks, such as image recognition [1], object detection [2], and semantic segmentation [3]. As a representative technology of computer vision, objet detection has attracted more and more researchers’ attention in recent years and has been well applied in many life and industrial fields, with the greatest contribution being the development of deep convolutional networks and the increase in GPU computing power. For different trade-offs between accuracy and efficiency, there are deeper and densely connected backbones, such as ResNet [4], ResNeXt [5], and Amoeba Net [6]. To improve detection accuracy, the researchers used a deeper, densely connected backbone network instead of the traditional shallow neural network structure. Mask R-CNN [7] uses ResNet instead of VGG [8], which is also used in Faster R-CNN [9], to improve accuracy because it can capture rich features. Despite the additional parallel mask branch, Mask R-CNN has better performance in completing object detection tasks. Lin et al. used Faster R-CNN with ResNet-FPN [10]; this backbone network can extract the features of the region of interest from different levels according to the scale information of the feature pyramid. This feature extraction method has achieved good performance in detection accuracy and processing speed. Researchers at Google Brain adopted a neural architecture search method to find new feature pyramid architectures, named NAS-FPN [11]. This method uses top-down and bottom-up connections to fuse the information of different feature layers in the feature pyramid. After using the feature pyramid network (FPN) architecture N times, they connect it into a large architecture in the subsequent network search process.

Because the number and size of detection targets vary greatly, Zhao et al. [12] proposed a multilevel FPN that can better extract the information from the feature map. The author uses three steps to design the enhanced feature pyramid. Firstly, as FPN, multilevel features are extracted from multiple layers of the backbone, and then they are fused into basic features. Secondly, the basic features are sent to a convolution module, which is composed of an alternately connected thinning U-shaped module and feature fusion module, and then the decoding layer of the obtained thinning U-shaped module is used as the input of the next feature layer. Finally, the decoding layers with the same scale are aggregated to construct a feature pyramid containing multilevel features.

When there are multiple categories in a stage, we called the one-shot detector [13]. There is a fixed set of default bounding boxes for each location with different scales in the feature map. The one-shot detector can directly predict the category score and box offset at the same time. In each feature map, the aspect ratio and scale of the default bounding box are different. In different feature maps, calculate the default bounding box scale by using the regular space between the highest and lowest levels. Each specific feature map obtains the information of object of different size through learning.

Although traditional CNN achieves satisfactory accuracy in some computer vision tasks, but the traditional object detection network model is usually large and needs a lot of computation. Thus, to make deep neural networks work in mobile devices, researchers have tended to study portable and efficient network structures in recent years (e.g., smartphones and self-driving cars). Efficient neural networks become more and more common in mobile terminal applications, realizing a new user experience. They also protect personal privacy as much as possible because they allow users to run neural networks directly using their devices without sending personal data to the server. For example, GhostNet [14] uses fewer model parameters to generate equivalent feature maps. The algorithm generates half of the required output feature map through conventional convolution and then uses the deep separable convolution network to generate the other half of the feature map. The required feature map output is obtained by concating the two parts of the feature map. With this method, the convolution parameter is only about half of that of the traditional method. Our proposed Ghost-4 module is a plug-and-play component for upgrading an existing CNN. A neural network implemented on a mobile device must be small and have few parameters. This paper proposes a small network structure with low computation, which allows developers to deploy applications on resource-constrained mobile devices.

In this paper, we use YOLOX to extract feature map information, and the ghost module is used to reduce the number of parameters in the whole object detection framework. The overall network is called G-YOLOX. This network generates more features using fewer parameters and uses a dual attention module to fuse more feature maps. Our contributions can be summarized as follows: (i)A multiscale ghost convolution structure is devised to fully extract the features while simultaneously controlling the number of parameters. We propose a feature fusion module between different parts to further increase the performance of the whole CNN(ii)We establish VOC2019, a dataset of images showing different vehicles in a city environment. This dataset can be used to train a model that detects vehicle types, vehicle colors, traffic violations, and so on

In this section, we introduce some lightweight network structures, and some existing automobile datasets are briefly introduced.

2.1. Lightweight Network Structures

MobileNet [15] is an efficient and low computational model suitable for mobile and embedded vision applications. They build lightweight deep neural networks through a streamlined architecture and deep separable convolution. They have two simple global super parameters, which can reduce the system delay as much as possible while ensuring accuracy and allow the model generator to select the appropriate size model for its application according to the needs of the detection task.

MobileNetV2 [16] is also a lightweight object detection network. The following improvements are made on the basis of MobileNet. The first is to use the linear bottleneck layer. Because the RELU layer will be used after convolution operation, this operation will clear the negative features and further lose the feature information. After using the linear bottleneck layer, the loss of feature information will be reduced. The second is to optimize the residual network structure. In the first step, use the convolution to expand feature map and enrich the number of features. Then use the deep separable convolution network for feature extraction. Finally, use the convolution kernel to output a specific number of feature maps. Because the order of operation and residual network is just opposite, it becomes the structure of inverse network. The above two improvements enable V2 to extract more image features from the feature map. Therefore, when performing the detection task, the detection performance of the network is greatly improved compared with MobileNet.

MobileNetV3 [17] uses a novel architecture called a combination of complementary search technology. This method adjusts the object detection network structure, by combining hardware aware network architecture search by the NetAdapt algorithm, so that the algorithm can be used on embedded devices such as the mobile phone CPU. These researchers took the lead in exploring how automatic search algorithms and network design work together to use complementary methods to improve network performance.

After systematic research on model scaling, researchers proposed EfficientNet [18]. They found that balancing the depth, width, and resolution of the neural network can improve the network performance. Therefore, they proposed a new method of network design, that is, by using a simple but efficient composite coefficient to uniformly scale all dimensions of depth, width, and resolution. They also proved that this method is still effective on MobileNet and RESNET.

YOLO [19] is a typical one-stage object detection network structure. Different from other detection networks, the network structure defines the detection object as a regression problem. Firstly, this work applies a single convolutional neural network to the whole image pixel. After dividing the image into grids, the network will predict the class probability and boundary box of each grid. Finally, the nonmaximum suppression algorithm is used to filter the redundant bounding box for each class of objects. Compared with other object detection networks, the unified architecture is more efficient in completing object detection tasks.

For better design of detector performance, YOLOv3 [20] uses the full convolution network to extract image features. In the network, a lot of residual structures are used to optimize the network performance. At the same time, the stripe in the convolution operation is used to replace the pooling operation, so as to reduce the negative effects brought by the pooling operation. In order to better extract image features, the network structure still uses multiscale feature fusion operation similar to FPN.

Gaussian YOLOv3 [21] not only models the boundary box of YOLOv3 with a Gaussian parameter but also redesigns the loss function of the detector. The proposed algorithm enables real-time operation and improves the detection performance. It is the most representative one-stage detector at this stage. In addition, the researchers of this paper proposed a method called predictive positioning uncertainty, which shows the reliability of the bounding box. By using the predicted positioning uncertainty in the detection process, the number of false positives is significantly reduced, and the number of true positives is increased, to improve the performance of the detector.

YOLOv5 [22] uses a residual network both in the network backbone and neck, this method enhances the network’s ability to fuse features and retain richer feature information. YOLOX [23] changed the YOLO detector so that it was anchor-free. Advanced detection techniques were implemented, i.e., a decoupled head and the leading label assignment strategy SimOTA to achieve the most advanced performance on various object detection models.

2.2. Existing Vehicle Datasets

There are several car datasets. For example, BDD100K [24] is the largest dataset of driving videos. It has 100,000 videos and 100,000 images. It is used to evaluate the exciting progress of the image recognition algorithms needed for autonomous driving. The images in the dataset are for a wide of geographic, environmental, and weather conditions, making it useful for training models that are less likely to be surprised by new conditions. The dataset has only four types of cars. Many researchers usually use this dataset to detect the number of vehicles, road signs, pedestrians, drivable areas, and other aspects of road scenes.

The UA-DETRAC [25] is a large-scale vehicle dataset with four complex weather conditions: cloudy, night, sunny, and rainy. It is suitable for vehicle detection and vehicle tracking tasks. The collection of this dataset mainly depends on the surveillance camera installed on the overpass, and 8250 vehicles and 1.21 million object frames are manually marked. The dataset divides vehicle types into four types: car, bus, van, and other vehicles.

The images in these existing vehicle datasets were captured by video cameras in aerial vehicles or traffic monitoring systems. No further information about the vehicle could be extracted from the images, as shown in Figure 1. However, these images do not clearly show the vehicle license plates. Thus, these datasets are not suitable for road violation detection that can be implemented in a vehicle-mounted mobile platform to detect traffic violations.

Therefore, we created a dataset of vehicles, called VOC2019. We collected 20,000 pictures of vehicles on urban roads and classified the vehicles into seven types. Further details are given in Section 3.3.

3. Approach

In this section, we first introduce the YOLOX network. This module is an enhanced member of the YOLO series and is a high-performance detector. Second, we use a multiscale convolution method to extract features from the feature map. This method can use less parameters without affecting the network detection performance. Third, we introduce our own vehicle dataset, VOC2019.

3.1. YOLOX: A New High-Performance Detector

In the object detection task of computer vision, the conflict between classification and regression is well known. Therefore, in most one-stage and two-stage detectors, researchers usually use decoupling heads for classification and localization. However, as the backbones and feature pyramids of the YOLO series have evolved (e.g., FPN and path aggregation network (PAN)), their detection heads have remained coupled as shown in Figure 2.

For each level of the FPN, this network first uses a conv layer to reduce the feature channel to 256. For classification and regression, respectively, it then has two parallel branches with two conv layers each. An intersection over union (IoU) branch was added alongside the regression branch.

This work experiments show that the coupled detection may reduce the performance of the object detector. Replacing the head of YOLO with a decoupled one greatly improved the speed of convergence. The decoupled head is an important feature of the end-to-end version of YOLOX. End to end, the average precision (AP) decreased by 4.2% with the coupled head, whereas the decrease was only 0.8% with a decoupled head. Thus, a lightweight decoupled head performs better in detection. Moreover, for this network, the inference time for one batch on NVIDIA V100 graphics cards was lower with the lightweight decoupled head by 1.1 ms (11.6 vs. 10.5 ms).

YOLOX also uses Mosaic and MixUp to improve the detection performance of neural network. Mosaic is an enhancement strategy proposed by YOLOv3, which can effectively improve the performance of the detector. At the same time, this method is also widely used in YOLOv4, YOLOv5, and other detectors. MixUp was originally designed to be applied to image classification tasks. Later, researchers modified the bag of features and applied it to the training stage of object detection tasks. YOLOX has adopted MixUp and Mosaic strategies in the last 15 epochs. After using these two enhancement strategies, the pretraining of ImageNet has little impact on the performance of the detector.

Anchor-free detectors have developed rapidly in the past two years. The performance of anchor-free detectors is comparable to that of anchor-based detectors. Using an anchor-free mechanism significantly reduces the number of design parameters that need heuristic tuning. Fewer tricks are needed to achieve good performance (e.g., anchor clustering or grid-sensitive), making the detector, especially its training and decoding phase, considerably simpler.

Making YOLO anchor-free was quite simple. In YOLOX, the number of predictions for each location was reduced from 3 to 1. It directly predicts four values, i.e., two offsets for the top left corner of the predicted box and its height and width. We assign the center of each object as the positive sample and predefine a scale range to designate the FPN level for each object. This modification reduced the parameters and GFLOPs in the detector and improved the detection speed of the network. Detection performance increased to 42.9% AP.

3.2. Cheap Operation for More Features

The efficient CNN network structure built by ShuffleNet [26] is realized by using depthwise convolution or shuffle operation, but the remaining convolution layers would still occupy considerable memory and require many FLOPs. Given the common redundancy in intermediate feature maps calculated by mainstream CNNs, as shown in Figure 3, we propose to reduce the amount of resource required, i.e., the convolution filters used for generating them. In practice, the input data , where is the number of input channels and and are the height and width of the input data, respectively. The operation of an arbitrary convolutional layer for producing feature maps can be formulated as where is the convolution operation, is the bias term, is the output feature map, which has channels, and is the convolution filter in this layer. In addition, and are the height and width of the output data, and is the kernel size of the convolution filter . During this convolution procedure, the required number of FLOPs is , which is often as large as hundreds of thousands since the number of filters and the number of channels are generally very large (e.g., 256 or 512).

Since we can use the ghost module to generate the same number of feature maps as the ordinary convolution layer, we can lightly apply the ghost module to the existing neural network architecture to reduce the amount of parameters and computational cost of the network framework. Here, we further analyze the profit on memory usage and the theoretical speed-up by employing the ghost module. As shown in Figure 3, the number of parameters for ordinary convolution and ghost convolution are calculated as follows: where is the number of parameters for the ordinary convolution operation, is the number for ghost convolution, and is the number for the Ghost-4 module. is the channel of input feature map, and is the channel of output feature map. is the kernel size for the convolution filters. and are the sizes of convolution kernels for different deep separable convolution networks. According to Equations (2), (3), and (4), for an output for the same feature map, the number of parameters in the convolutional ghost module is only half that of a traditional CNN. The number of parameters of the convolutional Ghost-4 module is only quarter of that of a traditional CNN.

3.3. Vehicle Dataset

In this work, we captured 20,000 vehicle images from Wenzhou, China. MV-CA050-10GM/GC digital cameras, which have 5,000,000 pixels, captured the images of city roads under different weather conditions during daytime. For each image, we manually labeled the position () of each corner of an irregular quadrilateral. These corner annotations can be used to rectify the image patch containing the dials. Figure 4 shows some images from our dataset and the annotations. They were captured from city roads, so the backgrounds are complicated. They always show trees, people, houses, roads, traffic signs, bikes, and so on. The target objects have different sizes in the images. So the dataset plays an important role in object detection research.

We annotated our dataset with the vehicle type. The numbers of vehicles in each of the seven classes are given in Table 1. Figure 5 shows the imbalance across the types, which also provides a challenge for our research work.

We also annotated the color of each vehicle, using one of ten colors. Table 2 and Figure 6 show the different numbers of vehicles for each color. But in this experiment, we only detected the vehicle type and did not use the annotation information of vehicle color; in our approaching study, we will attempt to detect the color of a vehicle.

Tables 1 and 2 show that both the number of vehicle types and colors is extremely imbalanced. About 51% of vehicles were annotated as a car, whereas only 2.3% were trailers. This is mainly because trailers are banned from the roads in Wenzhou.

The size of the bounding box for different objects in the same image is also quite imbalanced. As shown in Figure 6, the nearest car covers many pixels, whereas more distant objects need only a few pixels. The figure also illustrates the color imbalance. In the real world, some of the more distant vehicles are only partly shown, which increases the difficulty of object detection.

4. Experiments

For the VOC2007, VOC2012, and VOC2019 datasets, we tested the influence of different techniques to improve the accuracy of training. In the test, we compared and analyzed the parameters of the model (parameters), the amount of calculation (GFLOPs), the size of the model file (weight file(MB)), and the mean average precision of the intersection, and merger ratio is 50% (mAP50), and the average value of mean average precision for the intersection and merger ratio is 50%, 55%, 60%, …, 95% (mAP50_95).

In this object detection experiment, the whole training parameters are set as follows: the training epochs are set to 500, the initial learning rate is set to 0.001, and the initial value of the training image size is set to , but the image size will be automatically adjusted within 10%. Training hardware equipment using Ubuntu 18 04 system, the server has 8 NVIDIA XP graphics cards with 12 GB memory, but we only used one during training. In subsequent algorithm performance tests, the default settings are used except for the algorithm changes we mentioned.

In our work, we use NVIDIA NANO development board as our hardware deployment platform. This device is a powerful small computer with 128 core Maxwell GPU resources and 4GB 64 bit LPDDR4 with 25.6Gb/s’ memory resources. Users can deploy applications such as image classification, target detection, segmentation, and voice processing on the platform and run multiple neural networks in parallel. And the running power consumption of the device is low. By using the industrial camera as the image acquisition module, we have completely realized the tasks of image acquisition, preprocessing, vehicle type detection, and result saving on the device.

From Tables 3 and 4, we can see that the number of parameters for G-YOLOX decreased by almost 70% compared with YOLOX_s, the number of GFLOPs decreased by over 70%, and the size of the weight file decreased by over 40%. However, the detection performance was almost unchanged.

Figure 7 shows the practical application of our detector for road scenes. Since the detector is small, the whole model can run on an embedded mobile system and provide real-time detection. For conventional vehicles on the road, our algorithm can correctly identify the vehicle types.

Figure 8(a) illustrates the setup for our embedded platform. It has a power supply module, image acquisition module, and processing module. For each vehicle in images, the system outputs the vehicle type and probabilities and coordinates information. Figure 8(b) shows the power consumption for the embedded system, the supply voltage is 5 V DC, and the vertical coordinate is the power consumption, as we can see from Figure 8(b) which the power consumption is less than 5 W when idle and nearly 8 W when processing images.

5. Conclusion

In this work, we present the G-YOLOX detector, which needs fewer parameters and has a smaller model size than all available alternative detectors. It achieved good performance on datasets VOC2007 and VOC2012 and our car dataset, VOC2019. The detector can be trained and used on a conventional GPU with 12 GB of VRAM, which makes its suitable for a broad range of uses. The viability of the original concept of a one-stage anchor-free detector has been proven. The object detection framework G-YOLOX had slightly worse detection performance but drastically reducing the number of parameters. In the future, we will optimize the model for object detection.

Data Availability

In the work of this paper, we use the PASCAL VOC 2007, PASCAL VOC 2012 public datasets and our own vehicle type datasets, we called VOC2019.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the Zhejiang Provincial Major Research and Development Project of China under Grant 2022C01062 and in part by the Zhejiang Provincial Key Lab of Equipment Electronics. We thank the Senken group Co., Ltd. for providing the images in the vehicle dataset.