Abstract

Vehicle detection is one of the most important environment perception tasks for autonomous vehicles. The traditional vision-based vehicle detection methods are not accurate enough especially for small and occluded targets, while the light detection and ranging- (lidar-) based methods are good in detecting obstacles but they are time-consuming and have a low classification rate for different target types. Focusing on these shortcomings to make the full use of the advantages of the depth information of lidar and the obstacle classification ability of vision, this work proposes a real-time vehicle detection algorithm which fuses vision and lidar point cloud information. Firstly, the obstacles are detected by the grid projection method using the lidar point cloud information. Then, the obstacles are mapped to the image to get several separated regions of interest (ROIs). After that, the ROIs are expanded based on the dynamic threshold and merged to generate the final ROI. Finally, a deep learning method named You Only Look Once (YOLO) is applied on the ROI to detect vehicles. The experimental results on the KITTI dataset demonstrate that the proposed algorithm has high detection accuracy and good real-time performance. Compared with the detection method based only on the YOLO deep learning, the mean average precision (mAP) is increased by 17%.

1. Introduction

The core technology of the unmanned driving includes the environmental perception, precise positioning, and path planning. The complex road environments, especially the mixed traffic environments, cause great difficulties to the environment perception of the autonomous vehicles. Moreover, vehicle detection is an important part of the environment perception and plays a vital role in the safe driving of the unmanned ground vehicle (UGV).

Currently, the mainstream obstacle detection sensors are camera and lidar. Cameras have been widely used in the intelligent driving because of their low cost and the ability to obtain textures and colors of targets, which are especially important in recognition of the traffic lights and traffic signs. In [1], vehicles were detected based on the ground plane assumption. In [2, 3], the histogram of oriented gradient (HOG) and Haar features were, respectively, used to detect the vehicles. In [4], a scale-insensitive convolution neural network (SINet) was proposed to detect vehicles, and vehicles of different sizes were detected in the images. However, a single camera cannot obtain depth information and is greatly affected by lighting conditions.

On the other hand, lidar can acquire distance and three-dimensional information at a long detection distance. Lidar is not affected by the illumination conditions and has strong robustness, so it has been widely used in the environmental perception. In [5], the geometry-based target recognition was used, but that method was highly influenced by the environmental conditions, such as the presence of bushes, which could be misidentified. Moreover, in [6], lidar was used to detect and classify pedestrians. The point clouds acquired by the lidar were projected onto the grid and clustered by the global nearest neighbor (GNN), and for each candidate, its eigenvector was calculated and classified by the support vector machine (SVM) based on the radical basis function (RBF) kernel. In [7], the heuristic algorithm based on the shape and location of the road network was used, and the clustered data were processed to select the objects that could be vehicles. In [8], pedestrians and cyclists on the road are classified by saliency network. In [9], the local features were extracted from 3D voxels and the targets were classified by the decision trees. In [10], the deep neural network model is used to classify objects in multiscale. On the one hand, in [11], the vehicles and pedestrians were classified using the Gaussian hybrid model classifier (GMM classifier). In [12], a 3D object recognition method that does not require segmentation was proposed. The 3D Haar-like local features were used instead of the global features, and the local features were combined with the AdaBoost training. In order to improve the efficiency of the operation, a 3D summation area table was used, and to reduce the false alarm, the classifier was retrained multiple times using the false detection as negative samples. However, the cost of the used lidar was very high, the amount of calculation was large, the color information could not be obtained, and the resolution decreased with the increase in the distance; besides, only very scarce information on the long-distance objects could be obtained.

The usage of the multisensor fusion scheme that combines camera and lidar in the UGV has been gradually increasing. There are three mainstream integration schemes: (1) target detection using lidar and camera separately and then merging the results [13, 14], (2) target detection using the images obtained by the camera and then using the lidar for the confirmation [15], and (3) target detection by using the lidar to determine the region of interest and the camera for target detection [16, 17]. In [14], the obstacles in the image and point cloud were processed separately to identify and classify the pedestrians. Then, the target matching was carried out, and the results were merged for pedestrians with the same detection and classification results. In [15], targets were extracted from the image obtained by the camera, and then vehicles were detected, and hypotheses were generated by the Haar-like features and AdaBoost, and after that, the hypotheses were confirmed by lidar. In [16], the authors used lidar to determine the region of interest and the SVM to classify the target in the image, but the proposed method could detect only the rear features of a vehicle, and the detection rate was low. In [17], vehicle features were extracted from the side view of a tire, so the detection rate for the vehicles parallel to the autonomous vehicle was low. The main idea of this paper is the same as that of [16, 17], which is to use lidar to extract the region of interest, and then to use the YOLO, which has high detection accuracy, to detect the objects in the ROI of the images.

In this paper, the map is constructed by the max-min elevation map from point cloud, which is clustered by eight connected region markers, and then, morphological expansion is applied to the clustering results. After that, the minimum rectangle of each connected domain is calculated and points from these rectangles are projected onto the image. After projection, the coordinate extremum of each rectangle on the image is obtained. The area is enlarged and merged to obtain the area of interest in the image, and finally, the YOLO is used to classify the obstacles. The process is shown in Figure 1.

2. Region of Interest Extraction

2.1. Build Grid Map

The Velodyne HDL-64 lidar used in this paper returns about 1.3 million points per second. The huge amount of data helps the environment perception of the UGV, but it poses a great challenge to the real-time algorithm performance. In response to this problem, this paper uses the max-min elevation map for environment creation. At the distance of , in front of the vehicle, an grid map with the bottom left corner as the origin is established where each cell is a square with the side length . We set to 320, to 160, and to 0.2 m. The projection of the lidar coordinate system onto the grid coordinate system is given as follows: where is the lidar returned point in the lidar coordinate system. The lidar coordinate system takes the vehicle direction as the -axis, the vertical direction as the -axis, and direction that is perpendicular to the forward direction to the right as the -axis. and correspond to the grid coordinates of the point. When points and satisfy the conditions or and or , the point is dropped. Example of grid map is shown in Figure 2.

After traversing all the points, the points within the grid map are projected onto the corresponding grid, and the height data is preserved. We consider that there is an obstacle when the difference between the maximum height and the minimum height of a single grid is greater than the set threshold. The grid map provides a computationally efficient approximation to the terrain gradient in a cell [18] and the amount of data we need to deal with is greatly reduced from to , where is the amount of data returned from lidar.

Considering that the distance between the scanning lines of a multiline lidar increases with the increase in the scanning distance, an incomplete scanning of obstacles can be caused, especially in the vertical direction, and the distant obstacles can be scanned only by several scanning lines. Therefore, this paper uses the morphological expansion operation for obstacle cells in the grid map. The matrix we use is given as follows:

Taking the scene shown in Figure 3 as an example, the top view of the original point cloud, the created grid map, and the grid map after expansion operation shown in Figures 4, 5, and 6, respectively, are obtained.

2.2. Grid Clustering

In this paper, eight connected region markers are applied to clustering. If an obstacle cell is connected with the other obstacle cells in the upper, lower, left, right, upper-left, lower-left, upper-right, or lower-right corner, these cells are considered to belong to the same obstacle, so they are represented by the same number.

2.3. Projection from Grid Map to Image

We first convert the grid coordinate extremum of each obstacle to the lidar coordinate system and then extract all the obstacle points from the point cloud as follows: where denotes the obstacle number. After that, we project the extracted points to the image, where is the input point cloud and is the projection matrix of the point cloud to the image. Since the dimension of is and the dimension of is , it is necessary to fill 1 in the last column of to obtain . The projection formula is given as follows: where denotes the matrix of input point cloud after extension with 1 in the last column and denotes the nonnormalized homogeneous coordinates of obstacle points.

The normalized homogeneous coordinates are obtained as follows: where represents a new matrix consisting of the first and second column elements of , and its dimension is ; represents a new matrix composed of elements of the third column of , and its dimension is . The dimension of is and it denotes a set of the projection coordinates of the point cloud on the image. The image coordinate system takes the upper-left corner of the image as the origin, the right direction as the -axis positive direction, and the downward direction as the -axis positive direction.

The projection effect is shown in Figure 7.

2.4. Extract Regions of Interest

From the coordinates of the points of , the extremum of the horizontal and vertical coordinates of each obstacle in the image coordinate system , , , can be obtained. We use the rectangular region formed by the four extreme values that indicate the region of interest of the obstacle in the image, as shown in Figure 8. Since for the lidar, the farther the scanning distance is, the lower the density of the point cloud is, and when the obstacle is far away from the UGV, the lidar can only scan a part of the obstacle. At the same time, the scanning point of the lidar cannot penetrate the obstacle, so the occlusion between the obstacles can also cause the same problem. If only the abovementioned method for the region of interest extraction is used, often, only a part of the target obstacle is extracted, which affects obstacle classification. On this basis, this paper proposes a region of interest amplification method based on the dynamic threshold.

For obstacle , the average distance between the lidar scanning point belonging to the obstacle and the UGV labeled as is calculated, and the threshold is given as follows:

The parameters of the rectangular region of interest that the obstacle belongs to are given as follows: where is the rectangle width, and is the rectangle height. The rectangular frame in Figure 9 denotes the enlarged region of interest.

The region of interest will overlap after enlargement. In this paper, the overlapped rectangles are merged into one region of interest whose parameters are given as follows:

The merged region of interest is shown in Figure 10.

3. Obstacle Classification by YOLO

The YOLO is a new target detection algorithm. It uses a single neural network to predict the bounding box and class probability directly from the complete image by only one evaluation. Moreover, it can directly optimize the detection performance end-to-end, so it has a high real-time performance. The basic YOLO model can process the image in real-time at the speed of 45 frames/sec while achieving twice larger mAP than the other common real-time detectors. However, the YOLO has a certain positioning error [19]. In this work, the YOLO v3 is used to classify the obstacles. The YOLO v3 combines various advanced methods to overcome the shortcomings (low detection accuracy, lousy performance at detecting small objects, etc.) of the previous two generations of the YOLO algorithm and improves the detection accuracy guaranteeing excellent real-time performance. The YOLO v3 is based on the idea of the ResNet, and it combines the Darknet19 network of the YOLO2 to propose a new feature extraction network named the Darknet-53. Regarding the classification accuracy, the Darknet-53 is close to the ResNet-101 and ResNet-152, but much faster [20]. The running time of the YOLO v3 on pixels images is 22.2 ms, reaching 28.2 mAP, which is as accurate as the single shot multibox detector (SSD), but three times faster [21]. However, the YOLO shows a certain decrease in the detection rate when the target is dense and when there is occlusion.

3.1. Dataset Selection and Processing

In this work, we used the KITTI [22] dataset to train and verify the YOLO network. The KITTI dataset was cofounded by the German Karlsruhe Institute of Technology (KIT) and the Toyota Institute of Technology to provide a dataset for the computer vision algorithms in an autonomous driving environment. The KITTI dataset contained eight obstacle types: cars, vans, trucks, pedestrians in standing and sitting positions, cyclists, trams, and others. In this work, cars, vans, and trucks were considered to be the same type of obstacles (vehicles), and the other types of obstacles were ignored.

3.2. Network Training

Before the training, we created the training set and verification set. We used the random selection method to divide all of the 7481 images in the KITTI dataset into the training set and verification set by the ratio of 7 : 3. Size of images provided by KITTI is .

The main parameters of the experimental platform were CPU—Intel Xeon E5-2687W V4 at 3.00 GHz, memory—128 G, and GPU—NVIDIA Quadro M4000. The deep learning used the Keras platform.

The training parameters of the YOLO v3 are given in Table 1.

4. Results

4.1. Overall Results

In the test, the test set provided by the KITTI was used. The result of the original YOLO v3 algorithm is shown in Table 2(a). The result of vehicle detection using the proposed method is shown in Table 2(b). The precision and recall chart of the YOLO v3 algorithm and the proposed algorithm for vehicle detection on the KITTI test set are presented in Figures 11(a) and 11(b), respectively.

4.2. Comparison of Experimental Results in Sample Scenarios

Figures 12, 13, and 14 show some of the scenarios in the experiment. As it can be seen in these figures, there was a large number of vehicles in each scene, and the occlusion was severe. Figure 12(b) shows the results of vehicle detection using the YOLO v3 algorithm. Figure 12(c) shows the results of vehicle detection using the proposed algorithm.

5. Discussion

The experimental results showed that the proposed algorithm significantly improved the vehicle detection accuracy at different detection difficulty levels compared to the original YOLO v3 algorithm, especially for the vehicles with severe occlusion. Under easy, moderate, and hard difficulties, the average accuracy (AP) improvement was by nearly 12%, 20%, and 17%, respectively. We calculated the proportion of the area of all ROIs to the total area of images on the training dataset in KITTI, which is 55%, and the proportion of the number of ground truths contained in ROIs to the total number of ground truths in the training dataset, which is 96%. At the same time, because the use of grid map for the purpose of reducing dimensions of point clouds and the selected point clouds is only in the range of in front of the vehicle, the computational complexity of the algorithm is greatly reduced. Therefore, the time consumption of the proposed algorithm does not increase much compared to YOLO v3. We tested it on our experimental platform with 1000 frames of data which randomly selected from KITTI dataset, and the average processing time per frame of the proposed algorithm was about 0.09 s, while of the YOLO v3 algorithm was 0.05 s.

We selected 50 easy, moderate, and difficult images, respectively, from the KITTI test set and compared the obtained results with the results given in [16, 17] as shown in Table 3.

The experimental results show that the proposed algorithm has great advantages under various difficulty conditions compared with the algorithms proposed in [16, 17].

However, the proposed algorithm still has certain shortcomings. For instance, when the target vehicle is far away from the UGV, it cannot be scanned by the lidar, so the algorithm cannot detect it. At the same time, if a large vehicle is close to the autonomous vehicle, such as a container truck, it is possible that only a portion of that vehicle would be detected after enlarging the ROI area so that the algorithm could fail to identify it.

6. Summary and Outlook

A vehicle detection method based on the multisensor fusion is proposed in this paper. Using the calibration relationship between the lidar and camera, the region of interest extracted by the lidar is projected into the image obtained by the camera, and the region of interest in the image is obtained and processed. Finally, the YOLO v3 algorithm is used to detect the vehicle in the region of interest. The effectiveness of the proposed algorithm is verified by experiments.

In our future work, we will optimize the extraction of the region of interest to achieve better target extraction. At the same time, different ROI sizes mean different input image sizes. Although they will be resized into images of the same size in the beginning of YOLO, the proportion of area of obstacles on the image will change greatly, which will make the detection accuracy of the model decrease. To solve the problem, we will improve the YOLO algorithm or use some other more targeted networks to achieve better detection accuracy.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

H.W. and Y.C. did the methodology, X.L. and Y.L. worked on the software, and L.C. worked on the project administration of the study.

Acknowledgments

This research was funded by the National Key Research and Development Program of China (2018YFB0105003), the National Natural Science Foundation of China (U1764264, 51875255, U1664258, U1764257, 61601203, and 61773184), the Key Research and Development Program of Jiangsu Province (BE2016149), Natural Science Foundation of Jiangsu Province (BK20180100) the Key Project of the Development of Strategic Emerging Industries of Jiangsu Province (2016-1094, 2015-1084), the Key Research and Development Program of Zhenjiang City (GY2017006), and the Overseas Training Program for Universities of Jiangsu Province.