Abstract

Object recognition based on LIDAR data is crucial in automotive driving and is the subject of extensive research. However, the lack of accuracy and stability in complex environments obstructs the practical application of real-time recognition algorithms. In this study, we proposed a new real-time network for multicategory object recognition. The manually extracted bird’s eye view (BEV) features were adopted to replace the resource-consuming 3D convolutional operation. Besides the subject network, we designed two auxiliary networks to help the network learn the pointwise features and boxwise features, aiming to improve the category and bounding boxes’ accuracy. The KITTI dataset was adopted to train and validate the proposed network. Experimental results showed that, for hard mode, the total average precision (AP) of the category reached 97.4%. For an intersection over a union threshold of 0.5 and 0.7, the total AP of regression reached 93.2% and 85.5%; especially, the AP of car’s regression reached 95.7% and 92.2%. The proposed network also showed consistent performance in the Apollo dataset with a processing duration of 37 ms. The proposed network exhibits stable and robust object recognition performance in complex environments (multiobject, unordered objects, and multicategory). And it shows sensitivity to occlusion of the LIDAR system and insensitivity to close large objects. The proposed multifunction method simultaneously achieves real-time operation, high accuracy, and stable performance, indicating its great potential value in practical application.

1. Introduction

Autonomous driving is a futuristic technology that will transform mobility industries and ease the burden of driving. Autonomous driving is currently supported by relatively mature planning, decision-making, and algorithm implementation but is mainly hindered by its poor perception. As an efficient and precise remote sensing technique, the LIDAR systems have been widely applied in real-time intelligent systems, such as self-driving vehicles [1, 2]. The data acquired from LIDAR are point clouds, which is a set of points containing coordinates and other feature-related information, such as reflectivity. Detecting objects accurately within a point cloud is crucial and has been a widespread research subject. However, the key challenge is that the raw point cloud data are irregular, unstructured, and unordered. Consequently, specific processing methods that require data with a regular form are not suitable for direct application.

The convolution operation is an efficient approach for extracting deep features [37], and it requires a regular grid as the input, which a point cloud does not satisfy. Therefore, the first step is to transform the unstructured point cloud into a regular style. The structured processed data can be graphical [812], ordered points [1317], or voxels [1822]. In graph-based methods, nodes represent points, and edges represent the relationships between points. The abstract expression is obscure. Although point-based methods can achieve better performance by taking the raw point clouds as their input and predicting bounding boxes based on each point, in general, their inference time cannot meet the demands of a real-time system. Therefore, they are restricted primarily to offline analysis. Voxels are popular because they have a clear physical structure similar to images. VoxelNet [20] is an example of a classic voxelization method that performs impressively in 3D object recognition tasks. Its strong performance relies heavily on several 3D convolution operations, resulting in a time- and memory-consuming process. To avoid using 3D convolution operations, a structure that replaces 3D voxels with pillars, thereby erasing the vertical dimension, has been reported [21]. This method was also called the bird’s eye view method, which led to an improved processing speed, although the performance was unstable due to the lack of vertical information. Due to the lack of color information, the unstable performance is more serious than the image recognition task [22]. Alternatively, it should be a compromise approach to keep the necessary information concisely by using the maximum height, the density of the point set, and the reflectivity of the highest point to express the pillar feature [23].

To achieve higher precision of the bounding boxes, RBG images are fused with LIDAR data [24], thus obtaining a richer expression of the environment. The introduction of camera data means that this method is based on the trigger consistency of two kinds of data [25, 26] and the calibration accuracy of the camera and the LIDAR coordinate system, which may cause robustness problems in practical applications. Inspired by the better performance of point-based methods, an alternative method involves aggregating the voxels into a small number of key points [27], thus combining the advantages of both voxel- and point-based methods. In addition, this study adopted the farthest point sampling (FPS) to sample key points. However, FPS is extremely time-consuming, specifically for a large-scale scene, and the sampling time is not discussed in [27]. Therefore, finding an optimal balance between performance and processing time is still a challenge.

Most researchers use only a single category of data when training networks and assign independent evaluation indexes for the recognition effect of single categories. This method excludes the interference of other types of categories in the result. Furthermore, it causes deviation from the requirement that results in the recognition of multiple categories through one forward propagation in the actual application, which cannot explain the actual effect of the application.

This present study focused on developing a LIDAR-based 3D object recognition method for road scenes. Considering the significant effect of image recognition, we expect to take the advanced image recognition methods to the point cloud recognition task. Hence, the proposed method is a voxel-based recognition method that can simultaneously predict multiple object categories. We evaluated the method based on the 3D localization and bounding box precision, object recognition accuracy, and processing time. Unlike most present methods that heavily rely on 3D convolutional operations, we considered that the bird’s eye view (BEV) based method has not yet exhausted its performance potential. Thus, we improved the head network and designed an additional auxiliary network to improve the prediction accuracy. The network was trained and evaluated by the KITTI dataset and its benchmark. The results verify that the new part is beneficial to the network.

The rest of this paper is structured as follows: Section 2 presents the proposed network architecture; Section 3 outlines the implementation of the proposed network and presents the results; Section 4 discusses the specific recognition effects that are not obvious in the evaluation indicators; and Section 5 presents our conclusions.

2. Methods

The proposed network is divided into preprocessing, backbone network, neck network, head network, and auxiliary networks:(1)The preprocessing stage transforms the unordered point cloud into ordered data.(2)The backbone and neck networks are used to extract scene features.(3)The head network transforms the scene features into predicted outputs.(4)The auxiliary network is set up to help the subject network learn pointwise and boxwise features. It does not participate in the prediction process, so it will not cause an additional computing burden to the network.

2.1. Preprocessing

The method outlined in [23] is referred to. First, the irregular points are transformed into a pillar map according to their location. Besides the three channels mentioned in [23], we add a channel containing the pillar’s minimum height, which expresses the difference between the edge and the inside of an object. Therefore, four channels represent the vertical distribution of the points in each pillar: the first channel records the number of points in the pillar; the second and the third record the maximum and minimum vertical coordinates of the points in the pillar; and the fourth records the reflectivity of the highest point in the pillar. Finally, a four-channel bird’s eye view (4C-BEV) is obtained as the network input. This method is essentially equivalent to taking the upper cover shell of the spatial point cloud from a top-down perspective. Because of Earth’s gravity, very few objects are suspended in the air, and obstacles can usually be clearly distinguished by direct observation of such shells. The channels’ values need to be normalized, specifically, the first channel, because the point cloud density increases from far to near. Here, the distance factor Kd is added to make pillars with a similar degree of characteristic expression at different locations:where Np represents the number of points in each pillar and Kd is expressed aswhere ke is the coefficient.

As shown in Figure 1, through observation with the naked eye, the objects are visible in the 4C-BEV, indicating that this method can preserve the point cloud’s information in a vertical direction while compressing the data efficiently.

2.2. Backbone Network

The 4C-BEV is entirely consistent with the image in terms of data structure. Therefore, many popular backbone networks for image recognition can be used directly, such as ResNets [28], CSPDarknet53 [3], or VGG16 [29]. Besides, there are some differences between the object recognition tasks in image and point clouds. First, multiple scales are not necessary. In the image recognition task, the perspective phenomenon is one of the main factors requiring consideration in the network design. Therefore, the network contains output nodes representing different scales, or several preexisting boxes are predefined to represent different scales. When constructing the feature map using the LIDAR coordinates, as the objects have the same size as the real world, this perspective phenomenon is not encountered. Second, the scales of objects are different. In the image recognition task, in general, the recognition performance varies between large- and small-scale objects when using the same network. Typically, the area of interest appears near the observer, which means the identification accuracy of large targets is more important than others. The image passes through a multilayer network, which significantly reduces its scale and improves the recognition ability regarding large-scale objects. Taking CSPDarknet53 [3] as an example, after an input image was transmitted forward, the scales of the three outputs were reduced by 8, 16, and 32 times, respectively. Using the mentioned encoding approach, the feature vectors at each position can fuse with the features of the broader receptive field, identifying large-scale objects. However, for LIDAR-based tasks, the scene’s object is relatively small compared to the scene size, with the large-scale output feature map affecting the recognition accuracy. Third, the orientation of the bounding boxes needs to be predicted. The maximum pooling layers play an essential role in a backbone network because they can prevent overfitting and improve the network’s generalization ability; however, they can also enhance the rotation invariance.

The backbone network architecture is more similar to a tiny version of CSPDarknet53. Figure 2 illustrates the modified architecture. We use Conv (k, s, p, cout) to represent a 2D convolutional operator, where cout is the number of output channels; k, s, and p are the kernel size, stride, and padding size, respectively. The “Conv” operation contains a 2D convolutional operator, a group normalization (GN) layer, and an activation function layer sequentially when it acts as a convolutional middle layer. We used several small residual blocks to fuse features of the current layer and the previous layer. Then, we used big residual blocks to fuse shallow features and deep features. The nodes of the backbone network measure h ×  × c, where h and are the spatial dimensions, and c is the channel dimension. The input is a feature map with a fixed size of h ×  × 3. The backbone network has two outputs: one with a fixed size of h/2 × /2 × 128, while the other has a fixed size of h/4 × /4 × 512.

2.3. Neck Network

The role of the neck network is to perform further feature extraction and connect the backbone to the head. Figure 3 shows the architecture of our neck network. The residual blocks are retained to aid further feature extraction. Upsampling operations are used to unify the scale of the feature map. Although there is no perspective phenomenon, different categories of objects have different sizes, and multiscale features play a positive role in the network.

2.4. Head Network

The head network is custom-designed for our specific 3D object recognition task and divided into three parts. The first part is used for confidence prediction, with the sigmoid function used to limit the result range to [0, 1]:

Two channels are assigned to each category, representing the regression confidence of this category based on horizontal and vertical anchors.

The second part is used for predicting bounding boxes. The spatial position and physical dimension are predicted in this part. As there is no perspective effect, it is reasonable that bounding box regression based on the standard reference value should arise. Therefore, we predefined an anchor map as the standard reference value in which each position has 2 × Nc anchors, where Nc is the number of predicted categories. In general, the orientation and border predictions are conducted simultaneously [20, 21, 23]. This method cannot express the close relationship between the two ends of the interval. Inevitably, they produce the greatest divergence, which is incorrect. To keep the prediction of orientation continuous, we adopt an anchor-free [30] and anchor-based [20, 21, 23] combined method. Six channels are assigned to represent the regression parameters (except for orientation) of the two anchors at each position. Furthermore, the sine and cosine values are used to represent the orientation indirectly.

In most studies, there is little discussion on multicategory object prediction. By default, when predicting multicategory objects, the regression parameters for all categories are given, which leads to low information utilization (only 1/Nc information is useful). Thus, the convergence efficiency is greatly affected. In this study, it is designed to give only a set of border predictions at each position. The box center’s category is determined according to the ground truth. The other positions’ categories are determined by the overlap between the standard anchor and the ground truth bounding box.

For convenience, it is assumed that the category is determined, and there are two anchors for each position. The ground truth of the bounding box regression value Rgt of one anchor at each location can be expressed as follows:where A denotes the parameter of one anchor in each position.

The third part is used for category prediction, for which Nc channels are assigned. The softmax function is used to transform the result to Nc probabilities, whose range is limited to [0, 1]:

2.5. Auxiliary Network

Because of the ability to obtain more detailed pointwise characteristics, point-based methods usually achieve higher accuracy than voxel-based methods. To enhance our method’s accuracy, the pointwise feature was introduced to the network. Referenced by the SA-SSD [31], the pointwise feature learning network was set as an auxiliary network that only works during training, does not play a role in predicting, avoiding additional computational overhead caused by the additional feature extracting. The penultimate layer of the neck network was set as the former feature extraction layer of the auxiliary network, which is ultimately a voxelwise category prediction network. The auxiliary network is elaborated in Figure 4. The accuracy of border regression is highly dependent on the accuracy of category prediction. Therefore, the primary task is to improve the accuracy of object category prediction by the increased category information of the point cloud. Unlike the category prediction part in the head network, which only focuses on the category prediction of the bounding boxes’ center voxels, the auxiliary network focuses on the category prediction of the voxels around the bounding box center. Since each voxel contains only one highest point, voxel features are equivalent to pointwise features. We randomly extract no more than 1000 internal points and no more than 250 external points of bounding boxes to save memory space. We recreate the voxel category label, depending on whether its highest point is within the bounding box. The whole operation is similar to an additional “droop-out” process, which improves the generalization performance of the network. The second task of the auxiliary network is to enhance the accuracy of bounding box regression. In this step, we randomly sample no more than 50 highest points within each bounding box and calculate the inverse distance to the bounding box center as its weight. The weighted average and maximum pointwise features among all points in the bounding box region are combined to express a boxwise feature:

2.6. Loss

The loss contains the central part and the auxiliary part. The central part contains confidence loss, regression loss, and category loss. The auxiliary part contains point category loss and box regression loss. We adopted the smoothL1 function [5] to calculate the bounding box regression loss:

SmoothL1 function has stable convergence characteristics in the case of large deviation and adequate convergence in small variation. The predictions of category and confidence are converted into the probability value prediction within the interval of [0, 1]. The cross-entropy function was applicable to calculate their losses:where Xp, Xgt are the predicted value and ground truth value, respectively.

The ground truth values of the category are labeled as a one-hot form. The focus loss [32] can solve the problem that when the proportion of positive and negative samples is unbalanced, the negative ones are submerged in the positive ones. Although the positive and negative labeled data distributions are incredibly uneven, the ratio of positive and negative samples is given. To avoid the focus loss affecting the rate of convergence, we do not adopt focus loss.

Not all losses in each position are calculated in a feature map. Some grids that are far from the center of the object are inaccurate and can be neglected. The positive confidence label is vital because it can be used as a mask to filter out untrusted data not to be included in the loss calculation. In Section 2.4, an anchor map was established. We excluded the angle parameters and determined the confidence by calculating the intersection over union (IoU) between the ground truth bounding box and the anchors in the map. Because the confidence feature map does not rely on the vertical direction position, the projection plane in the vertical direction of the ground truth bounding box and anchors are used when calculating the IoU:

The final loss L is defined as

Among the final loss, the confidence loss is expressed aswhere Pp represents the positive confidence prediction, Pgt denotes the positive corresponding ground truth, and Ngt denotes the negative corresponding ground truth.

The regression loss is expressed aswhere Rp is the bounding box regression prediction and Rgt is the corresponding ground truth.

The category loss is expressed aswhere Mgt is the maximum last channel value of Pgt.

The auxiliary parts of the loss are defined aswhere Wb denotes the weights calculated by equation (12). Bp, denote the boxwise regression feature and category prediction of the sample point; and Bgt, and denote the corresponding ground truth, respectively.

2.7. Dataset

Most 3D object recognition networks are trained using the KITTI dataset [33]. The KITTI dataset contains 7481 frames, among which we selected 2000 frames as the verification set and the remaining 5481 frames as the training set. We were interested in cars, trucks, vans, pedestrians, and cyclists among the object categories. Besides, trucks and vans were merged into one class. In this study, the Apollo dataset [34] was also adopted.

In contrast to the KITTI dataset, the Apollo dataset contains continuous frame data. When a vehicle turns, the surrounding objects show disordered orientations and spatial positions, which are more complicated than those in the nonturning state. This representative disordered data frequently occurs in continuous frame data. The Apollo dataset contains 16 scenes. Each scene contains 2–5 sections of continuous frame data collected at a frequency of 2 Hz, lasting for 1 min. We take a section of each scenario as the verification set and the rest as the training set. The final training set consisted of 3,943 frames, while the validation set consisted of 1,650 frames. We were interested in four types of labeled data: small vehicles, big vehicles, pedestrians, and riders (i.e., motorcyclists and bicyclists), which were labeled using the abbreviations “VEH, TRU, PED, CYC,” respectively, during the visualization step. The data augmentation technique [34] was adopted during the training process.

2.8. Details

In this study, points inside the range covered from 41.6 m in front, 20.8 m left and right, and 2 m above and below the LIDAR coordinate system were used to construct the BEV feature map. The resolution of the grid was set as 0.2 m. Therefore, the BEV feature map was divided into 208 × 208 grids.

A minibatch gradient descent was conducted with a batch size of 1. We placed all batch normalization layers in the network with the group normalization layers because of the small batch size. Each training set involved in the training was defined as an epoch. Epochs were set as 100 and the first 75 epochs had a learning rate of 0.001, while the remaining epochs had a learning rate of 0.0001. All algorithms were run on a workstation with a Core i7 CPU, 8G RAM, an NVIDIA 1080Ti GPU, and the open-source deep learning framework TensorFlow. Nonmaximum suppression was deployed to filter out excess bounding boxes, with the IoU threshold set as 0.1.

2.9. Evaluation Indicators

It is assumed that np objects are predicted with ngt objects labeled in the ground truth. First, the prediction object needs to be paired with the ground truth object by calculating the IoU. Considering the predicted objects as the benchmark and matching all predicted objects with the labeled objects using the maximum IoU, the calculated result is denoted as the precision. Similarly, considering the labeled objects as the benchmark and matching all labeled objects with the predicted objects using the maximum IoU, the calculated result is denoted as the recall. As the recall rises, the precision drops. Using recall as the horizontal axis and precision as the vertical axis, the area surrounded by the plotted recall-precision curve and the coordinate axis is the average precision (AP), which is widely adopted to evaluate the performance of the network. We set the parameters consistent with the KITTI benchmark, where the IoU threshold of cars is 0.7, and pedestrians and cyclists are 0.5.

3. Results and Discussion

3.1. Loss Curves

Validation was performed using one batch of data in the validation set per 100 iterations. During the training process, the loss function fluctuated violently with the weight decline of 0.99, representing the changing trend of the loss. The evolution of the loss curves throughout the training process is shown in Figure 5. By the time the iteration reaches 500,000, the network has converged. Our trained model was marked as red points in the figures.

3.2. Speed

The acquisition frequencies commonly used by LIDAR are 5, 10, 15, and 20 Hz. The entire recognition process is divided into preprocessing and inference. The mean preprocessing and inference process duration was approximately 5.7 and 31.2 ms, respectively. The mean recognition process duration was approximately 37 ms, which meets the real-time requirement.

3.3. Accuracy

The recall precision curves (trained by the KITTI dataset) are given in Figure 6. The APs of the regression and category are listed in Table 1 (trained by the KITTI dataset). The data we mainly focus on is marked in bold. The AP for cars was 92.5, which is relatively high, considering that the regression is based on the anchor determined by the category prediction. The trucks, which are not considered in the KITTI benchmark, are included in the identification. Due to the uneven distribution of training data, the AP of other categories is slightly lower. The total AP (0.5), the total AP (0.7), and the total AP (categories) can reach 93.2, 85.5, and 97.4, respectively. In Apollo dataset, the labeled bounding boxes are the objects’ visible parts, which vary from the actual physical size, resulting in low indicators. As for apparent objects, using the Apollo dataset shows similar performance to the KITTI dataset, which is described in Section 3.5.

3.4. Comparison

The main contribution of this paper lies in the anchor-free and anchor combined prediction method and auxiliary networks specially designed for the bird’s eye view network. The contrast effect is shown in Table 2. Due to the use of the data augmentation technique, the training results can be slightly different. Indicators within ±1% were regarded as the same performance. Our proposed method significantly improves the prediction results of pedestrians and cyclists, and the data we mainly focus on is marked in bold.

3.5. Scene Analysis

The method described in this paper was designed for a complex environment (multiobject, unordered objects, multicategory). In this section, some typical scenes are selected to analyze the network’s performance concerning object recognition. The continuous frame data from the Apollo dataset provides sufficient verification of the stability of the recognition effect.

Figure 7 shows three frames of a congested traffic scene. This scene features many vehicles in a dense array, precisely what the proposed network has been designed to identify.

The recognition result corresponding to the turning scene is shown in Figure 8. Most objects are recognized accurately, and the performance remains stable. Despite the large size of the rotation scene, the direction of the objects was also predicted accurately.

Figure 9 depicts a scene containing several objects belonging to different categories, including a vehicle, truck, and cyclist. Each object was recognized consistently and accurately across the consecutive frames.

The recognition visualization of the KITTI dataset is shown in Figure 10, and a series of typical complex scenarios are selected, including multiobject, unordered objects, and multicategory.

3.6. Discussion

The orientation of the bounding box is expressed by sine and cosine values indirectly in this paper. Compared with the directly predicted method, when the predicted value is not accurate, it will not deviate wildly. It is the advantage of continuous prediction, which makes the network robust. However, because of an additional prediction dependence in the calculation, the accuracy will worsen if only the prediction results with high quality are compared. Similarly, in the proposed multicategory object recognition network, the bounding box regression is also highly dependent on the category prediction, making the regression not achieve the highest precision, but makes the prediction more robust.

Our approach is not very sensitive to large objects that are very close to the observer. It is the reason that the AP of cars for easy mode was slightly worse than that for the moderate mode. Expanding the receptive field can alleviate such problems, but it will increase the depth of the network. Our experimental results showed that deeper networks increase the inferencing time but have no significant effect on accuracy, which is very different from image recognition tasks.

The indicators in this paper are much higher than those on the KITTI ranking list. The main reason is that the scope of perception selected in this paper is smaller than the standard. Our range covers 41.6 m (70.4 m in standard) in front, 20.8 m (40 m in standard) left and right, and 2 m above and below the LIDAR coordinate system, which has met our application’s commands.

4. Conclusions

The main aim of this study was to design a LIDAR-based object recognition method for autonomous vehicle systems. Thus, we proposed a new multifunctional network that operates in real-time with high accuracy and stable performance. As several recognition methods achieve considerable performance differences in different datasets, the Apollo dataset was also adopted besides the KITTI dataset in this study, making the validation results more consistent with actual application scenes. Hence, the proposed recognition method has a high practical value. The key findings of this study are outlined below:(1)The proposed network realizes the accurate recognition of multiple types of objects in real time.(2)To tackle the inaccurate category prediction, an auxiliary network was designed to help network auxiliary learning pointwise features. It is not limited to the object’s center point category, making the prediction result more robust.(3)To tackle the inaccurate bounding box prediction, firstly, the validity of the indirect expression of orientation angle by sine and cosine values is verified. Besides, another auxiliary network was designed to help network auxiliary learning boxwise features.(4)The proposed network delivers a stable and robust object recognition performance in complex environments (multiobject, unordered objects, and multicategory), reflecting its high practical value.(5)The proposed network’s performance is impacted negatively when the LIDAR system is obscured and is not sensitive to large objects that are very close to the observer. Further research is necessary to address this weakness of the network.

In this study, we have considered several possible problems in practical application scenarios. Although our proposed method needs to be further improved, it has demonstrated a very high practical application potential. Based on the phenomenon that most current methods rely heavily on the 3D sparse convolutional operation [35], our research’s stable performance showed that artificial bird’s eye view features can do the same thing as three-dimensional convolution.

Data Availability

The data used to support the findings of this study are available upon request to the corresponding author.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was funded by the National Natural Science Foundation of China, under Grant no. 51775548.