Abstract

Object detection in point clouds is a critical component in most autonomous driving systems. In this paper, in order to improve the effectiveness of image feature extraction and the accuracy of detection of point clouds, a pillar-based 3D point cloud object detection algorithm with multiattention mechanism is proposed, which includes three attention mechanisms SOCA, SOPA, and SAPI. The results show that the recognition accuracy of the optimized algorithm for cars, pedestrians, and cyclists on KITTI dataset is significantly improved on the detection benchmarks of BEV and 3D. Despite using only LiDAR, our algorithm outperforms PointPillars, which is one of the state-of-the-art algorithms for 3D object detection, with respect to both 3D and BEV view KITTI benchmarks while maintaining a relatively competitive speed.

1. Introduction

Machine learning (ML) and the Internet of Things (IoT) can be applied in almost every industry [1, 2], from implementing AI digital assistants to supply chain automation. With the development of machine learning technologies, autonomous driving [3] has become more and more popular. For autonomous driving, the significance of IoT and autonomous driving technology cannot be overstated. Information is the bridge between IoT sensors and self-driving cars. Relying on IoT sensors for information collection and analysis, self-driving cars are one step closer to large-scale applications. In self-driving cars, there are many sensors such as LiDAR, radar, cameras, and IoT devices that communicate with each other. Using various deep learning models based on convolutional neural networks (CNN), based on the data received, enables the car to learn automatically and continuously improve detection performance over time and experience.

At present, 3D point cloud data got from LiDAR is mainly used by autonomous vehicles for object detection. Compared with two-dimensional images, LiDAR can provide more reliable depth and shape information and locate objects with higher accuracy. However, due to nonuniform sampling, occlusion, and reflection in 3D space, the LiDAR point clouds are sparse and have highly variable density. The accuracy of traditional 3D object detection algorithms based on manual features often suffers as a result. In recent years, 3D point cloud object detection algorithms [46] based on deep neural networks have been improved to some extent in terms of accuracy as deep neural networks have shown excellent capability of feature extraction and can handle high-dimensional data. Nevertheless, there is still enough potential for improvement in the accuracy of detection results for some categories, due to the highly sparse and inherently irregular nature of point clouds. Some of earlier works employ 3D convolution approaches or adopt the methods that project point clouds into a perspective view. Perspective view is another extensively used representation of LiDAR. Following this line of research, the VeloFCN [7] and LaserNet [8] are some of the representative works. The new research favors BEV of the LiDAR point cloud, the advantage of which is almost no occlusion. In 2018, Simony et al. [6] introduced Complex-yolo, a model that projects point clouds onto a 2D plane and employs a 2D image approach for object detection, thus speeding up inference of the network. However, the projection is limited by the sparsity of the point cloud, which prevents convolution from extracting features better. To cope with this issue, a common method transforms the point cloud raster into 3D voxel grids and encodes each voxel one by one using hand-crafted features. However, manual design not only cannot make full and effective use of the object’s 3D information but also is not conducive to the application of other radars. Based on PointNet [9], an end-to-end deep neural network was proposed by Qi et al., in which point features are learned directly from point clouds. In 2018, Zhou and Tuzel [4] first proposed an end-to-end trainable network VoxelNet, a universal 3D detection framework. Different from most previous work, VoxelNet starts to learn information-rich feature representation and can simultaneously learn different feature representation from point clouds. However, the disadvantage of 3D convolution is that it is too time-consuming and is confronted with high computation complexity, leading to a slow inference speed of the network. Later on, Yan et al. [10] proposed SECOND, which reduces memory consumption and speeds up computation through sparse convolution operation. In order to use the standard 2D convolution detection pipeline to improve the inference speed, Lang et al. [11] proposed PointPillars, which encode point clouds into vertical pillars, a special division of voxels. To further improve the performance of point clouds object detection in challenging situations, TANet is presented by Liu et al. [12] in 2019, which utilizes a combination of attention mechanisms.

The so-called “attention mechanism” is a way of perception that mimics the human brain and human vision, a mechanism for focusing local information [13, 14]. The attention mechanism can dynamically select areas of attention as the task changes, which is achieved by adaptively assigning weights based on the degree of significance of the inputs. The point-wise attention mechanism proposed by TANet assigns weights on the basis of the importance of the points, but it has not considered the correlations between points, resulting in the loss of a fraction of valuable geometric information. Considering that each point within a pillar is semantically linked, we propose a new method called second order of point attention (SOPA), which links points with each other within a pillar. The experimental results show that the detection accuracy of pillar-based 3D object detection method with SOPA is better than that with point-wise attention. Particularly, for the two categories of pedestrians and cyclists, which are currently detected with relatively low accuracy and are more challenging to improve, there is an improvement for accuracy of each category by using SOPA. Similarly, in consideration of the channel-to-channel relationship, we propose second order of channel attention mechanism (SOCA) based on pillars between the backbone network and the feature extraction network stage, which can extract more effective information. In addition, taking into account that not all features in the pseudoimage have the same contribution to the detection task, we propose the spatial attention of pseudoimage mechanism (SAPI), which assigns different weights to each point in the pseudoimage with regard to the importance of the region in the pseudoimage to the task in the pseudoimage generation stage, which could lead to more accurate detection results. Compared with existing 3D point cloud object detection algorithms, a proposal that integrates these three second order attention mechanisms can achieve higher performance detection with relatively competitive speed.

2.1. PointPillars Network

PointPillars [11] comprises three main phases: (1) a feature encoder network that transforms point clouds to sparse pseudo-images, (2) a 2D convolutional backbone network that converts pseudo-images into high-level representations, and (3) a detection head that detects and regresses 3D boxes.

2.1.1. Pointcloud to Pseudoimage

The space is partitioned into pillars [11], and at the same time, raw point clouds are assigned to the pillars and then converted into a form of sparse pseudoimage. Given to represent a point in the raw point cloud, which has coordinates , and reflection intensity r. First, the input point cloud is partitioned into multiple pillar cells. And each pillar is a 3D grid obtained by dividing the point cloud in the and plane in certain steps [0.16, 0.16]; then, a set of pillars can be obtained. Each point in a pillar is encoded as a nine-dimensional vector , which could be parameterized as . Here, represents the geometric centers of all points in pillars where the point cloud is located, and indicates the offset of each point from the geometric center.

Random sampling is performed if there are more than points in each pillar, and zero filling is employed if less than , to ensure that the number of points in each pillar remains at 32. In this way, a feature tensor of is obtained, and then the feature extraction is performed on the tensor. The original dimension of the point cloud is 9, while the dimension of point cloud is expanded to 64. Then, a feature tensor of is acquired. The 2D feature map is gained by performing the max pooling operation according to the third dimension. The last step is to generate a pseudoimage by a scatter calculus. Specifically, the original pillar is replaced by the feature tensor generated in the previous step based on the pillar index value of each pillar to create a pseudoimage (, , ). Here, and denote the width as well as the height of the canvas. In the process of constructing stacked pillars from the point cloud, the coordinates corresponding to each pillar are recorded. When the pseudoimage is constructed by the learned features, the pillars are filled with the corresponding learned features according to the pillar index.

2.1.2. Backbone

The backbone network is similar to that of VoxelNet [4], covering two subnetworks: the first one is a top-down structure with successively decreasing resolution, where the low resolution is responsible for extracting the high-dimensional features, and the second network carries out the upsampling operations, in order to stitch together the features of the corresponding size. The first network is composed of three blocks, which consist of conv layers behind followed by a Batch-Norm [15] layer and ReLU [16] layer. Here, the stride size is two, and the resolution decreases by half in the direction after each convolutional layer. As it passed through three blocks, the resolution dropped three times, down to one-eighth. At the same time, the channel dimension expands from to .

For the second network, after upsampling operation, the three blocks yield the same size features. Then, the features with the same resolution obtained after deconvolution are concatenated together to acquire an integrated feature.

2.1.3. Detection Head

SSD detection head [17] is employed for the 3D object detection in the final stage. Two anchors of 0 and 90 angles are put in the center of each pillar. For the calculation of IOU [18], rotating IOU [19] is the best one in terms of accuracy.

2.2. Attention Mechanism

Attention mechanism was first proposed in the field of image in the 1990s. The development of attention mechanism goes through four main phases. First, it utilizes RNN [20] and reinforcement learning to implement spatial attention. After that, Jaderberg et al. [21] proposed the STN, learning affine transformations to select important regions. In the third phase, CBAM [22] and Eca-net [23] are representative, of which the novel attention mechanism can adaptively predict underlying kernel features. In the fourth phase, self-attention is highly motivated [24]. Wang et al. were the first to introduce self-attention into computer vision and achieved great success in video understanding and object detection [25]. In recent years, as it has remarkable performance, an increasing number of studies based on attention mechanisms have emerged in the field of computer vision [2628].

The existent attention methods can be divided into channel attention [23, 29], spatial attention [30, 31], temporal attention, and branch channel [32].

2.2.1. Point-Wise Attention

In the pillar-based 3D point cloud object detection algorithm, for the pillar in the space, the global features of the points in the pillar are retained after the max-pooling process, and then the vector can be obtained. Here, represents the number of points. To limit the complexity of the network, only two fully connected layers are employed. The point-wise attention mechanism can be expressed by the following formula:

is the ReLU activation function between two fully connected layers, where denote weights of the two fully connected layers. Here, is the output length of the first fully connected layer as well as the output length of the second fully connected layer. represents the point-wise attention of the points in the pillar.

2.3. Contributions

(1)We propose three effective methods, including the second order of point attention mechanism (SOPA) based on pillars, the second order of channel attention mechanism (SOCA), and the spatial attention of pseudoimage mechanism (SAPI) after the stage of generating pseudoimage, to implement high-precision real-time object detection, respectively(2)We conducted experiments on the KITTI dataset [33] and presented the latest detection results of cars, pedestrians, and cyclists on BEV, 3D, and AOS benchmarks. Our model runs at 34 Hz, while the detection accuracy of the category of cyclists and pedestrians, which is slightly low, is substantially improved by about 6% mAP (mAP on both BEV and AOS) over the other methods(3)We performed several ablation experiments to examine the key influencing factors for achieving performance improvement

3. Multiattention Mechanism Network

The architecture of the multiattention mechanism network is demonstrated in Figure 1. The network accepts the raw point cloud as input and predicts 3D bounding boxes to identify cars, pedestrians, and cyclists. It is composed of the following stage: (1) first, the point cloud is voxelized, and then SOPA operation is performed on the point cloud, followed by a pillar feature network to convert it into the form of sparse pseudo-image. (2) Then, the generated sparse pseudoimage is subjected to SOCA operation. (3) After the backbone network, the SAPI operation is performed on the features of the output sparse pseudoimage, and finally, the 3D bounding box of the object is predicted by the detection head.

3.1. Second Order of Point Attention

As presented in Figure 2, when the point features are fed into the second order of attention module, the SOPA weight would be obtained as the output, and at this moment, the module is called SOPA module.

In the pillar, all points , where represents the max number of points and refers to the number of channels. Through the operation of a max pooling layer, a vector of the maximum values along the dimension is obtained as , where represents a vector with rows and 1 column. To maintain a large model capacity and further ensure the migration of representation capabilities, it is fed into a fully connected layer. Then, new vector is obtained as, where is the number of points after reduction through the fully connected layer . Then, an activation function ReLU is to prevent the network from gradient disappearance. And then, the covariance matrix between two points in the same pillar is computed to get their correlation. This covariance matrix is expressed as , where represents the number of points in the SOPA while refers to the number of channels in the SOCA, and means dimension. Next, row-by-row convolution operation is applied to the covariance matrix; then, the vector is obtained as . The vector is then fed into the fully connected layer , and the N-dimensional attention vector is then gotten using a sigmoid function, denoted as . The SOPA can be presented as the expression as follows:

Here, calculates the covariance matrix of points, and denotes row convolution. is the ReLU activation function while is the sigmoid function. With , represents two different fully connected layers, respectively, and is the point in the given pillar.

3.2. Second Order of Channel Attention

SOCA is similar to SOPA, as presented in Figure 2, when channel features are fed as inputs to the second order of attention module, the output is obtained as the weights of SOCA. For the pseudoimage features generated through a pillar feature network, SOCA can be expressed by the following equation: where the superscripts and are the height and width of the pseudoimage.

3.3. Spatial Attention of Pseudoimage

Not all features of regions in the space can contribute equally to the task, and only the regions which are relevant to the task are of interest. Pixel points in each layer of the spatial feature are assigned with different weights. In this way, task-relevant parts of the space are chosen and then processed. Here, the spatial attention operation is performed on the pseudoimage, and that is why we refer to it as spatial attention of pseudoimage. If the feature map and the signal are given as input, the final output yields the spatial attention weights . SAPI can be formulated as

The correlation between and can be expressed as

Here, , , , and are computed as linear transformations with convolution.

4. Implementation Details

4.1. Loss Function

We use the same loss functions as presented in SECOND and PointPillars. We parameterize a 3D ground truth box as , where represent the center location, and and are the size and the heading angle of the bounding box, respectively. The regression residuals between ground truth and anchor boxes are computed as follows: where denoting ground truth, and is the bounding box, with . All of the losses are summed up as the total loss of the overall network model, with the overall loss function defined as where represents the number of positive anchors. We set , , and .

The regression loss is denoted by the following equation:

For the classification loss, we adopt focal loss [34]: where denotes the probability of being a positive anchor. We adopt the settings of and .

5. Experiment

5.1. Dataset

All test results are evaluated using KITTI’s official evaluation test metrics, including aerial view (BEV), 3D, 2D, and average orientation similarity (AOS), where AOS evaluates the average orientation similarity of two-dimensional detection (BEV). KITTI dataset are available in easy, moderate, and hard difficulties, and the official KITTI leaderboard ranked by performance on moderate. Performance is measured as mean average precision (mAP) on KITTI validation.

The experiments all employ the KITTI 3D object detection benchmark dataset, which is composed of 7,481 training samples and 7,518 test samples. And the KITTI benchmark requires detection category [33], which include cars, pedestrians, and cyclists. We also follow the generally used training-validation split, which contains 3,712 training samples and 3,769 validation samples.

5.2. Settings

Here, we use resolution: 0.16 m, maximum number of points per pillar (): 100, maximum number of pillars (): 12000. Our approach is based on the PyTorch framework, with all networks trained on the NVIDIA 2080Ti computing platform.

We train it for 160 epochs with an initial learning rate of and decrease the learning rate by 0.8 every 15 epochs with Adam [35] optimizer.

5.3. Results

In this section, we will introduce the results of our object detection algorithm using three types of attention mechanisms. The tables below present the effect of adding three kinds of attention mechanisms to the network. Besides, combining any of the two attention mechanisms (SOPA, SOCA) separately results in 2–3% mAP boost overall.

As shown in results Tables 13, the network of object detection using our second order of multiattention mechanism exceeds most of the published networks (mAP on both AOS and BEV benchmarks). As listed in results Tables 4 and 5, we also find our method combining three attention mechanisms achieved BEV mAP (88.37%, 54.13%, and 67.38%) and 3D mAP (76.22%, 49.45%, and 63.58%) in the moderate difficulty categories of car, pedestrian, and cyclist, respectively. Moreover, in most of methods using only LiDAR, better results are achieved in all categories in three difficulty cases.

We show several qualitative results in Figure 3. And while we trained only on LiDAR point clouds, the 3D bounding boxes have been projected into the camera coordinate system for the sake of clarity of interpretation. Overall, our model provides highly accurate 3D bounding boxes in all categories.

6. Ablation Experiments

In this section, we provide the results of ablation experiments to evaluate the key factors that affect the accuracy of the experiments.

As shown in Tables 13, the ablation experimental results show that the accuracy of SOPA is overall improved by 1–2% mAP compared to adding point-wise attention. The points in the pillars are correlated with each other, thus processing the points in the point cloud individually will inevitably drop part of the useful geometric information, further affecting the detection accuracy. SOPA relates points within the same pillar to retain more meaningful information.

As indicated by our results in Tables 68, from the ablation experimental results, we can observe that the accuracy of adding the SOCA is superior to that of the only-fused channel attention mechanism, and the accuracy of adding the SOCA has overall improvement of 4–5% mAP compared with the existing method PointPillars. In particular, from Tables 5 and 9, we can see that the detection results of cyclist categories with slightly low detection accuracy are improved by a large margin (6% 3D mAP and 7% AOS mAP) on the detection benchmark of 3D and AOS. In the backbone network, each channel is processed separately in isolation. Ignoring the correlation between channels will lose some valuable information and decrease the detection precision. And SOCA associates channels with channels to retain more useful feature information.

The residual influencing factors might be the selection of various hyperparameters, including network design (convolution size, number of convolution layers, convolution type, and number of channels), projection using only a bird’s eye view or incorporating a front view, whether to choose pitch angle as main parameter in the pillar feature net, choice of single or multiple detection heads, and lots more, which requires more experimental studies to separate and evaluate each potential influence factor.

7. Conclusions

This paper presents an object detection algorithm based on PointPillars by combining multiple attention mechanisms. A novel deep network and encoder, which improves the traditional end-to-end algorithm and adds multiple attention mechanisms to the network structure in the stage of feature extraction, improves the effectiveness of image feature extraction. On KITTI dataset, the algorithm provides higher detection performance (BEV, 3D and AOS mAP) at a relatively competitive speed. Our experimental results show that our point cloud object detection algorithm using multiple attention mechanisms is an excellent network for LiDAR 3D object detection at present, and the comparison with PointPillars further demonstrates the effectiveness of the proposed method in this paper. It is worth noting that the proposed object detection algorithm can be extended into intelligent transportation systems [43, 44] and private transmission systems [45, 46].

Data Availability

We use resolution: 0.16 m, maximum number of points per pillar (): 100, maximum number of pillars (): 12000. Our approach is based on the pytorch framework, with all networks trained on the NVIDIA 2080ti computing platform.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the Guangzhou Science and Technology Project under Grant 202102021132, in part by National Nature Science Foundation of China under Grant 62173101, in part by Guangzhou Key Laboratory of Software-Defined Low Latency Network under Grant 202102100006, in part by the Open Research Project of Zhijiang Laboratory under Grant 2021KF0AB06, and in part by the International Collaborative Research Program of Guangdong Science and Technology Department under Grants 2020A0505100061.