Abstract

Object detection in the 2D domain is well developed owing to the wide application of CMOS image sensors and the great success of deep learning technologies in recent years. However, under circumstances such as autonomous driving, the variation of weather conditions and light conditions makes it impossible to perform reliable detection using regular 2D image sensors. 3D data generated by a Lidar or Radar is more robust to such environments, hence serving as an essential complement to 2D data in such scenarios. Well-established anchor-based detectors in the 2D domain suffer from time-consuming anchor configuration and cannot be exploited directly to process 3D data. This paper proposes an anchor-free network that encodes the raw point cloud into a hierarchical pillar representation to locate objects. Without predefined anchors and NMS postprocessing, our method directly predicts the center points and box properties to accomplish the detection task efficiently. In addition, a PCA-based initialization for the convolutional kernel is proposed to accelerate the training process. Experiments are implemented on the KITTI benchmark, and our method can achieve competitive performance with other anchor-based methods. Comprehensive ablation studies further verify the validity and rationality of each part of the proposed method.

1. Introduction

Object detection is one of the most important tasks in the field of computer vision, which has a wide range of applications in individual recognition, content understanding, autonomous driving, etc. In general, the task of object detection is to mark locations and determine categories of key targets with bounding boxes. In the past decades, huge 2D data have been collected from widely applied commercial 2D image sensors. Taking advantage of the cutting-edge deep learning technology, many convolutional neural network- (CNN-) based algorithms [19] have been designed for 2D object detection and have shown their superiority and effectiveness. Because of the 3D nature of many real-world problems, 3D object detection attracts more and more attention.

3D data, usually represented as point clouds, can effectively depict the real world with accurate geometry information, which is robust to changing light conditions, different object textural, and color variation. With the increasing availability, 3D data has been serving as an essential complement to general 2D sensors in many scenarios. However, the sparse and unordered structure of point clouds could not be directly processed by a conventional CNN, which urges novel network structures to encode the point clouds.

Considering point clouds’ irregular and sparse properties, most existing data-driven 3D object detection approaches can be categorized into point-based and voxel-based ones. Inspired by the pioneering work PointNet [10] and PointNet++ [11], point-based methods can take the raw point clouds as input to extract features without any data transformation or information loss. However, the real-time performance and effectiveness of point-based methods are not satisfactory due to the time-consuming point sampling procedure and the poor encoder perceptual ability. On the other hand, voxel-based methods [12, 13] transform the point clouds into some regular data representations which can be processed by CNN. Furthermore, the introduction of sparse convolution dramatically improves the performance and speed of voxel-based methods. Nevertheless, these methods are sensitive to the parameters of voxel partition and cause local information loss in raw point clouds inevitably. Recently, PointPillars [14] utilizes the vertical columns called pillars to organize point clouds and avoid complicated 3D convolutional operations. It alleviates the parameter configuration problem in preprocessing and shows considerable accuracy and speed. Pillar-based methods organize the raw point cloud into two-dimensional regular grids so that traditional 2D convolution operations can be applied. It is significantly effective for Lidar-based sparse point clouds; however, it is prone to missing small or far-away objects. Inspired by this, this paper focuses on how to solve the local information loss problem based on a pillar-based object detection model.

Current popular voxel-based (including pillar-based) methods often leverage anchors, which are some manually designed bounding boxes, to accomplish detecting and classifying. Although anchors provide some useful priors and enable the methods to predict offsets directly, applying them in the 3D objection detection is difficult. First, hyperparameters including aspect ratios, orientations, and anchor numbers needed to be predetermined and adjusted accordingly to diverse datasets. Manual hyperparameter tuning is time-consuming and inaccurate, which limits the applicability. Second, a great number of anchor boxes are generated during training and inference so that all possible locations of the ground truth bounding boxes can be covered. This introduces huge memory consumption and a serious class imbalance between the positive and negative anchors. Third, the necessary Non-Maximum Suppression (NMS) for determining the final detection results can lead to an extra computational burden.

One solution is to employ an anchor-free detector. Recently, anchor-free detectors have obtained continuous developments and breakthroughs in the field of 2D object detection [15, 16]. They directly estimate the key points and sizes of the objects without hyperparameter configuration and the generation of anchors. Later, some anchor-free 3D object detection approaches [17] have been proposed and outperform classical anchor-based ones.

All in all, the pillar-based approach improves voxel-based approaches by encoding the raw data into a lower dimensional representation. It has a faster inference speed but at the cost of local information loss. The anchor-free detector can learn from the data rather than relying on predefined anchors and boxes. It can regress more accurate bounding boxes and has achieved great success in 2D object detection, it deserves more investigation in the 3D domain.

Motivated by these facts, this paper proposes a hierarchical pillar-based anchor-free 3D object detection model. Compare with other pillar-based approaches, we further partition the pillars into subpillars and learn the hierarchical features of local regions. Then, the proposed method aggregates multilevel features to generate high-quality spatial representations with the CNN backbone. In addition to existing anchor-free approaches, we introduce an improved center point allocation strategy to further improve the accuracy and alleviate the positive-negative imbalance problem. At the training stage, we exploit a principal component analysis- (PCA-) based method to initialize the convolutional kernels. At the inference stage, our model can generate the center location directly and avoid the NMS for postprocessing. Experiments and ablation studies are carried out on a well-known benchmark KITTI [18] to evaluate the performance of the proposed method.

The contributions of this paper can be summarized as follows: (1)An anchor-free detector for point cloud 3D object detection without NMS is proposed. It can be end-to-end joint optimized and achieve competitive performance with other anchor-based methods(2)The point cloud is encoded into a hierarchical pillar-based feature representation, which can capture the local structure and mitigate the information loss in preprocessing. Subsequent multilevel feature aggregation in the CNN backbone can extract robust features and enhance detection accuracy(3)Our proposed PCA-based initialization [19] is incorporated into the CNN backbone for 3D objection detection. The convolutional kernels can be initialized with more informative values, which accelerates the training process of CNN and reduces the effect of gradient diffusion caused by random parameters(4)A novel center point allocation strategy is designed to train the model. Experimental results demonstrate its effectiveness in the 3D object detection problem

The rest of the paper is organized as follows: Section 2 describes the related works. Section 3 provides an introduction to the proposed method for 3D object detection. Experimental results on the dataset are presented and discussed in Section 4. Section 5 concludes the paper.

2.1. 3D Object Detection with Point Clouds

A point cloud is a set of points with sparse distribution and irregularity. PointNet [10] is the pioneer to take raw point clouds as input and extract 3D features by shared multilayer perceptrons. PointNet++ [11] further proposes the set abstraction levels to capture local patterns among the point clouds. Successive object detection works [2022] based on the above-mentioned PointNet or PointNet++ model to process original points directly are called point-based approaches. [20] detects the 2D proposals from the RGB images and implements the 3D frustum projection on them. Then, a PointNet is applied to extract ROI features of the points in frustums and refine the 3D bounding box. [21] directly proposes 3D proposals from the point cloud and combines the local spatial features learned on canonical coordinates with global semantic features to obtain better locations. [22] produces some initial predictions with voxel representation as input and generates the fused features of interior points for further refinement. [23] proposed a two-stage 3D object detection approach from sparse-to-dense. In the first stage, it makes proposals at all the foreground points; then, in the second stage, it incorporates the point cloud feature and the semantic feature to refine the bounding box. [24] improves the set abstraction layer in PointNet++ and designs a novel sampling strategy called F-FPS. Then, it uses an anchor-free detector to regress the object’s position. These point-based methods prefer to design a second fine-tuning stage to regress a more accurate box position locating. Although they show impressive performance, they trade off efficiency for accuracy and are not suitable for real-time applications.

Another category falls into voxel-based approaches, which preprocesses the raw point clouds into some compact representations. VoxelNet [12] organizes the points into voxels and then extracts 3D dense features through the voxel feature encoding (VFE) layer and 3D convolution. SECOND [13] utilizes a sparse convolution network to accelerate the convolution operations in training and inference. Recently, PointPillars [14] generates 2D pseudoimages by encoding point clouds on vertical columns (pillars) and eliminates the time-consuming 3D convolution. Most of the voxel-based methods are one-stage detectors with high computational efficiency but suffer from the information loss problem due to voxelization. [25] proposed a two-stage pillar-based approach to address the imbalance issue caused by anchors; it incorporates the concept of the pillars and multiview feature learning; then, a pillar-to-point projection is employed to refine the result. Our method is aimed at preserving more local information using a hierarchical pillar representation at a minimal cost of speed.

2.2. Anchor-Free Object Detection

Most of the existing object detection methods design a large number of predefined anchors for bounding box generation, which results in complex hyperparameter configuration and huge memory consumption. Anchor-free detectors directly predict the key points and sizes of the bounding boxes with high speed. The success of anchor-free methods in 2D object detection [15, 16] inspired researchers to investigate anchor-free 3D detectors. VoteNet [26] aggregates the votes of object centroids to obtain the object proposals directly from point clouds. But VoteNet is not a completely anchor-free model for the reason that it employs some anchor templates in the size prediction process. Later, [17] proposes an anchor-free detector and further simplifies the postprocessing to increase the detection efficiency. However, the performance of anchor-free approaches highly depends on the central point allocation strategy and 3D bounding box regression.

All the related works are summarized in Table 1.

3. Proposed Method

In this section, we introduce the proposed hierarchical pillar-based anchor-free 3D object detection model. As shown in Figure 1, the overall network is composed of the following parts: a point cloud encoder that transforms the unordered point clouds into 2D pseudoimages; a CNN backbone based on PCA initialization to further extract features, and anchor-free detectors. In the following, we describe each part of the proposed method in detail.

3.1. Point Cloud Encoder

Our proposed point cloud encoder is based on the high-efficiency PointPillars [14] but can further capture the local structure and mitigate the information loss in the point cloud encoding process. The input point cloud P is a set of points with irregular distribution in Euclidean space.

First, is discretized into vertical columns with uniform grids in the - plane. Considering the sparsity of point cloud data, we apply zero-padding when one pillar contains too few points and nonempty pillars are preserved. Second, several hierarchical feature extraction (HFE) levels are introduced to group the points into local patterns and aggregate the information. At the first HFE level, each pillar with enough points is divided into amount of evenly height subpillars according to the resolution parameters in the vertical direction. Following [14], the points in each subpillar are augmented as a 9-dimensional representation and are used to calculate the average vertical coordinate . Random sampling is implemented on the subpillars with more than amount of points. Then, the selected subpillars are applied with a linear layer followed by a batch normalization (BN) layer, a ReLU layer, and a max operation to produce an output subpillar feature.

In the subsequent HFE levels, we employ the FPS algorithm [11] to sample the input subpillars according to the average vertical coordinates. Then, we group the information of two adjacent subpillars in the vertical direction for each selected subpillar to generate fewer larger subpillars. By using the identical structure in the first HFE level, we can further obtain the corresponding output features in the current level . Through these hierarchical groupings of the points and subpillars, our encoder can abstract local patterns of the points and retain the information of vertical direction in the final point cloud feature with the size (). At last, a pseudoimage of size () is created by scattering the feature to the raw pillar locations.

3.2. CNN Backbone Based on PCA Initialization

Inspired by our previous work [17, 19], we design a CNN backbone based on PCA initialization to create the dense features for the following anchor-free detector. As depicted in Figure 2, the backbone is composed of the top-down part and the upsampling-concatenation part. The top-down part can be formulated as several blocks with 2D convolution and downsampling operations to extract features with high semantic and decreasing spatial size. Each block is applied with several convolution layers, a BN layer, and a ReLU layer sequentially. The convolutional kernels are all initialized by PCA as follows:

For a convolution layer with input channels and output channels, we firstly randomly divide the input feature maps into groups. Then, we implement a fully covered sampling on each feature map group with the kernel size to obtain the patch sets. After mean normalization, the covariance matrix and eigenvector matrix for each patch set is calculated. And the eigenvector with the largest eigenvalue is selected to initialize the weights of convolution kernels for each group, the initialization value is along which the data has the maximum variance, i.e., maximum information entropy of the sampled patches.

The features produced by each block are then upsampled to the same spatial size by applying transposed convolution followed by BN and ReLU. Finally, features from various blocks are concatenated to the final point cloud feature.

3.3. Anchor-Free Detector

The proposed anchor-free detector contains two modules to accomplish the proposal generating and classification: (1) a center point classification module that produces the keypoint heatmap in the - plane for each object category and (2) a bounding box annotation module that regresses offset, 3D object size properties, and the orientation. All heads of the two modules share the common features from the backbone, and each of them consists of an independent convolution layer and a convolution layer.

3.3.1. Center Point Classification Module

The center point classification module generates the keypoint heatmap , which describes the object centers in the - plane. The 3D ground truth bounding box parameters are converted to the center point location label in the discretized - coordinate system by applying origin location subtraction, pillar side length division, and floor operation.

In the traditional center point allocation, there is only one pixel selected to be the positive sample for each ground truth center. This results in a severe positive and negative sample imbalance in center point classification. To mitigate the problem, we calculate the heatmap label for each pixel of the pseudoimage as follows and propose an improved center point allocation strategy: where is the diagonal length of the ground truth 2D bounding box and is the maximum Euclidean distance between the pixel of the pseudoimage and the ground truth 2D bounding box centroid along both - and -axes in BEV.

Because some pixels around the ground truth center location can create a bounding box with sufficient IoU with the ground truth box, we divide the pixels of the pseudoimage into positive set and negative set following the threshold values. All other pixels are ignored in the training stage. To further balance the gradient of positive and negative sets, we introduce the following focal loss [9] as the center classification loss to train the heatmap. where is the number of center points in the detection range and are the hyperparameters set to 2 and 4 in the experiments.

3.3.2. Bounding Box Annotation Module

This module regresses the corresponding bounding box annotation for each positive center point, which includes a two-dimensional center point offset regression, a -axis center coordination regression, a three-dimensional object size regression, and a two-dimensional orientation regression.

There exists a discretization error when transforming the float center point locations to 2D pillar coordinates in the previous center point classification module. Moreover, the increase in positive center point samples and some wrong predictions in the heatmap can lead to an inaccurate center point location in BEV. To recover the deviation caused by these reasons and obtain more precise object centers, the offset head generates the offset map for the center points in the - plane which is shared by all object categories. A logistic function is applied to constrain the output values to fall between 0 and 1. We use the L1 loss [4] as the offset loss.

To further obtain the center points in 3D Lidar coordinate system, the -axis coordinate head regresses the center location in the -axis. This head creates a -value map for the center points which are shared by all object categories. Due to the unconstrained -value regression range, the gradients of inliers and outliers are imbalanced in the traditional L1 loss, making it difficult to regress. Following [27], we use the balanced L1 loss to train the -axis coordinate. where where and are the hyperparameters, which satisfy and are set to , in the experiments.

We also regress the length, width, and height properties for the bounding box in the object size head. Similar to -values, the size loss is in a balanced L1 form. where denotes predicted object sizes and is the ground truth values.

Finally, the yaw rotation around the -axis is predicted in the orientation head. To avoid angle confusion, we regress two trigonometric functions ( and ) for the rotation angle and decode it in the inference stage. We employ the L1 loss as the orientation loss to train the orientation regression. where represents the predicted orientation feature map and is the ground truth values.

The overall loss for the first stage of the proposed network is defined as follows: where denotes the weight for the center point classification loss and the regression losses.

3.4. Inference Stage

In this stage, we employ the max pooling operation to filter the peaks in the generated heatmap as the predicted centers, which is efficient and can avoid the time-consuming NMS. The inference algorithm for generating the detected 3D bounding boxes is shown in Algorithm 1.

Input:: the set of detected center locations of
 category in BEV; : the number detected centers of
 category ; : the pillar side length
Output: Detected bounding box set
1: fordo
2: Obtain the corresponding x-y offset, z coordinate, size,
 orientation and fine-tune offset:
   
3:  
4: end for

4. Experiments and Result Analysis

In this section, we describe the dataset and summarize the implementation details first. Moreover, for verifying the validity and improvement of our method in the 3D object detection problems, we provide the ablation studies and compare the performance with other detection models on the dataset.

4.1. Dataset and Implementation Details

We employ the KITTI benchmark dataset [18] to evaluate the proposed method for the 3D object detection problem and we only use the Lidar point clouds. The dataset has a total number of 7,481 training samples with annotation and 7,518 testing samples without labeling. Following the standard convention, we split them into 3,712 samples for the training set and 3,769 samples for the validation set. Three classes of objects, i.e., cars, pedestrians, and cyclists, have been annotated. Since the car class has the most samples and diversity, as advocated by other researchers [17, 22], only the car category is taken into consideration during the evaluation in this paper. Following the official evaluation protocol, average precision (AP) with the IoU threshold of 0.7 is selected as the metric for the car class.

For KITTI car detection, we followed PointPillars [14] to use a detection range [, , and ] along the -, -, and -axes. The pillar side length is set to 0.16 m in the - plane. The max number of pillars and max amount of points in each pillar is set to 12,000 and 100, respectively. We arrange two HFE levels in the point cloud encoder. Two CNN blocks are applied to generate the pseudoimage, the number of convolutional layers is set to 7 and 8 for each block, and the number of feature map channels is set to 64 and 128 for each block, respectively. We utilize the Adam optimizer to train the network. The batch size is set to 4, the learning rate is set to 0.0001, and trained for 180 epochs. At inference time, the max pooling and AND operations are applied to obtain the center points.

4.2. Ablation Studies

In this section, we will pay attention to verifying the effectiveness and reliability of different parts in the proposed method for the 3D object detection problem. The ablation studies are implemented in four aspects and the baseline model is a simplified version of the proposed method, in which the normal pillar representation and the traditional center point allocation strategy are applied, similar to PointPillars.

The studies are carried out on a small subset of the KITTI validation dataset, and the results are summarized in Table 2.

To evaluate the effectiveness of the proposed point cloud encode network, we replace the traditional pillar-based encoder with it and denote it as method2 (m2). Compared with the baseline method, m2 takes 1.95x time and gets a 2.45% mAP gain. It proves that vertical division can improve the method further.

Method3 (m3) improves m2 with PCA initialization of the CNN kernels. As shown in Table 2, m3 slightly improves m2 on both time and accuracy.

Method4 (m4) employs the anchor-free detector but without the size loss . Anchor-free detector reduces the inference time significantly owing to the avoidance of generating a large number of anchor boxes and also improves the mAP slightly.

Mehtod5 (HPAF) adds the size loss term to the total loss in addition to method4. With the loss term added, the final proposed HPAF achieves the best mAP among all the configurations evaluated and spends no more time than m4.

4.3. Comparison with Other Methods

To further test the effectiveness and robustness of the proposed model in 3D object detection, we compare it with the state-of-the-art ones including several one-stage methods and some two-stage detectors. The AP results for 3D detection and BEV detection on the KITTI test set are shown in Table 3. As it can be seen from Table 3, most of the two-stage methods outperform the one-stage ones. This indicates the introduction of the second stage can contribute to refining the object location in the first stage and enhance the detection performance. However, these two-stage methods are time-consuming and model-complicated. Among all the one-stage methods, the proposed method outperforms VoxelNet, SECOND, PointPillars, and AFDet by 15.57%, 4.29%, 0.89%, and 5.40% for 3D mAP (), respectively. On the other hand, the anchor-based methods need to predefine a large number of anchors and employ postprocessing to filter the predicted bounding boxes, which brings the computational burden and shows poor speed performance. Though the proposed anchor-free method is outperformed by most two-stage methods, it achieves competitive AP results compared to the SOTA one-stage methods and shows its superiority concerning detection speed, as illustrated in Figure 3.

5. Conclusion

In this paper, we proposed a hierarchical pillar-based anchor-free detector to address the 3D object detection task. It encodes the raw point cloud as a hierarchical pillar representation and predicts object center points directly without predefined anchors. Experiments are conducted on the KITTI dataset to examine the performance. Our method can achieve competitive performance with the anchor-based ones and speed up the model efficiency by introducing PCA-based initialization and avoiding NMS postprocessing.

However, like other pillar-based methods, organizing point cloud into global structures, such as voxels and pillars, will cause local information loss inevitably. As a result, the hyperparameters, such as voxel/pillar size, and the number of hierarchy levels should be carefully tuned to fit into the dataset so that one can get the best performance. Further work will be focused on solving the 3D object detection problems with the incorporation of RGB images and developing a more suitable loss function for object parameter regression to improve the accuracy.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no competing interests.

Authors’ Contributions

X. Ren and S. Li were responsible for the conceptualization. X. Ren and S. Li were responsible for the methodology. X. Ren was responsible for the software. X. Ren was responsible for the validation. X. Ren was responsible for the formal analysis. X. Ren was responsible for the investigation. X. Ren was responsible for the resources. X. Ren was responsible for the data curation. X. Ren was responsible for the original draft preparation. X. Ren and S. Li were responsible for the review and editing of the paper. X. Ren was responsible for the visualization. S. Li was responsible for the supervision. S. Li was responsible for the project administration. S. Li was responsible for the funding acquisition. All authors have read and agreed to the published version of the manuscript.

Acknowledgments

This research work is funded by the National Nature Science Foundation of China under Grant 61971283 and Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0102.