Currently, there are many kinds of voxel-based multisensor 3D object detectors, while point-based multisensor 3D object detectors have not been fully studied. In this paper, we propose a new 3D two-stage object detection method based on point cloud and image fusion to improve the detection accuracy. To address the problem of insufficient semantic information of point cloud, we perform multiscale deep fusion of LiDAR point and camera image in a point-wise manner to enhance point features. Due to the imbalance of LiDAR points, the object point cloud in the long-distance area is sparse. We design a point cloud completion module to predict the spatial shape of objects in the candidate boxes and extract the structural information to improve the feature representation ability to further refine the boxes. The framework is evaluated on widely used KITTI and SUN-RGBD dataset. Experimental results show that our method outperforms all state-of-the-art point-based 3D object detection methods and has comparable performance to voxel-based methods as well.

1. Introduction

3D object detection is particularly useful in autonomous driving applications, because various types of dynamic objects must be recognized in the driving environment, such as surrounding vehicles, pedestrians, and cyclists. In recent years, various 3D detectors using LiDAR point clouds have been proposed, including PointRCNN [1], Part- [2], PV-RCNN++ [3], 3DSSD [4], and CIA-SSD [5]. Although LiDAR points can capture the three-dimensional structure of an object and contain accurate depth information, they do not have sufficient semantic information and have the problem of point sparsity. Compared with LiDAR point clouds, RGB images have more regular and dense data format and have richer semantic information to distinguish between vehicles and backgrounds. Therefore, some research works [6, 7] try to estimate the position and size of objects through monocular or stereo images. However, the biggest challenge of 3D object detection based on camera image is that it cannot get accurate depth information, which is very important for 3D object detection. Considering that the representation under different sensor views have their own shortcomings, and for the 3D object detector of automatic driving, only one view input is not enough. This prompts us to design an effective framework to integrate features from different perspectives to achieve accurate 3D object detection. Early multisensor feature fusion methods take RGB image, front view, and bird’s eye view (BEV) as input and then directly combine and merge the features by cropping and resizing to generate 3D candidate boxes, such as MVF [8] and AVOD [9], but they ignore the different perspectives of image and BEV. In order to reduce the accuracy loss caused by different viewing angles, ContFuse [10] uses continuous convolution to improve feature fusion, and MVAF-Net [11] uses bilinear interpolation to correct features. Although continuous convolution or bilinear interpolation is used to modify alignment to overcome the challenges of different perspectives, quantifying point cloud 3D structures into BEV pseudoimages to fusion image features will inevitably suffer a loss of accuracy. There are also some research works [12, 13] using 3D frustum projected by 2D bounding boxes to estimate 3D bounding boxes, but these methods require additional 2D annotations and their performance is limited by 2D detectors. The above multisensor feature fusion methods all transform point clouds from sparse formation to compact representation by projecting them into images or subdividing them into uniformly distributed voxel. We call these methods voxel-based multimodal feature fusion methods, which voxelize the entire point cloud. However, the voxel-based feature fusion method will inevitably lose some information and is relatively sensitive to voxel parameters. There are also some methods that directly perform image feature fusion on LiDAR point cloud, instead of performing image feature fusion with BEV of the point cloud or voxelized pseudoimage of the point cloud. These methods are called point-based multimodal feature fusion methods. For example, PI-RCNN [14] directly fuses image features and point features, and EPNet [15] and MOT [16] perform deep fusion between point features and image features. In addition, since object detection serves the perception system of autonomous vehicle, the farther the object detected is, the more the time left for the decision planning system is, and the safer the autonomous vehicle will be. However, due to the imbalance of point clouds, the point clouds of the short-distance object are denser, and the point clouds of the long-distance object are sparse, which contains less spatial information, thus increasing the difficulty of detecting the distant object. In order to improve the detection accuracy of difficult cases, SIENet [17] predicts the shape of distant objects through point completion network to enhance the spatial structure information. Inspired by some multitask work (EPNet and SIENet), this paper proposes a point-based multimodal fusion 3D object detection method with enhanced spatial structure.

The main contributions of this paper are as follows: (1) we design a new backbone network for multimodal feature fusion, which combines LiDAR points and camera images in a point-wise manner to enhance point features without point cloud voxelization and image annotation. (2) A spatial structure enhancement module is proposed to predict the shape of object in the candidate box and learn structural information to further refine box. (3) We propose a new two-stage 3D object detection framework based on point cloud and image fusion. The test results on the KITTI benchmark show that the accuracy of our method is higher than all the current multisensor-based 3D object detection methods.

3D object detection based on LiDAR: due to the sparsity and irregularity of LIDAR point cloud, traditional convolutional neural networks (CNN) cannot be directly applied to LIDAR point cloud. Many algorithms have tried various point cloud representation methods to solve this problem. Currently, there are three types of point cloud representation for the input of the 3D detector. (1) Based on the voxel representation, this method converts point clouds into regular grids through voxel transformation, so that 3D CNN can directly apply this representation. SECOND [18] divides the point cloud into voxel representations and uses sparse convolution to learn voxel features to generate 3D bounding boxes. PointPillars [19] converts point clouds into pseudoimages, eliminating the time-consuming 3D convolution operations. Fast-PointRCNN [20] introduces the attention mechanism to enhance the positioning ability of the network. The ROI-aware pooling proposed by Part- [2] refines the candidate box and improves the 3D detection accuracy. The voxel-based method has high perceptual ability, but it will cause information loss during the voxelization process of point cloud. And the storage and computing efficiency of 3D CNN are very low. (2) Based on the point representation, this method does not need to transform the original point cloud and directly uses PointNet++ [21] to process the original point cloud to obtain global features, thus retaining the original geometric information as much as possible. F-PointNet [12] proposes the application of PointNet++ [21] to 3D detection based on the cropped point cloud of 2D image bounding box. Point-RCNN [1] is the first point-based 3D object detection method that only uses point cloud as network input. 3DSSD [4] proposed a lightweight and efficient point-based single-stage 3D object detection framework, which has a good balance between accuracy and speed. (3) Point-voxel joint representation method takes points and voxels as inputs and fuses the features of points and voxels at different stages of the network for 3D object detection, such as Part- [2] and PV-RCNN++ [3]. These methods can use voxel-based perception capabilities (i.e., 3D sparse convolution) and point-based geometric structure capabilities (i.e., set abstraction) to achieve high computational efficiency and flexible receiving field, thereby improving 3D detection performance.

3D object detection based on multiple sensors: in recent years, great progress has been made in the research of multisensors such as camera image and LiDAR. AVOD [9] uses RGB image and BEV as input, proposes a feature pyramid skeleton to extract features in BEV, and combines features from BEV feature map and RGB feature map through cropping and resizing operations. ContFuse [10] applies continuous convolution to overcome the problem of different viewing angles between image and BEV. MVAF-Net [11] proposes a multiview adaptive fusion module to enhance feature fusion among image, front view, and BEV. The above methods all try to fuse the features of image and BEV, but quantifying the point cloud 3D structure into BEV pseudoimage to fuse image features will inevitably suffer accuracy loss. F-PointNet [12] uses 3D frustum projected from 2D bounding boxes to estimate 3D bounding boxes, but this method requires additional 2D annotations, and their performance is limited by 2D detectors. There are also some methods that directly perform image feature fusion on the LiDAR point cloud rather than the LiDAR BEV or the voxelized pseudoimage of the point cloud. PI-RCNN [14] directly attaches the image semantic segmentation information to the LiDAR point cloud through the transformation matrix and then uses the LiDAR detector for 3D object detection. EPNet [15] and MOT [16] establish a deep fusion between the point cloud feature extractor and the image feature extractor to enhance the point cloud features. Although various sensor fusion networks have been proposed, they are not easily superior to LiDAR detectors because the fusion of multiview features will bring interference and noise.

3. Our Framework

In this section, we introduce a new two-stage 3D object detection framework based on point cloud and image fusion. Firstly, we describe our proposed multiscale deep fusion strategy and proposal generation layers. Next, we propose a spatial structure prediction network, including point cloud region pooling, spatial structure enhancement, and refined regression head. Finally, the loss function is discussed. Our overall framework is shown in Figure 1.

3.1. Multiscale Feature Fusion RPN

As shown in Figure 2, our multiscale feature fusion RPN consists of a point branch and an image branch. Specifically, we first use a four-layer four-scale PointNet++ to extract point features from the point cloud. Meanwhile, the image branch extracts semantic features from the image through a four-layer four-scale Unet [22] segmentation network. Finally, the proposed adaptive attention fusion (AAF) module is used to fuse the point features at different scales with corresponding image semantic features to enhance the point features.

3.1.1. Point Branch

The point branch takes LiDAR point cloud as input and generates 3D candidate boxes. The point branch is composed of four paired set abstraction (SA) and feature propagation (FP) layers for extracting point cloud features. SA consists of farthest point sampling (FPS) layer, multiscale grouping (MSG) layer, and PointNet layer, which are used for downsampling points to improve efficiency and expand the receptive field. FP consists of bilinear interpolation and multilayer perception (MLP), which is applied to broadcast feature for dropped points during the downsampling process to recover all points. Due to insufficient semantic information of LiDAR point cloud, we use LI-Fusion module [18] to fuse rich image semantic features and point features. In addition, multiscale deep fusion of point clouds and images can further enrich the point semantic features and obtain compact and discriminative feature representations. The multiscale feature fusion method is shown in Figure 2.

3.1.2. Image Branch

In order to perform multiscale semantic feature fusion, we choose the lightweight semantic segmentation network Unet that also has an encoder and decoder for image semantic feature extraction. Unet consists of four convolution blocks and four upsampling layers. Each convolution block has two repeated convolution layers and a maximum pooling layer. In order to obtain strong semantic features and balance GPU memory, we fine-tuned the convolution block of Unet. Our convolution block consists of two repeated convolution layers (stride 1, padding 1) and one convolution layer (stride 2, padding 1). Each of the first two convolution layers is followed by a batch normalization layer and a ReLU activation function, as shown in Figure 3.

3.1.3. Adaptive Attention Fusion Module

In order to fuse data from two different views, we first use the projection method to establish the relationship between LiDAR points and image pixels. Then, we obtain the semantic features of each point through grid sampling. Finally, the proposed adaptive attention fusion (AAF) module is used to perform feature fusion. Specifically, we take each point coordinate through the projection matrix to generate the corresponding image coordinate , which can be written aswhere is the internal parameters of the camera and the size is . Note that we convert and into four-dimensional and three-dimensional vectors in homogeneous coordinates in projection formula (1). After establishing the corresponding relationship, we use the grid sample function of Pytorch framework to obtain the semantic features of each point on the image. Because the projection point may fall between adjacent pixels, the bilinear interpolation method needs to be used to obtain the image feature at the continuous coordinates, which can be written aswhere is the corresponding image feature for point , is the bilinear interpolation function, and is the image feature of the adjacent pixels of the projection point . Finally, in order to better integrate point cloud features and image features, we design an adaptive attention fusion module to suppress the interference of noninterested areas and extract effective information for fusion, as shown in Figure 4. The adaptive attention fusion module can be expressed as follows:where and represent point cloud features and point-wise image semantic features, represents extended features, and represent two-branch attention features, represents fusion features, represents fully connected layer, represents element-wise addition, represents element-wise multiplication, represents the Sigmoid activation function, represents the Tanh activation function, and Concat represents the concatenation operation.

3.2. Spatial Structure Enhancement Module

For each candidate box generated in RPN stage, the denser the foreground point set is, the more spatial the information retained is. Therefore, the central idea of our spatial information enhancement module is to predict the complete shapes of candidate objects and extract structural information to enhance feature representation. To this end, we need to solve two subtasks, namely, how to predict the spatial shape, and how to extract the spatial structural information and integrate it into the model to further refine the candidate box.

3.2.1. Point Cloud Region Pooling

After obtaining 3D bounding box proposals, we use RoI Pooling [1] to optimize the box locations and orientations. Specifically, 512 candidate regions of RPN are sampled through NMS to obtain 64 candidate regions of RCNN. For each 3D box , we slightly enlarge it to create a new 3D box , so as to obtain additional context information, where is the center location of object , is the size of object , is the object orientation of the bird’s view, and is a constant used to expand the box size. For each point, through the segmentation mask we perform an internal/external testing, to determine whether the point is within the expanded bounding box proposal . If it is an internal point, the point and its features would be retained to refine the box . Finally, we will get 512 points for each candidate box and encode them to get the pooling feature , where represents the number of channels.

3.2.2. Spatial Shape Prediction

The foreground points of the candidate box constitute a shape describing semantic clues; however, this shape is usually incomplete. Therefore, based on the point completion framework PCN [23], we design a spatial structure prediction network to complete the missing part of the object in the candidate box. As shown in Figure 5, the network takes incomplete points as input and predicts the corresponding dense shape through the encoder-decoder. The encoder consists of two simple PointNet units (SharedMLP + Maxpool), each SharedMLP consisting of a convolution layer, a BN layer, a ReLU layer, and a convolution layer. The number of convolution output channels for the first SharedMLP is (128, 256), and the number of convolution output channels for the second SharedMLP is (512, 1024). The decoder consists of two stacked fully connected layers (Linear + BN + ReLU) and one fully connected layer (Linear), and the output is a matrix. The number of output channels for the three fully connected layers is (1024, 1024, ). Unlike the coarse-to-fine pipeline in PCN, we believe that the coarse output is effective for subsequent processing, so we remove the fine output branch, thus saving GPU memory. In order to reduce the burden of training, we download the KITTI [23] car data set and trained our spatial shape prediction network in advance.

Figure 6 shows part of the visualization results of our spatial structure prediction model. It can be seen from the figure that our spatial structure prediction model performs well on automobiles and has a good generalization prospect.

3.2.3. Structure Information Extraction and Fusion

To obtain the local and global context from the predicted spatial shapes, we use a PointNet++ [21] module to extract the structural information. First, we use the FPS algorithm to select 512 points from the predicted shape. Then for each point, we use the Ball Query algorithm to generate a local area. Finally, the PointNet units are applied to capture the local area feature of each point, thereby obtaining the enhanced features . In the refinement subnetwork part, we use a similar 3D box refinement network of PointRCNN [1] to further refine the box and confidence. The input of refining subnetwork consists of the canonical transformation coordinates of each pooling point, the pooling features, and the extracted spatial structure features. Since the pooling features and the spatial structure features come from different patterns, connecting them without any additional processing may cause interference. In order to better fusion spatial structure features and pooling features, we adopt the perspective-channel attention fusion [24] to obtain merged feature .

3.3. Loss Function

The proposed network is trained in an end-to-end manner. Our overall losses include the two-stream RPN loss in stage 1 and the box refining network loss in stage 2 as follows:where and are the coefficients that control the balance weight; we set the parameters  = 1.0 and  = 1.0. and adopt similar optimization objectives, including classification loss, regression loss, and consistency enhancement loss. For classification loss at the RPN stage, we use focal loss similar to [25] to balance positive and negative samples:where is the target classification label, is the positive sample prediction probability, represents the number of targets, and and are focal loss hyperparameters. For the regression loss in the RPN stage, we adopt a bin-based regression loss similar to [1] to regress the center point , size , and orientation :where denotes the cross-entropy classification loss, denotes the smooth 1 loss, and denote the bins and residuals of the ground truth, and and denote the predicted bins and residuals of the ground truth. In addition, in order to improve the consistency of localization confidence and classification confidence, we add a consistency enhancement loss:where represents the predicted bounding box, represents the ground truth, and represents the classification confidence of the predicted box. In summary, is a weighted sum of the three loss functions:where , , and are used to control the balance coefficient of the importance degree of loss. We set the parameters  = 1.0,  = 1.0, and  = 5.0. Similarly, RCNN loss also includes classification loss, regression loss, and consistency enhancement loss. For RCNN classification loss, we adopt binary cross entropy loss:where is the target classification label, is the target prediction probability, and is the number of targets. RCNN regression loss and consistency enhancement loss are defined in the same way as RPN. The weighted sum of the three loss functions of rcnn:

4. Experiments

In this section, we evaluate our method on two common 3D object detection datasets, including the outdoor dataset KITTI [26] and the indoor dataset SUN-RGBD [27]. In Section 4.1, we introduce these datasets and evaluation metrics. In Section 4.2, we provide the implementation details of the experiment. In Section 4.3 and Section 4.4, we, respectively, show the comparison results of indoor and outdoor datasets. Finally, we conducted an extensive ablation study to analyze our proposed 3D target detection model in Section 4.5.

4.1. Datasets and Evaluation Metric

KITTI is the most popular standard benchmark dataset for autonomous driving, consisting of 7,481 samples for training and 7,518 samples for testing. As a common practice, the training samples are divided into a train set with 3,712 samples and a val set with 3,769 samples. The KITTI 3D object detection benchmark uses an average accuracy (AP) with a bounding box overlap of 0.7 as the evaluation indicator for cars, where three difficulty levels (easy, moderate, and hard) are taken into consideration. SUN-RGBD is a benchmark dataset for indoor 3D target detection. The dataset consists of 10,335 images and directional 3D bounding boxes with 37 target categories, including 5,285 images for training and 5,050 images for testing. We follow the same settings in VoteNet [28] and report the performance of 10 classes on SUN-RGBD. We use the average accuracy (AP) with a 3D overlap of 0.25 as the evaluation index of SUN-RGBD. We compare our method with the state-of-the-art methods in the KITTI and SUN-RGBD test set.

4.2. Implementation Details

Two-stream RPN takes LiDAR point clouds and camera images as input. We select 1,6384 points from the raw LiDAR point cloud as the input of the point stream and take the image with a resolution of 1280  384 as the input of the image stream, which is the same as EPNet [15]. We use four SA layers (4096, 1024, 256, and 64) to subsample the input LiDAR point cloud and use four FP layers to recover the size of the point cloud for foreground segmentation and candidate box generation. Similarly, we use four convolutional blocks to downsample the input image and four transposed convolutional layers to restore the size of the image. In the NMS process, we select 8000 proposals generated by the two-stream RPN based on the classification confidence and then filter the redundant proposals with the NMS threshold of 0.8 to obtain 64 proposals for the refinement network. In the process of refining candidate boxes, we train a spatial structure prediction model in advance and then initialize the spatial shape prediction network with the weights. In the ablation experiment, we refer to the two-stage image classification strategy [29] to analyze the speed of our method. We train the model in an end-to-end manner on GeForce RTX 3090, the optimizer is ADAM [30], the initial learning rate is 0.002, and the weight attenuation is 0.001. The minibatch size is set to 2 and the model is trained for 40 epochs.

4.3. Experimental Results in KITTI

We compare the proposed two-stage detector with other state-of-the-art methods and submitted the results to the KITTI server for evaluation. As shown in Table 1, we evaluate our method on the BEV detection benchmark and 3D object detection benchmark of the KITTI test data set. It can be seen that our method is significantly ahead of the advanced single-stage multisensor methods ContFuse [10], MAFF [31], MVX-Net [32], and MVAF-Net [11] in terms of 3D mAP by 10.32%, 5.64%, 4.18%, and 1.01%, respectively. It should be pointed out that our method is a point-based two-stage multisensor method, so we focus on the performance comparison with the point-based multisensor methods. It can be seen that our method outperforms all advanced point-based multisensor methods F-PointNet [12], IDMOD [33], PI-RCNN [14], and EPNet [15] by 10.84%, 5.45%, 5.29%, and 0.47%, respectively. At the same time, our method is also superior to most voxel-based methods.

The visualization results of our method on KITTI are shown in Figure 7. For better visualization, we project the 3D bounding box of LiDAR coordinates to the RGB image. The upper part is the image 3D detection result, and the lower part is the point cloud scene detection result. It can be seen that our method performs well in capturing distant cars, although these objects are difficult to identify in RGB images and are susceptible to sparse point clouds.

4.4. Experimental Results in SUN-RGBD

We further perform experiments on SUN-RGBD data sets to verify the effectiveness of our method in indoor scenarios. Table 2 shows the results compared with the most advanced methods. Our method achieves excellent detection performance, outperforming PointFusion [35], F-PointNet [12], VoteNet [28], MBDF-NET [36], and EPNet [15] by 16.1%, 6.2%, 2.5%, 0.7%, and 0.4%, respectively. Specifically, F-Pointnet and VoteNet both estimate 3D boundary boxes of point clouds based on 2D boundary box projections of images. PointFusion combines point cloud features and image features in a concatenation fashion. Different from them, our method establishes a correspondence between image features and point features, thus providing a clearer representation. In addition, comparing with multisensor-based methods, EPNet and MBDF-NET are particularly valuable. Because they also establish the mapping relationship between image features and point features, EPNet and MBDF-NET do not consider the point cloud sparse problem, and MBDF-NET is a three-branch detector.

The visualization results of our method on SUN-RGBD are shown in Figure 8. Unlike the KITTI dataset, the SUN-RGBD dataset contains objects of multiple categories and different scales. It can be seen from Figure 8 that our method can better detect a variety of objects with obvious scale changes, including small objects (such as chair and dressing table) and large objects (such as sofa and bed).

4.5. Ablation Studies

We conduct a series of ablation studies on the KITTI dataset to analyze multiscale fusion RPN and spatial structure enhancement modules. All models are trained on the training set and evaluated on the validation set of the KITTI dataset for car detection. All evaluations on the validation set are conducted through 40 recall positions.

4.5.1. Effect of Multiscale Fusion RPN

In Table 3, we investigate the effectiveness of different structures in multiscale fusion RPN. We analyze the effect of each structure on stage 1 by removing one structure while leaving the others unchanged. To be fair for comparison, all the experiments shared the same fixed state 2. In the first row, we remove the image semantic branch, and the performance decreases significantly, which demonstrates the advantage of semantic segmentation. Then we compare two different fusion schemes. One is the single-scale feature propagation layer (SFP) fusion, which is similar to the multisensor feature fusion backbone network of EPNet [15], and the image semantic features are fused with the last feature propagation layer. The other is multiscale feature propagation layer (MFP) fusion, where image semantic features are fused with each feature propagation layer (see Figure 2). The results show that MFP is better than SFP by 0.25% in 3D mAP. This shows that the application of semantic features on multiscale feature propagation layer is effective. At the same time, we also give the inference time in Table 3. It can be seen that the inference time of SFP is similar to the baseline, and the time consumption of MFP does not increase much.

4.5.2. Effect of Convolution Layer

Table 4 shows the effects of different convolution layers on the performance of image semantic segmentation. We take the convolution layer number of Unet convolution block as the baseline. When the number of convolutional layers of the convolution block is increased appropriately, the AP is slightly increased, but excessively increasing the number of convolution layers of convolution blocks will reduce AP. This is because a reasonable depth of convolutional neural network can extract more image semantic features, but too deep network will lead to overfitting, which is not good for convergence. At the same time, it can be seen from Table 4 that the inference time increases slightly with the increase of the number of convolutional layers.

4.5.3. Effect of Point Cloud Region Pooling

Table 5 shows the effects of different pool context widths on performance. When no context information is pooled, the accuracy of 3D object detection, especially for those difficult instances, decreases significantly. Because the object might be obscured or far away from the sensor, difficult cases often have fewer points in the candidate box, which requires more contextual information to classify and refine the candidate box. As shown in Table 5, too large pooling context width can also result in performance drops because the pooled region of the current candidate box may include noisy foreground points for other objects.

4.5.4. Effect of Spatial Structure Enhancement

We explore the effects of the spatial information enhancement module in Table 6. In the first row, we do not use the spatial information enhancement module. In the second row, we add the spatial information enhancement module and only use the simplest connection fusion, which reduces AP. This is because the pooling features and the spatial structure features come from different patterns, and connecting them without any additional processing produces interference. In the third row, we use perspective-channel attention fusion to fuse the spatial information enhancement module, and the gain of mAP is 0.42%. This is because the spatial information enhancement module promotes the model to better obtain spatial information. In addition, the inference time of our spatial information enhancement module is only increased by 11 ms compared with the RCNN baseline.

5. Conclusion

In this paper, we introduce a multiscale fusion RPN for features extraction and proposals generation. Besides, we also propose a novel spatial information enhancement module for detecting 3D objects from point clouds with the imbalanced density. Specifically, we design a spatial structure enhancement module to generate the complete shape of the candidate box and learn the structural information to enhance the features for box refinement. A large number of experiments verify the effectiveness of our proposed framework.

Data Availability

The data used to support the findings of this study are available at https://github.com/liuhuaijjin/EPFSI/.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


This work was supported by the Natural Science Foundation of China (nos. 61673186 and 61871196), the Natural Science Foundation of Fujian Province of China (no. 2019J01082), and the Promotion Program for Young and Middle-Aged Teachers in Science and Technology Research of Huaqiao University (ZQN-YX601).