Abstract

3D object detection is a significant aspect of the perception module in autonomous driving; however, with current technology, data sharing between vehicles and cloud servers for cooperative 3D object detection under the strict latency requirement is limited by the communication bandwidth. The sixth-generation (6G) networks have accelerated the transmission rate of the sensor data significantly with extreme low-latency and high-speed data transmission. However, which sensor data format and when to transmit it are still challenging. To address these issues, this study proposes a cooperative perception framework combined with a pillar-based encoder and Octomap-based compression at edges for connected autonomous vehicles to reduce the amount of missing detection in blind spots and further distances. This approach satisfies the constraints on the accuracy of the task perception and provides drivers or autonomous vehicles with sufficient reaction time by applying fixed encoders to learn a representation of point clouds (LiDAR sensor data). Extensive experiment results show that the proposed approach outperforms the previous cooperative perception schemes running at 30 Hz, and the accuracy of the object bounding box results in further distances (greater than 12 m). Furthermore, this approach achieves a lower total delay for the procession of the fusion data and the transmission of the cooperative perception message. To the best of our knowledge, this study is the first to introduce a pillar-based encoder and Octomap-based compression framework for cooperative perception between vehicles and edges in connected autonomous driving.

1. Introduction

Object perception is indispensable in sensing the surrounding environment and improving driving safety in autonomous vehicles (AVs). However, the perception capability of onboard sensors on AVs could be influenced by external factors such as obstacles and weather conditions [1]. To address this issue, European Telecommunications Standards Institute (ETSI) proposes the generation rules of Cooperative Perception Messages (CPMs) [2], which specify that CAVs can generate CPMs for cooperative perception when vehicles receive the object results from the object detection algorithm using their sensor data. As a result, CPMs improve the perception capability of connected autonomous vehicles (CAVs) by expanding the effective sensing range and reducing the missed detections [3].

Nevertheless, CAVs still cannot perceive the missed vehicles, which are not detected by local sensor data, because the content of CPMs only includes the object results. Figure 1 is the comparison between CPMs and fusion sensor data detection. There are two cases to explain the undetected object in which each vehicle can only perceive a sparse point cloud. The above pictures are the situation at an intersection, and the below pictures are for bidirected roads. Figure 1(a) denotes the object detection result (yellow bounding boxes) from vehicle . Also, Figure 1(b) represents the result (green bounding boxes) from vehicle . Finally, the fusion result shows that the undetected vehicle (white boxes) can be detected in Figure 1(c).

Cooper [4] is the basic idea of cooperative perception to transmit CPMs with LiDAR data from multiple CAVs and point cloud fusion for detecting 3D objects. However, transmitting raw sensor data (such as point cloud) is challenging even though the sixth-generation (6G) communication technology. Significantly, the data size of 1 frame of LiDAR sensor with 64 beams can be about 3-4 MB, and the sample rate of typical LiDAR is 30 Hz. Therefore, the network capacity is at least 960 Mb/s. Besides, the raw point cloud is a sparse representation due to the measuring method by the laser scanner. As a result, the transmission of raw sensor data is inefficient and wastes limited communication resources.

To overcome the above challenges, this work proposes a cooperative perception scheme integrating pillar-based encoder sensor data and Octomap-based compression. To the best of our knowledge, this pillar-based fusion for cooperative perception at edges is the first study that tackles the problem of missing object detection. Furthermore, the proposed solution reduces the communication delay and improves the perception capabilities of CAVs. The main contributions of this study can be summarized as follows. First, this study proposes a new data transmission type to exchange cooperative perception messages between CAVs and edges for exploring the potential of “intermediate outputs” in 3D object detection algorithms. Second, this detection scheme performance is achieved while running at 30 Hz, and the precision of distant vehicles (more than 20 m) is improved compared with the feature map fusion solution [5]. Therefore, autonomous vehicles have sufficient reaction time to handle traffic emergencies. Finally, the scheme running in the edges does not require a specific onboard object detection algorithm. In other words, CAVs do not rely on object results from the edges. The proposed solution can be seen as enhancing the perception capability of CAVs by 6G communication.

The remainder of this work is organized as follows. Section 2 presents the preliminaries and background. Section 3 introduces the proposed object perception scheme on CAVs, including the phase to transmit the specific cooperative perception messages and receive the semantic and precise object results. Then, this study designs a pillar-based fusion algorithm on edges to detect cooperative perception results. Finally, we propose an integrating CAV-based and edge-based scheme for cooperative perception. Simulation results and conclusions are given in Section 4 and Section 5.

2. Preliminaries and Background

Given state of the art in the field of 3D object detection in autonomous driving, this work starts by briefly reviewing the data choice, including data type, fusion level [6], compression, and reconstruction in general. And then this study focuses on algorithms specific to object detection from LiDAR point clouds. After that, this study describes the outlook and work done in the scheme for cooperative perception.

2.1. Data Choice
2.1.1. Type

Rapid development of 3D object detection has motivated increasing studies to design efficient representations to detect vehicles in point clouds and images. Images can provide rich color properties and detailed texture information. Numerous research deduced the 3D object detection in image [711] benefits from nature 2D image-based object detection technology. At first, the 2D bounding boxes are obtained by processing images in these algorithms. After that, the bounding boxes are converted from 2D to 3D based on different methods, such as template matching [7, 8, 12, 13] and geometric properties [1416]. However, images lack the depth information and are susceptible to lighting conditions. The accuracy of detection algorithms is bounded by the capability of the depth estimation and light compensation. Point clouds can provide accurate 3D position, speed, and depth information of objects and have advantages of spatial dimension over 2D images [4]. LiDAR is less affected by the environment; it can work even in dark or bad weather and generate real-time, high-resolution point clouds of the surrounding environment. Besides, the point cloud representation , indicating each spatial position information and reflectivity, is easy to process. Therefore, this work is dedicated to exchanging data from point clouds.

2.1.2. Fusion Level

This study uses the classification of sensor data fusion defined in [6]: low-level, feature-level, and high-level. Briefly, low-level fusion is raw data from the onboard sensor without processing. Although the low-level fusion method keeps the original spatial information of objects, the transmission of low-level fusion requires ultrahigh bandwidth and is hard to apply in the 6G network. On the other hand, feature-level fusion seeks a preprocessing output from raw data before getting the object results. Hence, it can reduce the data size and keep the main features. Finally, high-level fusion means onboard sensors independently process raw data to generate an object list and fuse all the lists. Accordingly, it is less complex to implement in practice, but it cannot detect the missed objects which never detected before fusion. Considering the advantages and disadvantages of different levels fusion, this study pays attention to the feature-level fusion.

2.1.3. Compression

The point cloud provides a highly realistic view of the surrounding environment; it also costs large bandwidth to transmit uncompressed data. Therefore, compression of point clouds is necessary. The function proposed in current research can be summarized into three modes: 1D traversal, 2D projection, and 3D correlations [17]. As for 1D prediction methods, it is to construct a prediction tree to convert the geometry data into a 1D representation. Although this model is relatively simple to achieve, it does not consider the 3D spatiality of point clouds. 2D projection focuses on converting the 3D point cloud into 2D representation by projection or mapping [18]. Because of the 3D correlations of point clouds, the most common way to compress is to convert the point clouds to 3D representation [17]. Therefore, the 3D representation is used in this study. There are various approaches to achieve this purpose, such as octree-based coder [19], hierarchical clustering coder [20], and context-based intracoder [21]. The octree-basedcoder is adopted by this work because the lossless encoding is suitable for fusion.

2.1.4. Reconstruction

The received data cannot be processed straightforwardly because it is generated on different positions and angles. Thus, edges need to reconstruct the exchanging point cloud data into its UTM (Universal Transverse Mercator) coordinate system. In order to merge point clouds, cooperative perception messages need GPS and IMU (Inertial Measurement Unit) as additional information to calculate the offset information (Euler angles, unit quaternions, and rotation vectors [22]). Euler angles represent a rotation with yaw, pitch, and roll angles for and axis that are and , respectively. It can be calculated by using the IMU value difference between the transmitter and receiver. Besides, the quaternions can be represented as follows:

Given Euler angles and , the quaternions can be calculated as follows:

Subsequently, the rotation matrix can be written as follows:

Finally, the coordinate of point clouds can be calculated from different sensors as follows: where is the original coordinates from its sensor, and is the coordinate transformed to the UTM system.

2.2. 3D Object Detection in Point Clouds

A 3D object detection algorithm is divided into three parts: data representation, feature extraction (i.e., backbone), and detection network (detection head). Data representation organizes the point cloud generated by LiDAR into a proper structure to which the convolution operations can be applied. The main approaches are voxel-based, point-based, frustum-based, pillar-based, and 2D projection-based [23]. First, voxel is short for volume pixel, dividing point clouds into 3D grids at a specific resolution in a 3D Cartesian coordinate system. Many voxel-based methods are proposed in the earlier studies such as VoxelNet [24], SECOND [25], and Voxel-FPN [26]. The advantage is that it can easily migrate neural network operations such as convolution based on 3D grids. The disadvantage is that the efficiency is low due to a large amount of discrete computation and encoder. Besides, point-based methods process the raw data and generate a sparse representation and then aggregate the features of adjacent points, and the feature of each point is extracted [23]. However, this method poses stringent requirements on the hardware compare with the voxel-based method. Pillar-based method organizes point clouds in the - plane; since the point cloud is three-dimensional, point clouds are formed into vertical columns, called pillars, such as PointPillars [27]. Due to the exclusion of the -axis and fixed encoder, the pillar-based method has a significant processing rate improvement compared to other algorithms, and it also achieves top efficiency and performance on recent 3D object detection benchmarks. Therefore, this study chooses the pillar-based method as data representation due to its speed and accuracy. The convolutional neural network (CNN) architectures [28] as feature extraction can be seen as the state of art in images and point clouds. Besides, the detection head used in this work, as the layer to compensate for CNN positioning capabilities, is Single Shot Detector proposed in [29].

2.3. Current Cooperative Perception Schemes Proposed

The benefits of cooperative perception have been studied in recent years [3033]. Motivated by its potential, many researches focused on several areas, including communication protocol [34], generation rules of CPMs [1], and algorithm design. In CarSpeak [34], a content-centric communication protocol around the needs of cooperative perception has been proposed. It focuses on sensor information sharing at the medium access control (MAC) layer and is fully integrated with autonomous driving. On the other hand, Thandavarayan et al. [1, 3] has analyzed the performance of cooperative perception messages and offered generation rules to define which information should be included. In algorithm design, Chen et al. [4, 5] and Guo et al. [35] have proposed a series of feature map fusion methods on CAVs by vehicle-to-vehicle (V2V) communication. Aoki et al. [32] proposed the scheme with deep reinforcement learning (DRL) to select the data to transmit. Besides, the infrastructure sensor-supported cooperative perception has been proposed in [36, 37]. However, most of these studies only rely on the sensor data from assist CAVs, which cannot operate independently with local data. 3D object detection as a kind of computation-intensive tasks generated by CAVs is limited by battery and computing capacity. Therefore, some works focus on dynamic task offloading for mobile edge computing [38, 39]. The collaborative task offloading for vehicles and cloud faces challenges by the increase of the scales of task offloading problem and solution space size. These studies are meaningful and significant but beyond the scope of this paper. Distinguished from the previous works, this study focuses on the algorithm on edges to enhance the object detection of missed objects.

3. Pillar-Based Cooperative Perception from Point Clouds

Inspired by the advantages of the fixed encoder such as PointPillars [27] and Octomap compression [40], this study proposes the pillar-based cooperative perception framework from point clouds on CAVs and edges. This study presents two schemes for cooperative perception: Cooperative Perception on Vehicles (CPV) and Cooperative Perception on Edges (CPE). This section presents the detail and architecture framework of two schemes.

3.1. Cooperative Perception on Vehicles

The proposed architecture of CPV, depicted in Figure 2, consists of three components: , , and . These three components are executed sequentially. Besides, and start to execute when the vehicles receive corresponding messages from edges. It should be noted that the 3D object detection algorithm in this work is PointPillars [27]. However, it does not mean that CPVs only rely on PointPillars. The detection algorithm can be replaced by others. At first, the vehicle process downsampled 3D LiDAR point cloud represented as with 3D position and reflection value into Octomap [40]. Octomap based on tree structure uses probabilistic occupancy estimates to support lossless compression. Specifically, CAVs convert the range of point clouds into a cube space with side length . This space will be divided recursively into eight subvoxels until the minimum voxel size is . The leaf node in the Octomap only stores the occupancy probability and one pointer of an inner node with eight pointers to its children or a null pointer (no child node), as shown in Figure 3. The occupancy probability of the node is the maximum occupancy of all the eight subvoxels as follows:

After that, CAVs transmit the Octomap data to edges. Then, the CPVs execute the next component when it has received cuboid regions (how to generate cuboid regions on edges will detail in Section 3.2) from edges. Then, CAVs discretize the points into pillars , which is no control in the -dimension based on the cuboid regions. The point in the pillar represents with nine dimensions as follows: where denotes the center point with distance to the mean of all points, and the denotes the offset from the center point. Besides, if there are too many points in a pillar, this scheme randomly discards the points. Conversely, the zero points pad into the pillars with less than points. Once the generation of pillars is finished, CAVs transmit the feature pillars to edges with their locations.

Finally, CAVs receive the object list from edges. Vehicles execute the component to obtain final 3D bounding box of objects around itself. The detail of fusion algorithm based on is shown in Algorithm 1.

Data: Object list from cloud and object list from vehicle
Result: Object list
1 List , initialization
/Remove the objects which are out of range of the vehicle /
2 while index in List do
3  if
   then
4    remove
 /Storage the objects in same location to /
5 while j in List do
  // is the threshold of IOU
6  if IOU sthen
7   
8  else
9   Insert to
10   
 /Calculate the final bounding box and loction of object in List /
11 while in List do
//Weighted Average
12  
13  
14 RETURN

In the algorithm, vehicles first initialize the parameters such as groups that are the same object candidates and the threshold of IOU (Intersection over Union) . Then, the proposed algorithm removes the points out of range . Secondly, this algorithm filters out the candidates of a 3D bounding box for the same object and stores the candidates in a group of the list . Therefore, includes multiple groups. Finally, the algorithm outputs the object list after processing the list by weighted average in the same index. This algorithm can effectively reduce the maintenance detection list because these out-of-range bounding boxes have useless value for the vehicle.

3.2. Cooperative Perception on Edges

There are four main components: , , , and . These components are executed sequentially. It should be noted that the 3D object detection algorithm is based on [27] in this work. Besides, and can be executed when the edges have received corresponding basic messages from vehicles.

3.2.1. Octomap Fusion

The CPE executes Octomap fusion component as shown in Figure 4 when the edge receives messages from CAVs, including Octomap and the locations of vehicles. At first, all the Octomaps should transfer into the UTM coordinate system by using data reconstruction mentioned in Section 2.1.4. After that, there are four conditions: (1) , (2) , (3) , and (4) , shown in Figure 5, to merge the leaf node into final map from one Octomap due to the hierarchical tree-structure. If the leaf node in is not constructed on the (the state is ), the Octomap adds the node into , as follows: where the logarithmic occupancy probability of is ,and denotes the leaf node in and in , respectively. If and are in , the probability will be updated, as follows:

In condition (3), if is represented by coarse resolution node , is divided into eight child nodes with the same occupied probability and update as follows:

Similarly, the update occupied probability is represented in condition (4), as follows:

The final Octomap can be transferred into point clouds after all the merge into . The detail of the Octomap fusion algorithm on the edge is shown in Algorithm 2.

Data: Octomap from Vehicles
Result: Octomap
1  initialization;
2 while i do
3  i=i+1
4   reconstruction based on Eq. (4)
  // Octomap Fusion(, )
5  for Leaf node in do
6   if not in then
7    Add node:
8   else
9    The depth of is
10    for Leaf node in do
11     The depth of is
12     ifthen
13      
14      else && is
      then
15       Add child node to :
        ;
16       
17     else if && is
     then
18      Add child node to :
        
19      
20 RETURN
3.2.2. Cuboid Region Extraction

The cuboid region extraction will be executed after the edge gets point clouds . Point clouds are further classified into clusters by Euclidean, and each cluster is converted into a coarse-grained 3D bounding box. At first, point clouds outside the , , and are removed. After that, the ground plane is fitted from the survived points. And the unrecognized obstacle can be clustered by Euclidean because of removing the ground plane. In the end, all the bounding boxes are proposed by clusters. The cuboid region extraction is an adapted version of the traditional and efficient point cloud segmentation algorithm implemented by PCL [41], and its algorithm flowchart is shown in Figure 6.

3.2.3. Pillar Fusion

After vehicles transmit the pillars data, CPE executes component on edge as shown in Figure 4. For fixed backbone architecture, this study first feeds the point clouds to pseudoimages. The received pillars based on their locations fuse into different cuboid regions (can also be called as subtasks). In a subtask, there are some pillars in the overlapping area as shown in Figure 4. The approach in this study to fuse these pillars into one subtask can be represented in the following equation: where denotes the nonoverlapping areas of subtask in ground plane, randomly samples from all the pillars in the overlapping areas, and denote the pillars from vehicle and .

3.2.4. Object Detection

The fused pillars feed with a linear layer followed by BatchNorm [42], ReLU [43], and max pooling [44] to generate and reshape tensor size . After that, the backbone and detection head are similar to PointPillars [27].

The loss function is based on ground truth boxes and anchors defined by (). The total loss is as follows: with , , and . is regression loss for () by -loss [45] in the following equation: which

The Smooth L1-loss is less sensitive to outliers than other regression loss. Smooth L1-loss can be interpreted as a combination of L1-loss and L2-loss. It behaves as L1-loss when the absolute value of the argument is high, and it behaves like L2-loss when the absolute value of the argument is close to zero. is angle regression loss as follows: where superscript and represent angle class and residual, respectively, is orientation classification loss, and is residual prediction loss.

For the detection classification loss , this study uses the Focal Loss: with and . Besides, this study uses a softmax loss on the .

4. Simulations

This section presents simulation results to compare the performance with baseline and -Cooper [5] under the same datasets and scenarios. In order to have the same settings, this study carries out experiments with our dataset for cooperative 3D object detection. The dataset is simulated scenarios by the gazebo [46]. All data is collected from vehicles equipped with a 64-beam LiDAR, one GPS, and one IMU. There are the detailed parameters in Table 1. The parameter setting of LiDAR is suitable for most products on the market. Besides, the distance of “Near” and “Far” vehicle is defined by [5], and the other parameters (, , , , and ) refer to [27]. In the experiments, the framework proposed, and F-Cooper runs on a computing device with an NVIDIA RTX 3090 GPU.

4.1. Detection Precision Analysis

Table 2 shows the performance of the proposed algorithm against the baseline and -Cooper [5] by comparing the detected vehicles with the confidence score threshold at 0.5. There are three scenarios in the experiments, including multilane roads test, road intersection 1 test, and road intersection 2 test. The difference between these two road intersection tests is the number of vehicles. The former with more vehicles can be seen as traffic jams. Moreover, the latter can be thought of as an uncongested intersection. This study chooses vehicle one as ego vehicle for convenience compared with “Near” and “Far” results.

In a multilane road scenario, vehicle one can have a good “Near” detection accuracy but a weak vision of “Far” vehicles with 22.22% precision. Besides, vehicle 2 can only detect 16.67% of nearby objects and cannot sense any vehicles at a “Far” distance due to the severe occlusion. Next, -Cooper and our proposed algorithm achieve a similar detection precision at “Near” distance compared with baseline. These two approaches can significantly improve detection precision at “Far” vehicles with 42.78% and 37.5%, respectively, compared with 22.22% in baseline test. It means that the drivers can have more reaction slot windows due to the more accurate perception of the further vehicle.

In the road intersections 1 case, the CAVs are not affected by occlusion, which means that detection precision without fusion can perform well at “Near” and “Far” distances, such as 75% and 42.86% detection on vehicle 1, respectively. On the other hand, the fusion methods can perform better with 85.71% by our proposed approach and 80.21% by -Cooper at near distance and 50% and 46.42% at long distance, respectively.

Road intersection 2 is a particular case because the vehicles in this scenario are far away from each other. Therefore, it mainly focuses on the result of the far distance to show the performance. It can be noted that the detection precision on vehicle 1 is 25%, and on vehicle 2 is 40%. Besides, the -Cooper and method proposed in this study can achieve 36.41% and 48.81%, respectively. This result signifies that our method can improve approximately 10% on fusion precision.

4.2. Data Volume Evaluation

The data size generated from different scenarios is approximately equal. Therefore, this study only shows the data from road intersection 1, which means that vehicles are detected near and far. In Figure 7, the bars depict the amount of data that needs to transmit and processed compared to the raw point cloud and original Pillars.

It can be seen that the raw data gathered directly from the 64-beam LiDAR region of interest (RoI) is about 4 MB. Similarly, the original data volume of pillars ranges from 3.4 MB to 3.7 MB because of lots of null points. After processing and compression, the data size of different strategies falls off sharply to less than a quarter of raw data volume. The average voxel feature map mentioned by -Cooper is about 1.25 MB. The data volume satisfies the transmission constraint of 6G communication (about 125 KB per millisecond in this experiment setting). The processing time of the feature map is too long to meet the real-time running before transmission. This study will discuss this in the following subsection. Moreover, point clouds compressed into Octomap significantly reduce data size, approximately 124.6 KB. Besides, the data volume of pillars transmitted by CAVs is about 238.29 KB. Both Octomap and transmission pillars can process within time slots that meet real-time requirements (details will be shown later).

It should be noted that the Octomap compression is necessary for transmission by the difference in data volume between the original pillars and transmission pillars. The dominant factors that impact the detection precision in PointPillars [27] are the width and length of a pillar. The smaller pillar size is able to get a more accurate object detection result. While ensuring the granularity of pillar, Octomap compression can significantly reduce the data size before transmitting pillars.

4.3. Time Consumption Evaluation

As shown in Figure 8, the total time used for different components proposed in the architecture of this study is all less than 15 ms. Besides, the time consumption of this whole framework is approximately 39 ms, which is satisfied with the 30 Hz requirement. It can be noted that , and consume the most transmission time because the data size of Octomap and pillars is much greater than the object list used in the other components. However, the time spent is still less than 2 ms. As a result, the transmission time is almost negligible compared with the processing time. The total time of and includes and , respectively, before running the algorithm in this work. It can be seen that the primary time spent on the process is in and . Therefore, the raw data compression technology and 3D object perception algorithm will be mainly focused on improving cooperative detection in the future.

Besides, this study shows the time-consuming comparison with different strategies mentioned before in Figure 9. Although communicating with edges/vehicles by 6G, the raw data exchanging in Cooper [4] and feature map sharing in -Cooper spend 32.76 ms and 10.24 ms in transmission, respectively. This time consumption does not meet the low latency requirement of object detection. Besides, both Cooper and -Cooper are based on VoxelNet, a 3D CNN backbone. It can be noted that the processing time based on VoxelNet is approximately 237 ms, which is almost 15 times longer than that of the algorithm based on PointPillars.

5. Conclusions

In this study, a novel pillar-based perception framework has been proposed for cooperative autonomous vehicles and edges. The simulation results and analysis show that the proposed algorithm is able to achieve a 30 Hz running and higher detection accuracy than previous schemes in terms of “Far” vehicles. It can be deployed for both CAVs and roadside edges as the improvement of perception in real-world scenarios.

In the future, the authors will focus on more data exchange types for CAVs to achieve higher detection accuracy under low delay constraints. Besides, we will also pay attention to the object detection algorithm with a higher frame rate, such as 60 Hz, without detection precision loss.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the Jilin Province Science and Technology Project (No. 20200501012GX).