Abstract

Moving object classification is essential for autonomous vehicle to complete high-level tasks like scene understanding and motion planning. In this paper, we propose a novel approach for classifying moving objects into four classes of interest using 3D point cloud in urban traffic environment. Unlike most existing work on object recognition which involves dense point cloud, our approach combines extensive feature extraction with the multiframe classification optimization to solve the classification task when partial occlusion occurs. First, the point cloud of moving object is segmented by a data preprocessing procedure. Then, the efficient features are selected via Gini index criterion applied to the extended feature set. Next, Bayes Decision Theory (BDT) is employed to incorporate the preliminary results from posterior probability Support Vector Machine (SVM) classifier at consecutive frames. The point cloud data acquired from our own LIDAR as well as public KITTI dataset is used to validate the proposed moving object classification method in the experiments. The results show that the proposed SVM-BDT classifier based on 18 selected features can effectively recognize the moving objects.

1. Introduction

Autonomous driving has become an increasingly popular domain for intelligent transportation system [1, 2]. Moving object classification is a critical step to achieve reliable planning of driving trajectories for autonomous vehicles in dynamic environment, and the prior knowledge of the category attribute helps to build an appropriate dynamic model for moving objects [35]. The most commonly used sensors for object recognition are camera and LIDAR. Compared with cameras, LIDAR can obtain the accurate 3D measurements and it is invulnerable to weather and illumination. Extensive research bends the efforts to object recognition using LIDAR. Conventional techniques can be coarsely divided into two categories.

The first category of the methods determines the object semantics through calculating the similarity between the scanned object and the predefined template. Simple geometric or motion model is constructed to classify rigid objects, while it is difficult to recognize pedestrians. Fang and Duan [6] employ iterative endpoint fitting algorithm to fit the segmented point cloud and calculate the number and size of line segments to determine whether the object is a vehicle or not. Petrovskaya and Thrun [7] combine the rectangular model of point cloud in 2D occupancy grid map with the motion model established by Rao-Blackwellized particle filter to improve the vehicle classification accuracy when partial occlusion temporarily occurs.

The second class of the methods mainly focuses on the effective feature descriptors of the object of interest as well as training specific classifiers [8, 9]. For vehicle recognition, Yang and Dong [10] calculated the geometric features based on the optimal neighborhood size of each point, and classified the segments using SVMs. Lee and Coifman [11] checked the shape feature of each point cloud cluster and classified the vehicle into six classes of vehicles. For pedestrian recognition, Kim et al. [12] used SVM classifier with 31 layer-based features for pedestrian recognition. Arras et al. [13] defined 14 static features including roundness and compactness to train the pedestrian classifier based on the point cloud of the legs. Although this method can generate good recognition result at indoor environment, it is not suitable for outdoor. For multiclass object recognition, Azim and Aycard [14] used the simple ratio characteristics of 3D bounding box based on point cloud to recognize the vehicles and pedestrians, such as width-height ratio and length-height ratio. However, frequent occlusion in real traffic environment leads to false size ratio so that the recognition performance is poor. Wang et al. [15] proposed 120-dimensional feature set including rotating images, shape factors, point normal vectors, and Euclidean distances and employed SVM classifier to recognize multiclass moving objects. Teichman et al. [16] constructed boosting classifier to recognize moving vehicles, pedestrians, and bicycles by integrating geometric features and motion features. Moreover, occupancy-grid-based methods [17, 18] are presented to detect moving objects efficiently, but they only estimate the kinematic state of the object without the need to classify the category.

Conclusion drawn from the abovementioned related literatures can be summarized as follows. First, the existing moving object classification methods based on LIDAR are designed for the relatively dense point cloud returned from the scanned object. Second, temporary or partial occlusion at consecutive frames is seldom considered in the majority of object classification schemes, and the effectiveness of the extracted features has not been analyzed from the perspective of object category. Furthermore, most of the aforementioned classification methods are proposed to recognize common moving objects including vehicles, pedestrians, and bicycles. In real traffic scenarios, the pedestrians often appear as independent individuals or a small crowd. When two pedestrians are too close, it is so hard to segment the returned point cloud clearly that it is difficult to identify individual pedestrian. As shown in Figure 1, the pedestrians marked as A and B are segmented as a whole wrongly. It is very common that two or three human get together, and the crowd composed of two or three pedestrians may be regarded as other class of moving object due to point cloud under-segmentation. Motivated by the abovementioned analysis, we proposed a LIDAR-based classification method in this paper for four categories of moving objects, namely, vehicle, pedestrian, bicycle, and crowd. Velodyne HDL-64E LIDAR is adopted to collect 3D point cloud of the surrounding environment. Our method for moving object classification uses raw point cloud as follows. First, the points measured on moving objects are segmented from the rest of 3D point cloud. This process consists of ground segmentation, the clustering of nonground points, and moving object detection. Second, both global- and layer-based features are extracted to describe the geometric characteristics, and Gini index criterion is utilized to select the effective features based on the category attributes of training samples. Next, posteriori probability SVM classifier is employed to obtain the classification result at each frame, and BDT algorithm is further used to optimize the classification result of the tracked object at consecutive frames. Finally, the proposed SVM-BDT-based classification method is validated using the point cloud dataset collected by our own LIDAR as well as public KITTI dataset.

The contributions of this paper are two-fold. First, we describe a novel approach for classifying moving objects into vehicle, pedestrian, bicycle, and crowd, since the point cloud segment of the crowd may be confused with other types of objects and even reduces the accuracy of object recognition. This approach makes progress towards the application goal of moving object classification in real traffic environment for autonomous vehicle. Second, we adapt the idea of SVM-BDT classifier to incorporate multiframe classification results based on the effective features, and moving object classification is transformed into maximum posteriori probability solution problem.

The remainder of this paper is organized as follows. Section 2 introduces the point cloud preprocessing. Section 3 presents feature extraction. Section 4 describes the classification method. Section 5 demonstrates experimental results. Finally, Section 6 offers conclusions and future works.

2. Point Cloud Preprocessing

The point cloud is characterized by the coordinates in the world coordinate system, and LIDAR position is marked as the origin of the world coordinate system. In this section, ground points are removed from 3D raw point cloud using the ground segmentation method in [19] which combines Markov random field models with loopy belief propagation algorithm. Then, nonground points should be divided into independent clusters. Since the number of point cloud clusters of moving objects in surrounding environment is unknown and the density of point cloud varies with the range, mean shift clustering algorithm in [20] is selected. In order to reduce the influence of fixed bandwidth on the stability of clustering results, an improved mean shift clustering algorithm based on adaptive bandwidth is proposed as follows:(1)The nonground points are represented byGiven initial kernel radius bandwidth h0, Gauss kernel function and the tolerance , the initial kernel density estimation is calculated bywhere d is the dimension of the data space.(2)The adaptive bandwidth is calculated for each point by , where is a proportional constant value and it can be calculated by .(3)The initial kernel centroid is marked as , and the weighted mean value at is computed using kernel functions G and the weight :where is the mean shift vector. Note that is iteratively calculated until the gradient of the convergence is zero, i.e., .

Next, in order to detect moving objects, a local grid map is constructed using 3D occupancy grid algorithm in [6] to divide the surrounding environment into occupied, free, and unknown voxels. When new measurements return, the dynamic voxels are detected in the consistent grid map based on the inconsistencies between occupied space and free space. Then, all the moving clusters in the dynamic voxels are extracted, as shown in Figure 2.

3. Feature Extraction

In general, the height of the point cloud clusters of moving objects including vehicle, pedestrian, bicycle, and crowd ranges from 1 m and 4 m. When partial occlusion occurs, the height of the clusters can be less than 1 m. Since the layer-based features describe a more detailed level of local shape characteristics than the global features, we divide the point cloud cluster into eight layers along the vertical direction of the horizontal plane. 2D features at each layer are employed to supplement the description of 3D geometric features and reduce the disturbance of partial occlusion. In addition to collecting the existing feature descriptors in the literatures, the differences of point cloud characteristics among four classes of moving objects are analyzed, and number-of-point-based features, shape features, and statistical features are selected, as shown in Tables 13.

In order to remove the features that make no significant difference on the object classification results, Gini index criterion of CART decision tree algorithm [21] is used for feature selection. The forward search mechanism of the feature subset is combined with the subset evaluation mechanism to select the efficient features in order of priority, so that all samples falling at the subnodes belong to the same category, namely the highest purity is achieved at each subnode. Define that the proportion of the k-class sample in the training sample set U is uk (k = 1, 2, 3, 4), and Gini index is used to represent the purity probability distribution of the sample set U:

The attribute is used to divide the sample set U; thus, V branch nodes are obtained. The samples with the attribute at the branch node are denoted as UV, and the weight of the branch node is set as . Given the attribute , the Gini index of the sample set U is defined by

The attribute with the minimum Gini index is selected as the optimal boundary of the features. Based on the optimal attribute, the features are allocated for two subnodes generated from the current node. The abovementioned calculation is carried out recursively until Gini index is less than the preset the threshold. The category attributes of training sample sets are divided, respectively, for vehicle, pedestrian, bicycle, and crowd. Four decision trees of the hierarchical features for four categories of moving objects are obtained, as shown in Figure 3. In this figure, the solid line denotes yes and the dotted line indicates no. Based on the comprehensive analysis of four hierarchical feature decision trees, 18 effective features are selected from the initial 68 features, namely f1, f2, f4, f6, f11, f16, f19, f24, f40, f43, f47, f55, f59, f62, f64, f65, f66, and f68.

4. Classification

4.1. Training SVM Classifier

SVM classifier based on posterior probability is employed to calculate the probability of the point cloud cluster belonging to each categories of moving objects. First, the standard nonlinear SVM classifier is used as the basic classification function:

The standard SVM classifier only determines the probability value as 1 or 0. In order to ensure the sparsity in support vectors of SVM classifier and the accuracy of classification results, Sigmoid function is used to convert the output of standard SVM into a posterior probability [22]:where denotes the probability value of correct classification when the output is f and A and B are the parameters to be fitted. Define the training set as (fi, ti), and the output of the probability value is set as ti = (yi + 1)/2, where yi is the sample category, yi = {−1, +1}. The optimization strategy of the parameters A and B can be solved by minimizing the negative log likelihood function on the training set: where , fi = f (xi), ti = {0, +1}, i = 1, 2, …, l.

4.2. BDT-Based Classification Optimization

The point cloud of moving object is associated with the tracker at the consecutive frames and the tracker is updated based on the association result. The location information of each moving object at the next frame is predicted using linear Kalman filter. The moving object model is denoted as {I, L, W, x, y, if, tI, Gm}, where I denotes the object index; L and W denote the fitting rectangular size of the point cloud cluster; x and y denote the center location of the point cloud cluster; if denotes whether the object has an associated tracker, the initial value is set as 0 and indicates no tracker; tI denotes the associated tracker index; and Gm denotes the minimum value of the cost function between the object and the associated tracker. The tracker model is denoted as {I, L, W, x, y, , , ifLost, oI}, where I denotes the tracker index; L and W denote the fitting rectangular size of moving object matched with the tracker at the last frame; x, y, , and denote the position and speed prediction of the filtered tracker at the current frame; ifLost denotes whether the tracked object at the current frame is lost or not, the initial value is set as 1 and indicates that the tracked object is lost; and oI indicates the serial number of moving object corresponding to the tracker at the current frame when the tracked object has not been lost.

The deterministic data association algorithm based on the fusion of multiple features is used to associate the moving object with the tracker. The location and geometric features of moving object are utilized as the primary and secondary constraints respectively. The objects are associated with the trackers by minimizing the cost function. Assuming that m moving objects are generated from the point cloud pre-processing procedure at t + 1th frame and n trackers exist, the cost equation based on the association between the ith moving object and the jth tracker at t + 1th frame is established:where pos (i, j) denotes the cost component between the position of moving object and the position of the tracker; box (i, j) denotes the cost component between the size of the fitting rectangle of moving object and the size of the tracker; and denote the weight of the position and the weight of the size, respectively, (define that , the weight of the position is higher than that of the position; thus, ); maxn|•| represents the maximum of the association values between the ith moving object and n trackers at the t + 1th frame.

When both the number of points and the shape vary constantly for the point cloud of the same moving object at continuous frames and partial occlusion may exists in a few frames, the posterior probability of the category estimation will change frame by frame. Based on the prior probability distribution of the object recognition result at each individual frame, the posterior distribution model of the multiframe classification is established, and the maximum output of posterior distribution is recursively solved by the maximum likelihood method to generate the optimal classification result at the current frame. The classification results at 10 previous consecutive frames are selected to estimate the optimal category at the current frame. Define the object category as  = {vehicle, pedestrian, bicycle, crowd}, , and there are J moving clusters at the tth frame. According to equation (6), the result of the category decision for the kth cluster Ck (1 < k < J) is generated as {}, . The category decision vector for the kth cluster which is tracked by the tracker Tt at the tth frame is denoted as . Assuming that the observation Dt at the tth frame is only related to the given category Si, the posterior distribution probability of the kth moving object cluster belonging to each category Si at the tth frame is updated by the state at the t − 1th frame:where the likelihood function at the tth frame . At the end, the maximum posteriori probability is used to estimate the category of the kth cluster tracked by the tracker at the tth frame.

5. Experimental Results

5.1. Data Collection

In order to test the performance of the proposed moving object classification method, four categories of the point cloud samples including vehicle, pedestrian, bicycle, and crowd are collected using Velodyne HDL-64E LIDAR equipped on our autonomous vehicle (Figure 4). The videos generated by 3 external cameras on the autonomous vehicle are acquired synchronously to manually label the real categories for the samples. Meanwhile, 3D LIDAR data in public KITTI dataset [23] is also used to supplement the point cloud samples. The point cloud clusters of moving objects are extracted with the data pre-processing procedure. Note that the extracted clusters of moving objects with the same category are sensed from different view directions and distances. Figure 5 shows a few examples of moving object. All the experiments are processed on an Intel i7-4700, 3.20 GHz core processor with 8 GB RAM using C++ code.

5.2. SVM-BDT-Based Classification Results

The framework for moving object classification is tested on the task of calculating the posterior probability that the point cloud cluster belongs to each category of moving objects at consecutive frames. We run the proposed SVM-BDT classification method using the 18 selected features, and the output of the posterior probability in multiple scenarios are shown in Figure 6. In each subgraph, the upper picture is the scene image captured synchronously and the rectangle represents one moving object tracked by 3D LIDAR, and the bottom picture shows the variation of the posterior probability frame by frame. As shown in Figure 6(b), the pedestrian in the red rectangle suddenly appears and becomes from partially occluded to be completely exposed in the point cloud environment. The posteriori probability that the cluster belongs to the pedestrian increases rapidly after several initial frames, and then it maintains at the maximum value. Figure 6(c) shows that as the bicycle goes far gradually, the number of points returned from the bicycle decreases gradually, and the posterior probability that the point cloud cluster belongs to the bicycle decreases correspondingly. As shown in Figure 6(d), multiple pedestrians walk away from the LIDAR. At first, the pedestrians walk so close that the posterior probability of the category belonging to the crowd is the highest. Then, the distance among the pedestrians increases gradually, the point cloud of the crowd is successfully segmented into multiple pedestrians, and the posterior probability of the category belonging to the pedestrian increases accordingly. Later, the distances among the pedestrians decrease, and the posterior probability of the category belonging to the crowd increases again.

To evaluate the performance of the proposed SVM-BDT method quantitatively, 2000 groups of the point cloud samples are selected for each category of moving object, and 5-fold cross validation is conducted. Each group contains 10 consecutive frames. Figure 7 is the confusion matrix of the recognition results. We can see that the recognition accuracies of vehicle, pedestrian, bicycle, and crowd are 97%, 95%, 91%, and 90%, respectively. The proposed classification method shows the best recognition performance on moving vehicles, and the point cloud cluster of the crowd is the most likely to get confused with the other types of moving objects. Overall, the average recognition accuracy of SVM-BDT method is 93.25%, and the recognition result satisfies the requirements of autonomous vehicle on the recognition of surrounding moving obstacles. The total running time of the proposed SVM-BDT method increases with the number of moving objects in the traffic scenario, especially the time cost of BDT-based classification optimization stage, since the point cloud of each moving object is associated with each tracker at the consecutive frames and the trackers are updated based on the association result.

5.3. Classification Performance Comparison

To demonstrate that both the number of features and various classifiers affect the recognition performance, we compare four categories of moving object classification methods, namely, (1) 18 features + SVM, (2) 68 features + SVM, (3) 18 features + SVM + BDT, and (4) 68 features + SVM + BDT. All methods are tested with 5-fold cross validation using 2400 groups of point cloud samples, and each category of moving object has 600 groups of point cloud samples. Note that each group of point cloud sample contains 10 consecutive frames and partial occlusion occurs at several frames. The Receiver Operating Characteristic (ROC) curves for four methods are shown in Figure 8. The larger the area under the ROC curve (AUC) is, the better is the performance of the classification method, and vice versa. AUC values and the run time of four methods are listed in Table 4. The run time denotes the total time cost of both the feature extraction and classification stages. We can see that in terms of the SVM-BDT-based moving object classification method, AUC value obtained by 18 features is close to that obtained by 68 features; thus, the characteristics of four categories of moving objects can be explained by 18 selected features well. For the same feature set, the SVM-BDT-based classification method outperforms the SVM-based method. It demonstrates the BDT algorithm using the consecutive frames optimizes the classification result at individual frame effectively.

It even overcomes the partial occlusion. Moreover, for the same classifier, the method using 18 features run less time than the one using 68 features. Considering the recognition accuracy, computation complexity, and operational efficiency, it can be concluded that SVM-BDT classifier based on 18 features is the best choice to achieve the recognition of moving objects including vehicle, pedestrian, bicycle, and crowd.

The crowd is regarded as a special moving object which is different from the single pedestrian in this paper. To further validate the crowd recognition performance of the proposed SVM-BDT-based method, several commonly used recognition methods are compared. The ROC curves of the classification results are shown in Figure 9, and the results clearly show the superiority of the proposed SVM-BDT-based method against Adaboost algorithm [24], Naive Bayes algorithm [25], and FLDA algorithm [26]. Although the crowd recognition accuracy of MCI-NN algorithm [27] outperforms the proposed SVM-BDT-based method, MCI-NN algorithm consumes more memory and takes more operation time due to using Markov kernel function. Therefore, considering the recognition accuracy and efficiency, the proposed SVM-BDT-based method shows better crowd recognition performance.

6. Conclusions and Future Works

In this paper, we propose an approach for moving object classification using 3D point cloud in urban traffic environment. This approach classifies moving objects into four classes, namely vehicle, pedestrian, bicycle, and crowd. The accurate modeling of moving object classification using 3D point cloud consists of several procedures that all affect the final classification results. To obtain the effective features of moving objects, unlike the application of simple feature description, Gini index criterion is employed in this work based on the characteristics of each category of moving objects to select from the extracted features including number-of-points-based features, shape features, and statistical features. In the classification procedure, unlike previous works where the classifier is modeled with the point cloud at single frame, the moving object is recognized based on SVM-BDT classifier to incorporate multiframe classification results. The presented method has three benefits. First, our method can classify the common moving objects in urban environment even if the pedestrians walk close or partial occlusion occurs. Second, this method digs deep into the point cloud distribution based on the category attribute to recognize moving object efficiently. Moreover, the BDT-based classification optimization is conducted on the results of the posterior probability SVM classifier at consecutive frames to improve the moving object classification performance. The method is tested using the point cloud dataset collected by our own LIDAR as well as public KITTI dataset in the experiment. The results reveal that the proposed SVM-BDT method based on 18 features can achieve better classification accuracy for vehicle, pedestrian, bicycle, and crowd, compared with several other methods. Note that the point cloud samples collected by our method are within 40 meters, beyond which the declining resolution of the point cloud caused many mistakes. The proposed method has a limitation on classifying the moving objects at long range; thus, this challenge is treated as a subject of our future work. Another aspect of future work is the deep understanding of the behaviour using 3D LIDAR by integrating the motion cues with the classification results of the surrounding objects in urban traffic environment.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by grants from National Key R&D Program of China (2018YFB1600500), National Natural Science Foundation of China (51905007 and 51775053), the Great Wall Scholar Program (CIT&TCD20190304), Ability Construction of Science and Beijing Key Lab Construction Fund (PXM2017-014212-000033), and NCUT Start-Up Fund.