Multiobject Tracking (MOT) is one of the most important abilities of autonomous driving systems. However, most of the existing MOT methods only use a single sensor, such as a camera, which has the problem of insufficient reliability. In this paper, we propose a novel Multiobject Tracking method by fusing deep appearance features and motion information of objects. In this method, the locations of objects are first determined based on a 2D object detector and a 3D object detector. We use the Nonmaximum Suppression (NMS) algorithm to combine the detection results of the two detectors to ensure the detection accuracy in complex scenes. After that, we use Convolutional Neural Network (CNN) to learn the deep appearance features of objects and employ Kalman Filter to obtain the motion information of objects. Finally, the MOT task is achieved by associating the motion information and deep appearance features. A successful match indicates that the object was tracked successfully. A set of experiments on the KITTI Tracking Benchmark shows that the proposed MOT method can effectively perform the MOT task. The Multiobject Tracking Accuracy (MOTA) is up to 76.40% and the Multiobject Tracking Precision (MOTP) is up to 83.50%.

1. Introduction

The objective of Multiobject Tracking (MOT) is to track multiple objects at the same time and estimate their current states, such as locations, velocities, and sizes, while maintaining their motion identifications. Hence, the MOT is one of the most important abilities of autonomous systems, but it remains challenging because the target objects may be obscured, or it may be interfered by objects of similar shape. Owing to the rapid development of object detectors, several tracking-by-detection methods [15] have been widely proposed to address the MOT problem. Typically, the existing tracking-by-detection methods involve two main computational steps: object detection and tracking. These methods first detect the location of objects and then compute the trajectories of the objects based on the results of object detection [68]. The accuracy of object tracking is highly related to the performance of object detection. Hence, the important thing about the MOT is to track the new targets that appear at any time and find lost tracking target objects from detections and associate again. However, most of the tracking-by-detection methods are based on vision-based object detections. In the case of occlusion and overexposure, vision-based object detection may lead to false association with existing trajectories. For example, Figure 1(a) shows the failure of vehicle detection on the image with the occlusion of humans. Figure 1(b) shows the camera is disabled when overexposure.

The scene of autonomous driving may contain multiple objects, and the states of the objects are usually uncertain [9, 10]. In this case, the vision-based object detections are susceptible to occlusion or overexposure, which will easily lead to false checks or loss of target tracking. Besides, one major challenge of the MOT is how to reduce incorrect identity switching. Because the tracked objects often have high similarities, it is challenging to track objects correctly and perform correct Re-Identification(RE-ID).

Multimodal data fusion has the potential to improve the stability and accuracy of the MOT. However, a majority of traditional methods use the camera, LiDAR, or radar. These methods need to design hand-crafted features [11]. However, the hand-crafted features are often not of high precision, and it is difficult to guarantee the tracking performance. Hence, it is necessary to design a feature learning method that can automatically learn appearance features from raw visual data. Moreover, in autonomous driving systems, since the objects are moving rather than stationary, the motion information of objects should be integrated with the appearance features to achieve the MOT tasks. In addition, some MOT methods include depth information in the tracking process by using depth camera in order to improve tracking performance. For example, Mehner et al. [12] used an ordinary camera to obtain 2D information of objects and used a depth camera to obtain depth information to assist in locating the objects in world coordinates. Although it can improve the accuracy, the depth camera has a small field of view, high noise, and is easily affected by sunlight, so it is not effective as LiDAR. Moreover, they only use Kalman Filter for tracking, which does not work well in complex scenarios.

In this paper, we propose a multimodal MOT method by fusing the motion information and the deep appearance features of objects. This paper employs a 2D object detector, i.e., You Only Look Once (YOLOv3) [3] and a 3D object detector, i.e., PointRCNN [5] to process the RGB image and laser point cloud, respectively. The combination of 2D detection and 3D detection is helpful to improve the robustness of object detection. Then, the MOT is achieved by associating the motion information and the deep appearance features of the target object. A set of experiments on the KITTI Tracking Benchmark is performed to demonstrate the effectiveness of the proposed MOT method. Our contributions are summarized as follows:(1)The 2D object detection based on the image and the 3D object detection-based laser point cloud are combined to detect the location of objects, which is robust against light changes and occlusion.(2)We apply CNN that is pretrained to discriminate vehicles on a large-scale vehicle Re-Identification dataset to automatically extract the deep appearance features of the target object without manually designing features.(3)A multimodal MOT method is proposed by fusing the motion information and deep appearance features of the object to achieve the MOT task. In addition, the proposed method obtains competitive qualitative and quantitative tracking results on the KITTI tracking benchmark.

The rest of the paper is organized as follows. Section 2 introduces related works. Section 3 presents the proposed multimodal MOT method. Experiments and their results are presented in Section 4. Finally, the conclusion and future work are summarized in Section 5.

This section provides an overview of the two related research topics: multiobject tracking and object detection.

2.1. Multiobject Tracking

The problem of the MOT first appeared in the tracking of object trajectory. For example, tracking of multiple enemy aircraft or passing missiles. With the development of computer vision, researchers have proposed several MOT methods from different aspects in the past few decades. For example, the single-object tracking method is extended to support multiple objects. According to the data association, the existing MOT methods can be divided into two categories: offline and online MOT methods. In offline methods [1316], the detection of all frames in the sequence is combined to obtain the object trajectory robustly. These methods need to construct a global graph structure, which leads to high computational complexity. However, in the online MOT method [1720], the target detector is only associated with the existing trajectories frame by frame. Hence, online methods are more suitable for real-time tracking.

Most of the existing MOT methods rely on motion information produced from Kalman Filter [21], Hungarian algorithm with Kalman Filter [17], Particle Filter [22], or probability hypothesis density filter [23]. However, in autonomous driving systems, due to the uncertainty of the scene, it is impossible to track objects stably only by using motion information. Therefore, more recent methods combine the motion features with the appearance features to improve the re-identification of target objects. Traditionally, the appearance features of objects are manually designed [24], which cannot provide reliable features, especially, in complex scenes. Owing to the rapid development of deep learning, deep convolutional networks [9, 25, 26] have been widely used to extract the appearance features from raw visual data. For example, Wojke et al. [17] used CNN to extract the pedestrian image features and measure the distance between features for human detection.

2.2. Object Detection

Most of the existing 2D object detection methods are based on CNNs, which can be divided into two-stage detectors and one-stage detectors. In the two-stage detectors, such as RCNN [27], Fast RCNN [28], Faster RCNN [1], and FPN [29], they use Region Proposal Networks (RPN) to generate the candidate regions and then perform bounding-box classification and regression. For example, RCNN starts with the extraction of a set of object proposals by the selective search. Then, each proposal is rescaled to a fixed size image and fed into a CNN model that is trained on ImageNet. In this way, the presence of an object within each region is predicted and its category is recognized. Although the two-stage detectors have made great progress, their main drawback is that the redundant feature calculation of a large number of overlapping schemes results in a very slow detection speed.

The One-stage detectors have YOLO [3, 30, 31], Single Shot MultiBox Detector (SSD) [2], and RetinaNet [32]. These detectors do not need the RPN. They directly generate the categories’ probability and bounding boxes of the objects. These methods only use one-stage calculation to get the final detection results. For example, the YOLO applies a single neural network to the whole image. This network divides the images into regions and predicts the bounding boxes and the probabilities for each region simultaneously. Compared with the two-stage detectors, the one-stage detectors have a higher detection speed.

Because the point-cloud data contains richer geometric features, 3D object detection has attracted more and more attention. Compared with 2D object detection, 3D object detection is more challenging because it needs to process the point clouds of the scene. Chen et al. [33] projected point cloud to the bird’s view and used 2D CNNs to learn the features of point cloud for 3D boxes’ generation. Song and Xiao [34, 35] divided the point cloud into equally spaced 3D voxels and used 3D CNNs to learn the features of voxels to generate 3D boxes. Shi et al. [36] used PointNet++ [37] to process the point-cloud inputs for 3D boxes’ generation. Besides, some methods [38, 39] estimate 3D bounding boxes based on images.

3. Method

This section introduces the proposed multimodal MOT method that tracks multiple objects at the same time and records their trajectories. The proposed MOT method includes the four main computations: object detection with Nonmaximum Suppression, motion information extraction, learning deep appearance feature, and object tracking with data association. Figure 2 shows an overview of the proposed MOT method. We combine the result of 2D object detection and 3D object detection such that the location of the object can be detected robustly. Based on this, the motion information and appearance features of objects are computed respectively. Finally, the motion information and appearance features of objects are associated to track the target object.

3.1. Object Detection with NMS

The first task of the MOT is to detect the location of objects in the scene. In this paper, we propose to combine the results of 2D object detection and 3D object detection for robust object detection. We use the 2D detector, i.e., YOLOV3 [3] that is trained on the training set of the KITTI 2D object detection benchmark and uses the 3D detector, i.e., PointRCNN [5], that is trained on the training set of the KITTI 3D object detection benchmark.

The 2D detector processes the RGB image. The output of 2D object detection is a set of detections , where is the number of objects at frame . The 3D detector processes the point clouds that were collected from a LiDAR. The output of 3D object detection is , where is the number of objects at frame . For further calculation, we project the LiDAR point in the 3D space into the 2D space according to combine camera and LiDAR calibration:where is the projected point in the RGB image. denotes the 3D LiDAR point. and are the intrinsic camera parameters. The is the camera matrix, and the is the rectification matrix to make the image co-planar. projects the point X in the LiDAR coordinates onto the camera coordinate system. Both the intrinsic and extrinsic parameters are available in the KITTI dataset [40]. Figure 3 shows an example of point projections.

After the 3D point clouds are projected onto the image, two overlapping boxes will appear on the same object. This paper further uses the Nonmaximum Suppression (NMS) algorithm to get rid of the extra boxes. The NMS sorts all detection boxes on the basis of their scores and selects box M with the highest score. All other detection boxes with the large overlapping area with M are suppressed by using a predefined threshold :where is the detection box to be screened, when is greater than , will be removed. In our experiment, is set to 0.7. Figure 4 shows a comparison result by the detection method without NMS and with NMS.

3.2. Learning Object Appearance Features

Before implementing the MOT, we need to extract the appearance features of the object. This paper employs CNN to automatically learn the deep appearance features of objects from raw visual data. The CNN is trained on a large-scale benchmark dataset [41]. The dataset contains over 50,000 images of 776 vehicles captured by 20 cameras. Figure 5 shows several samples in this dataset.

Table 1 illustrates the architecture of CNN used in this paper. The CNN model is inspired by the wide residual network [40, 42] that consists of two convolution layers and six residual blocks. The Dense layer 10 extracts a 128-dimensional global feature. The final batch and -norm layer projects feature to a unit hypersphere. By resizing the tracked vehicle image to 224 × 224, then inputting it into the network. Finally, we get a 128-dimensional feature vector that is used as the deep appearance features of the object.

3.3. Extraction of Object Motion Information

Since the objects are usually moving rather than stationary, it is necessary to extract the motion information of objects for the MOT. This paper employs the Kalman filter to predict the state of the object and then extract its motion information. We use eight parameters to describe the tracking state at frame , where is the bounding box center position, is the aspect ratio, is the height of the bounding box, and represents the corresponding velocity in the image coordinate system.

Because the interval of time between each frame is very short, it can be regarded as a linear model of constant-velocity motion. We get the predicted object state at the next frame and calculate the error covariance matrix between the predicted state and the true state:where is the predicted object state at frame . A is a state transition matrix, and is the object state at frame . And Q is the covariance matrix of the predict noise. Then, we can get the Kalman gain matrix K and calculate the estimated state :where is the measured value and H is the conversion matrix from to . R is the covariance matrix of the measurement noise. Finally, update the covariance matrix :

3.4. Object Tracking Based on Data Association

The next is to associate the deep appearance features and the motion information of the object for the MOT. First, this paper uses the Mahalanobis distance to compare the motion correlation between the predicted state of the Kalman Filter and the newly detected bounding boxes:where denotes the th bounding box detection, and represent the mean and covariance of the th predicted bounding box. A threshold can be adjusted to control the minimum confidence of the motion information association between objects i and j. We denote this decision with an indicator , as shown in equation (7). The indicator will be equal to 1 if the Mahalanobis distance is smaller or equal to a threshold , which is set to 9.4877 for our four-dimensional measurement space:

Next, the above method is only a suitable related measurement index when motion uncertainty is very low. However, in the image space, only using the Kalman filter framework is a rough prediction. Therefore, this paper also adopted the second metric. It measures the smallest cosine distance of the appearance features between the ith track and jth detection as follows:where is the appearance feature vector of detection and represents the feature vector of the th tracked object at the most recent frame k. In our experiment, parameter k is set to a maximum number of 100 available vectors. In addition, in order to determine whether the appearance features are related, we introduce a binary indicator, as shown in equation (9). A threshold is set for this indicator on a VeRi dataset:

Then, the Mahalanobis distance determines whether the prediction position of the Kalman filter is related to the new detection, which is especially useful for short-term prediction. And the cosine distance considers the appearance of tracking objects, which is especially useful for recovering identity after a long period of occlusion. Therefore, this paper combines the two metrics using a weighted sum:where we call an association admissible if and . The hyperparameter is used to control the influence of each metric on the combined association. For example, when there is substantial object motion, the prediction of the constant-velocity motion model becomes less effective. Thus, the appearance metric becomes more significant by reducing the valve of ; on the contrary, when there are limited vehicles on the road without long-term partial occlusions, increasing the valve of can improve the importance of distance metric.

Finally, in our implementation, the maximum number of frames allowed to lose the target is considered. In order to avoid redundant computations, if a tracked object is not re-identified in the most recent frames passed since its last instantiation, it will be assumed that it has left the scene. If the object is seen again, a new ID will be assigned to it. The judgement of a new track is that an object in the result of detection can never be associated with the existing MOT methods. If the prediction of the object position can be correctly correlated with the detection in the consecutive frames, we can confirm that a new track target has appeared.

4. Experiment

This section introduces the dataset, evaluation metric, training parameters, and experimental evaluation results in the experiments on the KITTI Tracking Benchmark.

4.1. Dataset

The proposed method was evaluated on the KITTI tracking benchmark [43]. The KITTI dataset was collected under 4 different scenarios, including city, residential, road, and campus. Some samples of the KITTI dataset are shown in Figure 6. The dataset consists of 21 training sequences and 29 test sequences. In each sequence, LIDAR point clouds, RGB images, and calibration files were provided. In the training sequences, eight different classes were labeled, including car, pedestrian, and cyclist. The objects in images were annotated with 3D and 2D bounding boxes between different frames and had a unique ID. In this work, we used all 29 testing sequences for modal validation and only used on the car subset for model evaluation because it had the most instances of all object types.

4.2. Evaluation Metric

The indexes used to evaluate the performance of the proposed MOT method were as follows:(1)Mostly Tracked (MT) : objects are successfully tracked to at least 80% of their trajectories during their life span.(2)Mostly Lost (ML) : objects are successfully tracked to less than 20% of their trajectories during their life span.(3)Identity Switches (IDS) : the number of times objects’ identities have changed during their life span.(4)Fragmentation (Frag) : due to the missing detection, the number of times a trajectory is interrupted.(5)FP and FN : the total number of false positives and false negatives (missed targets).(6)Multiobject Tracking Accuracy (MOTA) : it combines three error sources, i.e., FP, FN, and IDS as follows [44]. Equation (11) shows the computation of the MOTA, where is the index of the frame and is the number of the ground truth:(7)Multiobject Tracking Precision(MOTP) : the alignment accuracy between the annotated and the predicted bounding boxes [44].

4.3. Training Parameters

This paper trained the 2D detector, i.e., the YOLOv3, on the training set of the KITTI 2D object detection benchmark [5], and trained the 3D detector, i.e., the PointRCNN, on the training set of the KITTI 3D object detection benchmark [36]. The IOU threshold of the NMS module was set to 0.7. The minimum number of matched frames required to create a new trajectory is set to 3 and the maximum number of frames allowed to lose the target . And because the prediction results of Kalman Filter is rough and there are many scenes with long-term partial occlusions in the KITTI dataset, we set .

4.4. Qualitative Evaluation

We evaluated the proposed tracking method qualitatively by using the KITTI test sequence. Different scenarios including occlusions, clutter, parked vehicles, and false positives from detectors were considered in the qualitative evaluation.

Figure 7 shows an example of the test sequence 0 in the test set. Each vehicle was assigned a tracking ID as a reference. Despite the compact and messy parking of the vehicle, the proposed MOT method can continuously detect and track the vehicles. Moreover, from this figure, we can see that, since the image is easily affected by the environment, such as illumination changes and partial occlusion, the shape of the detected target will change. In addition, the scale of the target object may be very different. In this case, the proposed MOT method still obtained a relatively high tracking performance. The experimental results show that our method can locate each car well even in the cluttered and strong lighting scene and maintain the ID of the car unchanged.

Figure 8 shows another example from the test sequence 1. Figure 8(a) shows that the object detector produces a false detection result, and Figure 8(b) shows the false positive of the detector is overcome by data association. In the case of transient errors in object detection, the proposed MOT method can still track the target stably. Hence, these experimental results demonstrated the robustness of the proposed MOT method.

4.5. Benchmark Results

We further evaluated the proposed MOT method on the KITTI Tracking Benchmark. In this evaluation, we considered some published online MOT methods for comparison. The results are presented in Table 2. It can be seen that the proposed MOT method is very competitive. In particular, the proposed MOT method returns the fewest number of identity switches, while maintaining competitive MOTA scores, MOTP scores, and track fragmentations. The tracking accuracy is mainly affected by a large number of false positives. Given their overall impact on the MOTA score, the combination of the 2D and 3D object detection results can significantly improve the performance of the MOT. Besides, because we set the maximum allowed trackage and associate the object motion information and appearance features, the proposed MOT method has the fewest number of identity switches. Therefore, the proposed MOT method can generate a relatively stable trajectory of the target object.

4.6. Ablation Study

The ablation study was to evaluate the effects of hyperparameters on the performance of the proposed MOT method. Table 3 shows the results of the ablation study on the KITTI benchmark. The hyperparameter is the threshold of IOU, and the denotes the minimum number of matching frames required to create a new trajectory. From the table, we can be seen that when , this may miss some correct detection results. That is because the number of detected objects is reduced. When , this may result in some wrong detection results, which is also the reason why it has the most IDS. means that track immediately when a new target is detected, which leads to more IDS and FRAG. The makes the minimum IDS, but MOTA is lower. Therefore, we finally set and .

5. Conclusion

This paper proposed a multimodal MOT method by fusing the motion information and the deep appearance feature of objects. In this method, we use a Nonmaximum Suppression algorithm to combine a 2D object detector and a 3D object detector for robust object detection. Then, the deep appearance features of objects are learned by a CNN, and the motion information of objects is computed by the Kalman Filter. The MOT task is achieved by associating the appearance features and the motion information of the target object. The effectiveness of the proposed MOT method was demonstrated in a set of experiments. The proposed MOT method can track objects stably in crowded scenes and effectively avoid false detection. In the KITTI tracking benchmark, the proposed method also shows competitive results.

Although 3D object detection is used in the proposed MOT method, it is only used as the auxiliary information for 2D object detection. 3D object detection can provide accurate position and size estimation for automatic driving. Therefore, our future work will be towards the direction of 3D multitarget tracking that can adapt to a more complex environment.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


This work was supported by the National Natural Science Foundation of China (Project no. 61673115). This work was also partly funded by the German Science Foundation (DFG) and National Science Foundation of China (NSFC) in project Cross Modal Learning under contract Sonderforschungsbereich Transregio 169.