Abstract

Target tracking has always been a popular research area in computer vision, and many important methods have been proposed. However, most methods can only solve partial and slight occlusion. If the target is lost, a common solution is to keep detecting, reidentify the target when it reappears, and then link the broken tracks together, but this makes tracking discontinuous. There are two key points in this problem: continuous tracking and occlusion judgment. In this paper, we propose a target tracking method with a short-time prediction function to solve this problem. For continuous tracking, we establish a 3D dynamic model to estimate the motion state of the target in each frame. For occlusion judgment, we use a depth prediction network to estimate the depth of the target and then determine whether the target is occluded by the depth. Without relying on depth sensors or multiple cameras, we achieve depth estimation using only a single monocular image, which greatly expands the application of our method. Benefit from the introduction of motion estimation and depth prediction, the tracking accuracy of our method has been significantly improved, especially for better robustness to occlusion. Even when the target is completely occluded, it can be tracked for a short time without reidentification. In addition, we improve the speed of depth prediction through knowledge distillation by 2.08 times, and the final tracking speed reaches 52.6 Hz on GPU, which meets the real-time tracking requirements.

1. Introduction

Modern society produces a large number of videos every day. As an important means of video analysis, video object tracking has a wide range of applications, such as autonomous driving [1], robotics [2], and augmented reality [3]. Although great progress has been made over these years, most methods are based on the assumption that the target is visible. Therefore, these methods can only solve partial and slight occlusion problems. However, in daily life, the target is often completely obscured. There are some ways to solve this problem, and the mostly used one is reidentification, but this breaks the continuity of the tracking. Even if the target is not visible, it still exists, and we should speculate its location based on our experience. And in some cases, we cannot pay the consequences of ignoring the completely missing target, such as online intelligent driving. To solve this problem, we can start from two aspects: continuous tracking and occlusion judgment.

The key to continuous tracking is that the tracker should be able to use the historical information of the target’s motion to estimate the current motion state of the target. Usually we assume that the motion of the target conforms to certain rules, so we can establish its motion model. Even if the target is lost, we can also infer the target’s state according to the model. Regression model is a common prediction model in engineering, but regression model is not applicable to state prediction in tracking problem. Because the regression model requires that all points in state spaces roughly conform to a known relationship, such as a linear relationship or a polynomial relationship, which is almost impossible to satisfy in visual tracking. In addition, the regression model reflects the overall trend, and the prediction error of a single point may be very large, which may even directly lead to tracking failure. A more practical prediction method is Kalman filtering (KF) [4], which has been widely used in tracking problems. It first predicts the ideal state value through the state transition equation and then corrects the predicted state value and the model based on the actual measured value. Compared with regression models, KF has three advantages. The first is low computation cost. Only five formulas are needed to perform in each a time step. The second is flexible. We can customize the target’s state values that need to be estimated, such as coordinates, aspect ratio, speed, and acceleration, and the relationship between state values can also be defined by ourselves. The third is strong adjustment ability. Some parameters in KF model can be adjusted over time to ensure that the model conforms to the current motion state as much as possible. After comprehensive consideration, KF will be used for state estimation in this paper.

Occlusion judgment is a very challenging problem, especially for 2D tracking. In the visual tracking problem, occlusion occurs frame by frame, and the target and the block are gradually mixed together. There is no general and effective way to separate the target and the block or to measure the level of occlusion. In 3D tracking, the tracker usually calculates the distance between the target and the camera, that is, the depth. It is easy to determine whether and how much the target is occluded by simply checking the depth value of key points on the target. Since the block is nearer to the camera than the target, the depth value of the target region will suddenly increase once occlusion occurs. The only problem is how to obtain the depth value. The most direct way to obtain depth information is to use a depth sensor, such as Kinect and laser scanner. Although the hardware-based approaches can obtain accurate depths, the dependence on hardware also greatly limits the applicability of these approaches, and they cannot handle images that are already captured without depth information. For most tracking tasks, we only need to distinguish the relative position of each object in space, not the absolute position, so it is a more practical solution to estimate the relative position of the object through a model. This has been a hot topic in 3D tracking in recent years, and we will discuss this in detail in Section 2. In this paper, we will introduce a practical neural network [5] to predict the depth due to its excellent performance and published depth dataset. To speed up depth prediction, we carry out a knowledge distillation to the original net [5] and get great improvement.

Another point to focus on is target detection. Most of the state-of-the-art (SOTA) tracking methods are tracking-by-detection (TBD) framework, and the performance of the detector directly affects the accuracy of target localization. Before the rise of deep learning, landmark target detection methods include Viola Jones Detectors [6], HOG Detector [7], and Deformable Part-based Model (DPM) [8]. These well-designed artificial feature detectors often achieve good results on specific tasks but had poor generalization capabilities. Since 2014, various deep learning-based target detection methods having continuously refreshed the record of target detection and many classic methods have also been proposed, such as RCNN [9], Faster RCNN [10], YOLO [11], SSD [12], RetinaNet [13], and Fast YOLO [14]. Here, we select Fast YOLO [14] as private target detector because of its high speed.

In this paper, we study the pedestrian tracking problem in video surveillance and aim to achieve real-time continuous 3D tracking using a single camera. Inspired by SORT [15], we propose a multipedestrian tracking method incorporating depth information. The basic idea is to use a Kalman filter to continuously estimate the motion state parameters for each target. To ensure the continuity of tracking, we do prediction when occlusion happens or the track does not match any target. The measurement of a KF consists of two parts: plane information and depth information, which are obtained by a target detection neural network and depth prediction network, respectively. To ensure real-time tracking, the neural networks used are trained in advance and are not updated online. Figure 1 shows the tracking effect of our method. When pedestrians are partially or completely occluded, our tracker still keeps tracking. In contrast, most trackers give up tracking once they cannot detect the targets.

To summarize, this paper presents a method for 3D tracking with a fixed monocular camera. The contributions of our work are summarized as follows: (1)We utilize KF’s short-term prediction capabilities to achieve continuous tracking. The original SORT method cannot detect occlusion, so when the target is occluded, its trajectory has to be ended. While we introduce depth information to make occlusion detection very easy(2)We fuse the uncalibrated depth information to achieve 3D tracking on 2D images. We use a neural network to estimate the depth information of the target in real time, thereby turning the original 2D tracking into a 3D tracking(3)We take a series of measures to speed up tracking. The most important strategy is knowledge distillation of the deep prediction network. Beside this, we select a faster target detector and remove reidentification step that is usually used in MOT methods

The remainder of this paper is organized as follows. Section 2 introduces some related studies about multitarget detection, tracking, and depth prediction. Section 3 presents the overall architecture of our method and details its improved parts. In Section 4, the experimental results verify the effectiveness of the proposed method. In Section 5, we provide a comprehensive summary of this study, and a future research direction is presented.

2.1. Multiple Pedestrian Tracking

Compared with single object tracking (SOT), multiple object tracking (MOT) is more complicated because there are extra issues to be considered, such as the matching of trajectory and target and target reidentification. Multiple pedestrian tracking (MPT) has become the main research direction of MOT. The related research work focuses on the following four areas for improvement: (a) design the association methods, (b) joint other vision tasks, (c) apply deep learning to MPT, and (d) multi-modality-based MPT. The core of TBD framework is data association. Some classical association methods are still used as basic algorithms. Hungarian method is a classic algorithm for solving the minimum weight matching problem of bipartite graph and is introduced into MPT by Singh et al. [16]. Although the algorithm is fast, the accuracy is not high due to local optimal nature. NF [17] extends local optimization to global optimization, and CRF [18] further considers association dependencies. GMMCP [19] are proposed to solve the problem of high computation complexity. Different from these graph-based methods, MCSM [20] formulates the association as a minimum cost subgraph multicut problem that links and performs clustering for the multiple plausible person detection jointly over time and space. Several researchers have leveraged other vision tasks to improve the tracking performance. One approach is to treat MOT as an extension of SOT. For example, Hu et al. [21] use Siamese-RPN to locate the target location. The other is to combine with image segmentation. Voigtlaender et al. [22] propose a new baseline method which jointly addresses detection, tracking, and segmentation with a single convolutional network. Since Kim et al. [23] first utilize CNN to extract 4096-dimensional features for each detection box, deep neural network models have been widely used in MPT, such as VGGNet [24, 25] and GoogleNet [21, 26]. When a single type of data is unreliable, people try to use multimodal data, such as Zhang et al. [27] using image and point cloud features and Gautam et al. [28] using image and radar features.

2.2. Depth Prediction

Scene depth estimation is an old-standing problem in vision. The most direct way to obtain depth information is to use a depth sensor, such as Kinect and laser scanner. Although sensors can capture accurate depth information, they are only applicable to specific scenes and cannot process existing RGB images. In recent years, using deep neural networks to estimate scene depth has become a mainstream research direction. Various DCNN-based methods focus on designing structural features, especially in depth prediction. Fu et al. [29] propose an encoder-decoder network, which extracts multiscale features from the encoder and is trained in an end-to-end manner without iterative refinement. Jiao et al. [30] propose an attention-driven loss, which merges the semantic priors to improve the prediction precision on unbalanced distribution datasets. Chen et al. [31] apply the generative adversarial training to lead the network to learn a context-aware and patch-level loss automatically.

The basic idea of these methods is to train a deep network model with a RGBD dataset and then reconstruct the 3D structure of the target from the RGB image to predict the depth information. Deep network models are data-driven, and the models achieve good results when trained with large amounts of samples, but it is still not an easy task to obtain large amounts of RGBD data. Silberman et al. [32] construct the NYU dataset using Kinect, but it was limited to indoors. Although the datasets Make3D [33] and KITTI [34] built with laser scanners can be used outdoors, they are collected in specific scenarios (a university campus and atop a car, respectively). Another way to collect depth data for training is to ask people to manually annotate depth in images, but it is not only time-consuming and laborious but also can only give the relative position of the object. Estimating geometry from Internet photo collections has been an active research area for a decade. Li et al. [5] propose a method to generate an infinite dataset. First, a large number of images are collected from the Internet, and then, the structure-from-motion (SfM) and multiview stereo (MVS) method are used to generate depth maps, and these depth maps are further processed to form a large-scale depth data set MegaDepth (MD).

2.3. Pedestrian Detection

The deep feature-based detection approaches have become the main direction of pedestrian detection research because of their SOTA performance. In fact, many general detection methods are also suitable for pedestrian detection, so we do not distinguish between them. We strongly recommend researchers to refer the surveys [35, 36], which give detailed summary about target detection. Here, we only review some classical algorithms. Ren et al. [37] propose a Recurrent Rolling Convolution (RRC) architecture, which can selectively integrate contextual information into the bounding box regressor. Liu et al. [12] propose SSD, the first deep network based object detector that does not resample pixels or features for bounding box hypotheses and is as accurate as approaches that do. Liu et al. [38] propose the ALFNet, which trains the SSD in multisteps and significantly improves the accuracy of pedestrian detection while maintaining the efficiency of the single-stage detector. Most detection methods are trained on datasets without occlusion or with reasonable occlusion. Once heavy occlusion occurs, the performance will decrease significantly. Therefore, the recent benchmark test pays special attention to pedestrian detection with heavy occlusion. Zhang et al. [39] design a new regression loss and introduce a part occlusion aware region of interest (PORoI) pooling unit to solve the problem of occluded pedestrian detection in crowded scenes. Tian et al. [40] design a set of component detectors; each component is designed to handle a specific occlusion mode. Zhou and Yuan [41] use a neural network to locate the full body and visible part of a pedestrian, respectively.

3. Method

The flowchart of the proposed multipedestrian tracking algorithm is shown in Figure 2. First camera motion is calculated to reduce the global error. Then, each KF predicts its trajectory at time, denoted as, according to the former estimated trajectory. Next, a target detection network and a depth prediction network, respectively, extract the plane information and depth information of the target from the image at current time, which constitute the measurement for KF. After that, the tracks and detected targets are matched with Hungarian method. There are three results for matching, and each result corresponds to a processing way. (1) For unmatched target, if a target does not match any track, the target will be treated as the starting point of a new track and a new KF will be initialized. (2) Unmatched track, if a track has not matched any target for several consecutive frames, the track is considered to have lost the target completely, and keep tracking is meaningless, so the track will be delete from tracking system. Otherwise, we treat the track as a temporary disappearance and update KF. (3) For matched track, if a track matches a target, the system will detect occlusion and then update KF. Finally, the added new tracks and the updated tracks compose the tracking results at time. In the following subsections, we will elaborate on several key points in the algorithm.

3.1. State Estimation
3.1.1. KF Prediction

Kalman filter is a classic time series estimation method, which is very suitable for estimating the motion state of the target. The state can be any parameter related to the target, such as center position, height, aspect ratio, and their respective velocities in image coordinates. According to the state of the target at the previous time, KF calculates a prediction value at time k by, where is a Gaussian noise that reflects the accuracy of the process model and is a function that describes the state change law of the target in two adjacent frames. In multitarget tracking, is usually assumed to be a constant noise model.

3.1.2. Observation Calculation

A target detection network and a depth prediction network, respectively, process the image at time and obtain the plan and depth map of the targets, which constitute the measurement value. We adopt fast-YOLO [14] for target detection, which greatly reduces the number of deep inferences and speeds up the detection. For depth prediction, we follow the monocular depth estimation method open-sourced in paper [5]. To reduce the calculation, we use the trick of knowledge distilling to transfer the original depth prediction network to a 5-layer student CNN. The teacher network is trained on MD dataset [5]. Then, the student network is further trained to predict semantic segmentation maps from depth image. Eventually, the accuracy of the student network can be close to that of the teacher network, but the model size is much smaller, thus achieving accelerated computation. Both the two networks are trained offline and are not updated online when tracking.

3.1.3. KF Update

The final state value is composed of two parts: the predicted value and the corrected term, where is the Kalman gain and is the state transition matrix. For multitarget tracking, we match tracks with targets and get three results: a target does not match any track, a track matches a target, and a track does not match any target. The first case indicates that the target emerges lately, so a new KF needs to be instantiated. Both the second and third cases can be calculated by Equation (1), which will be detailed in Section 3.3.

3.2. Constructing 3D Motion Model

Equipped with depth estimates, we construct a 3D linear motion model with a constant velocity assumption. The state of each target is modelled on the ten dimensional state space that contains the bounding box center position, aspect ratio, height, and their respective velocities. The observations of the target are .

The state change of the target at two adjacent moments can be denoted by the state equation: where is process noise and is the process noise covariance matrix. And we can derive the following formulas for ,

Similarly, we can get the equations of,, , and. The intrinsic relationship between state and observation can be denoted by the measurement equation: where is observation noise and is the observation noise covariance matrix.

Inverse depth is a commonly used representation predicted due to the ability to represent points at infinity and to model uncertainty in pixel disparity space. So the depth used is. In order to simplify notation, we assume the camera focal length is a constant. In fact, can be folded into a motion noise parameter and can be easily tuned on a training set. Then, the adjusted parameters of the bounding boxes are as follows:

This means that we dynamically scale the object with inverse depth. If depths are smooth over time, we can take as an approximation of, so we derive the following formulas from Equation (3), which is

The equation suggests that one can approximately apply a Kalman filter on 2D image measurements augmented with a temporal noise model that is scaled by the estimated inverse-depth of the object.

3.3. Occlusion and Unmatched Tracks

Due to the use of depth information, we can easily determine whether the target is occluded by comparing the depth values in and, because objects with smaller depth are always in front of objects with larger depth. To avoid accidental errors, we take the average depth of all points near the predicted location as the observed depth. The area size is of the predicted bounding box. When, we consider that the target is occluded, and is the information of the occluder rather than the target, so cannot be directly calculated by Equation (1). In this case, we take the KF prediction as the approximate state value, which means the observation value is completely accurate without error.

If a track does not match any target, it may because the target moves out of the image, not be detected, or not matched. To reduce the impact due to missed detections or matching errors, we introduce a counter for each track to count the number of frames since the last successful measurement association, denoted as. When the track matches a target, the counter is reset to 0. When the track does not match any target, KF continues to predict the state of the target while increases. If the track does not match any target for several consecutive frames, there is a high probability that the target is lost, and the tracking of this track should be ended. It can be noticed that measurement association also works as identification. In most MOT methods, once a target is lost, its track will break off. So, when a new target appears, we have to judge whether the target is a new one or a lost one through an extra reidentification step. However, benefit from KF and occlusion detection, the track will not be interrupted in our method, so the reidentification is omitted.

In short, there are two ways to update according to different situations,

In our experiment, we set.

3.4. Camera Motion

Camera motion is an important factor to vision tracking, which not only changes the coordinates of the object but may also blur the image. The motion of dynamic objects is assumed to be small relative to the scene motion in most videos, so we use image alignment algorithm to approximate camera motion estimation. Philipp et al. proposed a practical work [42]. We first estimating a nonlinear pixel warp between neighbouring frames which maps pixel coordinates in one frame to the next and then use this wrap to align boxes forecasted using frames up to with frame.

At last, we summarize the algorithm proposed in this article in pseudo-code as follows (Algorithm 1).

Input: The state vector , image , max unmatched consecutive frames
Output: The state vector
1: estimate camera motion;
2: calculate the predicted state vector ;
3: detect all pedestrian on ;
4: predict the depth of ;
5: match each track with targets ;
6: if does not match any
  go step 7;
 elseif does not match any
  count number of unmatched frames , and go step 8
 else
  go step 9
7: initialize a new KF tracker with ;
8: if
  else
  delete
9: if
  else
  
10: update KF

4. Results and Discussion

We conducted tracking experiments on the popular multitarget tracking datasets, and the results will be shown in Section 4.3. In addition, we also discuss the effect of depth prediction in Section 4.2.

4.1. Experiment Setting

We evaluated our method on two popular MOT datasets: MOT2017 [43] and MOT2020 [44]. MOT17 contains 14 videos, 7 for training and 7 for testing. Faster R-CNN [10], SDP [45], and DPM [8] are provided as public target detector. MOT20 contains 8 videos, 4 for training and 4 for testing, and only Faster R-CNN [10] is provided. All of the datasets are very challenging including crowded scenes with heavy occlusions, camera motion, and both day and night sequences.

To evaluate the performance of the tracking methods, we adopted the widely used CLEAR MOT metrics [46]. Multiple object tracking accuracy (MOTA) evaluates accuracy in the presence of false positives (FP), false negatives (FN), and identity switches (IDS). IDS counts the total number of identity switches. At the same time, IDF1, MT, ML, and Hz are also considered. IDF1 is the ratio of correctly identified detections over the average number of ground-truth and computed detections, and it indicates the average maximum consistent tracking rate. MT evaluates the mostly tracked trajectories that are successfully tracked at least 80%. ML evaluates the mostly lost trajectories that are successfully tracked at most 20%. Hz indicates the processing speed (in frames per second). Among these metrics, MOTA and IDF1 are usually considered the most important.

For target detection, public detectors provided by the authors and private detector (Fast YOLO [14]) are all used for sufficient comparison. The private detector is trained with the 7 training videos in MOT17. All models are retrained in advance and not updated when running. Our experiment was conduct in PyTorch and runs on a desktop with a CPU of Intel(R) Xeon(R) [email protected] and a 1080Ti GPU.

4.2. Pedestrian Detection and Depth Prediction
4.2.1. Pedestrian Detection

To evaluate the effects of pedestrian detection, we tested Faster R-CNN [10], SDP [45], DPM [8], and Fast YOLO [14] on MOT17. Since there are many well-trained network models, we just fine-tuned them on the train set and then detect the pedestrians on the test set. Table 1 shows the result. The up arrow and down arrow, respectively, indicate that the larger the value, the better and the smaller the better, and the italic value indicates the best result. AP, MODA, MODP, and FAF mean average precision taken over a set of reference recall values, multiobject detection accuracy, multiobject detection precision, and the average number of false alarms per frame, respectively. As we can see, SDP has the best detection performance in most metrics; however, the speed is as slow as 0.6 FPS. Although Fast YOLO gets the second-best detection results, it achieves amazing speed, which is 245 times faster than SDP. There is no doubt that it is worthwhile to trade extremely small detection accuracy for extremely high detection speed for target tracking problem.

4.2.2. Depth Prediction

To evaluate the effects of depth prediction, we selected many pictures with different backgrounds for experiments, and five representative results are shown in Figure 3. It can be seen that pedestrians on all images are correctly detected. The effect of depth prediction is satisfactory, which can distinguish people from background. After careful observation, we found that depth prediction has the following two characteristics: One is that the estimation of the big target is more accurate. If the size of the target is small, the depth difference between the target and the background is smaller, and the target will not be well-marked on the depth map. As shown in Figure 3(d), the depth of the car on the right is similar to that of the person on the left, but the outline of the car is obviously clearer. The other is that the estimation is more accurate under a simple background. It can be seen that the depth distinction between pedestrians and background is obvious in Figures 3(a) and 3(b), while in the depth maps in Figures 3(c)3(e), some people and background are difficult to distinguish. In Figure 3(a), it is easy to see the difference in depth between the pedestrians and the car, which reflects the advantage of using depth to judge the occlusion. In Figure 3(b), the background can be divided into three parts, the ground, the building on the left rear, and the trees on the right rear. Because the building and trees are far away from people, they have little influence on the depth prediction of the pedestrians, so depth image is also outstanding. As a comparison, we can be seen that the depth of the people nearby in Figure 3(c) is obvious, while the people faraway are difficult to distinguish from the background. Interestingly, in the depth map of Figure 3(d), the pedestrian in the middle seems to have blended into the background, while the rear car is vaguely discernible, which seems to be related to the contrast of pixels. In addition, the words on the image will also have a significant impact. In the depth map in Figure 3(e), the silhouette of human is not clear, probably because the interference from the ground is very serious. Although sometimes the result of depth estimation is not very accurate, it can help us judge occlusion as long as it can be distinguished from the background depth.

4.3. Pedestrian Tracking
4.3.1. Results on MOT17 and MOT20

Our goal is to build a real-time tracking system, so the algorithms that we select from MOT17 and MOT20 for comparison all reach a speed of 25 frames per second. Meanwhile, all these algorithms have been published in papers.

Table 2 shows the results on MOT17. IOU [47], SORT [15], GMPHD_Rd [48], and FlowTracker [49] are select as the baseline, because most of them also appear in MOT20, and this is helpful for comparison. All trackers use the same public detectors (Faster R-CNN [10], SDP [45], and DPM [8]). We can see that our method achieves the best performance on MOTA, IDF1, MT, ML, FN, and IDS. IOU [47] gets the smallest FP score and amazing tracking speed, and our method gets the second smallest FP score and the third fastest speed. IDS scores largely reflect the ability of continuous tracking. Our method uses different strategies for different matching results, so we obtain better IDS score.

The results on MOT20 are shown in Table 3. Compared with MOT17, MOT20 contains less frames and less trajectories. Due to more crowdedness and more pedestrians, the detection task is much more challenging. On the whole, tracking is significantly slower, and MT and ML are better, but there is no clear trend in the other metrics. Our method gets the best scores in most metrics except for MT, IDS, and Hz.

Next, we compared our method and more published real-time trackers on MOT17. Ten trackers are selected, including 4 trackers using private detectors (TrTrack [51], Fair [52], RekTCL [53], TraDeS [54]) and 6 trackers using public detectors (GMPHDOGM [55], GMPHD_Rd [48], PHD_LMP [56], IOU17 [47], SORT [15], FlowTracker [49]). Table 4 shows the detail results. Benefit from better detectors and better training strategies, trackers using private detectors achieve high MOTA scores than trackers using public detectors. While using public detectors, ours_pub performs best. Although ours_pri ranks last in the trackers using private detectors, it has a relatively high speed. Because an additional network is introduced to calculate the depth, the speed of our method has to be encumbered. In the future, we plan to use light architectures to optimize the processing of depth prediction. Better detection and association methods are also considered to improve accuracy.

We demonstrate the relationship between tracker accuracy and speed in Figure 4 (excluding IOU17). The farther to the right, the higher the MOTA score, and the higher the upward, the faster the tracking speed. It can be clearly seen in the figure that the methods using private detectors have great advantages in tracking accuracy and not slow in speed. Our method is in the middle in speed and accuracy, and there is still much room for improvement.

4.3.2. Impact of Different Detectors

To explore the influence of different detectors on multiobject tracking, we ran our tracker with public and private detectors on MOT17. The results are shown in Table 5. Obviously, using better detector can improve the performance of the tracking algorithm, which is mainly reflected in three aspects: the total number of detected targets, the number of correctly recognized targets, and the detection precision. The MOTA scores from high to low are ours_pri, SDP, FrRCNN, and DPM, and it directly reflects the performance of the detector. The private detector performs very well, especially for FN, which is 11704 lower than the best public detector. The tracking speed is not only related to the detection speed but also related to the number of targets detected, because the more the number of targets to be tracked, the greater the amount of calculation required to match the trajectory to the target. The private detector achieves 52.6 Hz, which is much higher than the public detectors.

4.3.3. Impact of Distillation

We compare the tracking performance with and without distilling the HG network, and the result is shown in Table 6. Accuracy of tracking improves as that of depth prediction improves after distillation. The most attractive change is speed. The tracking speed before distillation was only 17.1 Hz, and after distillation, it reached 52.6 Hz, increasing by 2.08 times.

4.3.4. Impact of Different Components

To get a deeper insight into our method, we run several experiments to test the effect of each component. The complete method with private detector is set as the baseline. We remove one component each time and then evaluate the method on MOT17. The result is shown in Table 7.

The function of KF is mainly reflected in two aspects: one is to predict the target search area, and the other is to smooth the trajectory. Since the detector is not updated online, its performance remains stable. But KF can affect the detection result by change the target search area. Therefore, the indicator related to accuracy (IDF1) has not changed much. When the occlusion or unmatched trajectory occurs, KF can continue to estimate the trajectory, effectively avoiding the interruption of the trajectory, so the impact on IDS is very obvious. Overall, MOTA drops by 6.2, mainly due to the increase in IDS. Since the computation of KF is very small, it has little influence on the tracking speed.

Occlusion detection involves depth prediction and occlusion judgment. Occlusion judgment will affect the processing method after IOU matching, so it has a greater impact on IDS but has almost no impact on IDF1. Since the depth prediction network is time-consuming, the tracking speed increases by 26.8 frame per second without occlusion detection, which is the largest increase amplitude in the three components.

The position of the target in the image is determined by the motion of the target and the camera. When using Kalman filtering, we assume that the camera is fixed. In fact, in the MOT17 dataset, three camera sequences are stationary, and four are captured from a moving camera. If we ignore the motion of the camera, it will inevitably affect the Kalman filter. It can be seen from Table 7 that motion estimation has an obvious impact on the overall performance. MOTA dropped by 5.9, IDF1 dropped by 1.1, IDS increased by 46, and tracking speed increased by 11.3.

Referring the results in Table 7 by column, we can get the impact of each component on a single indicator. Figure 5 shows this impact intuitively. KF, occlusion detection, and motion estimation all contribute to MOTA, IDF1, and IDS. For MOTA and IDF1, the effect of KF and motion estimation is slightly greater than that of occlusion detection, because KF and motion estimation have a greater impact on the target detection. For IDS, KF is particularly useful. This is due to the processing of missing values by KF, which results in a significant decrease in the number of reidentified targets. More components mean more computation, so the tracking speed inevitably decreases. Among them, occlusion detection has the greatest impact on the speed, because it calculates the depth through a deep neural network. Accuracy and speed are often difficult to compromise. Fortunately, our method improves accuracy while keeps high speed.

5. Discussion

In Section 4.2, we evaluate the performance of depth prediction. Experiment shows that depth prediction is more effective when the target is relatively large, or when it differs greatly from the background. Except for very few cases, the depth prediction is reliable. Although the estimated depth values cannot replace the true depth values, they completely reflect the sequential order of the objects, which is enough for us to judge the spatial location of the targets.

In Section 4.3, we evaluate our method with both public and private detectors on MOT17 and MOT20. Our method has achieved high accuracy while maintaining a high speed. It should be pointed out that the methods we selected are all real-time methods. In fact, more methods trade speed for accuracy. Through the ablation study, we show the impact of each component on the tracking results. KF framework has the greatest impact on IDS, as it is the key to maintaining continuous tracking. Occlusion detection plays the most important role in IDF1, because it determines the update method of KF, which in turn affects the final accuracy. For tracking speed, depth estimation involves neural network calculations and has the greatest impact on speed, while KF has relatively small impact. On the whole, each component has similar contributions to MOTA.

6. Conclusions

In this paper, we study pedestrian detection and tracking with a fixed monocular camera. To improve the robustness of tracking, we focus on continuous tracking and occlusion detection to deal with heavily occlusion. To keep continuous tracking, we introduce Kalman filter to estimate the motion state of the target in each frame. With the help of KF’s predictive ability, we can continue to estimate the state of the target when the target is occluded. Once the target reappears, the tracker can quickly locate the target, thus avoiding reidentification and ensuring the continuity of the target’s trace during the occlusion period. To detect occlusion, we introduce depth information in our tracking system. Once the depth of the target suddenly decreases a lot, we think the target is obscured by something, because the closer the distance to the camera the smaller the depth. Depth information is usually obtained through depth sensors or multiview images, which is not practical in daily life, because most cameras are monocular and cannot provide depth value. Here, we introduce a light depth prediction network, which is distilled from a large but well-performed network. Distillation can not only slightly improve the accuracy of depth prediction (2.6 MOTA score) but also greatly accelerate the speed of model inference (35.5 Hz). We evaluated our method on public object tracking dataset. The results show that our method can not only achieve high accuracy (65.1 MOTA score) on MOT 17 but also high tracking speed (52.6 Hz). In the future, we intend to introduce better association rules to improve the tracking accuracy, continuously optimize the target detection network and depth estimation network, and strive to achieve real-time tracking in a more universal environment.

Data Availability

The data used to support the findings of this study are included within the article. All datasets used in our research are publicly available and are cited in our article.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Natural Science Foundation of Beijing Municipality (4212023), the Fundamental Research Funds for the Central Universities, USTB (FRF-GF-20-04A), the Major Program of National Social Science Foundation of China (No. 17ZDA331), the National Key Research and Development Program of China (2018YFC2001700), the Scientific and Technological Innovation Foundation of Shunde Graduate School, and the Engineering Research Center of Intelligence Perception and Autonomous Control, Ministry of Education, Beijing (100124).