Abstract

In this paper, we address the problem of online updating of visual object tracker for car sharing services. The key idea is to adjust the updating rate adaptively according to the tracking performance of the current frame. Instead of setting a fixed weight for all the frames in the updating of the object model, we assign the current frame a larger weight if its corresponding tracking result is relatively accurate and unbroken and a smaller weight on the contrary. To implement it, the current estimated bounding box’s intersection over union (IOU) is calculated by an IOU predictor which is trained offline on a large number of image pairs and used as a guidance to adjust the updating weights online. Finally, we imbed the proposed model update strategy in a lightweight baseline tracker. Experiment results on both traffic and nontraffic datasets verify that though the error of predicted IOU is inevitable, the proposed method can still improve the accuracy of object tracking compared with the baseline object tracker.

1. Introduction

Car sharing services provide customers access to shared vehicles for short-term use. They can reduce inner-city traffic, trip cost, congestion, and environmental pollution and have developed rapidly in recent years. To achieve better safety and operating efficiency, more and more intelligent vehicle technologies have been utilized in car sharing services [1, 2]. Visual object tracking is a fundamental component of them, by which given an object’s initial location in the first frame its locations in subsequent frames can be estimated continually. Moreover, the object’s trajectories and velocities can be calculated simultaneously from the tracking results and used for augmented or automatic driving of shared vehicles. Compared with radar tracking, visual tracking technology is cheaper and can perceive richer semantic information about the traffic scene. However, its disadvantage is that there exist several factors such as the real-time variation of illumination, weather condition, and interaction between traffic elements which may usually reduce the object tracking performance in complex traffic scenes. Therefore, there is still huge room for the development of visual object tracking for car sharing services.

A typical visual object tracking method consists of five components, namely, feature extraction, motion model, appearance model, model updating, and integration process [3]. Most studies focus on feature extraction and appearance model. The features used for object tracking include hand-craft features such as Color, HOG, LBP, and CN and autolearned convolution features. The main appearance models can be classified to generative and discriminant ones and receive much attention. By contrast, the model updating component is less studied. Most object trackers use the simplest linear weighting for model updating, in which a new appearance model is obtained by weighting the old one and the tracking result of the current frame. The drawback of this method is that the weight factor of the current frame is set unchanged and has no connection with the tracking performance of the current frame during the updating process. In fact, if the tracking result of the current frame is reliable and the object is not occluded, a small weight factor of current tracking result may cause the appearance model not to be updated adequately. On the contrary, if the tracking result of the current frame is inaccurate or the object is occluded, a large weight factor of current tracking result may cause the appearance model to be updated improperly. Under both situations, some errors may be introduced into the appearance model, and as the updating proceeds, the errors may accumulate and make the appearance model drift away from the object. From the above analysis, we can find that it is necessary to assign a suitable weight factor according to the evaluation of current tracking performance. Nonetheless, how to update the tracking model online based on the analysis of current tracking performance is still an open problem. This study tries to bridge this research gap, and the main contributions are follows:(1)Introduce an object-specific IOU predictor which trained offline on a large number of image pairs to estimate the performance of current tracking result for object model updating.(2)Propose a dynamic updating mechanism based on IOU prediction. The updating principle is to assign the current tracking result a larger weight if it is relatively accurate and unbroken and a smaller weight on the contrary.(3)Integrate the IOU predictor into a lightweight correlation filter tracker and update the tracker online using the proposed updating mechanism.

This paper is organized as follows: Section 2 provides a scan of related works. Section 3 introduces a baseline object tracker and the IOU predictors used in computer vision and proposes our visual object tracker with online updating. Section 4 shows the experimental results and corresponding analysis. Finally, Section 5 presents the conclusions and future research directions.

As mentioned above, existing visual object tracking methods can be divided into two categories: generative ones and discriminant ones. In the generative methods, the appearance model contains only the object’s information and object tracking is achieved by searching for the optimal candidate region that best match the appearance model. Template tracking is the earliest generative tracking method, which takes the original spatial intensity distribution of the object region as the template and tracks the object by template matching. Aiming at the drift problem caused by inadequate updating of templates in tracking, Matthews et al. [4] kept the first template around, used it to align the current template, and finally reduced the possible drift phenomenon to a certain extent. As another classical generative tracking method, the mean shift method [5] takes the object’s kernel histogram in the first frame as the appearance model and employs a metric which derived from the Bhattacharyya coefficient as the similarity measure to perform the matching. Throughout the whole tracking process above, the appearance model remained unchanged. To update the appearance model dynamically, Peng et al. [6] employed Kalman filter to filter the kernel histogram using the previous appearance model and current candidate region. The modified method could partly keep up with the changing of object appearance, but the hidden assumption that the object appearance obeyed the Gaussian distribution may not hold in many practical situations. Besides the intensity template and the kernel histogram, low-dimensional linear subspace is also a generative appearance model and first introduced into object tracking by Hager and Belhumeur [7] to handle the object appearance’s variation caused by illumination. To update the linear subspace model adaptively, Ross et al. [8] proposed an incremental learning-based tracking method. It collected the object locations in previous frames and employed incremental PCA to update the linear subspace model. Through the updating operation, the linear subspace model could adapt to the variation of object appearance even more.

Different from the generative object tracking methods, the discriminant methods consider not only the objects’ information but also the backgrounds’ information for tracking. They take object tracking as a binary classification problem, train a classifier to separate the object from the background, and have attracted more attention due to their strength to deal with the objects under complex environments. Most traditional tracking-by-detection methods train their binary classifiers online to update the appearance model, and the updating process always has two steps: (i) the generation and labelling of samples based on the estimated object locations in previous frames and (ii) the online updating of the classifiers [9]. However, the generated samples’ labels are often noisy. To increase the classifier’ robustness to the poorly labelled samples, several improvements such as robust loss functions [10, 11], semisupervised learning [12, 13], and multiple instance learning [14, 15] have been proposed.

With the fast development of deep learning, modern visual object trackers such as correlation filtering-based trackers and siamese trackers generally use deep features to build their appearance models, and the corresponding model updating mechanisms have also been studied. MOSSE filter [16] the first correlation filtering-based tracker updates the object model by weighting the current estimated object region and the previous object model linearly, and the linear weighting method has been also used in many other correlation filtering-based trackers [1720]. Siamese tracker is another kind of modern object tracker, whose basic principle is to learn a similarity metric offline and search online for an optimal candidate region which best matches the object appearance template. SiamFC is the original siamese tracker, in which the object template is initialized in the first frame and then kept fixed during the remainder of the video [21]. Most siamese trackers [2224] implement the same model updating strategy as the one in MOSSE, and there are two problems in the updating of these trackers. First, the weight factors of current frame are set fixed and cannot change adaptively in the updating process. Second, only the object information is updated, and the updating of the background information is ignored. Aiming at the second limitation, Huang et al. [25] modeled the context between the object and its surroundings by an object-aware weight vector and took the spatial-temporal context into account in the updating process. Besides the above, there are some learning-based model updating methods. Taking the initial template, the accumulated template, and the template of the current frame as inputs, Zhang et al. [26] utilized a convolutional neural network to learn the optimal template of the next frame in an offline way. Li et al. [27] learned a RNN-based model updater on offline videos by metalearning. In general, to make the learned mechanism adapt to arbitrary targets, a large number of samples with different kinds of appearance variation are needed for these learning-based model updating methods.

To consider the feedback from tracking results in object model updating, Wang et al. [28] used the response map’s peak value and average peak-to-correlation energy (APCE) to measure the confidence of current tracking result. The object model was only updated if these indexes were greater than certain thresholds and remained unchanged if not. Similar to the above method, Sun et al. [29] calculated peak-to-sidelobe ratio (PSR) of response map to evaluate the quality of tracking result in each frame and used it to update the template of a siamese tracker. In addition, Zhu et al. [30] took peak-versus-noise ratio (PNR) as an evaluation index. When the PNR and the max value of response map exceeded certain thresholds simultaneously, one-step stochastic gradient descent with a small learning rate was used to update the object model.

In summary, most modern object trackers update their appearance models without considering whether the estimated object location is accurate or not. Actually, once the object is estimated inaccurately, severely occluded, or totally missing in the current frame, the object model will be updated improperly, and the impact will accumulate continually during the whole tracking. Few research studies used APCE, PSR, or PNR to measure the confidence of current tracking result. These rule-based indicators can be calculated from the response map easily and rapidly, but a lot of information in raw images is thrown away in the calculation. Therefore, they are limited in the evaluation of tracking performance. Different from them, in this paper, we introduce a data-based method to evaluate the performance of tracking results and use it as a guidance to update the object model online. For the reader’s reference, Table 1 summarizes some main symbols and their corresponding descriptions used in the following.

3. Object Tracking with Online Updating Guided by IOU

3.1. Base Object Tracker

The features used in traditional discriminant correlation filtering-based object tracking methods are either hand-crafted features like HOG, LBP, and CN or convolutional features trained independently in other visual tasks like image classification and object detection. The separation between feature learning and correlation tracking makes the achieved tracking performance not be optimal. Aiming at this problem, Wang et al. [20] proposed DCFNet which is an end-to-end lightweight network architecture to learn the convolutional features and perform the correlation tracking process simultaneously. Because of its high efficiency and performance, we use it as the base object tracker in this work.

In DCFNet, denotes the convolutional feature of object region , where is the convolutional network used for feature extraction with parameter . is the correlation filer’s ideal response which is generated by a Gaussian function and peaked at the object region’s center. Given the feature extraction network , the desired filter at time can be obtained by minimizing the accumulated ridge loss as follows:where the parameter is the updating rate which expresses the impact of object region , is the channel number of extracted feature, and is the regularization coefficient. The closed-form solution of the optimization problem in equation (1) can be formulated in an incremental mode as follows:

Here, hat denotes the discrete Fourier transform , represents the complex conjugate of a complex number, and denotes Hadamard product. In the test process, the feature of search region is extracted by feature extraction network and denoted as . Finally, the target’s final location is estimated by searching for the maximum value of correlation response map as follows:

The feature extraction network can be trained offline by the stochastic gradient descent method to minimize the object function on a dataset which consists of a large number of image pairs:where is the desired response map. Compared with traditional discriminant correlation filter-based tracking methods in which the features and the filters are learned independently, DCFNet can be learned in an end-to-end fashion and get higher accuracy. Furthermore, because the lightweight network architecture is adopted, it can also get a balance between speed and accuracy in tracking and operate in real time. This is the reason that we choose it as the base object tracker.

3.2. IOU Prediction

IOU is defined as the ratio of the intersection area between the candidate object region and the ground truth region to the union area of them. It evaluates the accuracy of the candidate region relative to the ground truth region and is useful in many visual tasks. The prediction of IOU is first implemented by IOU-Net [31] in object detection, in which each IOU-Net is trained for a certain object class independently but not suitable for other sorts of objects. However, the class-specific IOU predictors are of little use for generic visual tracking because the object’s class is generally unknown and arbitrary in object tracking. To predict the candidate region’s IOU of all sorts of objects in visual tracking, Danelljan et al. [32] proposed a new IOU predictor which could predict an arbitrary object’s IOU given only a single reference image by a modulation-based network architecture as shown in Figure 1.

As shown in Figure 1, the IOU predictor network has two branches, and both of them take the specific convolution layers of ResNet-18 as backbone. The reference branch accepts convolution feature and the object’s bounding box annotation in the reference image as inputs and outputs a modulation vector . The test branch takes convolution feature and the estimated bounding box in current test image as inputs and outputs a feature representation . Then, the feature representation of the estimated object region is modulated by the coefficient vector via a channel-wise multiplication. Finally, the modulated representation is fed to the IOU predictor module which consists of three fully connected layers. The predicted IOU of the estimated bounding box in current test image is given by

3.3. Online Updating of the Base Object Tracker with the Guidance of IOU

As many discriminant correlation filtering-based object trackers, DCFNet has an incremental model updating mechanism as shown in equation (2). In the updating process, the parameter which denotes the impact of current tracking result remains unchanged from the start of tracking as follows.

The assumption behind equation (6) is that the estimated object regions in different frames are of equal importance. Obviously, it does not hold in many cases. For example, if the object is occluded at time , the estimated object region may be unreliable and its importance should be reduced to avoid the model drift. On the contrary, if the object’s appearance changes little in recent frames, the estimated object region is more reliable and its importance should increase to update the model more adequately. Therefore, it is necessary to estimate the reliability of the estimated object region and take it as a guidance to adjust the importance adaptively as follows:

In fact, the evaluation of tracking performance has received certain attention and been used for model updating in visual tracking. In most existing methods, the reliability of tracking result is expressed as statistical indexes such as APCE, PSR, PNR, and so on. These statistical indexes are defined manually and calculated based on an intermediate response map. In the evaluation, the original information contained in tracking results such as color, texture, and intensity are ignored. Different from them, we introduce the IOU to measure the reliability of tracking result and use it as a guidance to update the object tracker online. As shown in Figure 2, the original DCFNet is supplemented with an IOU predictor to constitute a new tracker and in which the architectures of two networks are remained unchanged.

Because of the prediction error, it is hard and unnecessary to adjust the parameter very precisely based on the predicted IOU. In our method, in equation (2) is redefined by a linear piecewise function as follows:where and are the IOU thresholds and , , and are the model updating rates.

Compared with the manually defined statistical indexes such as APCE, PSR, and PNR, the predicted IOU between the estimated object region and the ground truth location is learned from large numbers of samples. It can adapt to different complex traffic scenarios and measure the reliability of current tracking result more exactly and therefore be helpful for the updating of the object appearance model.

4. Experiments

4.1. Experiment Settings

To verify the effectiveness of the proposed object tracker, we first conduct extensive experiments on 2 challenging public datasets including OTB-2013 [33] with 50 sequences and its updated version OTB-2015 [34] with 100 sequences. Without loss of generality, the used datasets contain both traffic and nontraffic scenes. The hardware for the environments includes an Intel E5-2687 3.0GHz CPU, 128GB RAM, and a Nvidia 1080Ti GPU. We implement our object tracker on Pytorch and compare it with other 8 modern object trackers such as SRDCF [35], Staple [36], SiamFC [21], CFNet [37], the original DCFNet [20], and its 3 modified versions which update their object models using APCE, PSR, and PNR, respectively.

For a fair comparison with the original DCFNet, the model updating rate in equation (7) is set to 0.01 as same as that in equation (2); meanwhile, and are set to 0.005 and 0.015. The relevant parameters used for model updating in DCFNet-APCE, DCFNet-PSR, and DCFNet-PNR are chosen according to references [2830], respectively. It is worth mentioning that the proposed tracker is evaluated under 6 different conditions to verify its robustness to hyperparameter selection.

4.2. Experimental Results

The tracking performance of each object tracker is estimated by one-pass evaluation (OPE). Figure 3 shows the success plots of OPE for the propose tracker under condition 1 and other trackers on OTB-2013 and OTB-2015, and the numbers in the legends indicate the average area under curve (AUC) scores of all trackers. A more complete quantitative comparison between our tracker under all conditions and other trackers is shown in Table 2.

In addition to the above experiments on OTB-2013 and OTB-2015, a group of experiments on KITTI [38] which is a vision benchmark of autonomous driving are also conducted subsequently to prove the feasibility of the proposed method in traffic scenes. Partial experimental results are shown in Figure 4, and for viewing convenience, the KITTI images are cropped to reduce the field of vision.

4.3. Experimental Analysis

It can be found from Table 2 that the proposed method achieves the highest tracking accuracy under all conditions and therefore has a certain degree of robustness to hyperparameter selection. Taking the results under condition1 as example, the tracking accuracy of our method increases by 5% on OTB-2013 and 1% on OTB-2015 compared with that of the original DCFNet. The improvement verifies that the proposed dynamic update mechanism which is guided by IOU is more adaptable to the variation of object appearance than the fixed update mechanism used in the original DCFNet. Furthermore, our method also exceeds the DCFNets which are modified by APCE, PSR, and PNR, respectively. The reason is that these rule-based evaluation indicators are easy to be affected by the irregularity and noise of the response map. By contrast, as a data-based evaluation indicator which is learned from a mass of videos, the IOU used in our method can evaluate the tracking results more realistically. However, the price is that the tracking velocity has decreased from 80FPS to 30FPS because of the IOU calculation.

In addition, the experimental results shown in Figure 4 demonstrate that our tracking method with online updating can track the traffic participants well and guarantee the operational efficiency and safety of car sharing services.

5. Conclusion and Future Work

Visual object trackers can acquire the trajectories of the objects such as pedestrians and vehicles in traffic scene and make the car sharing services more secure and efficient. To promote the tracking performance in complex traffic scenes, it is necessary to update the object model adaptively, and the accurate evaluation of current tracking result is beneficial to the updating of the object appearance model. Instead of using the rule-based indicators such as APCE, PSR, and PNR, we introduced a data-based IOU predictor which is learned offline from a large number of image pairs to evaluate the tracking result. Based on predicted IOU, a dynamic updating mechanism of the object model is proposed. In the updating, if the predicted IOU is high, a larger weight may be assigned to the current tracking result and a smaller weight on the contrary. Finally, we integrate this dynamic updating mechanism into DCFNet tracker. Experiment results showed that compared with the original tracker, the proposed tracker’s tracking accuracy increased by 5% on OTB-2013 and 1% on OTB-2015. More than that, out tracker also exceeds the modified DCFNet trackers which update their object models using APCE, PSR, and PNR, respectively. It is verified that as a data-based tracking performance evaluation index, IOU can act as a more reliable guidance than the rule-based evaluation indexes to update the object appearance model online and improve the accuracy of object tracking for car sharing services.

The limitation of our research is that because of the additional calculation produced by IOU prediction, the tracking velocity has decreased from 80FPS to 30FPS. Future research may include backbone network sharing, network structure searching, and model compressing of IOU prediction network to improve the accuracy and speed of the IOU predictor.

Data Availability

The OTB-2013 and OTB-2015 datasets used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was partially supported by the High-Level Talents of Jinling Institute of Technology (No. JIT-B-202013), the International Science and Technology Cooperation Project of Jiangsu Province (No. BZ2020069), the Research Fund for the Doctoral Program of Jinling Institute of Technology (No. JIT-B-201617), and the Major Program of University Natural Science Research of Jiangsu Province (No. 16KJA520003).