Abstract

Tracking-by-detection methods have been widely studied with promising results. These methods usually train a classifier or a pool of classifiers in an online manner and use previous tracking results to generate a new training set for object appearance and update the current model to predict the object location in subsequent frames. However, the updating process may easily cause drifting in terms of appearance variation and occlusion. The previous methods for updating the classifier(s) decided whether or not to update the classifier(s) by a fixed learning rate parameter in all scenarios. The learning rate parameter has a great influence on the tracker’s performance and should be dynamically adjusted according to the change of scene during tracking. In this paper, we propose a novel method to model the time-varying appearance of an object that takes appearance variation and occlusion of local patches into consideration. In contrast with the existing methods, the learning rate for updating classifier ensembles adaptively is adjusted by estimating the appearance variation with sparse optical flow and the possible occlusion of the object between consecutive frames. Experiments and evaluations on some challenging video sequences have been done and the results demonstrate that the proposed method is more robust against appearance variation and occlusion than those state-of-the-art approaches.

1. Introduction

Visual object tracking, which is one of cardinal problems in computer vision, has a wide range of applications including video surveillance, human computer interaction, video retrieval, and autonomous navigation. Despite the fact that numerous single object tracking methods have been proposed, many of them only achieve favorable performance in simple environment with slow motion and slight occlusion. Hence, it remains a challenging work to research a robust algorithm for complex and dynamic scenes caused by the distractions such as heavy occlusion, appearance variation, and cluttered background.

The purpose of object tracking is to estimate the states of a moving target in a video. A tracking system usually has three components including appearance model, motion model, and model update. An appearance model is used to represent the object with proper features and verify predictions by object representations. A motion model is exploited to predict the most likely state of the target. A model update scheme is applied to make the tracker adapt to appearance variation and occlusion of the target object. Existing trackers can be classified as either generative or discriminative. For generative methods [16], tracking is simplified as searching for the most similar position to the target object within many neighborhood positions of current location. The target is often represented by a set of templates. The fragment-based (Frag) tracker [1] addresses the partial occlusion problem by modeling object appearance with histograms of local patches. The incremental visual tracker (IVT) method [2] utilizes an incremental subspace model to adapt appearance variation. The visual tracking decomposition (VTD) approach [3] extends the conventional particle filter framework with multiple motion and observation models to account for appearance variation. The tracker [4] first applies sparse representation to visual tracking with designed trivial templates to handle occlusions and treats the tracking as finding the image region with minimal reconstruction error using minimization. The distribution field (DFT) tracker [5] introduces a method for building an image descriptor using distribution fields (DFs), a representation that allows smoothing the objective function without destroying information about pixel values. Li et al. [6] construct an appearance model using the 3D discrete cosine transform and propose an incremental 3D-DCT algorithm and then embed the discriminative criterion into a particle filtering framework for object state inference. Most of these methods use holistic representations to describe objects and hence do not handle occlusions or distracters well. For discriminative methods, tracking is treated as a binary classification problem which aims at designing an ensemble classifier to distinguish the target object from the background [714]. They utilize both the target and the background information. Avidan [7] combines a set of weak classifiers into a strong one and develops an ensemble tracking method. The OAB tracker [8] proposes an online boosting method to update discriminative features to handle the drifting problem in object tracking. Bai et al. [9] propose randomized ensemble tracker that is extended from the online boosting algorithm [8] and ensemble tracker [7] by characterizing the ensemble weight vector as a random variable and evolving its distribution with recursive Bayesian estimation. Babenko et al. [10] introduce multiple instance learning into online object tracking where samples are considered within positive and negative bags or sets. Kalal et al. [11] propose the P-N learning algorithm to exploit the underlying structure of positive and negative samples in learning classifiers for object tracking. Zhang et al. [12] propose a real-time compressive tracking algorithm by adopting random projection to project a datum in high-dimensional space to a low-dimensional vector. The structured output tracking (Struck) method [13] is proposed by adopting kernelized structured output support vector machine to avoid the labeling ambiguity when updating the classifier during tracking. Wang et al. [14] incorporate online distance metric learning into visual tracking based on a particle filter framework, and thus the appearance variations of an object are effectively learned via an online metric learning mechanism. Furthermore, several algorithms have been proposed to exploit the advantages of both generative and discriminative models [15, 16].

In this paper, we mainly focus on model update scheme because the part has a great impact on the performance of the trackers. Existing strategies usually concentrate on updating the weights of classifiers and ignore the update of learning rate of the classifiers. To address the problem, we introduce appearance variation estimation and occlusion estimation to control the learning rate of the classifiers motivated by [15, 1719]. We estimate the appearance variation using sparse optical flow and possible occlusion by the reconstruction error of local patches. In summary, we propose a robust tracking method with an adaptive appearance model. During tracking, we exploit the local patches to handle occlusion and appearance variation. The model is adaptively updated with the consideration of occlusions to account for variations and alleviate drifts.

The remainder of this paper is organized as follows. The work related to visual tracking and sparse representation is reviewed in Section 2. In Section 3, the proposed method with appearance variation and occlusion estimation is introduced. Experimental results and demonstrations are reported and analyzed in Section 4 and the conclusion is given in Section 5.

In this section, we discuss the related online tracking algorithms handling appearance variation and occlusion of target object. Although much progress has been made for visual tracking, a lot of challenges remain to design an effective and robust tracker due to some distractions including appearance variation, varying illumination, occlusions, and background clutter.

Tracking-by-detection [7, 9] is increasingly popular due to its top performance recently. The motivation of these methods treats tracking as a detection and classification problem, that is, to avoid modeling object dynamics especially when abrupt motion and occlusions occur. They require training a classifier to distinguish the object from the background for detecting the object in each frame. One common approach is to use an ensemble classifier that linearly combines many weak classifiers with different associated weights, for example, [7, 9]. This is done by constructing a strong classifier which combined a pool of weak classifiers using the initial frame, then updating the weights of classifiers by abandoning some bad weak classifiers, and adding some new good weak classifiers trained on next frame at each time step.

Appearance variation and occlusion are two problems for visual tracking. Some current algorithms have been proposed using holistic and local representation schemes to handle appearance variation and occlusion. The IVT tracker [2] with adaptive appearance model that aims to account for appearance variation or limited deformable motion is presented. It is less effective in handling heavy occlusion as a result of the adopted holistic appearance model. The ensemble tracker [7] formulates tracking as a binary classification problem. Although this method is able to make a distinction between target and background, it is rather limited to handle heavy occlusion. The Frag tracker [1] aims to solve partial occlusion with a representation based on histograms of local patches. The template combining votes of matching local patches is not updated and thereby it is not expected to handle appearance variation in terms of large variation in scale and shape deformation. The MIL tracker [10] is able to reduce drift, but it is not able to handle large shape deformation. The tracker [4] does not exploit the appearance information from the background and thus it is ineffective in handling heavy occlusion.

As mentioned above, how to handle appearance variation and occlusion are two major problems that are also two hot research topics for object tracking. Motivated by [15, 1719], we introduce methods to estimate appearance variation and possible occlusion in the process of model update. An optical flow algorithm calculates the motion information of the moving target from one frame to the next by using the intensity values of neighboring pixels. So, we can use it to estimate the rate of appearance variation between consecutive frames. Motivated by the widely successful applications of sparse representation in many tasks, a new sparsity-based occlusion estimation method is designed.

3. The Proposed Method

In this section, we introduce the appearance variation estimation based on sparse optical flow and sparsity-based occlusion estimation methods of our tracker. We solve the problem that learning rate of classifier could not be dynamically adapted with the change of scene during tracking. Our tracker is a variant of RET [9].

3.1. The RET Tracker

We give a brief introduction of the RET tracking method. The tracker puts emphasis on estimating the state of the classifier rather than the state of the object. It characterizes the ensemble weight vector that combines weak classifiers as a random variable and evolves its distribution with recursive Bayesian estimation. The object bounding box is divided into local patches of size . For each patch, it extracts a histogram of gradients (HOG) descriptor. At each time step, the method starts with the pool of weak classifiers , a distribution over the weight vector , and input data . It samples the distribution to get instantiations of the weight vector and combines them with the output of weak classifiers to yield ensembles of weak classifiers . These ensembles can be interpreted as instantiations of the randomized classifier and are used to compute the approximation of the expected output of the randomized classifier . The approximation is considered as the output of the strong classifier for input data .

The classification method of the tracker is described as follows. Given a weight vector and input data , we obtain an ensemble binary classifier of the pool by thresholding the linear combination of outputs of all weak classifiers:where is a component of weight vector associated with , is the dimensionality of weight vector , and is a model parameter.

denote the series of sequentially arriving datasets. At time step , given input data , the probabilistic label value is computed:where is base distribution, which is the expectation of vector , and is the concentration parameter and follows the Dirichlet distribution .

The final strong classifier is obtained for input data by voting and thresholding:

3.2. Appearance Variation Estimation Based on Sparse Optical Flow

In visual tracking, appearance variation is an important factor that affects the performance. Most existing methods solve the problem by involving some feature descriptors and ignore estimating the rate and degree. The methods proposed in [17, 20] have shown that appearance variation can be detected by observing changes in the induced optical flow. The magnitude of the induced optical flow vectors will depend on the distance to the object as well as the velocity difference between the object and the video capture devices. The sampling rate of a certain video sequence is constant. If the velocity of the video capture devices is constant, the change in the optical flow vector magnitude will reflect changes in the distance to the object. So, the rate can be computed. If the velocity difference between the object and the video capture devices is small, the change in the optical flow vector magnitude will reflect the degree of appearance variation. Sparse optical flow method is introduced due to the fact that the traditional optical flow algorithms cannot satisfy the request of real-time because they need a large amount of computation cost. Many methods such as Harris corners, Canny edges, SIFT, and SURF can be used to extract points of interest in the process of sparse optical flow calculation.

Motivated by the demonstrated success of optical flow [17, 21], we propose a method based on sparse optical flow to estimate the appearance variation of the object between consecutive frames. The workflow of our method is described in Figure 1. For each frame, we firstly employ the Canny edge descriptor to detect corners with the purpose of extracting suitable and accurate key points in a real-time way. Secondly we calculate the optical flow using the image pyramids and the Lucas-Kanade algorithm and estimate the image velocities with subpixel precision because of possible large displacement. Thirdly we get an optical flow descriptor composed of the position of key points and the magnitude of the corresponding flow vectors. Finally, we get the estimation result by Euclidean distance.

Sparse Optical Flow Descriptors. Given the frame , let represent the description of the optical flow vectors within the frame. For this frame, the sparse optical flow descriptor is given as follows:where represent the position of th corner, is the magnitude of flow vector at the corner, and denotes the number of corners in the frame .

Distance Measure. We can estimate the appearance variation after getting the descriptor. In order to estimate the changes of object, we compute the similarity between the consecutive descriptors. The similarity can be estimated by the Euclidean distance. The frames , indicate two consecutive frames, and , denote their sparse optical flow, respectively. The similarity can be computed by using the Euclidean distance as follows:where the bigger the value is, the faster and heavier the appearance variation is—and vice versa. To measure the rate of appearance variation, we can estimate the rate by computing the displacement of the object because the time is a fixed value. Then we commonly estimate the rate by thresholding the similarity. Considerwhere is a predefined threshold.

3.3. Sparsity-Based Occlusion Estimation

In order to estimate the possible occlusion, motivated by some successful applications of sparse representation [15, 19, 2224] and the effectiveness in handling occlusions for object tracking, we develop a new sparsity-based occlusion estimation method for object tracking.

In our method, each image patch is sparsely represented by its neighboring patches. If the patch is occluded, its true neighboring patches are unable to be found and thus causing large reconstruction error. The reconstruction error can be used to evaluate the occlusion state of each image patch. The larger the reconstruction error of the patch is, the greater the possibility of the patch which is occluded is. If the error of the patch is bigger than a threshold, we regard the patch as an occlusion patch. So, we can estimate the occlusion between two consecutive frames. The diagram of our sparsity-based occlusion estimation method is shown in Figure 2.

We use overlapped sliding windows on the images to obtain some patches with the same size. Each patch is converted to a vector. For better description of the algorithm, we use Figure 2 to explain the process. Taking patch centered at the th row and the th column of image and the vector of the patch into consideration, the sparse coefficient vector of each patch is computed bywhere is a constant, is the dictionary which is made up of several special selected patches from image and can be denoted as , represents the patch centered at the th row and the th column of image , is a constraint parameter in horizontal direction, and is a constant.

The construction error can be used to estimate the occlusion. The construction error of the patch according to image is denoted as and can be computed as follows:Then we obtain the vector ; its each element is an indicator of occlusion of the corresponding patch and is obtained:where is a predefined threshold which determines if the patch is occluded or not.

From the above equation, the occlusion values of image according to image can be obtained:where Wr, Hr denotes the ratio of the width of image and the width of patch and the ratio of the height of image and the height of patch, respectively.

3.4. Model Update

In tracking, the object appearance often significantly changes because of a number of factors such as appearance variation and occlusion. Hence it is necessary and important to update the classifier over time. In RET tracker, it updates both the Dirichlet distribution of weight vectors and the pool of weak classifiers after the classification stage in each time step. It updates the Dirichlet parameters and in a Bayesian manner and updates the pool of weak classifiers by using a fixed learning rate parameter. We improve the method by using an adaptive classifier update. Our approach takes not only the appearance variation of the object but also possible occlusion into consideration.

Once a new set of positive and negative samples is identified, a tracker should decide whether or not to use it to update the classifiers. RET updates the pool of classifiers by the following equation: where is the classifier coefficients and is a learning rate parameter. RET uses a fixed learning rate, which means that the appearance will be updated without adaptation to specific frames. Once the tracker loses the object, the whole model will be contaminated in the remaining frames. We tackle this problem by adaptively updating the learning rate parameter .

It is apparent that the model of local patches with fast appearance variation should be updated to learn the target appearance, while the model of local patches with heavy occlusion should not be updated to avoid introducing errors. The learning rate for the pool of classifier is set according to the estimation of appearance variation and occlusion. Our classifier update method is defined as follows:

We update the model at each time step. So, our adaptive classifier update can keep the first model and take new models which learn the target appearance with occlusion and other changes into consideration, thus making our tracker more robust to appearance variation and occlusion.

4. Experiments

To evaluate the performance of our proposed tracker, we test our tracker on 19 publicly available video sequences. The sequences come from VTD dataset, MIL dataset, and visual tracker benchmark [25, 26], some of which cover most challenging situations in visual tracking: heavy occlusion, complicated motion, large appearance variation, and so forth. For comparison, we run 11 leading algorithms with the same initial position of the target in the same video sequences. These algorithms are Tracking-Learning-Detection (TLD) [11], visual tracking decomposition (VTD) [3], tracking () [4], compressive tracking (CT) [12], distribution field tracking (DFT) [5], multiple instance learning (MIL) [10], incremental visual tracking (IVT) [2], fragments-based tracking (Frag) [1], Randomized Ensemble Tracking [9], Sparsity-based Collaborative Model (SCM) [15], and Online AdaBoost Boosting (OAB) [8]. Here, MIL, OAB, CT, TLD, and RET are discriminate trackers, and IVT, DFT, Frag, VTD, and are generative trackers. OAB and ET particularly are our baselines. Compared with these different kinds of methods, we demonstrate the effectiveness and robustness of our method. The proposed algorithm is implemented in Matlab2011b on a PC with an Intel(R) Core(TM) i5-4590u processor and 4 G RAM. For fairness, we use the publicly available source or binary codes and tracking result provided by the authors.

Note that we fix the parameters of our tracker for all sequences to demonstrate its robustness and effectiveness. We search for the target of interest in a standard sliding-window method, and the object bounding box is divided into a regular grid of small patches. For each patch, we extract its standard histogram of gradients (HOG), which yields a 100-dimensional feature vector. Each weak classifier corresponding to the local patch is a standard linear SVM, which is trained with its own buffer of 49 positive and 50 negative examples. The sample buffers and the weak classifiers are initialized with the ground truth bounding box in the first frame. During tracking, whenever a new example is added to the buffer, the weak classifier is retrained. The combination between HOG and SVM performs well in computer vision, which is proved by Bernt Schiele [24]. The threshold in Section 3.2 is set to be 0.15, and the threshold in Section 3.3 is set to be 0.20 according to the experiments.

4.1. Quantitative Evaluation

To quantitatively evaluate the performance of each tracker, we use three widely accepted evaluation metrics including the successful tracking rate (STR), the average center location error (ACLE), and the average overlap rate (AOR). The successful tracking rate is the ratio of the number of successful tracking frames and the number of the sequence. If the overlap between the predicted and ground truth bounding box is more than 0.5, we label the frame as the successful tracking frame. The center location error is the Euclidean distance between the center of the tracking result and the ground truth for each frame. The overlap rate is based on the Pascal VOC criteria. Given the tracked bounding box and the ground truth bounding box , the overlap rate is computed by . In order to evaluate the tracking performance of the algorithms, we compute the average center location error and the average overlap rate across all frames of each video sequence as done in most tracking literature. For fairness, we evaluate our proposed algorithm by taking the average over five runs. Due to the space limitation, we list the results of 11 leading algorithms. Tables 1 and 2 report the quantitative comparison results, respectively.

Tables 1 and 2 show AOR, STR, and ACLE obtained by the tracking algorithms with different similarity metrics on the 19 video sequences; our proposed algorithm achieves the first best overall performance at 12 sequences and achieves the second best overall performance at 2 sequences according to the STR metric. Compared with RET, our proposed algorithm achieves the better overall performance at 17 sequences excluding coke and basketball sequences with the situations of rotation and fast motion. In the girl, singer1, singer2, shaking sequences with heavy occlusion, and appearance variation, the proposed algorithm performs better overall performance than RET due to the fact that the adaptive classifier updates by combining the state of occlusion and appearance variation at each time step. Compared with other methods, our proposed algorithm achieves the first three best overall performances almost against all the sequences and is more effective and robust than other state-of-the-art algorithms, especially when there is significant appearance variation and heavy occlusion.

4.2. Qualitative Evaluation

We also plot some tracking results of 11 trackers on 12 video sequences for qualitative comparison as shown in Figure 3. The results are discussed based on the main challenging factors such as occlusion and appearance variation.

Occlusion. In the sequences of david, faceocc1, faceocc2, tiger1, skating1, liquor, and singer1, occlusion is the main challenge. From Figure 3(b), except for TLD and our tracker, other algorithms fail to track the target at frames 1372 and 1478 after the target was occluded. At frame 1584, our method and RET locate the liquor; others are not able to relocate the bottle. In the tiger1 sequence, when the tiger is occluded by the distractor, all other trackers fail to locate the object, but our proposed method handles the problem well and tracks the target accurately, as shown in Figure 3(k). In the other sequences with occlusion, our tracker achieves at least the second best performance through the entire sequence. Once the target is partially occluded or heavily occluded, our method can be able to relocate the target due to the adaptive classifier update method and our method takes occlusion into consideration.

Scale Variation. The shaking, singer1, and skating1 sequences contain significant scale change. In the skating1 sequence, a player is occluded by other players at frame 206, and CT, , and IVT lost to track the player. All the other approaches drifted at frames 289 and 339, but our tracker, VTD, and RET successfully locate the target and our tracker locate it better than VTD at frame 339, as shown in Figure 3(i). From Figure 3(g), CT and Frag failed to track the singer at frame 145; the others can look onto the target. At frames 220 and 324, the scale of the singer changes significantly. While most trackers lost tracking the singer, TLD, RET, and our tracker can track the object. Our tracker achieves the best and the second best performance in the singer1 and skating1 sequence, respectively.

Shape Change. In the shaking, david, singer2, and tiger1 sequences, the objects undergo heavy shape variation, especially in the shaking and singer2. From Figure 3(a), at frame 48 all the other methods start to drift, but our proposed approach locates the singer well. At frame 195, only Frag and our tracker successfully track the target well. With heavier shape change, Frag fail to track the singer at frame 284, while our tracker still tracks the target object well. In the shaking sequence, due to the heavy shape variation caused by sharp illumination change, OAB, CT, , and Frag lost the target at frame 60. At frame 119, IVT, DFT, and TLD also lost the object, but SCM, MIL, VTD, and ours can track the target and locate it accurately. With heavier shape change, only MIL, SCM, and ours can track the target, as shown in Figure 3(l).

As mentioned in scale variation and shape change, our tracker can handle appearance variation better than the other methods.

5. Conclusion

Based on RET, we propose a novel adaptive classifier update method. Instead of traditional model update, our proposed method takes appearance variation and possible occlusion into consideration at each time step. We estimate the appearance variation by using sparse optical flow method and estimate the possible occlusion by using sparsity-based occlusion estimation method. Then we combine the appearance variation and occlusion estimation to adaptively update the classifier and model. Extensive experimental results and evaluations against several state-of-the-art methods on some challenging video sequences demonstrate the effectiveness and robustness of our proposed algorithm. In the future, we are interested in extracting some kinds of features and fuse them. Besides, we will explore the integration of a strong motion prediction model to solve the problem of tracking-by-detection method which is sensitive to the gating parameters.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported in part by Natural Science Foundation of China (no. 61272195, 61472055, and U1401252), Program for New Century Excellent Talents in University of China (NCET-11-1085), Chongqing Outstanding Youth Found (cstc2014 jcyjjq40001), and Chongqing Research Program of Application Foundation and Advanced Technology (cstc2012 jjA40036).