Machine Learning in Intelligent Video and Automated MonitoringView this Special Issue
Part-Based Visual Tracking via Online Weighted P-N Learning
We propose a novel part-based tracking algorithm using online weighted P-N learning. An online weighted P-N learning method is implemented via considering the weight of samples during classification, which improves the performance of classifier. We apply weighted P-N learning to track a part-based target model instead of whole target. In doing so, object is segmented into fragments and parts of them are selected as local feature blocks (LFBs). Then, the weighted P-N learning is employed to train classifier for each local feature block (LFB). Each LFB is tracked through the corresponding classifier, respectively. According to the tracking results of LFBs, object can be then located. During tracking process, to solve the issues of occlusion or pose change, we use a substitute strategy to dynamically update the set of LFB, which makes our tracker robust. Experimental results demonstrate that the proposed method outperforms the state-of-the-art trackers.
Object tracking is one of the most important components of many applications in computer vision, such as human computer interactions, surveillance, and robotics. However, robust visual tracking is still a challenging problem, which is affected by partial or full occlusion, illumination, scale, and poses variation . The key for the object tracking is to construct an effective appearance model. Many tracking algorithms have been proposed recently, but designing a robust appearance model is still a major challenge, which is affected by both extrinsic (e.g., illumination variation, background clutter, and partial or full occlusion) and intrinsic (e.g., scale and pose variation) factors. In order to handle these problems, a wide range of appearance models based on different visual representations and statistical modeling techniques have been presented by researchers. In general, these appearance models can be categorized into two types: appearance model based on visual representation, such as global-based representation [2–6] and local-based representation [7–11]; appearance model based on statistical modeling, such as generative model [12–16] and discriminative model [5, 6, 17–20].
In this paper, we propose a part-based visual tracking algorithm with online weighted P-N learning. Weighted P-N learning is first proposed by assigning weights (property weight and classification weight) to each sample in training sample set, which can decrease false classification and improve the discriminative power of classifier. Then, we segment object into fragments and select parts of them as local feature blocks to represent the object. Finally, we train classifier for each LFB with weighted P-N learning to obtain the corresponding classifier, respectively, and track each LFB independently within the framework of Lucas-Kanade optical flow . During tracking process, a real-time valid detection method is used for each LFB. If certain LFB is invalid, we use a replacing strategy to update the local feature block set, which can ensure successful tracking.
Contributions. The contributions of this paper include the following.(i)A part-based visual tracking algorithm with online weighted P-N learning is proposed in this work. Object is represented by LFBs and tracked. When occlusion or distortion happens, a strategy is adopted to replace invalid LFB and keep the new LFB set effective.(ii)We define the weights (property weight and classification weight) for each sample in training process of P-N learning.(iii)An online weighted P-N learning is presented by assigning weight to each sample in training sample set, which can improve discriminative power of classifier by decreasing classification errors and increasing the accuracy of tracker.
The rest of the paper is organized as follows. Section 2 reviews the related work of this paper. Section 3 introduces weighted P-N learning. Proposed tracking method is presented in detail in Section 4. Experimental results are shown in Section 5. Section 6 concludes the whole paper.
2. Related Work
Recently, many trackers based on local feature representation have been proposed. Adam et al.  present a fragment-based tracking approach, and further, Wang et al.  embed the fragment-based method into mean shift tracking framework. This tracking method estimates the target based on voting map of each part via comparing its histogram with the template’s. Nevertheless, static template with equal importance being assigned to each fragment obviously lowers the performance of tracker. In order to overcome the shortcomings, Jia et al.  propose a fragment-based tracking method using online multiple kernel learning (MKL). All the patches are assigned to different weights based on the importance learned by MKL. However, this strategy may still cause drifting problem. Occlusion especially, which makes part patches invalid, leads to errors in computing voting map, even tracking failure. Wang et al.  introduce a tracking method based on superpixel. It only computes the probabilities of superpixels belonging to target, which is prone to drift away in color-similar background and whose tracking results will shrink to the unoccluded part of object when occlusion happens.
Another type of tracking method is based on discriminative appearance model. Tang et al.  present a tracking method based on semisupervised support vector machines. This tracker employs a small number of labeled samples for semi-supervised learning and develops a classifier to mark the unlabeled data. Babenko et al.  propose a multiple instance learning (MIL) method for visual tracking. This approach solves the problem of slight inaccuracies in the tracker leading to incorrectly labeled training samples and can alleviate drift problem to some extent. However, the MIL tracker might detect positives, which are less important because they do not consider the importance of sample in learning process. Further, Zhang and Song  suggest a weighted multiple instance learning (WMIL) tracking method. It assigns weight to each sample based on the corresponding importance. This approach improves the robustness of tracker. Kalal et al.  propose a method called P-N learning, which learns from positive samples and negative samples, to construct a classifier. In the meanwhile, the discriminative properties of classifier are improved by two categories of constrains that are termed P-constrains and N-constrains. However, false classification in P-N learning degrades the classifier in some degree.
Another work similar to ours is , which utilizes blocks to represent the object. However, the blocks are easily invalided when target appearance changes, which undermines its robustness to nonrigid distortion or occlusion. In our work, we employ LFBs to represent target and use a dynamically updating mechanism to update the local feature block set, which guarantees each LFB in the set is valid when occlusion or deformation occurs. Hence, our tracker is more robust and effective.
3. Weighted P-N Learning
P-N learning is a semisupervised online learning algorithm proposed by Kalal et al. [17, 24–26]. Let be a sample in feature space , and let be a label in label space . A set of samples and corresponding set of labels are defined as , which is termed a labeled set. The aim of P-N learning is to develop a binary classifier based on a priori labeled set and improve its discriminative performance by unlabeled data . The flowchart of P-N learning approach in  is shown in Figure 1.
3.1. Classifier Bootstrapping
The binary classifier is a function parameterized by . Similar to supervised learning, P-N learning is to estimate parameter via training sample set . Nevertheless, it is worth noticing that the training set is iteratively expanded through adding samples, which is screened by constraints from unlabeled data. Initially, classifier and its parameter are obtained by training labeled samples. Then, the process proceeds iteratively. In iteration , all the unlabeled samples are marked by classifier in iteration ; namely, Then, the constraints are utilized to revise the classification results and add the corrected labels to training set. Iteration ends with retraining classifier by using the renewed training sample set.
During training process, the classifier may identify the unlabeled data with wrong labels. Any sample may be screened many times with constraints and hence can be represented mistakenly in the training set repeatedly. Obviously, it can significantly degrade discriminative performance of classifier and therefore lower the accuracy of tracking. In order to further improve the accuracy and robustness of the classifier, we propose a weighted P-N learning method by assigning weight to each sample in training set. Sample in training set has two categories of weight which are termed P-weight and N-negative . P-weight represents the probability of being a positive sample, and N-weight represents the probability of being a negative sample. In iteration , sample from training set is represented as positive sample for times and as negative sample for times. The positive weight and negative weight are determined by the following formulation: Besides, the probability of sample being positive or negative in training set obtained by classifier is defined as classification weights and (in Section 3.3). The P-weight and N-weight of sample can be then obtained by the following formulation: At last, sample is determined to be either positive or negative via the following formulation:
Figure 2 demonstrates the tracking results with weighted P-N learning. In Figure 2, the left and middle images are the ground truth and tracking results with weighted P-N learning, and the right images are tracking results based on P-N learning.
In P-N learning, a constraint can be arbitrary function, especially two categories of constraints which we term P and N. P-constrains recognize samples which are labeled negative by the classifier, yet constraints need a positive label. P-constraints add samples to the training set in iteration . Similarly, N-constraints are employed to identify samples classified as positive but constraints require negative label. In iteration , N-constraints insert samples to training set.
In iteration , the error of a classifier is represented by a number of false positives and a number of false negatives . Let be the number of samples for which the label is correctly changed to positive in iteration by P-constraints, and is then the number of samples for which the label is incorrectly changed to positive in iteration . Hence, P-constraints change samples to positive. Similarly, N-constraints change samples to negative, where and are correct and false assignments. The errors of classifier can be represented as the following formulations: Equation (5) demonstrates that false positives decrease if . In the similar way, false negatives decrease if . To analyze the convergence of learning process, a model needs to be developed that relates the performance of P-N constraints to , , , and .
The performance of P-N constraints is represented by four indexes, P-precision , P-recall , N-precision , and N-recall , determined by the following formulation: According to formulation (7), it is easy to get By combining formulations (5), (6), and (8), we can obtain new formulations:
After defining state vector and transition matrix as the following, hence formulation (9) can be rewritten as the following formulation: According to , formulation (11) is a recursive equation that is related to a discrete dynamical system. Based on the theory of dynamical systems, the state vector converges to zero if eigenvalues and of the transition matrix meet the condition and . As pointed in , the performance of classifier will be improved constantly, only if the two eigenvalues of transition matrix are smaller than one.
3.3. Object Detecting
In previous subsections, we illustrate the weighted P-N learning method. In this subsection, a classifier will be developed to detect the object. Scanning window strategy is utilized to detect the object in . Similarly, we use this method to detect the object.
In this paper, the randomized forest classifier  is adopted. For each input subwindow, classifier consists of ferns. Each fern computes the input patch resulting in feature vector , which is used to obtain posterior probability . The following formulation is defined to discriminate input patch: where denotes the average of all posteriors and is the threshold which is set to 0.4 in all experiments. The detection process can be illustrated in Figure 3. Actually, and . Feature vector is represented by 2-bit Binary Patterns  because of their invariance to illumination and efficient multiscale implementation using integral image. In fact, the posteriors represent the parameter of the classifier and are estimated incrementally through the entire learning process. Each leafnode of fern records the number of positive and negative samples changed into it during iteration. The posteriors are then estimated by the following formulation:
The classifier is initialized in first frame, and posteriors are initialized to zero and renewed by 500 positive samples produced by affine warping of the selected patch . The classifier is then evaluated on all the patches. In this paper, detections far from the selected patch represent the negative samples and update the posteriors.
In this paper, the object is represented by independent local feature blocks. The tracking task is then transformed into tracking each local feature block. We train classifier for each LFB with online weighted P-N learning, respectively, and then track each LFB independently within the framework of LK optical flow . During tracking procedure, a real-time valid detection algorithm is utilized for each LFB. If certain LFB is invalid, it will be replaced with an unused block, which makes our tracker robust. Figure 4 illustrates the principle of tracking.
4.1. Set of Local Feature Blocks
Object is represented by LFBs, and thereby, object needs to be segmented into fragments. For simplicity, uniform segmentation is adopted in this paper as shown in Figure 5.
After segmentation, we select part blocks as LFBs. Assume object is divided into blocks; then we can obtain a candidate set of LFB set with candidate local feature blocks. For candidate LFB , we compute its 2-bit Binary Patterns feature vector . Then, scanning window method is used to compute the similar likelihood between feature vectors of input patch and , and the similarity is represented as . represents the highest similarity between and all input patches. Finally, local feature block set consists of candidate local feature blocks, with smaller .
In frame , represents the object, where and are coordinates of center position and and are the sizes of object. is the LFB, where and are coordinates of center location and and are the sizes of local feature block. represents the offset of the LFB relative to object, where and are the offset of center coordinates and and are the rations of sizes between object and the LFB. can be determined by the following formulation:
4.3. Object Tracking
The tracked target is determined in initial frame (first frame) and segmented into fragments according to Section 4.1. Then, we select local feature blocks and compute , , and .
Assume current frame is . Each LFB is corresponded with a classifier via weighted P-N learning, and then we track each LFB. For the LFB, is used to represent its tracking result in frame . By combining tracking results of each LFB and its offset , we can obtain the corresponding object via formulation the following formulation: Each LFB determines a related candidate object. Object finally can be located by the following formulation: where is the number of LFBs. The entire process can be explained by Figure 6. An adjustment is needed for offset of each LFB relative to object. We divide new object region into fragments and compute the new offset of each LFB with prior LFBs based on formulation (14). After this, classifier of each LFB needs to be retrained via weighted P-N learning.
Representation by LFBs has significant advantages. Firstly, compared with entire object, local feature block is more prone to recognition in background, and this guarantees the accuracy and stability of tracking. Besides, object is located by averaging all the tracking results of local feature blocks, which decreases tracking errors by counteracting positive and negative errors; therefore, the robustness of proposed algorithm is improved.
4.4. Updating Set of Local Feature Blocks
During the tracking process, update is essentially adaptive to complex environment variation. In this paper, the learning procedure is online, and then the main problem is how to handle the situation of local feature blocks being invalid. A strategy of replacing is adopted to solve this problem. During the tracking procedure, we make a real-time valid detection (in Section 3.3) for all local feature blocks. If , local feature block is invalid. When certain local feature block is invalid, it will be replaced with an appropriate block, which is selected from the outside of the LFB set.
Let be the unused block set, and . In current frame , LFB from is invalid and needs to be replaced. We first segment object into blocks and obtain . For block from , we compute similar likelihood , where is function of computing similar likelihood, is feature vector of block in frame , and is feature vector of block before it is used the last time in frame (if block is never used, equals one). is used to replace via the following formulation: The whole update process can be illustrated as shown in Figure 7.
So far, we have introduced the overall procedure of the proposed tracking algorithm as shown in Algorithm 1.
5. Experimental Results
In order to evaluate the performance of our tracking algorithm, we test our tracker on thirteen challenging image sequences. These sequences cover most challenging situations in visual tracking as shown in Table 1. For comparison, we run six state-of-the-art tracking algorithms with the same initial position of object. These algorithms are tracking , FG tracking , IVT tracking , MIL tracking , TLD tracking , and CT tracking  approaches. Some representative results are shown in this section.
5.1. Quantitative Comparison
Figure 8 shows the center location error of utilized tracker on thirteen test sequences. Overall, the tracker proposed in this paper outperforms the state-of-the-art algorithms.
5.2. Qualitative Comparison
Heavy Occlusion. Occlusion is one of the most common yet crucial issues in visual tracking. We test four image sequences (Woman, Subway, PersonFloor, and Occlusion1) characterized in severe occlusion or long-time partial occlusion. Figure 9(a) demonstrates the robustness performance of proposed tracking method in handling occlusion. Object is represented by local feature blocks in proposed algorithm. When occlusion happens, other tracking algorithms cannot track object well because they are prone to update background into object. However, our tracker can employ a new unused block to replace the invalid local feature block when occlusion occurs, which can make local feature block set effective to continue tracking.
(a) Woman, Subway, Occlusion1, and PersonFloor with heavy occlusion
(b) OneLSR, OSOW2cor, Juice, and Cup with scale variation
(c) Jumping, Lemming, and Deer with fast motion and motion blur
(d) DavidIndoor and DavidOutdoor with illumination changes
Scale Variation. Figure 9(b) presents the tracking results on four image sequences (OneLSR, OSOW2cor, Juice, and Cup) with large scale variation, even more with slight rotation. Our tracker can tail object throughout the whole sequences, which can be attributed to the discriminative classifier based on weighted P-N learning. We also observe that local feature blocks can better represent object, which makes the tracker focus on the stable part of the object.
Fast Motion and Motion Blur. Figure 9(c) demonstrates experimental results on three challenging sequences (Deer, Lemming, and Jumping). Because the target undergoes fast and abrupt motion, it is more prone to cause blur, which causes drifting problem. It is worth noticing that the suggested approach in this paper performs better than other algorithms. When motion blur occurs, our tracker can guarantee that the object’s local feautres are still available. The advantages of using local feature blocks to represent object are shown incisively and vividly. By combining improved P-N learning and local feature blocks, we can obtain a discriminative classifier of stable object parts, which can locate the object. Then we track each local feature block, respectively, and determine the object based on tracking results of local feature blocks.
Illumination Variation. Illumination is a critical factor in visual tracking. Two typical image sequences (DavidIndoor and DavidOutdoor) are employed to test our tracker as shown in Figure 9(d). When illumination varies, some local regions of target are insensitive actually. Our tracker captures these insensitive regions to track local areas of object and further locate entire object via local tracking information.
In this paper, we propose a part-based visual tracking algorithm with online weighted P-N learning. An online P-N learning is presented by assigning weight (property weight and classification weight) to each sample in training sample set, which can decrease classification errors and can improve the discriminative power of classifier. Firstly, the target is segmented into fragments, and parts of them are chosen to be local feature blocks to represent object. We then train classifier for each LFB with weighted P-N learning, obtain the corresponding classifier, respectively, and track each LFB independently within the framework of LK optical flow. In addition, a substitute strategy is adopted to update dynamically the set of LFBs, which ensures robust tracking. Experimental results demonstrate that our algorithm outperforms state-of-the-art trackers. However, our algorithm fails to track object exactly in some scenes. If the tracked target is nonrigid and has an extremely heavy deformation or is fully occluded for long time, the performance of proposed tracker drops.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Jinhai Xiang is supported by the Fundamental Research Funds for the Central Universities under Grant no. 2013QC024. Jun Xu is supported by the National Natural Science Foundation of China under Grant no. 11204099 and self-determined research funds of CCNU from the colleges’ basic research and operation of MOE.
Y. Wu, J. Lim, and M. H. Yang, “Online object tracking: a benchmark,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR '13), pp. 2411–2418, 2013.View at: Google Scholar
X. Mei and H. Ling, “Robust visual tracking using ℓ 1 minimization,” in Proceedings of the 12th IEEE Conference on Computer Vision and Pattern Recognition(CVPR '09), pp. 1436–1443, 2009.View at: Google Scholar
K. Zhang, L. Zhang, and M. H. Yang, “Real-time compressive tracking,” in Proceedings of the European Conference on Computer Vision (ECCV '12), pp. 864–877, Firenze, Italy, October 2012.View at: Google Scholar
R. Yao, Q. Shi, C. Shen, Y. Zhang, and A. van den Hengel, “Part-based visual tracking with online latent structural learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '13), pp. 2363–2370, Portland, Ore, USA, 2013.View at: Google Scholar
Q. Yu, B. T. Dinh, and G. Medioni, “Online tracking and reacquisition using co-trained generative and discriminative trackers,” in Proceedings of European Conference on Computer Vision (ECCV ’08), pp. 678–691, 2008.View at: Google Scholar
H. Grabner, C. Leistner, and H. Bischof, “Semisupervised on-line boosting for robust tracking,” in Proceedings of European Conference on Computer Vision (ECCV '08), pp. 234–247, 2008.View at: Google Scholar
B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in Proceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI '81), vol. 2, pp. 674–679, Vancouver, Canada, 1981.View at: Google Scholar
K. Zhou, J. C. Doyle, and K. Glover, Robust and Optimal Control, Prentice Hall, Upper Saddle River, NJ, USA, 1996.
M. Everingham, L. V. Gool, C. Williams et al., “Partbased visual trackin g with online latent structural learning,” in Proceedings of the PASCAL Visual Object Classes Challenge (VOC '10) Results, 2010.View at: Google Scholar