Abstract
Object tracking based on lowrank sparse learning usually makes the drift phenomenon occur when the target faces severe occlusion and fast motion. In this paper, we propose a novel tracking algorithm via reverse lowrank sparse learning and fractionalorder variation regularization. Firstly, we utilize convex lowrank constraint to force the appearance similarity of the candidate particles, so as to prune the irrelevant particles. Secondly, fractionalorder variation is introduced to constrain the sparse coefficient difference in the bounded variation space, which allows the difference between consecutive frames to exist, so as to adapt object fast motion. Meanwhile, fractionalorder regularization can restrain severe occlusion by considering more adjacent frames information. Thirdly, we employ an inverse sparse representation method to model the relationship between target candidates and target template, which can reduce the computation complexity for online tracking. Finally, an online updating scheme based on alternating iteration is proposed for tracking computation. Experiments on benchmark sequences show that our algorithm outperforms several stateoftheart methods, especially exhibiting better adaptability for fast motion and severe occlusion.
1. Introduction
Visual object tracking is an important technique in computer vision with many applications, such as robotics, medical image analysis, humancomputer interaction, and traffic control. The goal of tracking is to predict the motion state of the moving object in the video stream based on the initial state. Much progress has been made in this area, but many challenging tasks still remain caused by partial or full occlusion, fast motion, illumination and scale variation, deformation, background clutter, etc.
Lowrank constraint [1–4] on the candidate particles can reflect the subspace structure feature of the object appearance. This subspace representation is robust to handle the global appearance changes problem (e.g., illumination variations and pose changes). Furthermore, for the robustness of local appearance changes (e.g., deformation and partial occlusions), sparse representation [5–8] models the image observation by a linear combination of dictionary templates, which can measure the importance of each target candidate. Therefore, lowrank constraint and sparse representation can be learned jointly for effective tracking [9, 10]. Zhong et al. [11] develop a sparse collaborative model for object tracking, which exploits a sparse discriminative classifier and sparse generative model to describe drastic appearance changes. Zhang et al. [12] learn the sparse representation and lowrank constraint in the particle filter framework and exploit temporal consistency simultaneously. Wang et al. [13] propose an inverse sparse representation based tracking algorithm with a locally weighted distance metric. Sui and Zhang [14] exploit lowrank constraint to describe the global feature of all the patches and capture the sparsity structure to reflect the local relationship between the neighboring patches. Sui et al. [15] formulate spatialtemporal locality under a discriminative dictionary learning structure for object tracking. Dash and Patra [16] propose an effective tracking framework by using a regularized robust sparse coding for representing the multifeature templates of the candidate objects. These methods can successfully deal with the target appearance change problem caused by lighting variations and partial occlusions. Nevertheless, these formulations are not effective for handling fast motion challenges.
To solve this problem, we introduce reverse lowrank sparse learning with fractionalorder variation regularization for visual object tracking. In comparison with the existing lowrank sparse trackers, we introduce fractionalorder variation regularization to the representation model. Fractionalorder variation has been widely used in static image analysis. We generalize it to overcome the challenging problems in dynamic video tracking because of the following two factors: (1) the variation method can model the tracking problem in the bounded variation space, which allows the difference among a few frames to exist, to adapt the object fast motion. (2) Fractional differential is a global operator, which can take more adjacent frames information into account and overcome the severe occlusion problem.
The main contributions of this work are fourfold: (1) the lowrank constraint is exploited to prune the irrelevant particles; (2) fractionalorder variation regularization is introduced to learn the jump information generated by fast motion and complex occlusion; (3) an inverse sparse representation formulation is built to reduce the computation complexity for realtime tracking; and (4) an alternating iteration strategy is presented for online tracking optimization.
2. Problem Formulation
In this work, we formulate the target states within the particle filter framework [13]. Particle filter processing is built upon Bayesian inference rule, which can be used for predicting the posterior probability density function of the state variables in dynamic system. Object tracking as a classical dynamic variables estimation problem is suitable to be modeled in this framework. Based on this idea, the a posteriori probability of the target state can be inferred recursively as follows:where is the motion state variable at time , is the observed image, denotes the state transition model, and denotes the observation model. Thus, the target state can be found by maximizing a probability estimation model aswhere is the ith candidate at time t.
2.1. State Transition Model
State transition model describes the change of the target state between two successive frames. We measure the transition of the target state by an affine motion formulation described as , whose parameters correspond to , coordinate translation, rotation angle, scale, aspect ratio, and skew, respectively. To sample a group of candidate particles, we model the transition of the target state by a Gaussian distribution:where denotes a diagonal Gaussian distribution matrix, whose elements are the variances of the affine motion parameters.
2.2. Reverse LowRank Sparse Representation Model with FractionalOrder Variation Regularization
In this section, we utilize both reverse lowrank sparse learning and fractionalorder variation regularization to formulate the object tracking. Firstly, we employ the local appearance representation based on patches to replace the holistic one for dealing with partial occlusion. Here, local patches are sampled sequentially in a nonoverlapping manner from the candidate particles as shown in Figure 1.
Secondly, we use a generative model based on statistical processing to select the optimal target candidate. The existing lowrank sparse optimizationbased tracking methods usually make the drift phenomenon occur when the target faces complex occlusion and fast motion. Here, we build a reverse lowrank sparse learning formulation with fractionalorder variation regularization for object tracking as follows:where
At time t, denotes the target template reshaped by the intensity vector of the observed target, whose initial value is drawn manually in the first frame and the current value is updated dynamically during tracking as shown in Section 3.2. is a dictionary used for the sparse representation of target template, whose columns are formed by candidates particles . is the local patch vector of the candidate region in the current frame sampled by the state transition model. denotes the sparse coding. , , and are the adjustment parameters. denotes the matrix nuclear norm. is the fractionalorder gradient operator. is an integer constant, , and is the gamma function.
In model (4), the first three terms have already been used in existing trackers, which depict the reconstructed error, lowrank constraint, and sparse representation, respectively. The last term is our novel idea which represents fractionalorder variation regularization.
In this optimization, we utilize lowrank constraint to force the appearance similarity of the candidate particles. This global restriction can help to acquire the structure feature of the object observations and prune the uncorrelated candidate particles. Since matrix rank minimization is an NPhard problem, we minimize a convex envelope of the rank function (nuclear norm) for alternative processing.
To realize robust tracking under fast motion and severe occlusion challenge, we introduce fractionalorder variation regularization to the representation model. The variation method can model the variable selection problem in the bounded variation space. Functions in this space allow for the existence of jumping discontinuities. That is, the discontinuous features can be retained. Then, the appearance variation can be described effectively when the target undergoes fast motion. However, total variation regularization can only relate the information between two adjacent frames. Unlike this local processing, the fractional differential in equation (6) can involve more information from the front frames, which is helpful for acquiring more target feature information and dealing with the severe occlusion problems. Therefore, we employ fractionalorder variation regularization under fractionalorder bounded variation space to enhance the robustness of object tracking.
The candidate observation can be represented as a linear combination of target templates, and only a few templates are required to reliably represent the candidate image observation. Optimization problem (4) penalizes the representation matrix via L_{1} norm, which can retain the useful information and remove the redundant information so that the optimal solution is sparse. We employ sparse representation to model the relationship between target candidates and target templates, which is helpful to deal with occlusion, because the residual error in the occlusion location is sparse. Currently, most of the sparse representation models utilize the target template to represent the candidate particles. These methods need to solve a large number of minimization problems. To reduce the computational cost, we use candidate particles’ linear combination to represent the target template inversely. This is because the templates’ number is smaller than that of the candidates. Then, the computational efficiency for tracking processing can be improved.
2.3. Observation Model
The observation model measures the probability of the observed image at the motion state , which can describe the similarity between the target template and the candidate particle. Then, the candidate with maximal probability in equation (2) can be regarded as the tracking result. In this paper, we use the sparse coding coefficient in model (4) to estimate this similarity. The candidates with larger sparse coding coefficient have high probability to be the target, whereas the candidates with smaller coefficient are less likely to be the target. We define the observation model aswhere the superscript denotes the mth candidate. In each frame, we crop out the optimal candidate as the tracking result.
3. Numerical Implementation
3.1. Alternating Iterative Algorithm
To solve the optimization problem in (4) for online tracking, we present an alternating iterative algorithm based on three update steps as follows: Step 1: acquire the lowrank matrix by We solve this problem with the FISTA algorithm. Define , , and where is the Lipschitz constant for the function . The details of the FISTA algorithm can be summarized as follows:(1)Initialization: and .(2)Iteration: where . The terminal condition is set by the duality gaps. Step 2: introduce the fractionalorder variation regularization by We solve this model by an adaptive primaldual algorithm [17] formulated as follows:(1)Initialization: , , .(2)Iteration:(3)Termination condition: where is the dual space. is termed as the primaldual gap, which vanishes only if is the saddle point. Step 3: update the coding by the inverse sparse representation: This is a traditional linear regression problem. The solution of this model can be calculated by the LARS algorithm. We utilize the SPAMS optimization toolbox to realize this numerical calculation. This threestep iteration updates one variable at a time with the other variables fixed. Finally, the representation coefficient in (4) can be acquired.
3.2. Template Update
In model (4), a fixed target template is insufficient to account for appearance change among successive frames. In this work, we address this issue by a dynamic update scheme as
The target template is defined as the weighted sum of the target template and the tracking result . The contributions of these two terms can be balanced by the weight . The threshold is determined empirically by measuring the dissimilarity. We set and .
This update mechanism can overcome the target appearance change due to partial occlusion. We can retain the unoccluded patches in the target template and prune the occluded ones.
The details of the numerical implementation are shown in Algorithm 1. In our reverse sparse learning framework, the computational cost of updating in step 3 is , where is the number of target templates and is the number of image feature, whereas, in the traditional sparse learning framework, is the number of target candidates. Because the templates’ number is smaller than that of the candidates, the computational complexity for tracking processing can be reduced linearly. In our algorithm, the average frame rate of the video sequences is about 7 frames per second.

4. Experimental Results
In this section, we assess the proposed tracking algorithm by qualitative and quantitative experiments. The experiments are conducted on a set of benchmark sequences (faceocc1, faceocc2, girl, boy, deer, jumping, singer1, car4, david, cardark) with MATLAB. These sequences are categorized by their main challenging factors including occlusion, fast motion, illumination and scale variation, deformation, and background clutters. For each sequence, the initial value of the affine parameter can be acquired from the bounding box in the first frame, which is drawn manually, and then the affine parameter varies accordingly in the tracking process. We sample 300 candidate particles and regularize the target templates size to . Furthermore, we set weight coefficients , , , respectively.
Comparative studies with 7 stateoftheart trackers including SCM [11], IST [13], LLR [14], DDL [15], CNT [18], MCPF [19], and VITAL [20] are carried out. In [11, 13–15], object tracking is modeled by lowrank and sparse representation. In [18], a convolutional neural network is incorporated into object tracking without training. In [19], object tracking is described in a multitask correlation particle filter framework. In [20], object tracking by detection framework is realized via adversarial learning. These comparisons mainly consider deep network [21], correlation filter and adversarial learning have attracted much attention in complicated tasks of visual tracking.
4.1. Qualitative Results
Figures 2–6 compare the tracking results of the 8 trackers on 10 benchmark sequences qualitatively. In the following, we analyze the results according to the main challenging factors in each sequence. Occlusion: in the faceocc1 sequence, the target face undergoes frequent occlusion, which causes serious appearance changes. Figure 2(a) shows some representative tracking results in the sequence. These trackers can complete the tracking successfully. In the faceocc2 sequence, the target face not only suffers from heavy partial occlusion, but also undergoes rotation. As shown in Figure 2(b), these trackers overcome the effect of occlusion to different degrees. When the face is occluded by a magazine heavily (e.g., #181 and #726), all the trackers can still achieve favorable results. But when the face undergoes both severe occlusion and inplane rotation simultaneously around #481, most sparse trackers can detect the target well, whereas the CNT tracker deviates from the target. In the girl sequence, the target face involves heavy occlusion and outofplane and inplane rotation simultaneously, as shown in Figure 2(c). When a man occludes the target girl around #500, the IST tracker drafts away from the target girl and tracks the man in turn. The MCPF tracker loses locating the target accurately as the influence of occlusion and scale variation after #428. The VITAL tracker loses the object around #428 and #457 but retraces the object finally. The DDL tracker starts to drift around #428. The SCM tracker fails to track the object as a result of rotation while our tracker can track the girl reliably in the entire sequence. Fast motion: Figure 3 presents some tracking results over sequences whose target suffers from fast motion and motion blur. The ground truth indicates that the motion in these sequences is larger than 20 pixels. It is hard to locate the object, and it is rather challenging to describe the appearance changes caused by motion blur. Our tracker can achieve robust tracking in these sequences. But not all of the other trackers get promising results when the target faces these conditions. The boy sequence contains scenes with fast motion and motion blur, as well as outofplane and inplane rotation. The DDL and LLR trackers cannot keep track of the object and drift to the other areas around #360, #490, and #602. The IST tracker outperforms the other trackers but also with some errors (e.g., #117). In the jumping sequence, the DDL and IST trackers cannot detect the target around #124, #180, #248, and #310, and the LLR tracker fails around #180, #248, and #310. In the deer sequence, the deer head undergoes fast motion, background clutter, and rotation. The DDL and LLR trackers lose the object from the start, and the IST tracker makes the drift phenomenon exist around #32 and #48. Illumination and scale variation: Figure 4 shows some tracking results over sequences with severe illumination and scale variation. In the singer1 sequence, the stage light changes frequently. In the car4 sequence, the car crosses the overpass undergoing drastic illumination and scale changes. Most trackers can overcome the influences to obtain the object region based on lowrank constraint. The CNT tracker utilizes the normalized local image features to overcome this challenge. The MCPF tracker employs a particle sampling strategy to deal with largescale variation problems. The VITAL tracker handles the scale variance sequence by acquiring the discriminative features based on the weight mask. Deformation: in the david sequence, a moving face experiences strong nonrigid deformation due to pose variation and outofplane and inofplane rotations. We show some significance tracking results in Figure 5. Our tracker can track the target effectively on all the frames. It is attributed to the lowrank and reverse sparse characteristics of the tracking framework, which can learn the robust discriminative subspace. The IST and VITAL trackers also perform well with stable tracking results while the DDL and LLR trackers fail at different times. The CNT tracker deviates away in certain frames (e.g., #375 and #460). The MCPF tracker cannot locate the object effectively as scale variation (e.g., #460). Background clutter: The cardark sequence includes scenes with background clutter and illumination variation. The car and the surrounding scene have similar color and texture as shown in Figure 6. Overall, most trackers can achieve better performance, whereas the LLR tracker drifts away from the car when the similar color or texture draws near to the car, such as around #60. The MCPF tracker cannot locate the car effectively as scale variation (e.g., #284 and #351).
(a)
(b)
(c)
(a)
(b)
(c)
(a)
(b)
4.2. Quantitative Results
4.2.1. CentralPixel Error Comparison
This subsection compares the centralpixel error (CPE) of the 8 trackers on 10 sequences quantitatively as shown in Table 1. CPE records the Euclidean distance between the manually labeled ground truth and the central location of the tracked bounding box. The smaller the error is, the more accurate the tracking result will be. In Table 1, the smallest and the second smallest errors are marked in bold font for each sequence, and the last row presents the average performance of these trackers. From the results, it is clear that our tracker achieves the best or second best performance in terms of the CPE. CNT and VITAL trackers perform relatively well as well. Among these trackers, SCM, IST, LLR, and DDL are the most relevant trackers with us. However, our tracker outperforms the SCM tracker in deformation and occlusion sequences and outperforms IST, LLR, and DDL trackers in fast motion sequences. Furthermore, compared with the CNT tracker which models tracking in a convolutional neural network framework, our tracker is more efficient in occlusion and deformation conditions. Compared with the MCPF tracker which models tracking in multitask correlation particle filter framework, our tracker is more efficient in deformation and background clutter conditions. Compared with the VITAL tracker which models tracking via adversarial learning, our tracker is more efficient in occlusion conditions. These results indicate the robustness of our tracker to occlusion, illumination and scale variation, fast motion, deformation, and background clutter.
4.2.2. Influence of FractionalOrder Variation
This subsection compares the influence of fractionalorder variation with firstorder variation on the tracking results. Figure 7 draws the evolution curves of CPE versus frame numbers on different differential orders. The experimental sequences are selected according to their main challenging factors. In most sequences, the fractionalorder regularization is similar to the one obtained by using firstorder regularization. But in complex occlusion condition, fractionalorder regularization has an obvious advantage. In the faceocc2 sequence, especially from #576 to #819, the object face undergoes heavy appearance changes, occluded by a magazine and a hat. The tracking performance based on fractionalorder regularization is much better than that of the firstorder regularization. Similarly, in the girl sequence, especially from #90 to #110, the object face is occluded from local to global gradually, the fractionalorder operator also performs better with smaller error. This implies that fractionalorder regularization should be used to take more neighboring frames information into account. This is mainly because the fractional differential is a global operation. Theoretically, the number of its expansion terms should be very large, but we take for our tracking because the fractionalorder computation will cost more time. Based on the average CPE, we set in Figures 2–6.
(a)
(b)
(c)
(d)
(e)
(f)
5. Conclusion
In this paper, we proposed a novel object tracking method based on reverse lowrank sparse learning and fractionalorder variation regularization. Our tracker comprised some effective technical elements as follows. We utilized lowrank constraint to prune the uncorrelated candidate particles. We introduced fractionalorder variation regularization to retain the discontinuous features and conquer the fast motion problem. Meanwhile, this regularization could also relate adjacent frame feature information to repress occlusion. Furthermore, we built an inverse sparse representation to reduce the computational cost for tracking processing. We gave an alternating iteration strategy for online tracking optimization. Qualitative and quantitative evaluation on benchmark sequences have demonstrated the robustness of our tracking algorithm, especially in complex occlusion and fast motion challenges. In the future, we will extend our tracker to deep learning for enhancing its discriminatory ability.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by National Nature Science Foundation of China under Grant 61703285 and Liaoning Natural Science Foundation of China under Grant 2019MS237.