Robust Object Tracking via Reverse Low-Rank Sparse Learning and Fractional-Order Variation Regularization
Object tracking based on low-rank sparse learning usually makes the drift phenomenon occur when the target faces severe occlusion and fast motion. In this paper, we propose a novel tracking algorithm via reverse low-rank sparse learning and fractional-order variation regularization. Firstly, we utilize convex low-rank constraint to force the appearance similarity of the candidate particles, so as to prune the irrelevant particles. Secondly, fractional-order variation is introduced to constrain the sparse coefficient difference in the bounded variation space, which allows the difference between consecutive frames to exist, so as to adapt object fast motion. Meanwhile, fractional-order regularization can restrain severe occlusion by considering more adjacent frames information. Thirdly, we employ an inverse sparse representation method to model the relationship between target candidates and target template, which can reduce the computation complexity for online tracking. Finally, an online updating scheme based on alternating iteration is proposed for tracking computation. Experiments on benchmark sequences show that our algorithm outperforms several state-of-the-art methods, especially exhibiting better adaptability for fast motion and severe occlusion.
Visual object tracking is an important technique in computer vision with many applications, such as robotics, medical image analysis, human-computer interaction, and traffic control. The goal of tracking is to predict the motion state of the moving object in the video stream based on the initial state. Much progress has been made in this area, but many challenging tasks still remain caused by partial or full occlusion, fast motion, illumination and scale variation, deformation, background clutter, etc.
Low-rank constraint [1–4] on the candidate particles can reflect the subspace structure feature of the object appearance. This subspace representation is robust to handle the global appearance changes problem (e.g., illumination variations and pose changes). Furthermore, for the robustness of local appearance changes (e.g., deformation and partial occlusions), sparse representation [5–8] models the image observation by a linear combination of dictionary templates, which can measure the importance of each target candidate. Therefore, low-rank constraint and sparse representation can be learned jointly for effective tracking [9, 10]. Zhong et al.  develop a sparse collaborative model for object tracking, which exploits a sparse discriminative classifier and sparse generative model to describe drastic appearance changes. Zhang et al.  learn the sparse representation and low-rank constraint in the particle filter framework and exploit temporal consistency simultaneously. Wang et al.  propose an inverse sparse representation based tracking algorithm with a locally weighted distance metric. Sui and Zhang  exploit low-rank constraint to describe the global feature of all the patches and capture the sparsity structure to reflect the local relationship between the neighboring patches. Sui et al.  formulate spatial-temporal locality under a discriminative dictionary learning structure for object tracking. Dash and Patra  propose an effective tracking framework by using a regularized robust sparse coding for representing the multifeature templates of the candidate objects. These methods can successfully deal with the target appearance change problem caused by lighting variations and partial occlusions. Nevertheless, these formulations are not effective for handling fast motion challenges.
To solve this problem, we introduce reverse low-rank sparse learning with fractional-order variation regularization for visual object tracking. In comparison with the existing low-rank sparse trackers, we introduce fractional-order variation regularization to the representation model. Fractional-order variation has been widely used in static image analysis. We generalize it to overcome the challenging problems in dynamic video tracking because of the following two factors: (1) the variation method can model the tracking problem in the bounded variation space, which allows the difference among a few frames to exist, to adapt the object fast motion. (2) Fractional differential is a global operator, which can take more adjacent frames information into account and overcome the severe occlusion problem.
The main contributions of this work are four-fold: (1) the low-rank constraint is exploited to prune the irrelevant particles; (2) fractional-order variation regularization is introduced to learn the jump information generated by fast motion and complex occlusion; (3) an inverse sparse representation formulation is built to reduce the computation complexity for real-time tracking; and (4) an alternating iteration strategy is presented for online tracking optimization.
2. Problem Formulation
In this work, we formulate the target states within the particle filter framework . Particle filter processing is built upon Bayesian inference rule, which can be used for predicting the posterior probability density function of the state variables in dynamic system. Object tracking as a classical dynamic variables estimation problem is suitable to be modeled in this framework. Based on this idea, the a posteriori probability of the target state can be inferred recursively as follows:where is the motion state variable at time , is the observed image, denotes the state transition model, and denotes the observation model. Thus, the target state can be found by maximizing a probability estimation model aswhere is the i-th candidate at time t.
2.1. State Transition Model
State transition model describes the change of the target state between two successive frames. We measure the transition of the target state by an affine motion formulation described as , whose parameters correspond to , coordinate translation, rotation angle, scale, aspect ratio, and skew, respectively. To sample a group of candidate particles, we model the transition of the target state by a Gaussian distribution:where denotes a diagonal Gaussian distribution matrix, whose elements are the variances of the affine motion parameters.
2.2. Reverse Low-Rank Sparse Representation Model with Fractional-Order Variation Regularization
In this section, we utilize both reverse low-rank sparse learning and fractional-order variation regularization to formulate the object tracking. Firstly, we employ the local appearance representation based on patches to replace the holistic one for dealing with partial occlusion. Here, local patches are sampled sequentially in a nonoverlapping manner from the candidate particles as shown in Figure 1.
Secondly, we use a generative model based on statistical processing to select the optimal target candidate. The existing low-rank sparse optimization-based tracking methods usually make the drift phenomenon occur when the target faces complex occlusion and fast motion. Here, we build a reverse low-rank sparse learning formulation with fractional-order variation regularization for object tracking as follows:where
At time t, denotes the target template reshaped by the intensity vector of the observed target, whose initial value is drawn manually in the first frame and the current value is updated dynamically during tracking as shown in Section 3.2. is a dictionary used for the sparse representation of target template, whose columns are formed by candidates particles . is the local patch vector of the candidate region in the current frame sampled by the state transition model. denotes the sparse coding. , , and are the adjustment parameters. denotes the matrix nuclear norm. is the fractional-order gradient operator. is an integer constant, , and is the gamma function.
In model (4), the first three terms have already been used in existing trackers, which depict the reconstructed error, low-rank constraint, and sparse representation, respectively. The last term is our novel idea which represents fractional-order variation regularization.
In this optimization, we utilize low-rank constraint to force the appearance similarity of the candidate particles. This global restriction can help to acquire the structure feature of the object observations and prune the uncorrelated candidate particles. Since matrix rank minimization is an NP-hard problem, we minimize a convex envelope of the rank function (nuclear norm) for alternative processing.
To realize robust tracking under fast motion and severe occlusion challenge, we introduce fractional-order variation regularization to the representation model. The variation method can model the variable selection problem in the bounded variation space. Functions in this space allow for the existence of jumping discontinuities. That is, the discontinuous features can be retained. Then, the appearance variation can be described effectively when the target undergoes fast motion. However, total variation regularization can only relate the information between two adjacent frames. Unlike this local processing, the fractional differential in equation (6) can involve more information from the front frames, which is helpful for acquiring more target feature information and dealing with the severe occlusion problems. Therefore, we employ fractional-order variation regularization under fractional-order bounded variation space to enhance the robustness of object tracking.
The candidate observation can be represented as a linear combination of target templates, and only a few templates are required to reliably represent the candidate image observation. Optimization problem (4) penalizes the representation matrix via L1 norm, which can retain the useful information and remove the redundant information so that the optimal solution is sparse. We employ sparse representation to model the relationship between target candidates and target templates, which is helpful to deal with occlusion, because the residual error in the occlusion location is sparse. Currently, most of the sparse representation models utilize the target template to represent the candidate particles. These methods need to solve a large number of minimization problems. To reduce the computational cost, we use candidate particles’ linear combination to represent the target template inversely. This is because the templates’ number is smaller than that of the candidates. Then, the computational efficiency for tracking processing can be improved.
2.3. Observation Model
The observation model measures the probability of the observed image at the motion state , which can describe the similarity between the target template and the candidate particle. Then, the candidate with maximal probability in equation (2) can be regarded as the tracking result. In this paper, we use the sparse coding coefficient in model (4) to estimate this similarity. The candidates with larger sparse coding coefficient have high probability to be the target, whereas the candidates with smaller coefficient are less likely to be the target. We define the observation model aswhere the superscript denotes the m-th candidate. In each frame, we crop out the optimal candidate as the tracking result.
3. Numerical Implementation
3.1. Alternating Iterative Algorithm
To solve the optimization problem in (4) for online tracking, we present an alternating iterative algorithm based on three update steps as follows: Step 1: acquire the low-rank matrix by We solve this problem with the FISTA algorithm. Define , , and where is the Lipschitz constant for the function . The details of the FISTA algorithm can be summarized as follows:(1)Initialization: and .(2)Iteration: where . The terminal condition is set by the duality gaps. Step 2: introduce the fractional-order variation regularization by We solve this model by an adaptive primal-dual algorithm  formulated as follows:(1)Initialization: , , .(2)Iteration:(3)Termination condition: where is the dual space. is termed as the primal-dual gap, which vanishes only if is the saddle point. Step 3: update the coding by the inverse sparse representation: This is a traditional linear regression problem. The solution of this model can be calculated by the LARS algorithm. We utilize the SPAMS optimization toolbox to realize this numerical calculation. This three-step iteration updates one variable at a time with the other variables fixed. Finally, the representation coefficient in (4) can be acquired.
3.2. Template Update
In model (4), a fixed target template is insufficient to account for appearance change among successive frames. In this work, we address this issue by a dynamic update scheme as
The target template is defined as the weighted sum of the target template and the tracking result . The contributions of these two terms can be balanced by the weight . The threshold is determined empirically by measuring the dissimilarity. We set and .
This update mechanism can overcome the target appearance change due to partial occlusion. We can retain the unoccluded patches in the target template and prune the occluded ones.
The details of the numerical implementation are shown in Algorithm 1. In our reverse sparse learning framework, the computational cost of updating in step 3 is , where is the number of target templates and is the number of image feature, whereas, in the traditional sparse learning framework, is the number of target candidates. Because the templates’ number is smaller than that of the candidates, the computational complexity for tracking processing can be reduced linearly. In our algorithm, the average frame rate of the video sequences is about 7 frames per second.
4. Experimental Results
In this section, we assess the proposed tracking algorithm by qualitative and quantitative experiments. The experiments are conducted on a set of benchmark sequences (faceocc1, faceocc2, girl, boy, deer, jumping, singer1, car4, david, cardark) with MATLAB. These sequences are categorized by their main challenging factors including occlusion, fast motion, illumination and scale variation, deformation, and background clutters. For each sequence, the initial value of the affine parameter can be acquired from the bounding box in the first frame, which is drawn manually, and then the affine parameter varies accordingly in the tracking process. We sample 300 candidate particles and regularize the target templates size to . Furthermore, we set weight coefficients , , , respectively.
Comparative studies with 7 state-of-the-art trackers including SCM , IST , LLR , DDL , CNT , MCPF , and VITAL  are carried out. In [11, 13–15], object tracking is modeled by low-rank and sparse representation. In , a convolutional neural network is incorporated into object tracking without training. In , object tracking is described in a multitask correlation particle filter framework. In , object tracking by detection framework is realized via adversarial learning. These comparisons mainly consider deep network , correlation filter and adversarial learning have attracted much attention in complicated tasks of visual tracking.
4.1. Qualitative Results
Figures 2–6 compare the tracking results of the 8 trackers on 10 benchmark sequences qualitatively. In the following, we analyze the results according to the main challenging factors in each sequence. Occlusion: in the faceocc1 sequence, the target face undergoes frequent occlusion, which causes serious appearance changes. Figure 2(a) shows some representative tracking results in the sequence. These trackers can complete the tracking successfully. In the faceocc2 sequence, the target face not only suffers from heavy partial occlusion, but also undergoes rotation. As shown in Figure 2(b), these trackers overcome the effect of occlusion to different degrees. When the face is occluded by a magazine heavily (e.g., #181 and #726), all the trackers can still achieve favorable results. But when the face undergoes both severe occlusion and in-plane rotation simultaneously around #481, most sparse trackers can detect the target well, whereas the CNT tracker deviates from the target. In the girl sequence, the target face involves heavy occlusion and out-of-plane and in-plane rotation simultaneously, as shown in Figure 2(c). When a man occludes the target girl around #500, the IST tracker drafts away from the target girl and tracks the man in turn. The MCPF tracker loses locating the target accurately as the influence of occlusion and scale variation after #428. The VITAL tracker loses the object around #428 and #457 but retraces the object finally. The DDL tracker starts to drift around #428. The SCM tracker fails to track the object as a result of rotation while our tracker can track the girl reliably in the entire sequence. Fast motion: Figure 3 presents some tracking results over sequences whose target suffers from fast motion and motion blur. The ground truth indicates that the motion in these sequences is larger than 20 pixels. It is hard to locate the object, and it is rather challenging to describe the appearance changes caused by motion blur. Our tracker can achieve robust tracking in these sequences. But not all of the other trackers get promising results when the target faces these conditions. The boy sequence contains scenes with fast motion and motion blur, as well as out-of-plane and in-plane rotation. The DDL and LLR trackers cannot keep track of the object and drift to the other areas around #360, #490, and #602. The IST tracker outperforms the other trackers but also with some errors (e.g., #117). In the jumping sequence, the DDL and IST trackers cannot detect the target around #124, #180, #248, and #310, and the LLR tracker fails around #180, #248, and #310. In the deer sequence, the deer head undergoes fast motion, background clutter, and rotation. The DDL and LLR trackers lose the object from the start, and the IST tracker makes the drift phenomenon exist around #32 and #48. Illumination and scale variation: Figure 4 shows some tracking results over sequences with severe illumination and scale variation. In the singer1 sequence, the stage light changes frequently. In the car4 sequence, the car crosses the overpass undergoing drastic illumination and scale changes. Most trackers can overcome the influences to obtain the object region based on low-rank constraint. The CNT tracker utilizes the normalized local image features to overcome this challenge. The MCPF tracker employs a particle sampling strategy to deal with large-scale variation problems. The VITAL tracker handles the scale variance sequence by acquiring the discriminative features based on the weight mask. Deformation: in the david sequence, a moving face experiences strong nonrigid deformation due to pose variation and out-of-plane and in-of-plane rotations. We show some significance tracking results in Figure 5. Our tracker can track the target effectively on all the frames. It is attributed to the low-rank and reverse sparse characteristics of the tracking framework, which can learn the robust discriminative subspace. The IST and VITAL trackers also perform well with stable tracking results while the DDL and LLR trackers fail at different times. The CNT tracker deviates away in certain frames (e.g., #375 and #460). The MCPF tracker cannot locate the object effectively as scale variation (e.g., #460). Background clutter: The cardark sequence includes scenes with background clutter and illumination variation. The car and the surrounding scene have similar color and texture as shown in Figure 6. Overall, most trackers can achieve better performance, whereas the LLR tracker drifts away from the car when the similar color or texture draws near to the car, such as around #60. The MCPF tracker cannot locate the car effectively as scale variation (e.g., #284 and #351).
4.2. Quantitative Results
4.2.1. Central-Pixel Error Comparison
This subsection compares the central-pixel error (CPE) of the 8 trackers on 10 sequences quantitatively as shown in Table 1. CPE records the Euclidean distance between the manually labeled ground truth and the central location of the tracked bounding box. The smaller the error is, the more accurate the tracking result will be. In Table 1, the smallest and the second smallest errors are marked in bold font for each sequence, and the last row presents the average performance of these trackers. From the results, it is clear that our tracker achieves the best or second best performance in terms of the CPE. CNT and VITAL trackers perform relatively well as well. Among these trackers, SCM, IST, LLR, and DDL are the most relevant trackers with us. However, our tracker outperforms the SCM tracker in deformation and occlusion sequences and outperforms IST, LLR, and DDL trackers in fast motion sequences. Furthermore, compared with the CNT tracker which models tracking in a convolutional neural network framework, our tracker is more efficient in occlusion and deformation conditions. Compared with the MCPF tracker which models tracking in multitask correlation particle filter framework, our tracker is more efficient in deformation and background clutter conditions. Compared with the VITAL tracker which models tracking via adversarial learning, our tracker is more efficient in occlusion conditions. These results indicate the robustness of our tracker to occlusion, illumination and scale variation, fast motion, deformation, and background clutter.
4.2.2. Influence of Fractional-Order Variation
This subsection compares the influence of fractional-order variation with first-order variation on the tracking results. Figure 7 draws the evolution curves of CPE versus frame numbers on different differential orders. The experimental sequences are selected according to their main challenging factors. In most sequences, the fractional-order regularization is similar to the one obtained by using first-order regularization. But in complex occlusion condition, fractional-order regularization has an obvious advantage. In the faceocc2 sequence, especially from #576 to #819, the object face undergoes heavy appearance changes, occluded by a magazine and a hat. The tracking performance based on fractional-order regularization is much better than that of the first-order regularization. Similarly, in the girl sequence, especially from #90 to #110, the object face is occluded from local to global gradually, the fractional-order operator also performs better with smaller error. This implies that fractional-order regularization should be used to take more neighboring frames information into account. This is mainly because the fractional differential is a global operation. Theoretically, the number of its expansion terms should be very large, but we take for our tracking because the fractional-order computation will cost more time. Based on the average CPE, we set in Figures 2–6.
In this paper, we proposed a novel object tracking method based on reverse low-rank sparse learning and fractional-order variation regularization. Our tracker comprised some effective technical elements as follows. We utilized low-rank constraint to prune the uncorrelated candidate particles. We introduced fractional-order variation regularization to retain the discontinuous features and conquer the fast motion problem. Meanwhile, this regularization could also relate adjacent frame feature information to repress occlusion. Furthermore, we built an inverse sparse representation to reduce the computational cost for tracking processing. We gave an alternating iteration strategy for online tracking optimization. Qualitative and quantitative evaluation on benchmark sequences have demonstrated the robustness of our tracking algorithm, especially in complex occlusion and fast motion challenges. In the future, we will extend our tracker to deep learning for enhancing its discriminatory ability.
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was supported by National Nature Science Foundation of China under Grant 61703285 and Liaoning Natural Science Foundation of China under Grant 2019-MS-237.