Abstract

To benefit from the development of compressive sensing, we cast tracking as a sparse approximation problem in a particle filter framework based on multifeatures. In this framework, the target template is composed of multiple features extracted from visible and infrared frames; in addition, occlusion, interruption, and noises are addressed through a set of trivial templates. With this model, the sparsity is achieved via a compressive sensing approach without nonnegative constraints; then the residual between sparsity representation and the compressed sensing observation is used to measure the likelihood which weights particles. After that, the target template is adaptively updated according to the Bhattacharyya coefficients. Some experimental results demonstrate that the proposed tracker appears to have better robustness compared with four different algorithms.

1. Introduction

Visual tracking is an essential task in computer vision. It is applicable in many fields such as vehicle tracking, medical imaging, robotics, and surveillance. Many efforts have been paid in developing a robust visual tracking algorithm to overcome the challenges when occlusion, illumination changes, viewpoints variation, or noise interruption occurs [13].

With the help of sparse representation technique, [4] proposed the L1 tracker to perform a robust visual tracking, casting tracking as a sparse approximation problem under a particle filter framework. This approach shows better performance in dealing with occlusions, pose changes, and illumination changes, when compared with mean shift tracker, covariance tracker, and appearance adaptive particle filter. But it is computationally demanding because it needs to solve an norm related minimization problem for many times. Hereafter many efforts are paid to accelerate tracking progress and improve robustness based on it: [5] developed a real-time compressive sensing tracking (RTCST) algorithm to reduce the computational complexity; meanwhile [6] promoted an accelerated proximal gradient approach instead of the interior point method for acceleration; the result shows that both algorithms perform with higher accuracy and robustness than standard L1 tracker in many complex scenarios but are still insufficient in accommodating extreme illumination variations. A main reason causing this insufficiency may be that both algorithms directly use the tailored target image to generate the target templates, which is not sensitive to environmental changes; besides, it will be unstable when tracking target is similar to background. To improve robustness and accuracy, a multifeature based method is employed in our approach, which leads us to treat the tracking target as a sparse representation in the linear span of multifeature space.

Actually, the multifeature methods are widely used in image fusion and face recognition; many attempts are taken to extend them to deal with tracking problems [7, 8]. A mixture of infrared and visible features is a typical method for modeling the tracking object. Reference [9] uses the mixed visible and infrared features and combines a mean shift method with the level set evolution algorithm to track visual objects; [10] employs the intensity cue and edge cue of infrared target as the feature template and applies a particle filter framework to track with pedestrian. All these algorithms could successfully cope with the complex environmental cases such as illumination change, shadow, and occlusion but should be improved to handle challenges under more severe conditions. To benefit from these previous works, we use some of the features mentioned above, together with a modified compressed sensing tracking method to receive a more robust tracking result.

The reminder of the paper is arranged as follows. We briefly review some related works in Section 2. Section 3 details the proposed tracker, including a sparse target template representation model, a compressive sensing tracking strategy, and an adaptive template update scheme. Experimental results and conclusions are given in Sections 4 and 5.

2.1. Particle Filtering

The particle filter is a sequential Monte Carlo sampling method for Bayesian filtering; it provides a convenient framework for estimating the posterior distribution of state variables characterizing a dynamic system which is nonlinear and non-Gaussian. Assume that is the state variable at time which can be defined as affine motion or any other parameters reflecting the properties of the system. A predicting distribution of given all available observations up to time can be computed as When observation is available at time , the state vector is updated using the Bayes rule in particle filter; the posterior is approximated by a weighted sample set . Because it is generally impossible to sample from the state posterior directly, an importance distribution is used for updating the weights of the samples and a typical value of importance function is ; in this situation, the weights become the observation likelihood . Then the samples are resampled to generate a set of equally weighted particles according to their importance weights to avoid degeneracy.

2.2. Compressive Sensing

Assuming that a signal can be sparsely expressed in a basis of vectors the signal is -sparse if , in which counts the nonzero elements in vector . The compressive sensing theory demonstrates that can be recovered with an overwhelming probability by using few measurements: where is incoherent with and satisfies the RIP (restricted isometry property) [11].

Many studies indicate that the Gaussian measurement matrix is stable to protect the salient information of any -sparse or compressible signal in dimensionality reduction. With a small constant , an iid Gaussian matrix can be shown to have the RIP with high probability if ; besides, for being universal, will be iid Gaussian, thus having the RIP regardless of the choice of [12].

Since directly obtain (which equals to in the -domain) via is an NP-complete problem, it is commonly relaxed to solve the optimization When considering noises, the modified optimization where is a prespecified threshold.

2.3. L1 Tracker and Real-Time Compressive Sensing Tracking

The tracker considers the tracking target as a sparse representation in the linear span of target template set, along with noises. Given target template set and trivial templates , a tracking result can be modeled as where is a target coefficient vector and , are called positive and negative trivial coefficient vectors; with this definition, at time , the coefficient vector of each candidate target is gained by solving the regularized least squares problem where is the number of particles; then the residuals of particles are imported into particle filter to estimate the tracking result [4].

Based on the model above, to reduce computational load, [6] modifies (10) to the following minimization model: where , , , and is a parameter controlling energy in trivial templates; after that, an accelerated proximal gradient approach is proposed for solving problem (11). At the same time, RTCST uses compressive sensing to deal with the extremely high dimension of feature space in tracker. Given the measurement matrix , we get the measurement vector ; then a sparse coefficient vector can be recovered from

3. Multifeature Based Compressive Sensing Tracking

3.1. Target Template Initialization

Previous studies indicate that multifeature methods perform better robustness than methods employing single feature [13, 14]; moreover, good feature descriptors are helpful for stable tracking. This motivates us to employ a multifeature method to form the target template set. Suppose to be the th feature extracted from target image whose central coordinate is ; then an observation of the target can be characterized by a vector . In the first frame, after manually selecting the target, the features of target image , , are calculated to form a template ; for initialization, we first perturb the center of target image to generate candidate image center , crop candidates with the same size as target image according to , extract features of each candidate to form the rest templates , and finally initialize the template set as . Here we use a combination of visible and infrared features to improve robustness of the tracking algorithm; the mainly used visible feature is the RGB histogram and local binary patterns (LBP); in addition, the main infrared feature used is the intensity histogram.

3.1.1. RGB and Intensity Histogram

Histogram is an estimate of the probability distribution of a continuous variable; here we employ an alternative histogram extracting approach where pixels further away from the region center are assigned smaller weights. The weighting function (also can be called kernel function) is where is the distance from region center; then the histogram at location can be described as here is a normalization factor, is the current coordinates of pixels in candidate target, and is the bandwidth of kernel function. is a Kronecker delta function, where is the histogram bin and associates pixel with the histogram bin according to the corresponding RGB or gray intensity.

3.1.2. LBP

LBP (local binary pattern) is a type of feature characterizing the texture spectrum of an image; it is proved to be a powerful feature for texture classification. The original LBP feature first divides the examined window into cells and then, for each pixel in a cell, compares it to each of its 8 neighbors ; the LBP value can be calculated as After that, a histogram is computed over the cell to generate the feature vector for the window.

3.2. Compressive Sensing Tracking Based on Multifeatures

As done in [5, 6], we apply affine image warping to model the object motion of two consecutive frames; the state variable is defined to be which combine deformation parameters with 2D translation parameters ; the state transition distribution is modeled as a Gaussian distribution.

In tracker and RTCST, the dictionary is composed of target template , positive trivial template , and negative trivial template ; meanwhile the coefficient vector is divided into target coefficient vector , positive trivial coefficient vector , and negative trivial coefficient to enforce nonnegativity constraints. The main reason of exploiting nonnegativity constraints into account is that they help filter out clutter which is similar to target templates but in reversed intensity patterns; furthermore, it is normal that the target image in current frame appears to be more like that in previous frame. Since multifeatures have the potential of filtering clutter and occlusion, the nonnegativity constraints can be relaxed; besides, an extra occlusion handling method becomes less necessary; for example, when tracking with pedestrian, the occlusion may occur in visible image but not in related infrared image; in other words, infrared features help filter out the occlusions. While the target template set is available after initialization or update, we model the observation as where is the trivial template representing noises along with the trivial coefficient vector and is the target coefficient vector.

Under a certain threshold more weak correlated features will increase robustness and stability but will take more computational cost. To deal with this problem, we adopt the compressed sensing theory. Choosing random Gaussian matrix , as the measurement matrix, we take the measurements ; then by using orthogonal matching pursuit (OMP), we recover the coefficient vector from The main flow of multifeature based compressive sensing tracking (MFCST) algorithm is presented in Algorithm 1.

Input:
 (i) Current visible frame and infrared frame
 (ii) Particles
 (iii) Particles Measurement matrix
 (iv) Template set
 (v) Preset parameter
(1) Normalize each column of ;
(2) Generate particles according to state transition distribution;
(3) for     do
(4)    Obtain target candidates and relevant candidate center according to ;
(5)    Calculate RGB histogram , LBP of candidate in and intensity histogram of candidate in ;
(6)    Stack and into an observation ;
(7)    Get measurements ;
(8)    Solving (17) to get coefficient vector ;
(9)    Calculate residual ;
(10) Obtain observation likelihood ;
(11) end
(12) Resample the particles according to ;
(13) Using Maximum A Posterior (MAP) or Mean Square Error (MSE) to
estimate the state of current frame, which characterizes the current tracking target ;
(14) Get target observation corresponding to current tracking result ;
(15) Recalculate for by solving (17);
(16) Update target template ;
Output:
 (i) Tracked target
 (ii) Updated target dynamic state
 (iii) Updated target template set

3.3. Template Update

Intuitively, as time goes by, a static template will not be able to exactly capture the appearance variations of target due to illumination or pose changes; otherwise, frequently updating the template will accumulate errors and drift the tracker from target. We tackle this problem by adaptively updating the target template. Notice that can be seen as a sparse representation of in the linear span of , which means that the bigger is, the more important will be. The Bhattacharyya coefficient is applied to measure the similarity between target observation and the template which has maximum coefficient , which can be defined as a typical value of is between (0, 1); in addition, bigger means that is more similar to . We adopt two thresholds to guide the updating progress, where indicates that is not so similar to , which reflects variations in tracking; moreover, when decreases under a certain threshold that is , we consider it as a strong interference occurring in the tracking progress. The template update scheme is summarized in Algorithm 2.

Input:
 (i) Newly tracking result and the corresponding observation
 (ii) Target coefficient vector
 (iii) Template set , where
 (iv) Bhattacharyya thresholds , ,
(1) ;
(2) if     then
(3)    ;
(4)   ;
(5) end
(6) Normalize every columns of   ;
Output: Updated template set

4. Experiments

We test our algorithm on many real-world visible and infrared video sequences obtained from the data set 03 OSU Color-Thermal Database of the OTCBVS Benchmark Dataset [15]; three scenarios of two videos are involved in our experiments. To evaluate the performance of the proposed tracking framework, we compare the tracking results of our proposed multifeature based compressive sensing tracking (MFCST) with the adaptive multicue particle filter (AMC-PF) [10], the tracking method based on infrared and visible dual-channel video (IVDT) [9], the tracker [4], and the real-time compressive sensing tracking (RTCST) [5]. All targets are manually marked in the first tracking frame without careful selection.

The test sequences used in our first and second experiments S1 and S2 are based on location 2 in the data set. In S1, two pedestrians intersect with each other, which causes a partial occlusion. From the visible frames, we can see that the whole scene is under well illumination; two pedestrians are distinct from both each other and the background but in corresponding infrared frames; these two pedestrians are similar to each other. Some tracking results are given in Figure 1, where the frame indexes are labeled in the upper left corner of the images. It can be observed that although all trackers can well follow the target pedestrian, AMC-PF, RTCST, and MFCST do more precise work than those of tracker and IVDT.

The second experiment S2 is based on the scene in which target pedestrian passes by a static car. The target pedestrian has similar appearance as the front part and tire of the car in both visible and corresponding infrared frames. Results in Figure 2 indicate that AMC-PF drifts to the car when target occludes the tire. This is because AMC-PF uses the intensity cue and edge cue of the infrared image; however, the appearance of pedestrian and the front part of the car in infrared frame is easily confused.

The third experiment S3 considers a more complex scene; we use a video sequence from location 1 in the OSU Color-Thermal Database and then test 5 trackers mentioned above from frame 220 to 429. Since it is hard to distinct target pedestrian from background, we manually labeled the tracking target in the first infrared frame (frame 220) at the beginning of our experiment S3, without careful selection. The whole video sequence can be described in 5 clips. In the first clip, which covers from starting frame 220 to frame 250, the target walking in the shadow is confused with background in visible frame but is distinct in the infrared frame. As target walks straightforward, in clip 2, from frame 250 to 322, the tracked pedestrian particle occludes the first street lamp, which we sign as lamp 1; then, as target walks out of shadow, the illumination conditions are drastically changed. A heavy occlusion appears in clip 3, from frame 322 to 374; when target passes by the second street lamp, street lamp 2 completely occludes the pedestrian both in visible and infrared image. After that, target is particle occluded by pedestrian 2, wearing a red coat, which is obviously different from target pedestrian in visible frame; after several frames, the target walks by pedestrian 2 and encounters the third street lamp; these sequences are described in clip 4, from frame 374 to 390. At last, in clip 5, from frame 390 to 427, the target walks into a square and finishes the whole tracking process. The performance of 5 methods is shown in Figure 3, from which we can see that the tracker and RTCST failed to track with target in clip 1; the main reason may be that these two algorithms only use pixel information of visible images; unfortunately, it is severely a hard work to distinguish target from background in these visible frames. AMC-PF drifts to street lamp 1 from clip 2, where the illumination is severely changed and lamp 1 occludes the target both in visible and infrared frames. The dual-channel method IVDT succeeded in clips 1 and 2 but began to lose its target from clip 3. Meanwhile our proposed MFCST accurately tracks the target pedestrian throughout the whole sequence; this indicates that our approach appears to have better robustness in dealing with heavy occlusions and drastic illumination changes. In addition, when comparing with multifeature tracking methods AMC-PF and IVDT, our approach takes less measurements; furthermore, comparing with and RTCST, our algorithm receives better accuracy.

5. Conclusion

We model the tracking target as a sparse representation in the linear span of multifeature space. To generate the dictionary, we use RGB histogram and local binary patterns (LBP) of the visible frames with intensity histogram of the infrared frames to create the target template and then combine with the trivial template . A compressive sensing method is employed under particle filter framework to reduce the high dimension of the feature space and solve the coefficients for sparse representation. For further robustness, we introduce an adaptive template update scheme for this system. The proposed tracker receives robust and stable performance in dealing with occlusions and illumination variations.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by National Natural Science Foundation (NNSF) of China under Grant 61203266, Doctoral Foundation of the Ministry of Education of China under Grant 20113219110027, and National Defense Pre-Research Foundation of China under Grant 40405020201.