Abstract

Recent years have seen greater interests in the tracking-by-detection methods in the visual object tracking, because of their excellent tracking performance. But most existing methods fix the scale which makes the trackers unreliable to handle large scale variations in complex scenes. In this paper, we decompose the tracking into target translation and scale prediction. We adopt a scale estimation approach based on the tracking-by-detection framework, develop a new model update scheme, and present a robust correlation tracking algorithm with discriminative correlation filters. The approach works by learning the translation and scale correlation filters. We obtain the target translation and scale by finding the maximum output response of the learned correlation filters and then online update the target models. Extensive experiments results on 12 challenging benchmark sequences show that the proposed tracking approach reduces the average center location error (CLE) by 6.8 pixels, significantly improves the performance by 17.5% in the average success rate (SR) and by 5.4% in the average distance precision (DP) compared to the second best one of the other five excellent existing tracking algorithms, and is robust to appearance variations introduced by scale variations, pose variations, illumination changes, partial occlusion, fast motion, rotation, and background clutter.

1. Introduction

Visual tracking, as a fundamental step to explore videos, is important in many computer vision based applications, such as face recognition, human behavior analysis, robotics, intelligent surveillance, intelligent transportation systems, and human-computer interaction. The objective of visual tracking is to estimate the locations of a target in a video sequence [13]. During the tracking process, the state of the target is estimated over time by associating its representation in the current frame with those in previous frames. Though the research on visual tracking algorithms has lasted for decades, visual tracking is still a problem because of the factors such as pose variation, illumination changes, partial occlusion, fast motion, scale variation, background clutter, and so on.

In general, current tracking algorithms can be classified as either generative or discriminative approaches. Generative approaches [47] focus on learning an appearance model and formulate the tracking problem as finding the target observation most similar to the learned appearance or with minimal reconstruction error. The models are based on templates or subspace models. However, these generative models do not take the background information into consideration, therefore throwing away some very useful information that can help to discriminate object from background. Different from generative trackers, discriminative methods [813] address the tracking problem as a classification problem which differentiates the tracked targets from the backgrounds. They employ both the target and the background information. For example, Avidan [14] proposes a strong classifier based on a set of weak classifiers to do ensemble tracking. Kalal et al. [15] propose a P-N learning algorithm to learn tracking classifiers from positive and negative samples. These methods are also termed as tracking-by-detection [1618], in which a binary classifier separates the target from background in the continuous frames. In recent years, tracking-by-detection methods have shown to provide excellent tracking performance.

Most current tracking algorithms are only confined to finding out the target location. This implies poor tracking performance in sequences with great scale changes. Several methods [1921] that use Scale Invariant Feature Transform (SIFT) features can adapt to object scale variations at low frame-rates, but they are not able to be used in the real-time applications. Tu et al. [19] propose a vehicle tracking approach combining blob based tracking and SIFT features based tracking, which is robust to the size of vehicle. Jiang et al. [20] present a novel algorithm for object tracking based on particle filter and SIFT. Wei et al. [21] propose a SIFT based mean shift algorithm, which can be used for continuous vehicle tracking in complex situations. In this paper, we present an adaptive scale tracking approach using discriminative correlation filters, which can estimate the target scale accurately. One main contribution of this work is to decompose the tracking task into translation and scale estimation. The target translation and scale estimation both work by making use of the kernelized correlation filters. In addition, we adopt a new update scheme online based on the MOSSE [22] tracker, which takes all the previous frames into consideration when computing the current models. Experimental results on challenging video sequences demonstrate the superior performance of our proposed method in robustness and stability against state-of-the-art methods.

The rest of this paper is organized as follows. A brief summary of the most related work is first given in Section 2. The tracking algorithm with kernelized correlation filters is introduced in Section 3. Section 4 describes our proposed approach. Following this, the experimental results are presented with comparisons to state-of-the-art methods on challenging sequences in Section 5. Finally, we conclude this paper in Section 6.

Visual object tracking has been studied extensively with lots of applications. In this section, we introduce the approaches closely related to our work.

Correlation filters have been used in many applications such as object detection and recognition [23]. Since the operator is readily transferred into the Fourier domain as element-wise multiplication, correlation filters have attracted considerable attention recently to visual tracking due to its computational efficiency. In recent years, the researchers start to bring the correlation filters into the tracking-by-detection methods and gain a great development. Bolme et al. [22] propose to learn a minimum output sum of squared error (MOSSE) filter for visual tracking on gray-scale images, where the learned filter encodes target appearance with update on every frame. Henriques et al. [24] propose a circulant structure of tracking-by-detection with kernels (CSK) method, which uses correlation filters in a kernel space. They propose the first kernelized correlation filter, but the CSK method only builds on single channel features. Generalizations of linear correlation filters to multiple channels have also been proposed [2527], which allow them to use more modern features such as histogram of oriented gradients (HOG). Henriques et al. [28] propose a kernelized correlation filter (KCF) tracking algorithm, which is further improved by using HOG features. Danelljan et al. [1] propose an adaptive color attributes tracking method, which exploits the color attributes of a target and learns an adaptive correlation filter by mapping multichannel features into a Gaussian kernel space. However, the above methods do not consider the target scale prediction. Recently, Wu et al. [29] perform a comprehensive evaluation of online tracking algorithms. In the evaluation, the CSK tracker is shown to provide competitive performance with the highest speed among ten top trackers. Due to its excellent performance, we base our approach on the CSK tracker.

3. Kernelized Correlation Filters Based Tracking

For the correlation filter based trackers, correlation can be computed in the Fourier domain through Fast Fourier Transform (FFT); and the correlation response can be transformed back into the spatial domain with the inverse FFT. The CSK tracking method explores a dense sampling strategy while showing the process of taking subwindows in a frame induces circulant structure. The CSK tracker learns a regularized least squares’ (RLS) classifier of the target appearance from a single image patch, gets the kernelized correlation filter with using the circulant matrices and kernel trick, and localizes the target in a new frame by finding the maximum response of the correlation filter. In this section, we briefly describe the CSK tracker.

3.1. Circulant Matrices

Assume is an circulant matrix; then it can be obtained from a vector :

The first column is the transposition of the vector , the second column is the transposition of the vector that is cyclic shifted one element to the right, and so on. For an vector , the product of and represents the convolution of vectors and [30]; it can be expressed in the Fourier domain as follows:where and denote the inverse Fourier transform and Fourier transform, respectively.

3.2. The Regularized Least Squares Classification

It has been shown that, in many practical problems, the RLS classifier offers equivalent classification performance compared to the support vector machine (SVM), and the former is implemented easily [31]. The approach uses a single gray-scale image patch which is centred around the target to train the classifier. The classifier has the form , and it is trained by minimizing the cost function (see (3)) over samples :where is the desired output for and is a regularization parameter.

Mapping the inputs to the feature space with the kernel trick, the kernel is . Then we can express the solution as a linear combination of the inputs [32]:where is the coefficient.

Then the RLS with kernels has the simple closed form solution [31] as follows:where is the kernel matrix with elements , is the unit matrix, is the desired output with elements , and is the transformed classifier coefficient vector with elements .

3.3. Fast Target Location Estimation

It has been proved that the kernel matrix is circulant if is unitarily invariant [24]. We can get (6) from (5) according to the property of circulant matrices:where and is the vector with elements .

We complete the target location detection using the interesting image patch in a new frame. Then the response of the RLS classifier is , and it can be computed in the Fourier domain as follows:where , where is the vector with elements , where represents the target model learned from the previous frame and is the sample of the image patch .

The position of the target in a new frame is obtained by finding the position that makes the maximum, which means finding the position that maximizes the response of the filter , and and are updated as follows:where is the learning rate and and denote the updated coefficients at frame and frame , respectively. and denote the updated target model at frame and frame , respectively. and denote the coefficients and target model computed from frame , respectively. For more details, we refer to [24].

4. The Proposed Visual Tracking Algorithm

In this section, we present the adaptive scale tracking method based on the kernelized correlation filters in detail. Recently, Danell et al. [8] propose a scale estimation method based on the MOSSE filter. Inspired from it, we propose a robust correlation tracking approach based on the CSK tracker. Since the scale changes very little between two frames in visual tracking, we can detect the target position using the position kernelized correlation filter firstly and then estimate the target scale using the scale kernelized correlation filter that is learned by using the samples collected from the detected target. In the following subsections, we will introduce a new online update scheme and a scale prediction strategy.

4.1. Online Update Scheme

Since the appearance of the target often changes significantly during the visual tracking, it is necessary to update the target model to adapt to these changes. In the CSK tracker, the model consists of the transformed classifier coefficients and the learned target model. But they are computed only considering the current appearance. This limits the performance because not all the previous frames are considered to compute the current model. However, the MOSSE tracker [22] employs a robust update scheme by considering all previous frames when computing the current model and performs well. Here we adopt the same idea to update the models in our approach. Then we take all the extracted appearances of the target from the first frame till the current frame into consideration in our update scheme. Therefore, the cost function in (3) can be modified as

Then the coefficients for the frame can be computed as follows:where and , where is the vector with elements .

Then (7) is expressed as

The target appearance is updated using (9). Here we update the numerator and the denominator of in (11) separately as

4.2. The Target Scale Prediction Strategy

To predict the target scale variation, we learn another kernelized correlation filter and train another classifier on multiscale image patches around the most reliable tracked targets. During the tracking, we construct a target pyramid around the tracked target to estimate the target scale. We resize the patches by using the bilinear interpolation to the size of the initial target before extracting features. The training samples for learning the filter are computed by extracting HOG features using the resized patches which are centred around the tracked target. Then the extracted features are multiplied by a Hamming window to reduce the frequency effect of image boundary when using the FFT, as described in [22]. Assume the initial target size in the current frame is and the size of the scale filter is ; then we extract the sample from the image patches of size which are centred around the target, where , , , and is the scale factor. The process of extracting features is shown in Figure 1. We compute the coefficients by (15) and the response for a new frame by (16), update using (13) and (14), and update the scale model using (9). The target scale in a new frame is obtained by finding the scale that makes maximum:where , with being the vector with elements , where is the learned scale model from frame , and , where is the desired output for at frame :where , with being the vector with elements , where is the scale model learned from frame and is the sample extracted from a new frame.

4.3. Implementation

The total procedure of our approach is summarized in Algorithm 1. In our approach, we use the Gaussian kernel function in the translation and scale detection; is the standard deviation. In tracking-by-detection method, the closer the samples to the currently tracked target center, the larger the probability the samples are the positive samples. Since the square loss of RLS with kernels allows for continuous values, we do not need to limit ourselves to binary labels. The line between classification and regression is essentially blurred. For the continuous training output, we choose the Gaussian function, which is known to minimize ringing in the Fourier domain [33]. Therefore, the desired outputs and both use the Gaussian functions that are expressed in where represents a target location, represents the coordinate of the tracked target center, is a target scale with elements (, where is an integer), is the centre scale of the target, and and are the standard deviations.

Input: The th frame video sequence , Initial target position and scale .
Output: Detected target position and scale .
Repeat:
Crop out the searching region in frame according to and , and extract the sample ;
//Position Detection:
() Compute the response with , and using (12);
() Find the target position which maximizes .
//Scale Prediction:
() Extract a sample from at and ;
() Compute the response with , and using (16);
() Find the target scale which maximizes .
//Model Online Update:
() Extract samples and from at and ;
() Update using (13), (14) and update using (9);
() Update using (13), (14) and update using (9).
Until the End of the Video Sequence

5. Experimental Results

To verify the efficiency of the method introduced above, we test the proposed tracking algorithm on 12 challenging video sequences which are from [29]. They have been widely used in many recent tracking papers and are summarized in Table 1. We provide both quantitative and attribute-based comparisons with 5 state-of-the-art trackers. The tracking results for 12 video sequences using 6 tracking algorithms are shown in Supplementary Material available online at http://dx.doi.org/10.1155/2015/238971.

5.1. Experiment Environment and Parameters

All our experiments are performed using MATLAB 2010a on a 3.4 GHz Intel core i3-2130 PC with 2 GB RAM. For fair evaluation, all the parameters are fixed for all the video sequences in our experiments. For the target of size and the scale filter of size , the standard deviations are set to and . The standard deviation for the Gaussian kernel is 0.2. The learning rate is 0.075. The regularization parameter is 0.01. The scale of the scale filter is set to and the scale rate is set to .

5.2. Performance Evaluation

In order to evaluate the overall performance of the proposed method, three evaluation metrics are used, namely, centre location error (CLE), success rate (SR), and distance precision (DP). The CLE is defined as the average Euclidean distance between the manually labeled ground truths and the detected centre locations of the target. Then we use the average CLE over all the frames of a sequence to evaluate the overall performance for the sequence. SR is computed by (18). DP is defined as the relative number of frames in a sequence whose CLE is smaller than a fixed threshold. The threshold is set to 20 pixels in our experiment:where score is the overlap score, is the tracked bounding box, is the ground truth bounding box, area represents the region area, and , respectively, represent the intersection and union of two regions, is the number that we use to count the successfully tracked frames whose overlap score is larger than 0.5, and is the total frames of one sequence.

5.3. Comparison with Original Update Scheme

To show the effect of the changed update scheme on tracking, we compute the average CLE, average SR, and average DP over 12 sequences for the CSK, CSK with new update scheme, CSK with scale prediction, and our tracker which includes a new update scheme and a scale prediction at the same time. Table 2 shows the comparison results, and the best results are shown in bold. From the table, we can see that the new update scheme improves the performance of the tracker compared to the original update scheme. The CSK with new update scheme approach reduces the average CLE by 13.2 pixels and improves the performance by 1.4% in average SR and 7.7% in average DP compared to the CSK. Our tracker reduces the average CLE by 1.5 pixels and improves the tracking performance by 3.5% in average SR and by 2.6% in average DP compared to the CSK with scale prediction approach. Our tracker achieves the best performance in terms of the average CLE, average SR, and average DP.

5.4. Comparison with CSK Tracker

From Tables 35, we can see that our tracker reduces the average CLE from 40.2 pixels to 8.1 pixels and improves the performance by 29.1% in average SR and 31.1% in average DP compared to the CSK. Our approach outperforms the CSK in terms of the average CLE, average SR, and average DP.

In order to show clearly, we use the Girl sequence as an example to analyze. Figures 2 and 3, respectively, show the partial tracking results and the three evaluation metrics plots. Figure 2 shows the tracking results on Girl sequence with scale variation, pose variation, rotation, and partial occlusion. When the girl undergoes the rotation at frame #110, the CSK tracker begins to drift. When the target size becomes smaller at frame #156, our tracked box becomes smaller at the same time and our tracker can track the girl accurately. The tracking error of the CSK tracker is accumulated as the target appearance varies. CSK has a great drift at frame #436 and fails to track the girl at frame #472. However, our tracker can track the girl successfully all the time. Figure 3 also shows that our approach is better than CSK.

5.5. Comparison with State-of-the-Art Trackers

Since it is impractical to use all the existing tracking algorithms to validate the efficacy of our tracker, we compare the proposed algorithm with 5 state-of-the-art trackers: MOSSE tracker [22], Compressive Tracker (CT) [17], Weighted Multiple Instance Learning Tracker (WMILT) [34], KCF tracker with HOG features [28], and CSK tracker [24]. In order to compare fairly, we use the same parameters as the authors suggested in their papers and only change the target location and size used in the first frame.

5.5.1. Quantitative Analysis

We compute the median CLE, SR, and DP to evaluate the performance of 6 tracking methods on the 12 challenging video sequences in our experiments. The results are shown in Tables 35. The best results are reported in bold. The three tables show the quantitative results in which our tracker achieves the best or second best performance in most sequences in terms of CLE, SR, and DP. Compared to the second best tracker among the 6 trackers, our tracker reduces the average CLE by 6.8 pixels and improves the performance by 17.5% in the average SR and by 5.4% in the average DP. To describe the tracking results in detail, we give the center location error plots, the overlap score plots, and the distance precision plots which are shown in Figures 46 over 12 sequences for these trackers. From the figures, we can see that our tracker maintains a smaller centre location error, a higher overlap score, and a higher distance precision in general. The above analysis implies that our approach performs more accurate and stable results than the other 5 trackers.

5.5.2. Qualitative Analysis

Scale, Illumination, and Pose Variation. Figures 7(a), 7(b), 7(c), and 7(d), respectively, illustrate the results on Car4, Singer1, Trellis, and David sequences with scale and illumination variations as well as pose changes. In Car4 sequence, the vehicle undergoes drastic illumination and scale changes especially when it passes beneath a bridge (see frame #230). Besides, the vehicle also undergoes background clutter. Only our approach and KCF are robust to these factors and perform well on this sequence. The HOG features are robust to illumination changes, but the background information in the tracked box of KCF accumulates because of the target scale variation, and KCF has a great drift at frame #641. However, our tracker can accurately detect the target position and scale all the time since it can predict the object scale in time. CT and WMILT use the discriminative classifiers learned by Harr-like features, MOSSE uses an adaptive correlation filter, and CSK brings kernelized correlation filters into tracking, but they perform poorly in this case. For the Singer1 sequence, the other trackers except our tracker fail to deal with the large scale, large illumination, and pose variation at the same time. Despite these challenges, our approach is able to track the target accurately. For the David indoor sequence shown in Figure 7(d), the person walks towards the moving camera, resulting in significant appearance variations due to the illumination and scale change. CT, KCF, and our approach can successfully track the target in most frames of the David sequence. However, the target undergoes abrupt pose variation in the Trellis sequence, and only KCF and our tracker perform well. The CLE of our tracker is smaller and the SR of our tracker is higher.

Scale, Pose Variation, Occlusion, and Rotation. Figures 7(e) and 7(f), respectively, show the results on CarScale and Girl sequences with scale variation and partial occlusion. The car moves from far to near and undergoes occlusion by trees in the CarScale sequence. Both KCF and our tracker can complete the total tracking task for the sequence, but the SR of our tracker is higher. In Figure 7(f), the girl also undergoes in-plane rotation and pose variation (see frames #141, #180) which make the tracking more difficult. Only our tracker is able to track the target successfully in most frames of this sequence.

Background Clutter, Illumination, Pose Variation, and Occlusion. The targets in the Skating1 and CarDark sequences undergo background clutter, illumination, and pose changes. For the Skating1 sequence in Figure 7(g), the target also undergoes partial occlusion (see frame #163). Only KCF and our tracker perform well during the tracking process, but our approach performs better in terms of CLE and SR. For the CarDark sequence in Figure 7(h), MOSSE, CSK, and our tracker provide promising results compared to other trackers.

Scale, Pose Variation, Occlusion, and Abrupt Motion. Figure 7(i) shows the Dog1 sequence with scale and pose variation. MOSSE and KCF as well as our approach perform well on the sequence. In addition, our tracker achieves the best performance in terms of SR. For the Tiger1 sequence as shown in Figure 7(j), the object undergoes abrupt motion, pose variation, and partial occlusion. Only WMILT and our tracker can adapt to these factors. The partial occlusion occurs in the Woman and Faceocc1 sequences (Figures 7(k) and 7(l)) at times. The Woman sequence has nonrigid deformation and heavy occlusion at the same time. All the other trackers fail to successfully track the object except KCF and our tracker. But, in the Faceocc1 sequence, only MOSSE, CSK, and our approach perform well.

5.6. Discussion

From the above qualitative and quantitative analyses, our tracker outperforms other trackers in most cases. The reason is that our tracker not only can predict the target location, but also is able to estimate the target scale accurately at the same time. As to the computational complexity, the most time-consuming part of our tracker is to compute the latent HOG feature vectors of all the candidate samples. Our tracker is implemented in MATLAB, which runs at about 15 frames per second (FPS) on an Intel core i3-2130 3.4 GHz CPU with 2 GB RAM. Our tracker performs well in the above experiments, but drifts are also observed when the initial target is very little (e.g., see Freeman3 and Freeman4 sequences) and when the target moves unstably all the time (e.g., see Goat sequence as shown in Figure 8(c)). Figures 8(a) and 8(b), respectively, show the tracking results of our tracker over the Freeman3 and Freeman4 sequences, where the initial targets are very little (12 × 13 pixels in Figure 8(a), 15 × 16 pixels in Figure 8(b)). Our tracker can not estimate the increasing scale in the two sequences. This is because the HOG features perform poorly at low resolutions. In Figure 8(c), the goat moves unstably all the time (see frames #5, #54, and #98). Our tracker drifts away because of the accumulated online updated error from the continuous unstable motion.

6. Conclusion

Based on the framework of tracking with kernelized correlation filter and tracking-by-detection method, we develop a robust visual correlation tracking algorithm with improved tracking performance in this paper. Our tracker estimates the target translation and scale variations effectively and efficiently by learning the kernelized correlation filters. By accurately estimating the target scale in the tracking, our tracker can obtain more useful information from the target and reduce the interference from background. The translation is estimated by modeling the temporal context correlation and the scale is estimated by searching the tracked target appearance pyramid. In addition, we further develop an update scheme that takes all the previous frames into consideration when computing the current model. Experimental results on challenging sequences clearly show that our approach outperforms state-of-the-art tracking algorithms in terms of efficiency, accuracy, and robustness.

Conflict of Interests

The authors declare that they have no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors truly thank the reviewers for valuable advice and comments. This work is supported by the National High Technology Research and Development Program of China (Grant no. 2014AA7031010B).

Supplementary Materials

The Supplementary Material contains the tracking results of 6 trackers on 12 challenging video sequences. The 6 trackers are our tracker, MOSSE, WMILT, CT, KCF, and CSK. The video sequences include Car4, CarScale, Dog1, Girl, Trellis, Singer1, David, Woman, Tiger1, Skating1, CarDark, and Faceocc1 sequences. The target in the above sequences undergoes appearance variations introduced by scale variations, pose variations, illumination changes, partial occlusion, fast motion, rotation, and background clutter. The plots of our tracker, MOSSE, WMILT, CT, KCF, and CSK are represented by red dash-dot box, cyan dashed box, blue dashed box, yellow dashed box, white dashed box, and green solid box.

  1. Supplementary Material