Abstract

Object tracking is a vital topic in computer vision. Although tracking algorithms have gained great development in recent years, its robustness and accuracy still need to be improved. In this paper, to overcome single feature with poor representation ability in a complex image sequence, we put forward a multifeature integration framework, including the gray features, Histogram of Gradient (HOG), color-naming (CN), and Illumination Invariant Features (IIF), which effectively improve the robustness of object tracking. In addition, we propose a model updating strategy and introduce a skewness to measure the confidence degree of tracking result. Unlike previous tracking algorithms, we judge the relationship of skewness values between two adjacent frames to decide the updating of target appearance model to use a dynamic learning rate. This way makes our tracker further improve the robustness of tracking and effectively prevents the target drifting caused by occlusion and deformation. Extensive experiments on large-scale benchmark containing 50 image sequences show that our tracker is better than most existing excellent trackers in tracking performance and can run at average speed over 43 fps.

1. Introduction

It is difficult to accurately estimate the location of target in a video due to the complex causes such as occlusion, deformation, illumination variation, background clutter, and scale variations, all of which have brought difficulties to tracking. Although object tracking has been successfully used in robotics, video surveillance, human-computer interaction, automation, etc., we still need to find an effective and robust tracker.

Most of existing tracking methods mainly include two categories: one is generative method and the other is discriminative method. Generative trackers firstly construct a target appearance model, then match it with the candidate target regions, and take the candidate region with the highest similarity to the target region as the tracking result. There are many generative algorithms such as sparse representation [13], density estimation [4, 5], and incremental subspace learning [6]. In contrast, discriminative trackers use sample data to learn a binary classifier which can discriminate tracked target from its background areas. Discriminative trackers include multiple instance learning (MIL) [7], compressive tracking (CT) [8], tracking-learning-detection (TLD) [9], support vector machines (SVMs) [1012], and online adaboost (OAB) [13, 14]. MIL tracker employs a set of generalized Haar-like features to represent the image patch and each feature consisting of two to four rectangles. MIL also trains a classifier with multiple instances learning to achieve superior results. CT tracker based on compressed sensing enhances tracking efficiency thanks to Haar-like features are reduced by random measurement matrix conforming to the restricted isometry property (RIP), and a simple naive Bayes classifier is used to classify the features after dimensionality reduction. In recent years, the discriminative trackers based on correlation filters have raised much attention in the field of visual tracking due to their outstanding performance in computing efficiency. Bolme et al. [15] firstly introduce correlation filters into visual tracking and learn filter by minimizing the output sum of squared error on grayscale images. Henriques et al. [16] figure out a circulant structure with kernel (CSK) method achieving an amazing speed on tracking benchmark [17], but CSK only uses gray features which are less effective in representing the target appearance model. Later, the performance of tracking has been further improved in kernelized correlation filters (KCF) [18] tracker. Discriminative scale space tracking (DSST) [19] enhances the tracking accuracy by multichannel HOG features instead of the low dimensional gray features. Danelljan et al. [20] exploit color attributes (CN) for tracking and extend the input color features from single channel to multiple channels.

However, all above-mentioned trackers only use single feature which limits the power of target representations when the object appearances undergo challenges such as occlusion and illumination changes. As a result, the ideal tracking results are often difficult to obtain. To overcome the limitation of single feature on target tracking, scale adaptive multiple features (SAMF) [21] tracker integrates HOG features and CN features based on correlation filter to improve tracking accuracy. Lan et al. propose a discriminative feature learning method in [22, 23], which can exploit the representation and discriminative abilities of multiple features by separating out contaminated features. Sum of template and pixel-wise learner (Staple) [24] tracker combines the response maps of the HOG template and global color histogram both of which are learned independently in previous estimated translation to enhance tracking performance. Convolutional neural network (CNN) has found a broad application in pattern classification [25] and text processing [26] because of its powerful feature representation ability. Several existing tracking approaches based on CNN such as a deep compact image representation for tracking (DLT) [27], hierarchical convolutional features for tracking (HCF) [28], hedged deep tracking (HDT) [29], and spatial and semantic convolutional features for tracking (DSCF) [30] have been proposed. They all extract rich features from CNN to precisely predict the target position and have shown excellent performance. Although these algorithms based on features fusion or CNN features are satisfactory in constrained environment, these methods do not address the vital problem with respect to the model update mechanism with a constant learning rate which are prone to drifting in tracking due to inaccurate prediction. For the drifting problem, thus, TLD tracker [9] combines tracking learning with detection, the mechanism performs well in presence of occlusion, deformation. The long-term correlation tracking (LCT) [31] can prevent significant occlusion by using an online detector to detect the target again when wrong tracking results appear. SUN et al. [32] present mixed classifier decision compressive tracking (MDCT) method to locate the target and update the models by using different learning rates to improve the tracking accuracy.

In this paper, to overcome the problem that DSST tracker cannot describe target well and its model updating strategy which uses constant learning rate is unable to update filters adaptively, we propose a fast object tracker based on integrated multiple features and dynamic learning rate. We integrate gray features, HOG, CN, and IIF [33] to improve the target description ability of algorithm while preserving the performance of tracking under complex circumstances. Meanwhile, for the problem of constant learning rate, we apply the criteria of skewness to our approach. Skewness can reflect the confidence degree of the tracking results via fluctuation of response map. By comparing the skewness values between two adjacent frames, our approach can adaptively choose a learning rate to update the model in tracking. To validate the contribution of our approach, we perform the extensive experiments on a popular benchmark dataset [17] with 50 image sequences and compare our proposed approach with 12 excellent algorithms using precision and success rate. Experimental results show that our tracker performs significantly against existing trackers in the aspect of accuracy and robustness of tracking, while maintaining a high average speed which exceeds 40 frames per second.

The organizational structure of the paper is shown below. We first introduce DSST tracker in Section 2 and then describe our approach in Section 3. Section 4 demonstrates the experimental results on benchmark dataset. Conclusions are finally given in Section 5.

2. The DSST Tracker

DSST tracker [19] has obtained impressive results on tracking benchmark and has some significant ideas relevant to our work. The algorithm separately learns correlation filters for translation and scale estimation. For the translation estimation, the DSST tracker trains an optimal correlation filter relying on the high-dimensional HOG features and then employs the filter to determine target location of next frame. Equipped with the estimated translation, the multiscale filters which use HOG features are applied to obtain accurate target size. We briefly describe the main ideas of DSST tracker in the following.

2.1. Translation Estimation

In DSST tracker, we crop an image patch where target is located and extract -dimensional feature map from the image patch to train translation filters. Considering the multidimensional feature maps of image patch, we let denote the -th dimension feature map of , . Per feature dimension has a corresponding filter . The -th feature dimension has a single filter , and all of these can be concatenated into optimal correlation filter which obtained by minimizing the cost function:where represents 2-dimensional Gaussian function in which its peak at the target center of the image patch . denotes a regularization parameter, and is circular correlation. Note that the minimization issue in (1) can be solved by transforming (1) to the Fourier domain using Parseval’s formula. The solution to (1) can be availably obtained bywhere and denote the Discrete Fourier Transform (DFT) of and , respectively, and the bar indicates complex conjugation.

In (2), we only compute the correlation filter of a training sample. In practice, we find an optimal filter by minimizing the output error over all training patches, but it will lead to complex computations when requiring solving a linear system of equations. In order to obtain high computational efficiency, and are defined as the numerator and denominator of filter in the (t-1)-th frame, respectively. In the t-th frame, the numerator and denominator of in (2) are updated separately in the following iterative ways:where is the learning rate. Given an image patch cropped from a new frame, d-dimensional feature maps are extracted from . The correlation scores can be computed bywhere denotes the inverse DFT operator. denotes the Discrete Fourier Transform (DFT) of . The new target position is found via the maximal response value of .

2.2. Scale Estimation

In the actual tracking, the scale of target often changes because of the complexity of tracking environment. In order to solve problem of changing target size, the DSST tracker proposes a novel approach to predict the target scale. After determining the position of target, we construct scale pyramid by multiscale sampling in target area. Let denote the target size of the t-th frame, for each , an image patch with the size of centered at the estimated target location of the t-th frame is cropped. Here, denotes the scale factor and is the number of scales. The set of image patch consisted of all these image patch . We extract d-dimensional HOG features from image patch set to train scale filters . Similar to translation estimation, (3) and (4) are used to update the scale filters , but the desired correlation output is a 1-dimensional Gaussian function. We get the response scores between the scale filter and image patch by (5); the optimal scale of target can be obtained by maximum response scores.

3. Our Approach

DSST tracker uses HOG features for tracking, which only reflects partial characteristic of target and is easy to affect the robustness of tracking. Moreover, DSST tracker updates the filters using a fixed learning rate. However, the target appearance is dynamically changing in the tracking, so the DSST tracker cannot ensure the target model is updated with a reasonable learning rate. Therefore, in this paper, we improved the DSST algorithm by feature integration and model updating strategy with dynamic learning rate. The flowchart of our algorithm is shown in Figure 1. Like DSST, our tracking task is composed of two parts: translation and scale estimation. However, our algorithm fuses gray features, HOG, CN, and IIF [33] for translation estimation. We transform the multichannel fusion features extracted from image patch into Fourier domain in current frame and then use (5) to get the maximum response in the location of new target. For scale estimation, our method uses the same procedure as DSST, which uses the HOG features to train scale filter. With respect to the model update for translation filter, we adopt a new model updating strategy with dynamic learning rate which helps our algorithm to achieve significant performance gain in object tracking. With respect to the model update for scale filter, we follow the method in DSST.

We introduce the integration of multiple features for our tracking in Section 3.1. The novel model updating strategy is investigated in Section 3.2.

3.1. The Integration of Multiple Features Based on DSST

Integration features can provide richer representation of target than single feature. In this paper, we integrate gray features, HOG, CN, and IIF together based on DSST tracker.

HOG features are commonly used by various algorithms in the field of computer vision and can well show the edge and gradient information of the target. It divides the image patch into small connected regions which are also called cells. The gradient direction or edge orientation histogram are collected on the pixels of each cell. Compared with other features, HOG has many advantages that it can maintain favorable invariability in geometric and illumination. CN, based on the color names in English linguistics, are assigned by 11 color labels which can represent color names in real word. RGB color image is mapped to color-naming space via the mapping methods in [34]. We incorporate it into our integration scheme because it is robust to scale variation and rotation. IIF are obtained by transforming image into CIE Lab color space and then perform a nonparametric local rank transformation [35] on the image brightness channel. IIF features can enhance the ability of tracker to suppress intense illumination changes. The gray features only contain brightness channel and are the simplest features. Figure 2 shows the visualized results of gray features, HOG, CN, and IIF, respectively, on the Bolt sequence.

The four types of features are complementary to each other. In our work, we extract one-channel gray features, 31-channel HOG features, ten-channel CN features, and one-channel IIF from the image patch, respectively. Totally 43-channel features represent the target appearance. Note that the sizes of the four features are different from each other and these feature sizes should be normalized to a fixed size. Afterwards, we concatenate these normalized features together, which significantly enhance the performance of our proposed tracker.

3.2. Model Updating Strategy

The conventional algorithms use a constant learning rate to update the model. However, the target appearance is constantly changing due to the influence of deformation, occlusion, scale variations, and other factors in target tracking. Constant learning rate cannot cope with these interference factors effectively. The learning rate controls the updating degree of the target template. A higher learning rate can prevent insufficient updating of samples when the target appearance changes. But this will increase the probability of adding negative samples. In contrast, a lower learning rate can avoid learning more background information. Simultaneously, it is easy to suffer from target appearance deformation. Therefore, how to design a reasonable mechanism for dynamically updating learning rate is important.

In object tracking, the maximum response value in the response map is regarded as the target location, and other nontarget responses are generally much smaller than the maximum response value. However, in practice, the target is often disturbed by many factors, such as complex background, occlusion, and illumination variation, which may lead to some nontarget responses being closer to the target response value. As a result, the tracked target may be difficult to distinguish. The fluctuation degree of response map can reflect the quality of tracking results to a certain extent. In our method, in order to measure the fluctuation of the response map, we introduce a new criterion called skewness. It is a measure of the deviation direction and degree of data distribution and can reflect the asymmetric degree of data distribution. The relationship between skewness of two sequential frames can be used to decide the necessity of learning rate updates. The skewness value of response map in the t-th frame is defined aswhere is the response map obtained by (5) in the t-th frame and is its mean. denotes the width of response map and denotes the height of response map. The larger the skewness is, the greater the response value of the target is than the nontarget response; thus the tracking result in current frame is more reliable. On the contrary, it indicates that the difference between response value of target and nontarget is not significant when the skewness becomes smaller, and the tracking result in the current frame is disturbed. In these cases, we should choose appropriate learning rate to update target appearance.

Figure 3(b) shows the skewness of each frame on the Woman sequence. It can be seen from Figure 3(a), the target is partly occluded by the car in the 145th frame and interfered by a lamppost in the 340th frame. Consequently, the values of skewness are affected and reduced to lower points in Figure 3(b). When the occlusion disappears in the 434th frame, the value of skewness recovers to a higher point.

In our approach, we use the following three ways to update the learning rate:where is an original learning rate which is defined as in this paper. and are new learning rates. gradually decreases with the increase of . gradually become larger with the increase of . There are more details in Remarks 1 and 2. denotes the number of frames when the skewness of the (t-1)-th frame minus the skewness of the t-th frames over . Similarly, denotes the number of frames when the present skewness minus previous is greater than . We define the threshold () as the skewness difference of two adjacent frames. Whether a new learning rate is applied to the t-th frame depends on the difference of the two skewness values between two adjacent frames. A more intuitive updating strategy for learning rate is displayed in Table 1. If the value of skewness satisfies condition 1, it shows that the confidence level of tracking results in the t-th frame is not as reliable as the previous frame and we should reduce the learning rate to adapt the quick changes in the appearance of target. If condition 2 is satisfied, it indicates that the tracking results are reliable in the t-th frame and we should increase the learning rate to adapt the rapid changes of target appearance. There are slow changes in the appearance of target when condition 3 is satisfied. In this case, we apply the initial learning rate to the t-th frame.

Remark 1. With the iteration of (8), is getting smaller and can be used as a new learning rate whose value requires to be in the interval (0,1).

Proof. Given original learning rate , the corresponding number of frames meets condition 1 in Table 1. According to (8), a new learning rate is updated as . Therefore, we have two inequalities as follows:Equations (10) and (11) denote the new learning rate . Equation (10) denotes that is getting smaller.

Remark 2. With the iteration of (9), is getting larger and can be used as a new learning rate whose value requires to be in the interval (0,1).

Proof. Given original learning rate , the corresponding number of frames meets condition 2 in Table 1. According to Eq. (9), a new learning rate is updated as . Therefore, we have two inequalities as follows:Equations (12) and (13) denote the new learning rate . Equation (13) denotes that is getting larger.

Algorithm 1 presents an overall procedure of our proposed approach.

Input: Initial target position and scale
Output: Estimated target position and scale
For t=2: n
Translation estimation
 1: Crop out the translation sample from the input image at the previous target
 position and extract the four types of features.
 2: Compute the translation correlation filters using Eq. (2).
 3: Estimate the new position through the maximum response of Eq. (5).
Scale estimation
 4: Construct the scale pyramid centered at the estimated position .
 5: Compute the scale correlation filter using Eq. (2).
 6: Estimate the optimal scale through the maximum response of Eq. (5).
Model update
 7: Calculate with Eq. (6), update the learning rate according to relation of
 skewness between two adjacent frames.
 8: Update the translation filter using new learning rate in Eqs. (3) and (4).
 9: Update the scale filter using the original learning rate in Eqs. (3) and (4).
END

4. Experimental Results

In this section, implementation details of experiments are first discussed. Secondly, we perform multiple trackers with different features setting based on DSST to validate the effectiveness of our integrated features. Thirdly, we investigate the most suitable threshold for learning rate update and validate the effectiveness of skewness. Finally, we evaluate propose algorithm on benchmark dataset containing 50 sequence with comparison to 12 state-of-the-art algorithms.

4.1. Implementation Details

We perform the experiment in MATLAB R2015b on Intel (R) Core (TM) i7-6700K 4.00 GHz CPU. The regularization parameter and scale number are set the same as DSST. The original learning rate is set to =0.025. The threshold for learning rate update is set to =2.3. To prevent the boundary effect, the extracted features are usually multiplied by a Hann window. We use OTB-2013 dataset which contains 50 sequences and adopts one-pass evaluation (OPE) reporting by two aspects: precision and success plot. The precision plot shows the ratio of correct frames whose distance between predicted location of target and the ground truth not exceeds a certain threshold. We use center location error with a threshold of 20 pixels to rank tracking algorithms for precision plot. Success plot shows the ratio of correct frames whose overlap rate between prediction and the ground truth exceeds the given bounding box overlap threshold. For success plot, we rank tracking algorithm by employing the area under curve (AUC). In addition, we also utilize average speed to measure the efficiency of excellent algorithms.

4.2. The Multiple Feature Comparison

We implement several variations of our tracker to verify the validity of our approach. Figure 4 presents the tracking results with different features. As can be seen from Figure 4, our algorithm using the gray features, HOG, CN, and IIF achieves excellent performance in precision and performs as good as the variation tracker with three types of features (HOG, gray features, and CN) in success rate. The algorithm with three types of features (HOG, gray features, and CN) outperforms one with two types of features (HOG, gray features). The tracker with only HOG features has the worst performance among compared trackers. The results from our experiments indicate that the multiple feature integration is effective and robust.

4.3. The Detailed Analysis of Skewness
4.3.1. The Threshold Analysis

The threshold value has significant influence on the result of tracking and it is important to investigate the optimal threshold for learning rate update. The differences of two sequential frames on the Basketball and Woman sequences are shown in the Figure 5. By analyzing the skewness difference between two adjacent frames, we approximately estimate the range of the threshold and define threshold set as = , . We found that tracking performance gradually improved as threshold is gradually increased by 0.1. When the threshold is increased to 2.3 or 2.4, the experimental result is the best. But when the threshold increases again, the performance of the tracker begins to decrease. As is shown in Figure 6, we find that the tracker has the best robustness and accuracy when the threshold is 2.3 or 2.4, and we employ = 2.3 as a threshold value for our algorithm.

4.3.2. The Effectiveness Analysis of Skewness

To validate the effectiveness of our proposed skewness, we progressively incorporate our contribution. We add the integration features proposed in this paper into the baseline DSST and refer to it as multifeatures in Figure 7. Skewness is incorporated into the multifeatures, i.e., our tracker, which obtains the best results in precision and success rate. Figure 7 presents the performance results of our skewness. When the model update threshold is 2.3, the precision and success rate of our tracker have reached 79.9% and 58.4%, respectively. Compared with multifeatures, the precision and success rate increased by 2% and 0.8%, respectively. It illustrates that our model updating strategy which apply skewness model to decide the update of learning rate is effective. While compared with baseline DSST, our tracker increased by 5.9% in precision and 3% in success rate, respectively.

4.4. Comparisons with State-of-the-Art Trackers

We evaluate proposed tracker with 12 existing state-of-the-art trackers which include MIL [7], TGPR [36], Struck [10], CSK [16], KCF [18], DSST [19], ASLA [37], SCM [38], TLD [9], CT [8], DLT [27], and HDT [29]. Specifically, DLT and HDT are methods based on deep learning. The quantitative, attribute-based, efficiency, and qualitative evaluations are implemented in this section.

4.4.1. Quantitative Evaluation

Figure 8 shows one-pass evaluation (OPE) results on 50 sequences. Our tracker performs well in both precision and success rate and just behind HDT. HDT incorporates deep features into correlation filter framework, which enhance the ability of target representation. Compared with it, the performance of our approach falls behind, but the speed of our tracker is far ahead of HDT. DLT is also based on deep learning, but its performance in precision and success plots is inferior to our tracker. The baseline framework DSST ranks the fifth in precision and occupies the third place in success rate. Overall, our tracker is better than most existing excellent trackers.

4.4.2. Attribute-Based Evaluation

We compare our tracker with existing tracker based on 50 sequences annotated by 11 challenging attributes. Figures 9 and 10 illustrate attribute-based evaluations of all trackers in precision and success plots. Our algorithm favorably outperforms the most existing trackers in all challenging evaluations. The results show that our tracker is effective. In success plot, our tracker is superior to HDT in presence of illumination variations. It can be attributed to the fact that our fusion framework contains IIF features that are robust to severe illumination changes. Our algorithm outperforms other compared algorithms in occlusion, motion blur, and fast motion because our proposed model updating strategy plays an instructive role in these challenging sequences.

4.4.3. Efficiency Evaluation

We evaluate the operating efficiency of our proposed tracker with comparisons to 12 existing excellent trackers on 50 benchmark sequences. The results in Table 2 show that the average speeds of 13 trackers. Among these trackers, CSK achieves the best results with an average speed of 269.45 fps and KCF tracker acquires the second highest speed. Our proposed algorithm performs well with an average speed at 43.798 frames per second. DSST tracker achieves the average speed of 25.919 fps which are provided from [19], while the tracker obtains an average speed at 63.683 fps in our experimental platform. HDT obtains the slowest tracking speed because it is time-consuming to extract features from deep neural networks. Our integrated features and model updating strategy have a slight effect on tracking speed. However, our tracker is still faster than most of compared trackers.

4.4.4. Qualitative Evaluation

Figure 11 summarizes a qualitative comparison of proposed tracker with five existing excellent trackers (Struck [10], TGPR [36], KCF [18], DSST [19], and CSK [16]) on five challenging sequences. Comparison results of different algorithms are represented via solid rectangular frames with different colors. Five frames of each video sequence are selected to display the results and they contain common tracking problems. The Sylvester sequence contains illumination variations and out-of-plane rotation. The first row of Figure 10 shows partial tracking results of Sylvester sequence. The target suffers from rotation in the 935th frame. However, our algorithm is more adaptable to rotation than DSST, which mainly attribute to CN features. Although TGPR and Struck keep path with the target in the following sequence, their tracking effect is inferior to our tracker. The CarScale sequence shown in the second row presents scale variation. DSST and our tracker are better for tracking on this sequence, but they still can not completely mark the target. It is found that the reason for the phenomenon of tracking drift is that the target moves quickly and heavily occluded by the trees. The other four trackers suffer from heavily scale drift due to no adaptive scale estimation. The Bolt sequence comprises occlusion and deformation. Our tracker and DSST work well, while TGPR and Struck trackers lose the target due to the deformation and the following tracking is always in the state of tracking the wrong target. The Soccer sequence comprises occlusion, motion blur, fast motion, and illumination variation. Although our tracker and DSST tracker in this paper accurately track the target, the DSST tracker is still insufficient compared to our tracker. The reason is that our proposed model updating strategy has anti-interference ability in presence of motion blur. Target of TGPR tracker appears drifting in the 38th frame and TGPR fails to track the target in the 127th frame. It is unstable during the tracking process. Shacking sequence exhibits illumination variation and complex background. Our tracker and DSST can keep up with the target, but KCF and Struck trackers have poor performance in all frames of Shacking sequence. From the previous analysis, our tracker performance is superior to the above five state-the-art trackers in general.

5. Conclusion

In this paper, we put forward a simple and fast object tracker based on DSST. Our method extracts powerful features including gray features, HOG, CN, and IIF to learn correlation filters for estimating the target position and the scale is estimated by constructing feature pyramid. To prevent model drift, we further introduce skewness to measure the confidence degree of tracking results and update the learning rate by comparing the skewness value of two adjacent frames. Our tracker performs the excellent performance with the help of cooperation between the integrated features and dynamic learning rate in tracking. Contrastive experiments demonstrate that superiority of our proposed tracking algorithm over the 12 existing state-of-the-art algorithms on popular tracking benchmark dataset.

Data Availability

We are grateful to the Computer Vision Lab, Hanyang University, Seoul, Korea who provide the dataset publicly (Visual Tracker Benchmark, http://cvlab.hanyang.ac.kr/tracker_benchmark/datasets.html).

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

The research work was funded by the National Natural Science Foundation of China grants nos. 61402053, 61772454, and 61811530332, the Scientific Research Fund of Hunan Provincial Education Department Grant no. 16A008, the Industry-University Cooperation and Collaborative Education Project of Department of Higher Education of Ministry of Education Grant no. 201702137008, the Postgraduate Scientific Research Innovation Fund of Hunan Province Grant no. CX2018B565, the Undergraduate Inquiry Learning and Innovative Experimental Fund of CSUST Grant no. 2018-6-119, and the Postgraduate Training Innovation Base Construction Project of Hunan Province Grant no. 2017-451-30.