Mathematical Problems in Engineering

Volume 2017 (2017), Article ID 3276103, 10 pages

https://doi.org/10.1155/2017/3276103

## Robust Visual Tracking Using the Bidirectional Scale Estimation

^{1}Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai 201203, China^{2}Key Laboratory of Intelligent Information Processing in Universities of Shandong, Shandong Institute of Business and Technology, Yantai 264005, China

Correspondence should be addressed to An Zhiyong; moc.361@tuytyza

Received 21 August 2016; Accepted 18 December 2016; Published 19 January 2017

Academic Editor: Francisco Valero

Copyright © 2017 An Zhiyong et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Object tracking with robust scale estimation is a challenging task in computer vision. This paper presents a novel tracking algorithm that learns the translation and scale filters with a complementary scheme. The translation filter is constructed using the ridge regression and multidimensional features. A robust scale filter is constructed by the bidirectional scale estimation, including the forward scale and backward scale. Firstly, we learn the scale filter using the forward tracking information. Then the forward scale and backward scale can be estimated using the respective scale filter. Secondly, a conservative strategy is adopted to compromise the forward and backward scales. Finally, the scale filter is updated based on the final scale estimation. It is effective to update scale filter since the stable scale estimation can improve the performance of scale filter. To reveal the effectiveness of our tracker, experiments are performed on 32 sequences with significant scale variation and on the benchmark dataset with 50 challenging videos. Our results show that the proposed tracker outperforms several state-of-the-art trackers in terms of robustness and accuracy.

#### 1. Introduction

Visual tracking has drawn significant attentions in computer vision with various applications such as activity analysis, video surveillance, and auto control systems. Despite significant progress in recent years, it is still difficult due to baffling factors in complicated situations such as scale variations, partial occlusion, background clutter, deformation, and fast motion.

In recent years, many object tracking algorithms have been proposed. Among those trackers, the tracking-by-detection algorithms have achieved excellent performance by learning a discriminative classifier. Bolme et al. [1] presented a minimum output sum of squared error (MOSSE) filter on the gray images when initialized using a single frame. The MOSSE tracker is robust to variations in lighting, pose, and nonrigid deformations while running with a speed reaching several hundred frames per second. The MOSSE tracker can be performed efficiently because of using the fast Fourier transform (FFT). Recently, many new algorithms [2–5] based on the MOSSE filters have been proposed in researches in field of object tracking.

Unlike tracking-by-detection algorithm, the TLD tracking framework [6] was presented that decomposed the tracking task into three subtasks: tracking, learning, and detection. The TLD tracker used a P-N learning algorithm to improve classification performance in long-term tracking. In recent years, sparse representation has been successfully applied to visual tracking. Zhang et al. [7] proposed the multitask tracking framework based on the particle filters and the multitask sparse representation. Wang et al. [8] proposed the WLCS tracker based on the particle filters and weighted local cosine similarity which measures the similarities between the target template and candidates. Guo et al. [9] proposed the max confidence boosting tracker that allows ambiguity in the tracking and effectively alleviates the drift problem. Zhang and Song [10] proposed the weighted multiple instance learning tracker that considers the sample importance into the tracker framework. Wang et al. [11] proposed the structure constrained grouping based on the Bayesian inference framework which casts visual tracking as foreground superpixels grouping problem. However, those algorithms cannot handle the scale changes well in object tracking.

In this paper, we decomposed the tracking task into two subtasks: translation filter and scale filter with a complementary scheme. The translation filter relies on a temporal context regression model. As for scale filter, our key idea is using both the forward and backward tracking information to construct the scale filter. The information from the last frame to the current frame can be defined as the forward tracking information, while the information from the current frame to the last frame can be regarded as the backward tracking information. Therefore, the scale filter can be estimated by the bidirectional scale estimation, including the forward scale and the backward scale. The proposed approach achieves state-of-the-art performance on both scale variation dataset and the benchmark dataset.

#### 2. Related Work

Our tracking algorithm is based on the correlation filter model. The classic correlation filter algorithm is the MOSSE tracker. The MOSSE tracker takes randomly affine-transformed ground truths as training set when initializing its correlation filter. However, in MOSSE tracker, the correlation filter is only used to detect the position of target.

Based on the correlation filter, Henriques et al. [2] proposed the CSK tracker algorithm that used the circulant matrices theory to learn a kernelized least-squares classifier. The CSK tracker runs at hundreds of frames per second. Furthermore, Henriques et al. [3] proposed the kernelized correlation filter (KCF) using the multichannel HOG features based on the CSK tracker. However, the CSK and KCF trackers imply the low accuracy in the scale variations sequences because the algorithms are limited to target translation. Li and Zhu [12] proposed the SAMF tracker that extends KCF to handle scale changes by sampling with several predefined scale perturbations. In fact, by sampling predefined scale variations, the SAMF tracker is not flexible enough to deal with fast and abrupt scale changes.

Zhang et al. [4] proposed a kernelized correlation filter to predict the target scale variation that often produces the inaccurate scale estimation in a complicated environment. Danelljan et al. [5] proposed the discriminative scale space tracker (DSST) that learnt the translation filter and scale filter for the tracking target. The scale filter in DSST only uses the forward tracking information to search the reasonable scale while neglecting the backward information. This makes the scale unstable usually in complex scenes.

#### 3. Our Approach

Here we describe our approach. Section 3.1 presents translation tracking filter. In Section 3.2 we describe scale tracking filter using the bidirectional scale estimation.

##### 3.1. Translation Tracking Filter

The MOSSE tracker only uses the image intensity features to design the translation filter. Since the HOG feature is a strong local analysis feature, it has been widely used in visual tracking. So we use both the HOG feature and intensity feature to build the translation filter. Here we build the translation filter using the ridge regression similar to the DSST and MOSSE. Given features , the translation tracking filter is obtained by minimizing the sum of squared errors with the regularization termwhere is the desired correlation outputs. Here can be designed as the 2D Gaussian shaped peak centered on the target in training image. In this paper, the upper case variables , and the filter denote the Fourier transform of their lower case counterparts separately. The solution to (1) in the Fourier transform domain is [5]where is the element-wise product and the bar represents complex conjugation. To obtain a robust approximation, the translation filter can be updated by the numerator and denominator .where is the index of frame and is the learning rate. Given the features of an image patch in the new frame, the confidence score can be calculated as So the new target location can be estimated by searching for the location of the maximum confidence score.

##### 3.2. Scale Tracking Filter

In this section, a 1-dimensional correlation filter is constructed to estimate the target scale. The purpose of scale tracking filter is to locate the appropriate scale for the target in the current frame. Let be the target size and is the number of scales , where denotes the scale factor. For each , we extract an image patch of size centered around the estimated location. At the same time, we use the HOG features to construct the scale pyramid.

Here the training features at each scale level are set to a -dimensional feature descriptor. Given the training samples , the scale filter is obtained by minimizing the sum of squared errors with the regularization term where is as the 1-dimensional Gaussian shaped peak centered around the current scale. Then the solution to (6) in the Fourier transform domain is where the upper case variables , and the filter denotes the Fourier transform of their lower case counterparts separately. Assume the scale filter in the last frame be expressed by the numerator and denominator . Given the features of an image patch in the current frame, the maximum scale confidence score can be calculated as where is the index of frame. So the new target scale can be estimated by searching for the scale of the maximum confidence score. Let denote the corresponding scale to . Here denotes the target scale in the last frame and denotes the relevant factor between the new target scale in the current frame and the target scale in the last frame. Then the scale can be set to the forward scale for the new target since it uses the last frame’s filter to estimate the target scale in the current frame.

In order to improve the performance, we compute a backward scale for the same target. Based on the new target location and the forward scale , we can obtain a new scale filter that includes the numerator and denominator . It is noted that the scale filter only uses the new target information in the current frame. Given the features of an image patch in the last frame, the max confidence score can be calculated asLet denote the corresponding scale to in the last frame and let* d* denote the backward relevant factor. Then the new target scale can be expressed by . The scale can be set to the backward scale estimation since it uses the current frame’s filter to estimate scale.

In most cases, the backward scale is equal to forward scale . However, there are always estimating errors for the backward and forward scales in practical situation. Here the trend of scale change can be defined as two types: increase and decrease. When the scale value is more than 1, the scale trend is regarded as the increase. While the scale value is less than 1, we can assume that the scale trend is the decrease. The forward scale’s trend is often inconsistent with the backward scale’s trend in some situations. The main reason is that the scale filter includes the history information while the scale filter only uses the current scale information. In fact, the scale filter with the history information can produce inaccurate trend estimates for scale. Thus, we should set the backward scale to the final scale when the forward scale’s trend is inconsistent with the backward scale’s trend. So a conservative strategy is adopted to integrate two above scalesWhen the forward scale’s trend is identical to the backward scale’s trend, we can set the forward scale to the final scale since the filter includes abundant scale information. So the scale filter can be updated by the numerator and denominator where is the learning rate. It is noted that the features in (11) and (12) are based on the final scale . It is effective to update the scale filter since the accurate scale can improve the performance of scale filter to a great extent. We present an outline of our method in Algorithm 1.

*Algorithm 1 (proposed tracking algorithm). *

*Input*. Image Previous target position and scale . Translation model : the numerator and denominator . Scale model : the numerator and denominator . *Output*. Estimated target position and scale . Updated translation model and scale model . *Translation Estimation*. (1) Extract the target features at the previous target position and scale from . (2) Compute the translation filter using (2). (3) Set to the target position that maximizes the confidence score in (5). *Scale Estimation*. (4) Extract the target features around from at target position and scale . (5) Compute the forward scale using the scale filter . (6) Compute the scale filter and the backward scale . (7) Integrate the forward scale and the backward scale as the final scale using (10). *Model Update*. (8) Extract the target features from at the target position and scale . (9) Update the translation model : updated by the numerator in (3) and denominator in (4). (10) Update the scale model : updated by the numerator in (11) and denominator in (12).

#### 4. Experiments

We name our proposed tracker “BSET”* (Bidirectional Scale Estimation Tracker)*. All trackers in this paper are implemented in Matlab 2013 on an Intel I5-3210 2.50 GHz CPU with 4 GB RAM. The regularization parameter is set to and the learning rate is set to 0.025 in formulas. The number of scales is set to 33 with a self-adaptive scale factor. Given a target of size *,* the self-adaptive scale factor can be set toThe strategy can adjust the scale factor with the size of target adaptively and overcomes the problem of scale slow increase for the small target. When the size of target is large, we should set a small value to the scale factor since this can adapt the scale change responsively. To assess the overall performance of BSET, a large benchmark dataset (OTB-50) [13] is adopted that contains 50 videos with many challenging attributes such as scale variation, in-plane rotation, low resolution, and background clutter.

We adopt the 32 sequences annotated with “scale variation” as the scale variation dataset. Two criteria, the center location error (CLE) as well as the overlap precision, are employed to the scale dataset in our paper. The CLE can be defined as the average Euclidean distance between the ground truth and the estimated center location of the target. Overlap precision (OP) [5] is defined as the percentage of frames where the bounding box overlap surpasses a threshold . We compare our method with the nine state-of-the-art trackers: DSST [5], KCF [3], Struck [14], SCM [15], TLD [6], MTT [7], LOT [16], DFT [17], and CXT [18], which have been shown to provide excellent performance in literatures. In addition, we provide two kinds of plots: precision plot and success plot [13] to evaluate all trackers, where trackers are ranked using the area under curve (AUC).

##### 4.1. Experiment 1: Robust Scale Estimation

The scale variation dataset includes the 32 sequences and those sequences also have challenging problems such as illumination variation, motion blur, background clutter, and occlusion. Table 1 shows the Per-video overlap precision at a threshold 0.5 compared with 9 state-of-the-art trackers. From Table 1, we find that the BSET algorithm provides better or similar performance on 18 out of the 32 sequences. The BSET algorithm performs well with an average OP of 0.768, which outperforms the DSST algorithm by 10.9%. Table 2 shows Per-video CLE compared with 9 state-of-the-art trackers. From Table 2, we find that the BSET algorithm provides better or similar performance on 12 out of the 32 sequences. The SCM tracker performs well with average CLE of 31.8 pixels. But the BSET tracker achieves lower average CLE of 19.7 pixels.