Abstract

This paper proposes a new visual tracking method by constructing the robust appearance model of the target with convolutional sparse coding. First, our method uses convolutional sparse coding to divide the interest region of the target into a smooth image and four detail images with different fitting degrees. Second, we compute the initial target region by tracking the smooth image with the kernel correlation filtering. We define an appearance model to describe the details of the target based on the initial target region and the combination of four detail images. Third, we propose a matching method by the overlap rate and Euclidean distance to evaluate candidates and the appearance model to compute the tracking results based on detail images. Finally, the two tracking results are separately computed by the smooth image, and the detail images are combined to produce the final target rectangle. Many experiments on videos from Tracking Benchmark 2015 demonstrate that our method produces much better results than most of the present visual tracking methods.

1. Introduction

Visual tracking is a hot topic in computer vision and graphics. The changes in background and object bring many tracking challenges such as deformation, occlusion, rotation, and so on. Now, it is still under well solved to produce accurate tracking results. Many tracking methods have been proposed recently. They are divided into two categories: generative tracking methods and discriminative tracking methods. The generative methods usually describe and identify the target with the maximum likelihood probability or posterior probability. The discriminative tracking methods often train a classification model to separate target and background.

The generative tracking methods find the candidate most similar to the target object as the tracking result. For example, Black and Jepson [1] proposed a subspace-based method to calculate the radiation transformation between the current frame and the image reconstructed with the feature vector of target. Later, Ross et al. [2] improved it by updating the basis of feature space online. Mei and Ling [3] solved the sparsity between the target template and the subspace composed of positive and negative trivial templates through the regularized least squares, which performed well in dealing with illumination variations and occlusions. Subsequently, many tracking methods [4, 5] have proposed to improve the tracking results by optimizing the algorithm. Although the generative tracking methods made a great breakthrough, they are limited by how to accurately separate the background and target, especially when dealing with clutter background and great deformation.

The discriminative methods often learn to distinguish target and background based on the cues from previous frames. For example, the tracker based on support vector machine (SVM) [6] distinguishes the target and background by learning positive and negative samples. Following the SVM, Hare et al. [7] proposed the structured support vector machine (SSVM) to further enhance its discriminating ability in dealing with deformation and occlusion challenges. Later, Ning et al. [8] proposed the dual linear structured support vector machine (DLSSVM) based on SSVM to produce efficient high-dimensional features of target and candidates.

The recent discriminative trackers are often defined by correlation filtering [915]. For example, Bolme et al. [9] proposed minimum output sum of squared error (MOSSE) tracking algorithm, which first applied correlation filtering in visual tracking. The MOSSE performs a convolution calculation on the Fourier domain between the target template and the interest region of target. Based on MOSSE, Henriques et al. [10] introduced the cyclic matrix and kernel method of tracking and convolving dense samples with the cyclic matrix formed by the target template in the Fourier domain. Then, they proposed the circulant structure kernel (CSK) tracking algorithm. Based on the CSK, Henriques et al. [11] introduced histogram of oriented gradient (HOG) feature to convert a single channel to multiple channels to improve the tracking results without increasing the time cost. Bertinetto et al. [12] integrated the features from both HOG and color cues to further improve tracking accuracy. Danelljan et al. [13] used the depth feature of single-layer convolution in CNN to replace the HOG feature in spatially regularized correlation filters to deal with tracking challenges. Danelljan et al. [14] improved the speed and stability of the algorithm based on the continuous convolution operators by reducing the model parameters and adopting a sparse update strategy. Valmadre et al. [15] used an end-to-end learning method to treat correlation filtering as a layer in CNN to reduce tracking drift and failure.

Recently, the tracking methods with deep features [1619] have become very popular for their good performance in describing target and background. Li et al. [16] proposed a method to learn target perception features and integrate these features with Siamese matching network. Wang et al. [17] proposed a SiamFC-based tracker using “rough matching” and “fine matching.” They enhanced tracking robustness through training in rough matching and improved discrimination through distance learning network in fine matching. Du et al. [18] proposed a tracker to detect target corners. It first uses the Siamese network to roughly distinguish the foreground and background to get the interest region of target. Then, they used the relationship between the target template and interest region to highlight the corner regions and enhance the features of the corner to produce a more accurate bounding box. Guo et al. [19] proposed a fully convolutional Siamese network for tracking. Chen et al. [20] proposed a Siamese box adaptive network structure named SiamBAN. The network views tracking as a parallel classification and regression problem, then classifies the objects, and regresses their bounding boxes in a fully convolutional network. Danelljan et al. [21] proposed a probabilistic regression model for tracking. Yang et al. [22] defined a tracking model by an offline recurrent neural optimizer to update the tracking model in a meta-learning setting. Li et al. [23] improved the tracking by integrating alignment data with deep features, then used ConditionNet to bridge the gap between the preconditioning and learning process.

The above tracking methods are mostly defined by establishing an appearance model of target. Therefore, when the target appearance model becomes not robust or inaccurate, it is very difficult to correct the model to improve the performance in tracking the following frames. Especially, for the updated target appearance model, if the online update ability of the model is too strong, it is easy to take the surrounding background as target information and introduce overfitting. However, if the online update ability of the appearance model is too weak, it leads to underfitting and tracking drift. This paper divides the target into a smooth image and four detail images based on convolutional sparse coding (CSC) and separately establishes target appearance models for them to track targets. The proposed tracker is achieved by combining the tracking results of the two parts to cope with challenges and improve the tracking performance.

2. Our Tracking Framework

This paper first extracts the interest region by expanding the target rectangle area by 2.5 times. Then, we divide the interest region into a smooth image and four detail images with different fitting degrees by CSC. For the smooth image, the model is initialized and tracked based on the KCF. For the detail images, we first combine the four detail images with different fitting degrees and construct the appearance model of the combined image to represent the target details. Then, we evaluate the candidates by measuring the overlap rate and Euclidean distance between the candidates and the appearance model. This evaluation describes that how much a candidate matches the appearance model, and the best-matched candidate is the tracked result based on detail images. Finally, the tracked results of the details and the smooth image are combined to produce the final tracking result. To suit the changes of the target, we update the appearance model with the tracked result frame by frame. The flowchart of our method is shown in Figure 1. It includes the target model initialization phase, target tracking, and model update.

2.1. Initialize the Appearance Model

As shown in Figure 1, we first extract the interest region based on the target rectangle of the last frame. Then, we divide the interest region into a smooth image and four detail images with different fitting degrees by the CSC. For the smooth image, we initialize its appearance model based on the KCF method. For the detail images, we combine the detail images with different fitting degrees and establish an appearance model for it to describe the target details.

2.2. Tracking Target Object

After producing the smooth image and detail images of the target, we track the two parts separately with different approaches. As described in the top row of Figure 1, for the smooth image, we use the KCF method to construct its appearance model and get the response value matrix. The tracking result based on a smooth image is selected as the candidate with the largest response value. Similarly, as shown in the bottom row of Figure 1, for the detail images, we first combine the four details and construct the appearance model. Then, we compute the overlap rates between samples and the appearance model and select the samples with bigger rate values. Later, we compute the Euclidean distance values between the selected samples and the appearance model. Then, we select the sample with the biggest overlap rate and minimum distance as the tracking result based on detail images. Finally, we fuse the two tracked results based on the smooth image and detail images to get the final tracking result.

2.3. Update the Appearance Model

The model update includes the appearance model update of smooth image and detail images. For the smooth image, an appearance model is first established according to the tracked result. Then, it is combined with the old one to form the updated appearance model. For the detail images, according to the new tracked result, four new detail models are extracted and then combined with different fitting degrees to replace the appearance model about target details to achieve model update.

3. Target Tracking Based on the CSC

In this section, we first describe how to divide an interest region into a smooth image and four detail images based on the CSC and how to establish the appearance models of smooth image and detail images, respectively, as shown in Figure 2. Second, we separately describe how to track the smooth image and the detail images in Section 3.2. Finally, we describe the details of the appearance model update in Section 3.3.

3.1. Initialize the Appearance Model for Smooth Image

Following the method proposed in [24], we divide the interest region of the target into the smooth part and the detail parts using filters as shown in Figure 2. The smooth part contains the cues about the color and shape features of target. The detail parts describe the features about the image edge and texture structure. Equation (1) describes the separation of smooth part and detail parts: where represents the original image, describes the smooth part, and describes the detail part. is a low-pass filter and is a characteristic diagram.

As shown in Figure 2, the green rectangle describes the target region, and the red rectangle shows the interest region of target. As shown in the third column, it is divided into a smooth image and four detail images with different fitting degrees based on the CSC.

3.1.1. Initialize the Target Appearance Model of the Smooth Image

We construct the appearance model about the smooth image based on the KCF method. Therefore, the first step to initialize the tracking target is to construct a cyclic matrix by extracting the target features. Then we diagonalize the cyclic matrix to obtain the diagonal cyclic feature matrix using the Discrete Fourier transform, as shown in Equation (2). where describes the Constant Fourier matrix and satisfies . is the generated vector after Fourier transform. We solve the least squares in the Fourier domain based on to train the target detector , as described in Equation (3). where is the first row of the kernel matrix in the Fourier domain, and is the regression target training based on the Gaussian kernel function.

3.1.2. Initialize the Appearance Model for the Detail Images

As shown in Figure 3, this paper constructs the appearance model for the details of the target based on the detail images. This paper employs 400 filters to implement CSC. Therefore, it can get 400 detailed feature maps about tracking targets . To prevent underfitting or overfitting, this paper combines every 100 feature maps to form a detail image. Therefore, we obtain four detail images by , then four detailed models with different fitting degrees are constructed to describe the details of target. Finally, we combine , , , and to get the final appearance model of detail images named as , as shown in Equation (4) where .

3.2. Tracking Target Based on CSC

This section introduces our tracking method based on the CSC in detail. For each frame of a video, we first use the CSC to divide it into the smooth image and four detail images. Second, we use KCF to track the target region based on the smooth image. Then, we predict the target region based on the appearance model of image details. Finally, we compute the final tracking result by combining the target regions based on both the smooth image and detail images.

3.2.1. Tracking Target Based on Smooth Image

We expand the target region from the last frame by 2.5 times to form the sampling area . We extract the features of the sampling area to form a feature matrix . Then, we calculate the kernel correlation values of the cyclic feature matrix and using the kernel functions. It means that , where is a row vector from and through the kernel function operation, and is a function that constructs a cyclic matrix. is a circled function. After that, we use target detectors and to do dot multiplication to get the response value matrix of each position in the sampling area by Equation (5).

The sample with a larger response value matrix has more possibility to be used as the target. The position of the max response value is taken as the target center based on the smooth image.

3.2.2. Tracking Target Based on Detail Images

To predict the target based on detail images, we first randomly select 400 samples which has the same size as target in the interest region . Then, we calculate the overlap rate (OR) between each sample and the appearance model of detail images by Equation (6).

The is the width of the overlapping part, and the is the high of the overlapping part. and are the width and height of the target model. and are the width and height of samples. and represent the coordinate values of the down right point and the top left point of the target model. and represent the coordinate values of the down right point and the top left point of the sample. When the overlap rate is greater than the setting threshold, the sample is retained; otherwise, the sample is rejected directly.

For the retained samples, we calculate the Euclidean distance between them and the appearance model of the detail images. First, for each sample, we extract four detail images with different fitting degrees accumulated by every 100 detail features. Then, we combine the four detail images to obtain the detail description of the sample, denoted by . Finally, we calculate the Euclidean distance as between and the appearance model of the detail part by Equation (7).

The sample with the minimum distance is taken as the final tracked result based on detail images.

3.2.3. Computing the Final Tracking Result

We use to describe the center of the tracked result based on the smooth image and use to describe the center of the tracked result based on detail images. The final tracked result is computed by combining and based on Equation (8). where . In our experiments, we set and both to be 0.5. The size of the tracked result also follows the same process of the center points of tracking results.

3.3. Update the Appearance Model of Target

Obtaining the tracking result on the current frame, we need to update the appearance model of the target to adapt to the changes of the target. This update is achieved by separately updating the appearance model of the smooth image and the appearance model of the detail images.

3.3.1. Update the Appearance Model of the Smooth Image

The update of the model about the smooth image is achieved by updating the cyclic feature matrix and the target detector . After getting the center position of target, the matrix and the detector on frame are obtained. Then, we combine them with the cyclic feature matrix and the target detector at frame to get the updated and . The equation to achieve the update is defined by: where and are the updated cyclic feature matrix and the target detector on frame , and is the learning parameter. and are the cyclic feature matrix and target detector from the last frame, which greatly preserve the stable target feature from previous frames.

3.3.2. Update the Appearance Model of the Detail Images

The update of the appearance model of the detail images is achieved by updating the four models of detail images with different fitting degrees. First, we extract four new detail images with different fitting degrees based on the interest region of the tracked result. Then, we use the method to initialize the appearance model of detail images proposed in Section 3.1 to construct the new appearance model of detail images for the current tracked result. Finally, we update the appearance model of detail images by replacing the present model with the new one.

4. Results

Our method is implemented with MATLAB 2014b on the PC with Windows 7 system, Intel i7-6700 3.4GHz processor, 12G video memory, and 12G memory. 61 video sequences from the Visual tracking Benchmark 2015 were used for experiments. They include several challenges such as complex background, illumination variation, and rotation. We do quantitatively and qualitatively evaluations on the famous present 13 trackers such as DLSSVM [8], KCF [11], Staple [12], ECO [24], CNN-SVM [25], CFNet [15], SINT [26], Struct [7], DeepSRDCF [14], RPT [27], TLD [28], VTD [29], and CT [30]. The results show that our method is more effective for in-plane rotation, complex background, and illumination variations.

4.1. Quantitative Evaluation

We evaluate the tracking results based on the accuracy of center position error and the success rate. The center accuracy is obtained by the center distance between the tracked result and the ideal target region, and the success rate is computed by the overlap rate of them. Compared with six target trackers, the center accuracy and the success rate of our method perform better, as shown in Figure 4. Tables 1 and 2 separately describe the evaluations for dealing with different challenges. For complex backgrounds, the center accuracy and the success rate value are ranked first with 0.781 and 0.605. For in-plane rotation, the center accuracy is 0.745 and the success rate value is ranked first with 0.555. For illumination variations, the center accuracy is 0.712 and ranks as the third, but relative to the first place KCF and the second DLSSVM differs only by 0.014 and 0.002. In addition, our success rate value is ranked first with 0.523. More details about the comparisons can be reviewed in Tables 1 and 2.

4.2. Qualitative Evaluation

This section qualitatively evaluates our method for occlusion, illumination variations, out-of-plane rotation, and so on. We compare the proposed method with nine famous algorithms.

4.2.1. Occlusion

Figure 5 shows the comparisons about target occlusion by video Coke and video jogging.1. The target is occluded by the surroundings, such as the leaves in Coke from frame 39 and the pole in Jogging.1 from frame 75. Many present trackers easily lose targets, such as Struct [7] and DLSSVM [8]. If the target is occluded completely, the background is taken as the target and finally leads to tracking drift or failure. We construct the appearance models of smooth image and detail images to describe the target, which greatly improve our performance in occlusion.

4.2.2. Illumination Variation

Figure 6 shows the comparisons among several trackers with illumination variation. We use the video Human8 (top row in Figure 6) and the video Man (bottom row in Figure 6) as examples. For video Human8, the illumination undergoes great changes. When people pass the shadow, the present trackers such as CFNet [15] and Struck [7] fail in identifying the blurred target. Using the proposed appearance model update scheme, our method performs much better than the present trackers by efficiently adapting to the changes of target appearance.

4.2.3. Out-of-Plane Rotation (OPR)

Figure 7 shows the tracking results of various trackers under the OPR challenge. We use the video Liquor (top row in Figure 7) and video SUV (bottom row in Figure 7) as examples. From frame 386 to frame 401 in video Liquor, the target bottle rotates out of the plane for several times. From frame 28 and frame 47 in the video SUV, the fast movement of the car makes the camera unable to keep up, so part of target exceeds the image range. As shown in frame 386 on the top row and frame 47 on the bottom row, our method accurately detects the target region. However, some present trackers such as ECO [24] introduce track drifting. The main reason is our method using much detail information of the target to effectively separate the target and its surroundings.

5. Conclusion

This paper defines a new visual tracking method based on convolutional sparse coding. First, it extracts an interest region of the target in the current frame. Then by the convolutional sparse coding, we divide the interest region into a smooth image and four detail images with different fitting degrees. For the smooth image, we initialise its appearance model and compute the general tracking result by kernel correlation filter. For the detail images, we first extract four detail images and combine them to initialize the appearance model for target details. We randomly sample 400 candidates in the interest region and calculate the overlap rate and Euclidean distance between each candidate and the appearance model of details to determine the tracking result based on the target details. By combining the tracking results of the smooth image and the detail images, we get the final tracking result of target. By introducing the appearance models of both the smooth image and detail images, the proposed method performs favourably in dealing with the tracking drift and failure introduced by the deformation and occlusion of target. We do quantitative and qualitative evaluations on some famous trackers on the Tracking Benchmark 2015. Many experiments demonstrate that our method produces better results in dealing with many challenges such as illumination variations, occlusion, and complex background.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Natural Science Foundation of China (61772209), Science and technology planning project of Guangdong province (2019A050510034, 2019B020219001), the Production Project of Ministry Education China (201901240030), and the College Students Innovations Special Project of China under Grant 201910564037, 202010564026.