Abstract

The correlation filter method is effective in visual tracking tasks, whereas it suffers from the boundary effect and filter degradation in complex situations, which can result in suboptimal performance. Aiming at the solving above problem, this study proposes an object tracking method with a discriminant correlation filter, which combines an adaptive background perception and a spatial dynamic constraint. In this method, an adaptive background-awareness strategy is used to filter the background information trained by the interference filter to improve the discriminability between the object and the background. In addition, the spatial regularization term is introduced, and the dynamic change of the real filter and the predefined spatial constraint template is used to optimize filter learning to enhance the spatial information capture ability of the filter model. Experiments on the OTB100, VOT2018, and TrackingNet standard datasets demonstrate that our method achieves favorable accuracy and success rates. Compared with the current popular correlation filter methods, the proposed method can still maintain stable tracking performance with a scene scale variation, complex background, motion blur, and fast motion.

1. Introduction

As one of the popular research fields of computer vision, the main task of visual tracking is to obtain cues, such as the scale and position of the target of interest in every frame of the video by designing tracking models and algorithms to gain the object movement trajectory and acquire object tracking. With the expeditious development of computer vision technology, object tracking algorithms have made remarkable progress and have been widely used in traffic monitoring, video surveillance, security check systems, and other application fields [1]. It should be noted that object tracking methods often face the challenges of multiple adverse factors, such as illumination variation, background clutters, and occlusion in real applications. Therefore, designing a robust tracker in complex background environments is a pendent challenge.

Currently, mainstream object tracking methods can be coarsely divided into deep learning (DL)-based [2] methods and correlation filter (CF)-based [3] methods. The former achieves tracking by connecting multilayer neural networks, training large-scale datasets, and extracting deep features rich in semantic information for training and optimizing tracking models. Representative DL trackers include C-COT [4], ECO [5], MDNet [6], CREST [7], CFCF [8], and DeepSRDCF [9]. With the increasing complexity of deep neural network models, the large computational overhead and model size limit the application of DL-based tracking methods on mobile devices, such as drones, unmanned boats, and smart cars. In contrast, CF-based tracking methods are relatively simple and interpretable. The concept of CF originated in the field of signal processing to describe the correlation between signals. In the field of vision tracking, this CF-based method uses the minimum mean square error results of the ideal filter template and the real image samples to train a filter model that acts on the search area of the image samples of subsequent frames, while filtering the maximum output response as the object position of the current frame. The CF-based method uses a fast Fourier transform (FFT) to convert the convolution calculation into a less complex dot product operation, which greatly improves the tracking efficiency of the algorithm. CF-related research work is still receiving much attention from scholars.

Generally, CF-based methods can be further divided into generative correlation filters (GCF) and discriminative correlation filters (DCF). The GCF learns online and establishes a generative model to characterize the appearance of the object and then predicts the object location by searching for the region with the most similar features to the model. The representative methods include optical flow estimation [10, 11], Kalman filter [12], MeanShift [13], Kernel Particle Filter [14], etc. The DCF regards the tracking task as a classification problem. It distinguishes the object and background regions by training filters to obtain the object position as the tracking result. Relevant representative methods include MOSSE [15], CSK [16], KCF [17], SRDCF [18], BACF [19], CACF [20], and ASRCF [21]. Compared with DCF’s better robustness and higher computational efficiency in complex scenes, this GCF often ignores background region information and only uses object area information for modeling. Therefore, modern GCF’s performance is often poor for diverse object scenes. As a result, this work focuses on the DCF method.

It is well known that DCF often causes boundary effects [22]. The reason is that the cyclic sampling method does not remove boundary samples, which makes the tracker produce an overfitting learning phenomenon, resulting in a decrease in the discriminative power of the model. In response to the above problems, background-aware correlation filter (BACF) introduces a clipping matrix to filter the boundary samples of the cyclic sampling set. Although BACF can effectively suppress the background information interference outside the object frame, it ignores the surrounding background information inside the object frame, and the model is still affected by the background interference to some extent, which affects the discriminability of the filter. Subsequently, spatially regularized correlation filters (SRDCF) construct a spatial regularization term to improve the spatial representability of the filter for boundary region features. It is regrettable that the linear space regularization term makes the filter difficult to adapt to the image content. When there is interference, such as scale variation and video scene variation in the background, the background information is easily included in the tracking frame, which affects the object discriminability of the filter model, causing the tracking object to drift and the tracking performance to drop substantially.

Motivated by the abovementioned discussions, we propose a method based on adaptive background awareness and spatial dynamic constraint correlation filters (BSCF) in this work. The major contributions are summarized as follows:(i)A new adaptive background perception is introduced, which makes full use of the background information of the surrounding area of the object. This strategy can punish the nonobject area more accurately and greatly improve the classification and discriminability of the filter model.(ii)The proposed spatial dynamic constraints are added to the filter to enhance its adaptive learnability, so the filter is restricted more pertinently, thus effectively resolving the problem of boundary distortion and further improving its robustness to object tracking in complex scenes.(iii)Experimental results on the OTB100, VOT2018, and TrackingNet benchmark datasets validate the performance advantages of BSCF in the object motion scene; especially under the conditions of scene scale variation and complex background conditions, compared with popular methods, the proposed method performs better.

The remaining sections of this study are arranged as follows: Section 2 summarizes the traditional CF method, the spatial regularization CF method, and the background-aware CF method. Section 3 describes the method in detail. Section 4 presents the experimental analysis and the comparison of various mainstream tracking methods; the benchmark scoring dataset and evaluation metrics are briefly described. Section 5 summarizes the full text.

2. Preliminaries

2.1. Traditional CF Methods

Henriques et al. proposed a circulant structure of tracking-by-detection with kernels (CSK) [16] using the intensive sampling method of the cycle matrix by modeling the translation around the object to obtain random negative samples to expand the amount of samples and then to improve the training effect of CSK filter model. This proposed method clinched the matter of sample redundancy result in the original sparse sampling and achieves accurate and stable real-time tracking.

The objective function of the traditional CF method is written specifically in the form as follows:where represents the filter, is the cyclic matrix obtained after the cyclic sampling of the initial sample , is the ideal Gaussian response and the vectorized regression object for the image, and is the regularization coefficient. The regular term in equation (1) is used to penalize overdeformed samples after cyclic sampling and prevent the filter from learning overfitting.

We optimize the above objective function with a closed solution, which is the global optimal filter coefficient. The specific calculation formula is as follows:where are the Fourier transform results of initial sample , filter , and regression object , respectively. ∗ represents the conjugate operation and denotes the dot product operation of vector.

2.2. The Spatial Regularization CF Method

To solve the boundary effect problem of traditional correlation filtering methods caused by cyclic sampling, Danelljan et al. proposed spatially regularized correlation filter (SRDCF) [18]. The method imposes spatial regularization constraints on the negative Gaussian shape distribution on the filter template, punishing the background region away from the object, thus reducing the boundary region filter response value and making the filter focus on the object central region. SRDCF can effectively improve the proportion of real object samples and better reduce the boundary effect. The objective function of the spatial regularized filter is exactly as follows:where is the image sample data, is the filter template, the upper corner mark represents the spatial feature channel, and is the spatial regular penalty matrix with a negative Gaussian shape distribution, which is used to adjust the regular parameters of the object sample and the boundary sample. By penalizing the boundary samples, the filter pays more attention to the objective information.

It should be emphasized that the addition of the spatial regularization term destroys the closed solution structure of the relevant filtering framework, and equation (3) usually adopts the Gaussian–Sidel optimization filter solution using the iteration method, which greatly affects the tracking speed. Furthermore, the SRDCF uses a fixed space weight matrix, and the constraint term cannot be adaptively updated with the object, which affects the robustness of the filter model.

2.3. The Background-Aware CF Method

To solve the boundary effect problem inherent in traditional methods and reduce the large time consumption of the spatial regularization object tracking algorithm, Mueller et al. proposed the context-aware correlation filter (CACF) to increase the constraint on context background information based on the CF framework [20]. The CACF takes the object as a positive sample and the context background block as a negative sample, which are used together for filter training to improve the number and quality of the samples. The objective function of the proposed method can be specifically expressed as follows:where () is the cycle matrix of the background information of the object sample. is the regularization coefficient, and the response strength of the filter to the object and the background information is adjusted to ensure that the response value of the object area of the sample is much higher than that of the background area; the penalty for the context information is realized. Contextual background information can assist in filter training, substantially improving the tracking effect. The CACF only collects the context information of the objective neighborhood, ignores the spatial constraints on the filter, and does not consider the object boundary samples, affecting the accuracy of the filter model tracking.

3. Proposed Method

This section first describes the BSCF tracking model, focusing on filter model construction. Then, the sampling strategy of the adaptive background perception is expounded, and the model optimization solution process is introduced in detail. Finally, the specific implementation steps of the tracking algorithm are given.

3.1. BSCF Model

To improve the spatial adaptive learnability of the filter and enhance the model robustness against the complex background, we propose the BSCF model. The specific objective functions are as follows:where represents the regularization parameter and is the circular matrix of the maximum background response samples. represents the filter spatial reference template initialized to a Gaussian shape distribution. The template changes with the object movement, allowing the filter to learn more accurate spatial weights. is an adaptive background sensing term, which collects negative samples of background interference information around the object and adds them to filter model training. To upgrade the filter’s capability to distinguish the object from the background information, is a spatial dynamic constraint term. It introduces the filter space prior information as a predefined space template and dynamically updates the filter according to the change in the predefined space template, improving its robustness. Figure 1 illustrates the BSCF model framework.

Specifically, the proposed BSCF filter adopts an adaptive background-aware sampling strategy to obtain background sample information around the object, and the information is sampled from regions with higher filter response values around the object. We denote the response calculation result of objective sample and the filter in the current frame as matrix , and the local maximum in the response result graph is . Then, the latter is able to be calculated by the formula as follows:

The background-aware negative sample is taken from the three sublargest values , , and in are next to the maximum response value of the object center. Figure 2 shows the schematic diagram of background sampling. If these background negative samples next to the objective filter response value are not distinguished, they are easily regarded as objective samples by the filter. As a result, object samples and background-aware samples are distinguished by adaptive weighted strategies, which are then used for filter training.

In contrast to traditional filters, the proposed BSCF introduces background sample information into the filter training process. The comparison with the traditional filter training results is shown in Figure 3.

3.2. Optimization

The proposed BSCF model objective function is a convex function with a globally optimal closed solution. The filter wants the response value to be high at the object sample and near zero at the remaining background samples. As a result, the object sample cycle matrix and the background-aware sample cycle matrix in equation (5) can be merged into a cycle matrix to represent the sample characteristics. At the same time, the response matrix in equation (5) is converted into . To maintain the corresponding constraints of the filter on the object and background samples, the object samples are made close to the ideal regression object , and the background negative samples are close to 0. Accordingly, equation (5) is rewritten as follows:where and . By setting the partial derivative of in equation (7) to 0, the closed solution of the filter is obtained:where is the transpose of matrix . Given that the circulant matrix is able to express as the diagonalized form of the Fourier transform matrix. Therefore, the circulant matrices and in equation (8) are expressed as follows:where denotes the group of row vectors that constitute the circulant matrix. expresses the Fourier transform matrix. means the Fourier transform of . indicates the conjugate vector of . is the conjugate matrix of . According to equation (9), equation (8) can be simplified as follows:where is a cyclic identity matrix whose Fourier transform result is expressed as a real number .

According to the characteristic that the simplification of the cyclic inverse matrix is equivalent to the inversion of its eigenvalues, equation (10) is further rewritten as follows:

According to the diagonalization property of the circulant matrix, . For the digestion and simplification of equation (11), we can obtain the formula as follows:

Furthermore, according to the circular matrix convolution theorem, we obtain the optimal solution of the filter by converting equation (12) into the Fourier domain, specifically as follows:

The specific implementation steps of BSCF tracker are demonstrated, as shown in Algorithm 1.

Input: Image sample , filter model , filter spatial reference template , object box position and scale information.
Output: The scale and the position information of the target in each frame, the updated filter model , and the updated appearance model.
(1)initialization:
The filter space reference template as a Gaussian shape distribution.
(2): repeat
(3): Extract the image sample of the frame i to form a sample matrix ;
(4): Extract the background information samples by using equation (6);
(5): The background information samples form a matrix ;
(6): Solve the filter model of the current frame according to equation (5);
(7): Obtain the model peak response of the filter representing the object location;
(8): Obtain the position and scale information of the tracking object;
(9): According to equation (13), the filter model parameter is obtained;
(10): Update filter model and filter space template .
(11): end
(12)until End of the video sequence of frame n;

4. Experiments

4.1. Evaluation Data and Settings

To verify the effectiveness and the feasibility of the proposed BSCF, we evaluate it with state-of-the-art methods on three standard tracking benchmarks, that is, OTB100 [23], VOT2018 [24], and TrackingNet [25]. The proposed BSCF tracker is implemented in MATLAB 2017b. Intel I9-9900X4.50 GHz processor is used for filter training and testing, and an NVIDIA GTX 2080ti GPU is used for acceleration. We set the regularization parameter factor  = 1,  = 0.0001, and  = 10 and use the standard filter update rule with learning rate  = 0.015. The search box size is set to padding = 4. All parameters remain fixed in all experiments below.

The specific attribute information of the benchmark datasets used in this study is as follows: the OTB100 dataset consists of 100 sequences that can comprehensively evaluate the overall performance of the method. It covers a series of sequence attributes, which represent challenging aspects in visual tracking, such as occlusion (OCC), illumination variation (IV), scale variation (SV), deformation (DEF), fast motion (FM), background clutters (BC), and motion blur (MB). The VOT2018 dataset consists of 60 sequences, which are designed to evaluate the short-term tracking performance of a single object in complex scenes. The test sequence covers the influencing factors, such as camera motion, illumination variation, scale variation, and motion variation. It is the authoritative measurement data at present. The TrackingNet dataset is the first dataset dedicated to the field of visual object tracking, where there are more than 30,000 videos, including 27 object categories. The number of videos and labels is larger than other tracking datasets. The test sequence includes a series of influencing factors, such as background interference, camera motion, fast motion, occlusion, and scale variation. The broad and diverse object audience of the TrackingNet ensures the validity of the assessment.

4.2. Experimental Results and Analysis

The experimental results and analysis are carried out from the following three datasets: OTB100, VOT2018, and TrackingNet.

4.2.1. Evaluation on OTB100

The tracker effects are compared and analyzed on the OTB100 benchmark scoring dataset, and two standard evaluation methods under the one-pass evaluation (OPE) method are used as the quantitative evaluation criteria, namely, precision plots and success plots.

The precision plots show the percentage of video frames, where the distance between the center point of the tracker’s predicted object (bounding box) and the artificially labeled object (ground truth) is less than a given threshold, and the threshold is set to 20 pixels in the experiment. The calculation formula of precision is expressed as follows:where i is the object dimension, is the center point coordinate of the ground truth, and is the center point coordinate of the object position predicted by the tracker.

The success plots show the percentage of successful frames in all frames, where the definition of successful frames is that the overlap rate between the tracker’s predicted object and the real object frame is greater than the set threshold, which is set to 0.5 here. The formula for calculating the success is given as follows:where is the area of the ground truth, is the area of bounding box predicted by the tracking method, and and respectively represent the intersection and union of two regions.

The proposed BSCF is compared with nine mainstream CF methods, including ASRCF [21], ECO [5], C-COT [4], TADT [26], DeepSRDCF [9], BACF [19], SRDCF [18], STAPLE [27], and KCF [17]. The precision and the success rate are used as the quantitative evaluation metrics for the experiments. Figures 46 show the experimental results.

Figure 4 illustrates the precision and the success plot results on the OTB100 dataset. Our BSCF method achieves the great performance with a relatively excellent precision accuracy of 91.4% and a finest success rate score of 69.5%. Compared with the baseline methods, SRDCF and BACF, the accuracy is improved by 15.8% and 12%, and the success rate is improved by 16.2% and 13%, respectively. Compared with the similar improved method, ASRCF, the proposed BSCF has improved in terms of success rate. Overall, the proposed BSCF method has comparable performance compared with the existing outstanding tracker, the ASRCF.

Furthermore, to estimate the tracking results of BSCF on different visual scenes, we evaluate the proposed BSCF trackers using 8 attributes and report their success plots on OTB100 benchmark, as shown in Figure 5. These scene attributes include scale variation (SV), background clutters (BC), illumination variation (IV), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), and out-of-plane rotation (OPR). The results show that our BSCF tracker achieves the highest success rate. The proposed BSCF can better handle these challenges, especially for scale variation and background clutters.

Combined with Figures 4 and 5, the BSCF exhibits excellent tracking performance in the experimental results. This proves that on the basis of the traditional CF method, adding adaptive background perception terms and spatial dynamic constraints, the tracker can learn more sample information and conduct training to ensure that for complex scenes and scene scale variation, the background information is not affected. Capture is more sensitive, greatly improving the quantity and quality of object samples. Therefore, the success rate and accuracy of the proposed BSCF have been improved, and it has excellent performance results for different complex scenes, does not involve deep features, and only uses traditional handcrafted features.

Figure 6 demonstrates the visualization performance of the trackers on different scene sequences of OTB100 dataset. We select four challenging video sequences for qualitative analysis of specific scenes with scene scale variation, background clutters, motion blur, and fast motion.

In the Bird1 and Couple video sequences, most tracking methods are able to achieve stable tracking when the object does not undergo scene scale variation and deformation. During the strong scene variation from 186th to the 368th frame of the Bird1 sequence, all comparison methods deviated to different degrees. BACF, ECO, and C-COT methods even lost the object at the beginning, which greatly affected the subsequent real-time tracking performance of the tracker, while the proposed BSCF maintained good tracking performance. In the Couple sequence, the proposed BSCF is still able to maintain high tracking accuracy along with the substantial variation of pedestrian movement scenes.

We add the attribute challenges of fast motion, motion blur, and occlusion to the sequence with scale variation and deformation. The performance of each tracking method is shown in Figures 5(c) and 5(d). We employ the DragonBaby and Matrix videos sequence as examples. When fast motion and motion blur occur in frame 42 of the DragonBaby sequence, the object is lost except for ECO, KCF, and the proposed BSCF. Until the 83rd to the 107th frame, when the object has fast motion and scale variation, the ECO and the BSCF achieve stable tracking throughout the process, while the C-COT and the KCF methods cannot be adaptive to the variation of the object appearance. In the Matrix sequence, when the scene changes in frame 78, the object moves rapidly, and even the appearance of the object changes; only the proposed BSCF method is able to accurately focus on the object. Even though the 82nd frame motion blur is more serious, other methods are more likely to lose track and drift or maintain a short return after deviation, and the BSCF can always maintain the sensitivity of capturing the object.

The BSCF has better adaptability to the scale variation of the object scene and the other background information variation due to variation in motion conditions.

4.2.2. Evaluation on VOT2018

The evaluation metrics of the VOT2018 benchmark scoring dataset adopt the expected average overlap (EAO), robustness (R), and accuracy (A). The EAO is obtained by calculating the average of the overlap of the predicted box and the actual object box per frame and then dividing it by the number of videos for that frame. The bigger the EAO value is, the more excellent the tracking performance. Robustness evaluates the stability of the tracker tracking the objective. It is measured by the number of times the tracking method has missed the object. The bigger the robustness value is, the less excellent the tracker stability. The accuracy is measured by the intersection ratio between the predicted box and the real object box. The larger the accuracy value is, the more excellent the tracking effect.

Table 1 lists the evaluation results of the proposed BSCF and other 10 mainstream tracking methods on VOT2018. The specific methods include ATOM [28], BACF[19], C-COT [4], ECO [5], KCF[17], MDNet [6], SiamRPN++ [29], SRDCF [18], STAPLE [27], and STRCF [30].

Our BSCF achieves optimal results in terms of accuracy, outperforming other algorithms, including the CF-based and DL-based algorithms. The proposed BSCF scores 0.397 in the EAO indicator, which surpasses the mainstream methods of the CF-based in general, and is second only to the DL SiamRPN++ and ATOM tracking methods. In terms of robustness, the proposed BSCF is second only to the first place, with a score of 0.03. It basically achieves robust tracking. Compared with the traditional CF tracking methods, the proposed BSCF enhances the robustness of the tracker, while ensuring the accuracy and algorithm stability. It achieves an effective improvement of the CF-based methods.

4.2.3. Evaluation on TrackingNet

The TrackingNet benchmark dataset uses success, precision, and normalized precision as evaluation metrics. The two former evaluation indicators are consistent with the definition standards of the OTB100 dataset, and the normalized accuracy can better avoid the influence of the image scale and the object frame size on accuracy measurement. Table 2 lists the comparison of the experimental results between BSCF and 9 top-performing trackers on TrackingNet. These specific methods are ATOM [28], BACF [19], DiMP [31], ECO [5], KCF [17], MAML [32], SRDCF [18], SiamRPN++ [29], and STAPLECA [20].

Compared with the CF method, the proposed BSCF shows a significant gain on TrackingNet. Compared with the tracking method using the deep network as the backbone network, the performance of the BSCF tracker proposed in this study still has room for improvement. Although the proposed BSCF is on par with SiamRPN++, DiMP, and ATOM, there is still a certain gap with the MAML.

5. Conclusion

In this work, we develop an adaptive background-aware and spatially dynamically constrained CF tracking model. Adaptive background-aware constraints and dynamic space constraints are introduced into the framework of the original tracking model to screen negative samples of background information around the object. This avoids the problem of filter degradation due to dynamic changes in the object. Thus, the background awareness and the interference discriminability of the tracker are effectively improved. This method is applied to object tracking on three challenging aerial object tracking benchmarks. Compared with the current mainstream CF-based and DL-based trackers, we exhibit the competitive performance of our BSCF method. The accuracy and success rate are greatly improved under eight tracking challenges, including scale variation, background clutters, illumination variation, and motion blur. The proposed BSCF can better meet the object tracking requirements of complex scenes and achieve efficient tracking.

Data Availability

The data used to support the findings of this study are available from the author upon request. For details, please contact via e-mail: [email protected].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported in part by Zhejiang Provincial Science and Technology Program in China under grant no. 2022C01083, in part by the Natural Science Foundation of Jiangsu Province under grant no. BK20211539, in part by the National Natural Science Foundation of China under grant nos. U20B2065, 61972206, 62011540407, and J2124006, in part by the 15th Six Talent Peaks Project in Jiangsu Province under grant no. RJFW-015, in part by the Qing Lan Project, and in part by the PAPD Fund.