Abstract

The structured output tracking algorithm is a visual target tracking algorithm with excellent comprehensive performance in recent years. However, the algorithm classifier will produce error information and result in target loss or tracking failure when the target is occluded or the scale changes in the process of tracking. In this work, a real-time structured output tracker with scale adaption is proposed: (1) the target position prediction is added in the process of target tracking to improve the real-time tracking performance; (2) the adaptive scheme of target scale discrimination is proposed in the structured support to improve the overall tracking accuracy; and (3) the Kalman filter is used to solve the occlusion problem of continuous tracking. Extensive evaluations on the OTB-2015 benchmark dataset with 100 sequences have shown that the proposed tracking algorithm can run at a highly efficient speed of 84 fps and perform favorably against other tracking algorithms.

1. Introduction

Visual target tracking is an important task in computer vision, which has been widely used in various fields, such as intelligent transportation systems, military, medical, and so on [13]. Many scholars have done a lot of research on target tracking and have achieved great progress. However, the tracking target faces many challenges such as target scale changes, occlusion, and illumination changes [4, 5]. Therefore, ensuring the accuracy of the tracking process and improving the real-time performance are of great theoretical significance [6, 7].

The Struck tracker is an algorithm based on discriminant classifier, and it has excellent adaptability to various complex backgrounds in visual target tracking [8, 9]. The Struck tracker is to construct a structured output support vector machine classifier through an online learning method. In the tracking process, the algorithm first samples the local area of target position in the previous frame and then takes the maximum value of classifier discriminant function as the current tracking result [10]. In the process of updating classifier, the classifier abandons the sample labeling process of classifier. Therefore, in the case of occlusion and scale variation, the Struck tracker cannot track target correctly [11, 12]. This paper suggests a structured output SVM tracker with real-time and occlusion detection capabilities. This work includes the following. (1) The target position prediction process is introduced. The sparse sample is used to calculate the rough position of target from the discriminator, which reduces the amount of calculation to determine the target position and improves the real-time performance. (2) A multiscale sampling adaptive scale tracking strategy is proposed. The scale discriminator calculates the scale change of target, which can adjust the size of scale adaptively. (3) The discriminant function of SVM classifier is used to implement occlusion detection. The SVM classifier will stop updating when a certain percentage of occlusion is detected, and the Kalman filter is applied to predict target position when the target has certain motion information, so the target can be successfully tracked.

With the development of target tracking, many visual target tracking algorithms have been investigated [13]. The mainstream target tracking algorithms include traditional tracking methods, correlation filtering-based tracking methods, and tracking methods based on deep learning [14, 15].

2.1. Traditional Tracking Methods

Caulfield and Dawson-Howe[16] proposed a FragTrack tracking method by comparing template histograms with sub-block histograms to obtain the possible locations of target. Wang and Li [17] suggested a mean shift algorithm which used color as feature to obtain the probability density map of overall image. Wu et al. [18] introduced a multi-instance learning method to solve target drift and improve tracking accuracy. Babenko et al. [19] provided a compressive tracking algorithm to deal with the problem of occlusions and image noise. Chunxiao et al. [20] developed a tracking-learning-detection method combining traditional tracking algorithms with detection algorithms to solve the target occlusions. Hare et al. [21] presented a structured output SVM tracker which was robust to occlusions of tracking target.

2.2. Correlation Filtering-Based Tracking Methods

Bolme et al. [22] used a minimum output sum of squared error filter, which could improve tracking accuracy compared with many traditional algorithms. Henriques et al. [23] developed a circulant structure kernel tracker by building a cyclic matrix to solve the ridge regression problem. Henriques et al. [24] proposed a kernel correlation filter which had excellent performance in tracking accuracy, but it could not solve illumination and scale changes. Zhang and Zheng [25] presented a spatiotemporal context tracking algorithm which had achieved excellent real-time tracking performance. Li and Zhu [26] suggested a scale adaptive multifeature tracking algorithm for solving the scale changes of tracking target.

2.3. Tracking Methods Based on Deep Learning

Held et al [27] presented a GOTURN tracking method and applied the end-to-end deep learning model to target tracking for the first time. Nam and Han [28] developed an MDNet tracking algorithm and updated the model online to adapt to the changing targets and scenarios. Danelljan [29] proposed a C-COT tracking algorithm, and it used the deep neural network to extract target features and solved target scale changes. Danelljan [30] proposed an effective convolution operator tracking algorithm, which solved the problem of too large C-COT model and improved the real-time performance.

Although these research methods have some effect in dealing with complex background, there are two shortcomings which can be summarized as follows: (1) these algorithms cannot judge the current occlusion state and introduce error information into the classifier when the target is occluded; (2) the performance of the algorithm is significantly reduced when the target scale changes. This work uses the discriminant function value of SVM classifier to achieve occlusion detection by combining the framework of Struck tracker and the motion information of target. The classifier stops updating when partial occlusion is detected, and the Kalman filter is used to predict the target position when the target has certain motion information. Aiming at the change of target scale, a multiscale sampling strategy is proposed, and the optimal scale is calculated with scale discriminator.

3. Proposed Method

First, the Struck tracker and a real-time structured output tracker with scaleadaption and then the whole details of the presented algorithm are given.

3.1. The Struck Tracker

During the tracking process, let be the estimated bounding box at time . The Struck algorithm estimates the target displacement , where is the search space which is given as ( is the search radius, and are two-dimensional space coordinates). The current frame target position which is expressed as can be obtained by shifting the previous frame position.

The prediction function is constructed to estimate target transformations between frames, and the output space is the space of all transformations instead of the binary labels. Now, we introduce a discriminant function to predict the target position between frames as follows:which evaluates the similarity between sample and target ( is the coefficient vector and is the kernel function that maps the input space to the feature space). In the given sample set , for finding the optimal hyperplane, the optimization objective function iswhere is the loss function and is the regularization parameter. This loss function should decrease towards 00 as and become more similar and is only on the condition that . We now choose the loss function based on the overlap of target bounding boxes aswhere is the overlap between target bounding boxes. Using the standard Lagrangian duality techniques, the solution of equation (2) can be converted into its equivalent dual form:replacing the parameter into equation (4) as follows:

According to equation (5), equation (4) can be written aswhere

Then, the discriminant function can be simplified as

We refer the sample with as support vector. For a given support pattern , only the support vector is and any other support vectors with is . We refer to these as positive and negative support vectors, respectively. The selection of support vector is controlled by the following gradient:

It can be known from that the gradient calculation includes the overlap between the sample and the target bounding box, and the algorithm updates and gradient incrementally during the update process of each frame to implement the classifier learning and the update of the algorithm.

3.2. The Real-Time Structured Output Tracker with Scale Adaption

The output of the Struck tracker is only the target position, and the accuracy of target position decreases when the scale changes, so we suggest a real-time structured output model with scale adaption. By taking the target tracking frame of the previous frame, the candidate possible target position set in the current frame, and the candidate target scale set in the current frame as input and taking the exact position and the target scale in the current frame as output, the model can be represented using the following decision function:where is the discriminant function, and the discriminant function in the model iswhere is the coefficient vector, is the structured feature function, and is the inner product operation. By combining equation (12), the adaptive scale tracking model iswhere is the sampling number, is the relaxation amount, and is the loss function. The loss function can be written aswhere is the sample blocks when the target positions and scales are and is the sample blocks when the target positions and scales are .

We can transform the structured SVM problem of equation (12) into a dual problemwhere . The discriminant function is

When , set or , and equation (14) can be converted aswhere when y = yi and s = sj, else . We can get the discriminant function as follows:

By selecting the proper kernel function , the discriminant function is

Equation (18) is the final form of discriminant function in the adaptive scale tracking model.

The algorithm of tracking process is divided into two parts: the target position prediction and the target position and scale determination.

In the process of target position prediction, the target position is sampled equally at first, and then the previous frame of target position is evenly sampled according to the sparse sampling method:

The structural feature vector of the corresponding position is obtained by extracting the feature of the samples:

The feature vector (the corresponding sample is the target position prediction result and is the rough target position) that maximizes the decision function is selected according to the discriminant function equation (18) and the decision function equation (10).

The process of determining the target position and scale includes four steps:(1)We set up a set of scales (), and the collected elements are satisfied with (where is the step size).(2)By taking the rough estimated target location as center, the search area is further reduced and we can sample the scale set in the reduced search area and get sampling results:(3)Perform feature extraction for each sample and obtain structured feature vector(4)Select the feature vector (the corresponding sample is the exact position and is the target scale, is the tracking result) that maximizes the decision function according to equations (18) and (22).

The performance of the Struck tracker is decreased significantly when the target is partially occluded or completely occluded. We perform an active occlusion detection during the tracking process of each frame.

Let be the value of classifier discriminant function at the search position in frame (where is the frame number of tracking target):

For the target position of the current frame, the classifier discriminant function value at the satisfying target position is , then

The change rate of can be defined as

The queue () is constructed, and it stores the value of frame history , where is the average value of elements in queue . After each frame target tracking, the current frame value is added to the queue to update the occlusion detection.

The threshold of occlusion detection is introduced to distinguish whether the target is occluded or not. When , the algorithm continues to track the target. When , the algorithm stops updating the element of queue , and the classifier to ensure that the classifier does not introduce error information.

3.3. Steps of the Suggested Method

The proposed algorithm steps are shown in Algorithm 1. The steps of the developed method are as follows.

 Input:: the frame video : target position set of the previous frame: target scale set of the previous frame: occlusion threshold
 Output:: target prediction position set of the current frame: the structured feature vector of rough position: the target rough position: the structured feature vector of accurate position: target scale of the current frame: the accurate position of target
 End
 Target rough position prediction process(1) Obtain the sparse sampling result of target prediction position from equation (19)(2) Get the structured feature vectors according to equation (20) by performing feature extraction on the samples(3) Take the structured feature vector with maximal decision function as the target rough position according to equations (10) and (18)
 Determination of target accurate position and scale(4) Set a set of scales and obtain scale samples(5) Densely sample each scale of the set and get sets sampling results from equation (21) taking estimated position as the center(6) Perform feature extraction on the samples to obtain k groups of structured feature vectors according to equation (22)(7) The target scale and the accurate position can be obtained by calculating the feature vectors which maximize the decision function
 The target occlusion judgment process(8) Set the size of occlusion threshold (9) Calculate the change rate value of the discriminant function according to equations (24) and (25)
If(10) The tracking target is occluded and the classifier and the queue element stop updating
else(11) The tracking target is not occluded and the tracker continues to track the target
Until the end of all frames
End

4. Experiments

The tacking performance of six trackers, which include the structured output tracking with kernels (Struck), the kernel correlation filter (KCF), the background-aware correlation filters (BACF), the compressive tracking (CT), the tracking-learning-detection (TLD), and the OURS tracker, is tested on the OTB-2015 benchmark dataset. The OTB-2015 dataset comprises 11 different complex backgrounds, namely, scale variation (SV), deformation (DEF), illumination variation (IV), out of view (OV), background clutter (BC), occlusion (OCC), motion blur (MB), fast motion (FM), out of plane rotation (OPR), in-plane rotation (IPR), and low resolution (LR). The overlap success rate and the center position error are used as evaluation criteria for trackers [31, 32]. The overlap success rate reflects the degree of overlap between the tracked target frame and the actual target frame [33, 34]. The center position error indicates the offset between the output target center position and the labeled position [35, 36].

4.1. Parameter Setup

Experiments are implemented in Matlab on an Intel I7–8565U 1.8 GHz CPU with 8 GB RAM. The regularization parameter is set to 100, the size of support vector threshold is set to 100, the original search range is three times the target size of the previous frame, the number of scales is , the historical frame is set as , the size of is 10, the threshold value of occlusion detection is , the target boundary box size change factor is 0.1, and the search radius is .

4.2. Experiments on OTB-2015 Benchmark Dataset

The tracking performance of different trackers is tested on the OTB-2015 benchmark dataset. The tracking results under the OTB-2015 dataset are shown in Figure 1, and we can get that the tracking results of OURS tracker are better than comparsion trackers. The OURS tracker and the Struck tracker are also tested under various complex background, and the selected sequence information is shown in Table 1, and their average success rate plot and average precision in various complex background are shown in Figure 2. Figure 2 shows that the performance of the OURS tracker is better than that of the Struck tracker on the OTB-2015 benchmark dataset.

The success rate of various trackers in selected sequences is demonstrated in Table 2. From Table 2, the tracking performance of OURS tracker is better than those of other comparison algorithms.

The frames per second of various trackers in selected sequences are shown in Table 3. From Table 3, the running speeds of OURS tracker is higher than those of other comparison algorithms.

4.3. The Success Rate on More Challenging Video Sequences

In addition, the tracking success rate plots under 11 challenging video sequences are shown in Figure 3. We can see from Figure 3 that the OURS tracker shows excellent tracking performance under 11 challenging video sequences than other trackers. Therefore, OURS tracker processes challenging video sequence effects better than other comparison trackers.

4.4. Experiments on Various Complex Video Sequences

In the experiment on various complex video sequences, the comparison algorithms including OURS, Struck, KCF, and TLD are selected to test various video sequences, and their corresponding performance is discussed. The video sequences are Faceocc1, Coke, Faceocc2, Deer, Man, and Dog. The sequences information is shown in Table 1, and the test results of sequences are shown in Figures 49 (the original images of Figures 49 are from https://cvlab.hanyang.ac.kr/tracker_benchmark/datasets.html).

In the video sequence Faceocc1, the challenge during tracking is partial occlusion. The TLD tracker and the Struck tracker occur to drift at the 44th frame. Due to the removal of the target from partial occlusion, the comparsion trackers lose the target at the 69th frame, but the OURS tracker can deal with the problem of partial occlusion.

In the video sequence Coke, the target in video sequence is affected by complete occlusion. The target is completely occluded at the 256th frame. The Struck tracker, the KCF tracker, and TLD tracker tracking fails at the 276th frame, but the OURS tracker can deal with full occlusion problem.

In the video sequence Faceocc2, the partial occlusion affects the tracking target during the tracking process. The target is partial occluded at the 261th frame. The comparsion trackers lose the target when target removes from partial occlusion at the 294th frame, but the OURS tracker can solve the partial occlusion problem.

In the video sequence Deer, the tracking target appears fast motion. The Struck tracker and the TLD tracker lose the target at the 19th frame. The KCF tracker loses the target at the 60th frame, but OURS tracker can solve the problem of fast motion.

In the video sequence Man, the illumination variation affects the tracking target during the tracking process. The TLD tracker and the Struck tracker occur to drift in video sequence at the 35th frame. All comparison trackers lose the target when the background illumination is changed gradually, but the OURS tracker can solve the illumination variation problem.

In the video sequence Dog, the tracking target scale changes. The comparsion trackers occur to drift at the 911th frame. The comparsion trackers lose the target at the 1034th frame, but the OURS tracker can adapt to changes in target scale and track the target correctly.

5. Conclusion

In this work, a real-time structured output tracker with scale adaption is proposed: (1) the process of position target prediction which can improve the tracking real-time performance is added during the tracking process; (2) multiscale sampling is used to obtain samples of different scales, and the best scale is obtained by using a discriminator to improve the accuracy of tracking; and (3) the occlusion judgment mechanism is suggested to determine whether to update the classifier or not, and the Kalman filtering is applied to solve the problem of continuous tracking with occlusion.

The tracking performance of OURS tracker is better than those of other trackers in different research cases due to the following advantages. The OURS tracker uses a multiscale sampling strategy to estimate the scale of target during tracking. The OURS tracker uses Kalman filter to solve tracking problem with target occlusion. From the experimental results, the tracker proposed in this paper shows excellent performance when processing various complex backgrounds under the OTB-2015 dataset, and it also achieves excellent success rate and tracking accuracy in different challenging complex backgrounds.

In the future, our research work will focus on applying the proposed algorithm to multitarget tracking due to the successful application of the proposed algorithm on the single target tracking.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (no. 61671222) and the Postgraduate Research and Practice Innovation Program of Jiangsu Province (no. KYCX19_1693).