#### Abstract

The representation of the object is an important factor in building a robust visual object tracking algorithm. To resolve this problem, complementary learners that use color histogram- and correlation filter-based representation to represent the target object can be used since they each have advantages that can be exploited to compensate the other’s drawback in visual tracking. Further, a tracking algorithm can fail because of the distractor, even when complementary learners have been implemented for the target object representation. In this study, we show that, in order to handle the distractor, first the distractor must be detected by learning the responses from the color-histogram- and correlation-filter-based representation. Then, to determine the target location, we can decide whether the responses from each representation should be merged or only the response from the correlation filter should be used. This decision depends on the result obtained from the distractor detection process. Experiments were performed on the widely used VOT2014 and VOT2015 benchmark datasets. It was verified that our proposed method performs favorably as compared with several state-of-the-art visual tracking algorithms.

#### 1. Introduction

Given the initial state (e.g., position and other information) of a target object in the first frame, the goal of visual tracking is to predict the states of the target in subsequent frames. Visual tracking has an important role in several applications in the areas of computer vision, such as motion analysis, visual surveillance, human computer interaction, and robot navigation. Although this issue has been studied for several decades and considerable progress has been made, it still presents challenges, in particular, the development of a robust algorithm for overcoming problems such as occlusions, camera motion, illumination changes, motion changes, and size changes.

An important factor in creating a robust visual tracking algorithm is the representation of the target object. Several decades ago, to solve challenging problems in visual tracking, researchers used a color histogram [1] to represent the target object. A generative approach combined with an optimization method, such as the Lukas-Kanade algorithm, Kalman filter [2], and particle filter [3], was usually applied. The Lukas-Kanade algorithm usually utilized a differential method to handle optical flow. Unfortunately, the computation involved in this method is expensive and it has many disadvantages for addressing challenging problems in visual tracking. The Kalman filter also has some limitations for challenging problems in that it assumes that both the system and observation model equations are linear and that the distribution of the state uses Gaussian distribution. These assumptions are not realistic in many real conditions. The particle filter was proposed for overcoming the limitations of the Kalman filter. Although it has been shown that a particle filter significantly improves the results and can handle nonlinear problems, it has some issues related to the relationship between accuracy, the number of particles, and computation time [4]. Furthermore, the generative approach is focused only on learning an appearance model. It does not take the information from the background model into consideration, although such information is very valuable for developing a more robust visual tracking algorithm. Moreover, although color histogram-based representation has advantages which are robust to deformations, it has a disadvantages or a drawback when illumination changes occur. It is also sensitive to motion blur.

Later, the discriminative approach was proposed for improving the performance of the generative approach. The main difference between the discriminative and the generative approach is in the utilization of a classifier method to determine the location of the target object. The generative approach, on the one hand, does not need a classifier method to determine the output; the output is determined by the nearest distance according to a one-by-one distance comparison with the target. For this reason, the computation time of the generative approach is expensive. On the other hand, the discriminative approach uses a classifier method for determining the output and takes the information from the background model into consideration. Therefore, positive and negative samples should be used for representing the target object and the background, respectively. For example, Grabner et al. proposed an online feature selection method using an AdaBoost algorithm for visual tracking. This method has online training capability [5]. Although it operates quickly, online learning is problematic, in particular, when each update of the tracker may introduce an error, which finally can lead to tracking failure (drifting). Semisupervised online boosting alleviates the drifting problem in tracking applications [6]. Another method for visual tracking, called multiple instance learning (MIL), was proposed by Babenko et al. to replace traditional supervised learning [7]. This method treats positive and negative samples as a positive and negative bag, respectively. Then, to determine the output, a boosting classifier is used. This method operates faster and more accurately than traditional supervised learning. Kalal et al. proposed a tracking-learning-detection framework [8]; however, unfortunately this framework needs a large memory for computation. These methods can be termed tracking-by-detection methods.

Recently, a correlation filter has been used, which provides efficient computation, since the operator is transformed into the Fourier domain. Further, it also produces good results, although a limited amount of training data is used. For these reasons, researchers introduced the correlation filter into the tracking-by-detection method for visual tracking. An example is the method, called minimizing the output sum of squared error (MOSSE) tracker, that was introduced by Bolme et al. [9]. For training the correlation filter, this method used only grayscale samples. To improve the method, according to the results of recent studies multidimensional features such as histogram of Gaussian (HOG) features can be used [10–13]. Although the correlation filter provides efficient computation, all the circular shifts should be learned during the process. To resolve this issue, Danelljan et al. proposed the spatially regularized discriminative correlation filter (SRDCF) [14]. Although it achieves excellent results, this method needs a computational time longer than the original one. Moreover, although the correlation filter has the advantages which show excellent robustness to challenging problems, such as illumination changes and motion blur, it has a disadvantages or a drawback when problems such as deformation arise.

To compensate the advantages and disadvantages of color histogram-based representation and correlation filter-based representation, respectively, a representation of the target object based on complementary learners was proposed [10, 11, 15]. In this study, we adopted complementary learners and we propose an object-aware method based on them. These representations are computed in parallel, where each representation produces a color histogram response and correlation filter response, respectively. Since the tracking algorithm can fail because of the distractor, a method to handle the distractor is proposed to minimize tracking failures. First, distractor detection should be performed. This can be achieved by calculating the distance between the maximum value of the color histogram response and the maximum value of the correlation filter response. Then, the location of the target object can be determined from either the maximum value of the correlation filter response or the maximum value of the merged responses of the color histogram and the correlation filter; the value selected depends on the results of the distractor detection process. We demonstrate our proposed method on the widely used VOT2014 and VOT2015 benchmarks. According to the results of our experiments, the proposed method performs favorably as compared to state-of-the-art visual tracking algorithms.

The rest of this paper is organized as follows. We describe our object-aware method based on complementary learners in Section 2. The distractor detection method is explained in Section 3. The proposed method is detailed in Section 4. In Section 5, the experimental results with comparisons to the state-of-the-art methods are presented. Finally, conclusions are presented in Section 6.

#### 2. Object-Aware Method Based on Complementary Learners

One important factor in building a robust visual tracking algorithm is determining the model representation of the target object. Color histogram-based object representation has been used widely. Unfortunately, this representation is not robust when the color of the distractor is similar to that of the tracked object. In addition, this representation has disadvantages or the drawback when illumination changes occur and is also sensitive to motion blur. Recently, a correlation filter has been used for representing the object. Although it is robust to challenges such as motion changes and illumination changes, it has a drawback when deformation occurs. Complementary learners, in which the results of a collaboration between the correlation filter and color histogram are used to represent the target object in visual tracking, were inspired by these ideas [10, 11, 15]. The representations should be computed in parallel to produce each response before the distractor is analyzed based on these responses.

Given frame , we can calculate the color histogram of the object, , and the color histogram of the background, , from the previous frame to obtain the response of the color histogram, . First, this response is computed from the pixel at location in the location of the search area of the target object , which has the same bin index . Then, following Bayes’ theorem, we calculate by usingwhere and are the rectangle area of the object and the background, respectively. Finally, the response of the color histogram can be obtained by using the integral image from .

On the other hand, as in [10–14], HOG features are used as multidimensional features. They produce -dimensional feature map representation of an image. Based on this representation, the optimal correlation filter is obtained by usingwhere , , , and are the rectangle patch of the feature map that represents the target, the desired correlation output, the circular correlation, and the parameter that controls the effect of the regularization term, respectively. Further, the correlation filter operates in the Fourier domain, and, therefore, we can use the discrete Fourier transform (DFT), which produces a complex variable. Because the results of the DFT take a complex form and we need to solve (2), we follow the method presented in [16] and then we obtain where is the complex conjugate of the DFT of , is the DFT of , is the complex conjugate of the DFT of , is the DFT of , represents element-wise multiplication, and is the result in the Fourier domain.

An inexpensive computation is required to develop a visual tracking algorithm. This is because, to handle the appearance changes in the target object, online learning is effective, as was proved in [5–8]. Further, based on (3), linear system of equations per pixel needs to be solved and this requires expensive computation. Thus, rather than performing expensive computation, where robust approximation is needed, an online update of the numerator and denominator at frame , which was adopted from [16], is used:where , , is a learning rate parameter, is the numerator at frame , and is the denominator at frame . Moreover, a response of the correlation filter can be calculated using the inverse DFT:where is the feature map from which has been multiplied by hanning window and is the complex conjugate from .

#### 3. Distractor Detection

Visual tracking algorithms usually fail because of the distractor, in particular when the distractor has a representation similar to that of the target object. To overcome this problem, Kalal et al. [8] proposed a learning method assisted by positive and negative constraint to distinguish a target object from the background. In addition, they used optical flow for motion model. Unfortunately, this approach needs a large memory for computation. Recently, Possegger et al. [17] proposed foreground and background modeling based on the color histogram. Unfortunately, the drawback or the disadvantages of the color histogram features still influence their approach and makes less robust than shape HOG correlation filter-based tracker. In this section, we describe our proposed distractor detection method. Given the responses from color histogram and correlation filter , the maximum value of and can be determined. The maximum value of is represented by and that of by . Because these responses take a two-dimensional form, these maximum values have coordinate information indicating their respective positions. Distractor detection can be achieved by using the Euclidean distance between position and position :where and represent the distance at frame and the distance threshold, respectively, and 1 indicates that a distractor appears and 0 that no distractor appears. The distractor detection procedure is illustrated in Figure 1. Moreover, compared with [8], our proposed distractor detection method does not need a large memory for computation.

#### 4. Proposed Method

In this section, our proposed method for visual tracking is described. Given frame , the rectangle area of an object , and that of the background , we calculate certain parameters that are related to each representation before we proceed to frame , since the proposed method uses a color histogram and correlation filter for representing the target object. First, considering the color histogram-based representation, the parameters and can be calculated based on the pixels in the observation area and the number of bins that are needed. For the correlation filter-based representation, the numerator and denominator parameters for translation estimation and the numerator and denominator parameters for scale estimation should be determined. Parameters and can be calculated by and , respectively. On the one hand, parameters , , , and are the complex conjugate of the DFT of , the DFT of , the complex conjugate of the DFT of , and the DFT of , respectively. On the other hand, parameters and can be calculated by and , respectively. Parameters , , , and are the complex conjugate of the DFT of , the DFT of , the complex conjugate of the DFT of , and the DFT of , respectively.

After these parameters for frame have been calculated, the search for the target object in frame can proceed. To search the target object in frame , the response from color histogram and the response from correlation filter are needed. Given the search area of the target object at frame , where , to obtain , we use and and, then, implement these parameters in (1), where is related to the pixel at . Further, the results of this step are computed by using an integral image in order to obtain . On the other hand, translation estimation is used to estimate the location of the target object when the correlation filter-based target object representation is used. Given frame , translation sample is extracted from within the scale estimation from the previous frame . After is extracted, the parameters and are used together with to obtain by implementation in (5). Figure 2 shows the proposed method framework.

When the parameters and have been obtained, in order to minimize the tracking failure due to the distractor, the distractor must be detected prior to the final location estimation of the target object. To detect the distractor, we use (6). The final location estimation of the target object can be obtained by maximizing the score , wherewhere and are the coefficients related to and , respectively. According to (7), when the distractor appears, the response from correlation filter is selected in order to get final location estimation. This is because color histogram-based representation is less discriminative than correlation filter-based representation. This reason is based on the disadvantages of color histogram-based representation, where this representation is often inadequate to discriminative target object from the background, sensitive to motion blur, and can not handle the variation of the illumination well. Besides that, this reason is made based on the benchmark results of the VOT2014 dataset [18, 19] and VOT2015 dataset [15, 20]. The DSST tracker [16], SAMF tracker [21], and KCF tracker [22] occupied the top three rank in the benchmark results of the VOT2014 dataset. These trackers are developed based on shape HOG correlation filter. Furthermore, shape HOG correlation filter-based tracker is always dominant and leading in the accuracy-robustness rank compared to color-based tracker of the VOT2015 benchmark dataset.

Scale changes of the target object also can cause tracking failure. For this reason, scale estimation is required, for which a correlation filter can be used, as proved in [16]. The process is almost the same as for translation estimation. Scale sample is extracted from , considering the scale estimation from the previous frame . After is extracted, the parameters and are used together with to obtain by implementation in (5). Scale estimation at frame can be calculated by maximizing the score . The parameter that has the maximum score is represented by the output of the proposed method.

Since appearance changes always occur and influence the target object, they also can cause tracking failure. Certain parameters need to be updated to handle this problem. Six parameters should be updated: the parameters and for color histogram-based representation and the parameters , , , and for correlation filter-based representation. The parameters and can be obtained aswhere is the color histogram for the target object, is the color histogram for the background, and is a coefficient related to the color histogram-based representation. On the other hand, the samples and should be extracted from frame at and to update the parameters in the correlation filter-based representation, respectively. After the samples have been extracted, the updates of parameters and are determined by using (4) with . Parameters and are also updated by using (4) with .

#### 5. Experimental Results and Discussions

In this section, a comprehensive evaluation of the proposed method is presented. The proposed method is compared on two recently published benchmarks that are widely used: VOT2014 [18, 19] and VOT2015 [15, 20]. The method was implemented in MATLAB 2016A, and the experiment was performed on an Intel(R) Core(TM) i5 2.60 GHZ CPU with 8 GB RAM. For color histogram-based representation, the number of bins that was used was 32 for each channel of a red green blue (RGB) image color format. The value of the parameter for updating the color histogram was 0.01. Further, for the correlation filter-based representation, we used a HOG cell size of 8 × 8. The values of parameters , , and were 0.01, 0.01, and 20, respectively. When a distractor did not appear, parameter was constructed from the merged responses of and . Thus, coefficients and were required. According to the results of our experiments, these coefficients and are equal to 0.3 and 0.7, respectively.

The VOT2014 benchmark dataset includes 25 sequences that represent several challenging problems in visual tracking: camera motion, illumination change, motion change, occlusion, and size change. For this benchmark dataset, two performance measures were used: accuracy and robustness. The accuracy parameter was determined as the average per-frame overlap between the bounding box output of the system and the ground truth using the area under curve (AUC) criterion . Further, the robustness parameter was expressed as the number of failures over the sequence, where a failure is the condition that the AUC is equal to zero. By using this benchmark dataset and in order to justify the design choice of the proposed method which uses compLementary learners for rePresentation Model of the target object and detecTing the distractor (LPMT), this proposed method is compared with the proposed method without distractor detection, the proposed method which uses only shape HOG features, and the proposed method which uses only color histogram features. Figure 3 shows the results of these comparisons.

Using the VOT2014 benchmark dataset, the proposed method is also compared with the state-of-the-art visual tracking algorithms: SIR_PF [23], SAMF [21], ThunderStruck [24], DynMS [23], IVT [25], ABS - [23], LT_FLO [26], IPRT [23], PTp [23], NCC [27], qwsEDFT [28], Matrioska [29], ACAT [23], KCF [22], FRT [30], EDFT [31], BDF [32], FoT [33], FSDT [23], MIL [7], aStruck [23], MatFlow [23], HMMTxD [23], OGT [34], CT [35], IIVTv2 [23], IMPNCC [23], ACT [36], DGT [37], VTDMG [23], Struck [24], DSST [16], and CMT [38]. Based on the accuracy parameter and the robustness parameter, the accuracy-robustness (AR) rank plot is used to determine the comparative rank of the methods.

Figure 4 shows the AR rank plots of LPMT and the state-of-the-art methods for the challenges of camera motion, illumination change, motion change, occlusion, and size change. For each challenge, LPMT shows a good performance: it is always ranked in the top 5 among all the 33 trackers. In particular, in the occlusion challenge, where most trackers fail because of this problem and the problem is coupled with the disruption caused by the presence of an object similar to the target object, LPMT outperforms the other state-of-the-art algorithms. This proves that the proposed method meets these challenges effectively. The definition of neutral in this figure is that no challenge exists in the sequence frame.

Figure 5 shows the AR rank plots of LPMT and the state-of-the-art trackers on the VOT2014 benchmark dataset for all the challenges combined and the average expected overlap rank. Since LPMT showed a good performance according to the AR plot rank for each challenge, where it was always ranked in the top five, this method also ranked in the top five for the overall challenges. Based on the average expected overlap, the LPMT was ranked fourth, where the average expected overlap is almost 0.3. In the average expected overlap parameter of this benchmark dataset, DSST [16] achieved the top rank, which has an average expected overlap equal to 0.3. This method uses HOG and grayscale features. For detailed information about the VOT2014 benchmark dataset and its performance parameters, please refer to [18, 19].

**(a)**

**(b)**

In the VOT2015 benchmark dataset, there are 60 sequences that represent more challenging problems than those in the VOT2014 dataset. As for the VOT2014 benchmark dataset, the accuracy and robustness performance parameters were used, which are represented by the AR rank plot. By using this benchmark dataset and in order to justify the design choice of the proposed method LPMT, this proposed method is compared with the proposed method without distractor detection, the proposed method which uses only shape HOG features, and the proposed method which uses only color histogram features. Figure 6 shows the results of these comparisons.

Based on the VOT2015 benchmark dataset, for all of the proposed trackers, the proposed tracker without distractor detection, the proposed tracker which only uses shape HOG features, and the proposed tracker which only uses color histogram features achieve the accuracy rank of 1.00. Furthermore, the robustness rank baseline mean of the proposed tracker, the proposed tracker without distractor detection, the proposed tracker which only uses shape HOG features, and the proposed tracker which only uses color histogram features are 1.00, 1.33, 2.83, and 3.33, respectively. According to the results, these prove that color histogram features are less robust than shape HOG features. It indicates that color histogram features are less discriminative than shape HOG features. Furthermore, these results also prove that the proposed tracker which uses distractor detection can improve the robustness compared to the proposed tracker without distractor detection.

Using the VOT2015 benchmark dataset, the proposed LPMT method was compared with the state-of-the-art visual tracking algorithms: ACT [31], CT [35], ggt [15], L1APG [39], mkcf_plus [15], RobStruck [15], STC [40], amt [15], DAT [17], HMMTxD [15], LGT [41], muster [42], s3Tracker [15], sumshift [43], AOGTracker [15], DFT [15], HT [15], loft_lite [15], mvcft [15], samf [21], TGPR [44], ASMS [45], DSST [16], IVT [25], LT_FLO [26], ncc [27], SCBT [46], tric [15], dtracker [15], kcf_mtsa [15], matflow [15], OAB [15], sKCF [15], zhang [15], bdf [15], fct [15], KCF2 [15], MCT [15], OACF [11], sme [15], cmil [15], fot [33], kcfdp [15], MEEM [47], PKLTF [15], SODLT [48], CMT [38], FragTrack [15], kcfv2 [15], MIL [7], rajssc [15], and srat [15].

Figure 7 shows the AR rank plots of LPMT and the state-of-the-art methods for the challenges of camera motion, illumination change, motion change, occlusion, and size change. For each challenge, surprisingly, LPMT shows a good performance, being always ranked in the top two. This proves that the proposed method addresses these problems effectively, where these problems are more challenging than those in the VOT2014 benchmark dataset and the number of sequences is also greater. This condition is inversely proportional to DSST, which in this experiment achieved a rank considerably below that of LPMT.

Figure 8 shows the AR rank plots of LPMT and the state-of-the-art trackers on the VOT2015 benchmark dataset of the overall challenges and the average expected overlap rank. Since LPMT shows a good performance in the AR rank plot for each challenge, where it was always ranked in the top two, for the overall challenges, this method outperforms the other state-of-the-art tracker. Based on the average expected overlap, the LPMT achieves the first rank, where the average expected overlap is equal to 0.25. In the average expected overlap parameter of this benchmark dataset, DSST [16], which achieved the top rank on the VOT2014 benchmark dataset, was ranked the thirtieth. The second rank is achieved by the Rajssc tracker, which is based on a correlation filter. For detailed information about the VOT2015 benchmark dataset and its performance parameters, please refer to [15, 20].

**(a)**

**(b)**

#### 6. Conclusions

This paper presented a method that uses complementary learners, which consist of the response of the color histogram and the response of the correlation filter, for representing the target object. To overcome a distractor that has a representation similar to that of the target object, the proposed method also detects the distractor based on the response of the color histogram and correlation filter. Based on evaluations on the VOT2014 and VOT2015 benchmark datasets, the proposed method yields a favorable performance as compared to several state-of-the-art visual tracking algorithms.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

This work was supported by BK21PLUS, Creative Human Resource Development Program for IT Convergence, and was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MEST) (no. 2010-0024110).