Abstract

Histogram of oriented gradients (HOG) is a feature descriptor typically used for object detection. For object tracking, this feature has certain drawbacks when the target object is influenced by a change in motion or size. In this paper, the use of convolutional shallow features is proposed to improve the performance of HOG feature-based object tracking. Because the proposed method works based on a correlation filter, the response maps for each feature are summed in order to obtain the final response map. The location of the target object is then predicted based on the maximum value of the optimized final response map. Further, a model update is used to overcome the change in appearance of the target object during tracking. A performance evaluation of the proposed method is obtained by using Visual Object Tracking 2015 (VOT2015) benchmark dataset and its protocols. The results are then provided based on their accuracy-robustness (AR) rank. Furthermore, through a comparison with several state-of-the-art tracking algorithms, the proposed method was shown to achieve the highest rank in terms of accuracy and a third rank for robustness. In addition, the proposed method significantly improves the robustness of HOG-based features.

1. Introduction

In the field of computer vision research, the basic problem of visual tracking has been studied. Given an initial state, the task of visual tracking is estimating the trajectory of the target object in an image sequence. We can implement visual/object tracking, in several types of applications including surveillance systems [1], human-computer interaction (HCI) systems, unmanned aerial vehicle (UAV) systems, robotics, and three-dimensional (3D) reconstruction [2]. Further, it is difficult to implement a visual tracking algorithm that has excellent performance in terms of both accuracy and robustness. Several problems such as changes in illumination, motion, and size, as well as occlusions and camera motion, may cause tracking failures. For this reason, visual tracking has become a significant research topic in the area of computer vision.

Over the past several years, visual tracking based on online learning has made excellent progress. The key idea with this type of tracking is how to exploit the boosting classifier. Because the framework uses a boosting classifier, it can be categorized as a discriminative approach, and thus the computation time should be fast and feature extraction that can be computed rapidly is required. For example, the Adaboost classifier is implemented in an object tracking algorithm in [3]. In their research, the authors used Haar-like features. Further, a boosting classifier has also been used in the learning of multiple instances for object tracking [4]. The authors also use Haar-like features to represent the target object. The development of face tracking system based on this work was proposed by [5]. Recently, Wang et al. [6] proposed multiple instance learning based on the use of a patch, where an object is divided into many blocks. Unfortunately, extensive experiments using a benchmark dataset have not been performed on their research. Furthermore, a discriminative approach using a boosting classifier has a limitation related to the region used for searching the target object.

Based on this limitation, generative approach has been proposed. In [7], the authors proposed the use of fuzzy coding histogram features with a point representation for handling tracking failures. By adopting compressive sensing, new features representing the target object may be obtained. Such features are called sparse coefficient vectors, which are usually combined with a particle filter for motion estimation. Several examples using a sparse coefficient vector combined with particle filter can be found in [810]. Such features have shown good results with regard to the problem of occlusions, although several other issues and challenging problems exist in visual tracking. Further, because these methods are based on a particle filter, issues related to the computation time still remain.

To address such issues through a generative approach, a correlation filter has been proposed. This method works based on a Fourier transform, and to improve the computation time, a fast Fourier transform (FFT) is used. Correlation-filter-based visual tracking was initially proposed by Bolme et al. [11]. They proposed a way to implement a correlation filter during visual tracking and handle changes in appearance adaptively. To do so, they used a simple linear classification to solve the problem, which is a limitation of their work. Further, a correlation filter that operates efficiently was proposed by Henriques et al. [12]. The efficiency of this filter was achieved through computations using a circulant matrix combined with a kernel. To represent the target object, histogram of oriented gradients (HOG) features were used. In [13], a HOG feature-based correlation filter is also applied. In their method, they focused on how to handle the problem of a change in size during visual tracking. Other features for correlation-filter-based visual tracking, such as adaptive color features, were used in [14], and recently a fusion between color histogram features and HOG features with distractor handling was proposed in [15]. Unfortunately, these works have a limitation in that they use only a one-dimensional (1D) feature map.

From this reason, a multiscale feature-map-based correlation filter was proposed in [16]. In their method, they used color features for simplicity. Unfortunately, although multiscale feature maps have been successfully implemented, their performance still needs to be improved. In this research, we propose how to improve a HOG-feature-based Visual Object Tracking algorithm using convolutional shallow (CS) features. Because we use both HOG and CS features, the problem is how to integrate the two owing to their different resolutions. Further, to handle the problem of a change in size of the target object, an estimation of the scale is computed after the location estimation of the target object is achieved. After the scale estimation of the target object is computed, several parameters of the proposed method need to be updated to handle the changes in appearance of the target object during tracking. Furthermore, extensive experiments were conducted using the Visual Object Tracking 2015 (VOT2015) benchmark dataset. In addition, we also conducted a comparison among the proposed method, the proposed method using only HOG features, and the proposed method using only CS features. The purpose of this comparison is to prove that the proposed method has advantages for challenging problems in Visual Object Tracking over the use of a single type of feature.

The rest of this paper is organized as follows. Section 2 discusses the proposed method. Parameter updates are described in Section 3. Further, Section 4 discusses the experiment results. Finally, Section 5 provides some concluding remarks.

2. Proposed Method

In this section, our proposed method is described. Deep learning has been rapidly developing in recent years, particularly in the area of computer vision research. In addition, one of the methods used in deep learning is the application of a convolutional neural network (CNN). The architecture of a CNN usually consists of several layers, including convolutional layers, normalization layers, and pooling layers. Moreover, convolutional layers usually consist of several layers: from a shallow layer to the deepest layer. Although the deepest layer provides the best results for image classification, in this research, we used a shallow layer because it provides more favorable information than the deepest layer for object tracking owing to the fact that the information from a shallow layer of a pretrained CNN only requires a small number operations as compared to the deepest layer. Based on this fact, the information from a shallow layer can still represent the input, which will be more useful for the case of object tracking. For detailed information regarding the architecture of the pretrained used in the present research, refer to [17]. Further, the framework of the proposed method is represented in Figure 1.

Starting with frame , the search area for the target object is defined based on an expansion of the result from frame . From this area, feature extraction is conducted and for this step, we use two features: HOG and CS features. For this reason, we define and for smooth results of the HOG and CS features extraction, respectively. In addition, a cosine window is used for the smoothing process.

The next step is the interpolation process. The purposes of the interpolation process are to estimate the output more accurately and to achieve an integration of multiresolution feature maps. Further, the interpolation model can be defined as follows.where is the number of feature channels, and are related to the size of the feature map, and , , and represent interpolation functions. Because two features are used, we also have two interpolation models, where is the interpolation model for and is the interpolation model for .

After the interpolation models are obtained, the next step is obtaining a response map for each feature. Because the proposed method is based on a correlation filter, each response map can be obtained through a convolution between the interpolation model and the correlation filter. This computation can be expressed as follows.where , , , , and are an inverse Fourier transform, interpolation model in the Fourier domain, correlation filter in the Fourier domain, complex conjugate value, and elementwise multiplication, respectively. The variable , and as can be seen in (2), for a fast computation, a Fourier transform is used; thus, elementwise multiplication can also be applied. For this reason, (1) can be transformed into the Fourier domain as follows.where .

Remembering that we use two features, based on (2), we have two types of correlation filters, where is the correlation filter from the CS features, and is the correlation filter from the HOG features. For this reason, two response maps and can be obtained. The final response map can then be calculated usingwhere is the response map from the CS features, and is the response map from the HOG features. After is obtained, we can estimate the location of the target object by finding the maximum value of . As shown in Figure 2, there are three types of response maps: a response map from the proposed method using only CS features, a response map from the proposed method using only HOG features, and a response map from the proposed method. The response map is not sharper than the response map . This shape may provide an incorrect decision when we estimate the location of the target object because the maximum value of the response map has a small difference with the second-maximum value. However, when we combine the response map with the response map , the shape of is sharper than . This shape makes the location estimation of the target object more robust than .

Further, given the location estimation of the target object, we can conduct a scale estimation of the target object because, during tracking, the scale of the target object may change, and in order to handle this problem, a scale estimation of the target object is required. Therefore, we should estimate the scale of the target object using (5) as follows.where is the numerator in the Fourier domain, is the feature sample based on the scale factor in the Fourier domain, is the denominator, and is the weight parameter controlling the regularization term in (5). The selected scale can be obtained by calculating the maximum value from .

3. Parameters Update

During tracking, the target object usually changes its appearance, which should be handled to make the tracking algorithm more robust. Because the proposed method is based on a correlation filter, the correlation filter needs to learn the desired output to handle the changes in appearance of the target object. This learning process can be achieved by solving minimization problem as follows.where is the weight parameter used to control the sample pairs, is the number of sample pairs, is the desired output, is the weight parameter used to control the regularization term, and is the convolution operator. Further, because it is used to control sample pairs, parameter should be updated for each frame. To update this parameter, we can use the following equation:where is the learning rate, and is the index from the previous frame.

For the scale estimation, two parameters, the denominator and the numerator , need to be updated. Further, the numerator can be obtained using (8) as follows: where is a weight parameter, , and is the numerator from the initial frame. Furthermore, the denominator can be obtained using (9) as follows:where is the denominator from the initial frame. In addition, an equation conducted in the Fourier domain is symbolized by using the upper-line.

4. Experimental Results

In this section, the experimental results are described to validate the proposed method. Using the VOT2015 benchmark dataset, the proposed method was compared with 55 state-of-the-art tracking algorithms. These 55 state-of-the-art tracking algorithms are as follows: ACT [18], amt [19], AOGTracker [19], ASMS [20], baseline [19], bdf [19], cmil [19], CMT [19], CT [21], DAT [22], DFT [19], DSST [13], dtracker [19], fct [19], fot [23], FragTrack [19], ggt [19], HMMTxD [19], HT [19], IVT [24], kcf_mtsa [19], KCF2 [19], kcfdp [19], kcfv2 [19], L1APG [9], LGT [25], loft_lite [19], LT_FLO [26], LPMT [15], matflow [19], MCT [19], MEEM [27], MIL [4], mkcf_plus [19], muster [28], mvcft [19], ncc [29], OAB [19], OACF [19], PKLTF [19], rajssc [19], RobStruck [19], s3Tracker [19], samf [30], SCBT [31], sKCF [19], sme [19], SODLT [32], srat [19], STC [33], struck [34], sumshift [35], TGPR [36], tric [19], and zhang [19]. Further, to prove the advantages of the proposed method, a comparison among the proposed method, the proposed method using only CS features, and the proposed method using only HOG features was also conducted.

The VOT2015 benchmark dataset consists of 60 videos that have certain problems including camera motion, changes in illumination, motion changes, occlusions, and size changes, as well as videos under normal conditions (empty). In addition, to evaluate the performance in terms of the accuracy and robustness of the tracking algorithm, each video uses its protocols based on the area under curve (AUC). For more details regarding these protocols, refer to [37]. Further, the parameters used by the proposed method, , , , , and , have values of 1, 0.01, 0.001, 0.002, and 0.008, respectively. Parameter is equal to 15, where each scale has a difference of 0.02. Parameter is equal to zero at the initial frame. Furthermore, we implemented the proposed method by using MATLAB on a 3.3 GHz i5-4590 with 4 GB of RAM.

The convolutional layer is one of many layers contained in the convolutional neural network (CNN). According to the references from [38, 39], the dot product computations of the output of the neurons with the local regions in the input (i.e., an image) are performed. The results from these computations are represented in the volume. For example, if we use the filter which has size and the number of the filters is three, then the results of these computations in volume become . Further, these results are categorized as the result in the first convolutional layer. For the second convolutional layer, it can be obtained with similar computation with the first convolutional layer. The differences between the second convolutional layer and the first convolutional layer are on the part of the input, the size of the filter, and the number of the filters. The input of the second convolutional layer can be as only the output of the first convolutional layer or the output of the first convolutional layer combined with the computations of the normalized layer and the pooling layer. Further, the size of the filter in the second convolutional layer is smaller than the size of the filter in the first convolutional layer. If we use the result from the first convolutional layer as the features, it can be called convolutional shallow (CS) features. In this research, we used CS features from pretrained CNN that has been proposed by the authors in [17].

Several illustrations of the comparison results among the proposed method, the proposed method using only CS features, and the proposed method using only HOG features are provided in Figure 3. In addition, as shown in Figure 4, for the camera motion problem, the proposed method achieved the highest rank in terms of accuracy and second rank in terms of robustness, whereas the proposed method using only HOG features achieved the highest rank in accuracy but ranked 43rd in robustness. The proposed method using only CS features achieves the same accuracy and robustness ranks as the proposed method. Further, the accuracy and robustness ranks of the DFT tracker are also the same as those of the proposed method. This tracker uses an image descriptor based on distributing fields, and the approach maintains the pixel value information when the objective function is smoothed. Top ranks for both accuracy and robustness were achieved by the rajssc tracker, which is based on a correlation filter and uses a block circulant-structure combined with a Gaussian space response for representing the target object.

For an empty label problem, that is, normal conditions, the proposed method achieved the highest rank for both accuracy and robustness. Meanwhile the proposed method using only HOG features achieved the first rank in accuracy and 45th rank for robustness. The proposed method using only CS features achieved the first rank in accuracy and second rank in robustness. These results indicate that combining CS and HOG features can make the tracking algorithm more robust than using CS or HOG features alone. The struck and samf trackers achieved the same rank as the proposed method, where the struck tracker is based on a kernelized structured output support vector machine, and the samf tracker is based on kernelized correlation filter that efficiently utilizes a scale adaptive method.

Further, for changes in illumination, as shown in Figure 4, both the proposed method and the proposed method using only CS features rank first in both accuracy and robustness. Meanwhile, the proposed method using only HOG features ranks first in accuracy and ninth in robustness. These results indicate that CS features are more useful than HOG features for the problem of changes in illumination. Combining CS and HOG features has no significant influence on this particular problem. However, the OACF tracker, which is based on a correlation filter combined with a red-green-blue (RGB) histogram, and also uses an adaptive scaling method, ranking first in accuracy and fourth in robustness, which are the same ranks as the rajssc tracker.

For motion changes, the proposed method achieved the highest rank for both accuracy and robustness. The proposed method using only CS features also achieved the highest rank in accuracy; however, for robustness, the method achieved second rank. Further, the proposed method using only HOG features also achieved the first rank in accuracy. Unfortunately, this method achieved a robustness ranking 44th. Based on this, the proposed method shows a superior performance than using only CS features or only HOG features. Furthermore, combining HOG and CS features proves that CS features may improve the robustness performance significantly compared to using only HOG features. The rajssc tracker has the same rank as the proposed method. For the s3Tracker tracking algorithm, which is based on an RGB histogram to represent the target object and also uses an aspect ratio selection. Moreover, an accuracy ranking second and a robustness ranking third were achieved by the LPMT tracker, which is based on a correlation filter with distractor handling.

The next problem addressed is occlusions, where the target object is fully or partially occluded. For this problem, the proposed method achieved the highest rank in terms of accuracy. Unfortunately, for the robustness, it only achieved a 16th ranking, which is much lower than when using only CS features, where the proposed method achieved the highest rank in accuracy and a robustness second ranking. Based on this evidence, the proposed method using only CS features is more robust than both the proposed method and the proposed method using only HOG features. This is because when the target is occluded and the correlation filter is updated, the response from the CS features is more similar to the target object than the others. For this reason, for the occlusion problem, the CS features are more robust than both the proposed method and the proposed method using only HOG features. On the other hand, the proposed method using only HOG features ranks first in terms of accuracy and 27th in robustness. Top ranks in both accuracy and robustness were achieved by the rajssc and sme trackers, the latter of which is a tracking algorithm that operates based on a score function for selecting the candidate from multiple experts.

The final problem defined in the VOT2015 benchmark dataset is a change in size. For this problem, the proposed method shows better results than the others, achieving the highest rank in both accuracy and robustness. Meanwhile, the proposed method using only CS features achieved the highest rank in accuracy and a robustness 11th ranking. Further, the proposed method using only HOG features achieved the highest rank in accuracy and a robustness 50th ranking. These results show that a combination of CS and HOG features may increase the robustness significantly. Other state-of-the-art tracking algorithms, s3Tracker and muster, achieved second and third ranks for accuracy, respectively, and the highest rank for robustness. These results are shown in Figure 4.

Finally, after AR ranks from several problems including camera motion, changes in illumination, motion changes, occlusions, and size changes, as well as videos under normal conditions (empty) are obtained, we can summarize the results and then ranking plot for experiment baseline (pooled) can be obtained. This ranking plot is obtained by concatenating the results from all sequences and creating a single rank list. As shown in Figure 5, the proposed method achieved the highest rank in accuracy and second rank in robustness. Meanwhile the proposed method using only CS features achieved the highest rank in accuracy and a robustness seventh ranking. The proposed method using only HOG features achieved the highest rank for accuracy but unfortunately achieved 45th rank for robustness. This reinforces the idea that combining CS and HOG features makes the tracking algorithm more robust and is very useful when developing a tracking algorithm. However, for the sme and sumshift trackers, both, achieved the highest rank for accuracy and ranked fourth in the robustness. Finally, the proposed method achieved a computation time of about 15 fps.

5. Conclusion

This paper described how to improve the performance of HOG features-based Visual Object Tracking algorithm. The proposed method combines a response map between the HOG and CS features. The CS features are computed from a shallow layer of a pretrained CNN with the input. In addition, to handle the differences in resolution, an interpolation approach is used. Further, experiments were conducted using the VOT2015 benchmark dataset, which consists extensively of 60 different videos. The results indicated that the proposed method significantly improves the robustness performance of a HOG feature-based approach. In addition, based on a comparison with many other state-of-the-art tracking algorithms, the proposed method achieved the highest rank in terms of accuracy and a third rank for robustness.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by BK21PLUS, Creative Human Resource Development Program for IT Convergence.