Abstract

In recent years, visual object tracking has become a very active research field which is mainly divided into the correlation filter-based tracking and deep learning (e.g., deep convolutional neural network and Siamese neural network) based tracking. For target tracking algorithms based on deep learning, a large amount of computation is required, usually deployed on expensive graphics cards. However, for the rich monitoring devices in the Internet of Things, it is difficult to capture all the moving targets in each device in real time, so it is necessary to perform hierarchical processing and use tracking based on correlation filtering in insensitive areas to alleviate the local computing pressure. In sensitive areas, upload the video stream to a cloud computing platform with a faster computing speed to perform an algorithm based on deep features. In this paper, we mainly focus on the correlation filter-based tracking. In the correlation filter-based tracking, the discriminative scale space tracker (DSST) is one of the most popular and typical ones which is successfully applied to many application fields. However, there are still some improvements that need to be further studied for DSST. One is that the algorithms do not consider the target rotation on purpose. The other is that it is a very heavy computational load to extract the histogram of oriented gradient (HOG) features from too many patches centered at the target position in order to ensure the scale estimation accuracy. To address these two problems, we introduce the alterable patch number for target scale tracking and the space searching for target rotation tracking into the standard DSST tracking method and propose a visual object multimodality tracker based on correlation filters (MTCF) to simultaneously cope with translation, scale, and rotation in plane for the tracked target and to obtain the target information of position, scale, and attitude angle at the same time. Finally, in Visual Tracker Benchmark data set, the experiments are performed on the proposed algorithms to show their effectiveness in multimodality tracking.

1. Introduction

Visual object tracking (VOT), the subfield of computer vision, is a process of continuously estimating the target state through video image sequence. In recent years, VOT has become a very active research domain due to its extensive applications in many sorts of fields such as intelligent surveillance [1], automatic driving [2], and traffic flow monitoring [3], to name a few.

In fields such as security monitoring and control, the traditional network architecture is difficult to deal with in terms of network delay and security reliability, and thus edge computing technology was born. Tasks with different attributes can be passed to different levels for processing. Zhan [4] shows that the first few feature extraction layers could run on edge device, and the others run on the cloud. And Gao [5, 6] divides tasks into different levels according to the business applications and using edge devices in one level.

As Figure 1 shows, for nonsensitive areas, video streams with lower resolutions can be processed on the local device; in the medium area, ordinary-resolution video streams can be used on the edge device; and in high-risk areas, high-resolution video streams can be used on the core cloud server, thereby reducing network bandwidth and improving the overall operating efficiency of the system. This article mainly explores the processing of video streams on edge clouds, and the tracking algorithm used is based on filtering.

A number of robust tracking algorithms have been proposed and developed to deal with the problems resulted from occlusion, illumination variation, background clutter, motion blur, and so on [714]. Such these algorithms are divided into deep learning-based category (DLC) and correlation filter- (CF-) based category [9].

For DLC, since the papers written by Geoff Hinton et al. [15, 16] were published, deep learning has become especially popular in the context of deep neural networks and has achieved impressive success on many applications, especially on feature extraction in computer vision. Inspired by such success, various deep-learning-based trackers [13, 14, 1722] have been proposed and developed to cope with the problems encountered in tracking. Although most of trackers based on deep neural networks demonstrated the potential advantages for significantly improving the tracking performance which was testified by world VOT competitions [17], there are still some obvious limits. For example, there are fewer or even no training data available for the tracker because the prior information of the tracked object or the object bounding box is usually available only in the first frame. Even if the offline pretraining is employed to learn the target features for constructing a feature set of many targets, it is very possible for tracking a particular object whose features are not contained in the feature set. Nowadays, zero-shot and one-shot learning, as well as Siamese region proposal network, may be the most effective measures to cope with these problems [2327].

And correlation filter-based tracking is also a solution. From its beginning of the minimum output sum of squared error (MOSSE) method [10] to the discriminative scale space tracker (DSST) method [12], a lot of improvements have continuously been made, which makes tracking based on CF achieve some highlighted tracking performances, such as lower computational load, being robust to the appearance variations of targets, and high tracking accuracy. However, there are still some improvements that need to be further studied for the tracking based on CF. One is that the algorithms do not consider the target rotation on purpose. The other is that the real-time property of DSST cannot always be ensured because it has a very heavy computational load to extract the histogram of oriented gradient (HOG) features from so many patches centered at the target position in order to ensure the scale estimation accuracy. This inspired us to think of an idea: on the premise of ensuring the tracking accuracy, appropriately decrease the number of patches in order to save the time for the introduction of the target rotation into DSST to form a multimodality tracking. It means that the tracker should simultaneously cope with translation, scale, and rotation in plane for the tracked target, which leads us to propose the visual object multimodality tracking based on correlation filters (MTCF), to figure out these two problems, and at the same time to obtain the target information of position, scale, and attitude angle simultaneously.

In this paper, we design a correlation filter-based tracker aiming at tracking the target accurately and robustly with the tracking speed at 25 frames per second and tracking the rotation of target.

In this section, centering on tracking based on CF, we briefly list some relevant research works which have contributions to the tracking based on CF to highlight our motivations.

MOSSE method is taken as the earliest real-time CF-based tracker [28], which is an improved version of average synthetic exact filter (ASEF) [29] trained offline to detect objects. MOSSE tracker has strong robustness to target appearance and environment change, which can achieve very fast tracking speed. This is because that the correlation convolution of image in time domain is transformed into the multiplication of image in frequency domain, which greatly reduces the computation complexity and load. However, MOSSE method uses only grayscale samples to train CF and mainly focuses on translation without considering scale and rotation.

Based on the MOSSE, the circulant structure kernel CSK method [30] constructs a circulant matrix of training data by using cyclic shift of target window to maintain dense sampling around the target, rather than random sampling. On the other hand, CSK method maps ridge regression of linear space to nonlinear space through a kernel function and simplifies the calculation via solving a dual problem in nonlinear space to avoid inverse matrix operation, which leads to reducing the computation complexity and improving the tracking speed. The kernelized correlation filter (KCF) method [11] is an improved version of CSK. It introduces multichannel HOG features into CSK to enhance the feature representation ability and to improve the tracking performance significantly. Nevertheless, there exists a major imperfection for KCF method; i.e., it is not robust to the scale variation of the target. In addition, for the KCF-based tracking, the authenticity of negative samples will decrease along with the increase of cyclic displacement, which results in the tracker being trained on a portion of unreal samples. To address this issue, Danelljan et al. [18] introduce a spatial regularized term in the goal function of KCF-based tracker to penalize the filter coefficients near the margins of the bounding box. Based on [18], Dai et al. [28] propose a novel adaptive spatial regularized CF to make the tracker learn more reliable filter coefficients by fully exploiting the diversity information of different objects in different frames during the tracking process. However, just as the standard KCF-based trackers do, these two trackers are still not robust to the scale variation of the target.

DSST [12, 31] trackers address the scale adaptation problem using multiscale searching strategies. It divides tracking into translation prediction and scale prediction. Firstly, translation prediction is performed by applying a standard translation filter on the current frame to get the position of the target. Secondly, the target size is estimated by employing trained scale filter at the target location obtained from the translation filter. Translation filter and scale filter are two independent filters, and both are based on MOSSE. Although DSST tracker has improved the tracking performance and is robust to target scale variation, there exist some obvious limitations to be further perfected. One is that DSST does not consider the target rotation on purpose, which has strong negative impacts on the tracking performance. The other is that it is not necessary for guaranteeing the tracking speed to spend a lot of operation time on sampling too many patches centered on the target location.

Besides of the tracking method, features of the tracked target are also key components of a tracker, which has a very heavy influence on the tracking performance. Generally speaking, the richer the features are, the better the performance of the tracker is. The simplest feature is intensity matrix of the search image, which is used in MOSSE [10]. And SIFT features [32] and HOG features [33] are used in object tracking afterwards. In recent years, deep features [34] are widely used in object tracking. In this paper, HOG features combined with grayscale features rather than deep features are adopted because our focus is on the CF-based tracking. And we do not adopt SIFT because SIFT is scale-invariant and we need to explicitly capture the size change of the object.

Summarizing the analysis stated above, we propose the MTCF to alleviate the imperfections of the relevant CF-based trackers stated above. Aiming at tracking the target accurately and robustly with the tracking speed at 25 frames per second at least for practical visual object tracking, MTCF consists of 4 tasks. Firstly, based on the standard CF-based translation tracker, determine the target location in the current frame. Secondly, based on DSST, sample several patches (with alterable number of patches) with different resolutions, centered at the tracked target location determined by translation CF, figure out the feasible scale for patches, and seek out an optimal decision policy to find the final scale among feasible scales. Thirdly, based on the standard CF-based translation tracker, design a rotation tracker using space searching. Lastly, integrate the previous 3 tasks to form MTCF.

3. Methodology of the Tracking Design

3.1. Variable Symbols Used in This Paper

In this paper, denotes the “feature” of one image patch cropped with specific bounding box, denotes the correlation filter, and denotes the response map of correlation. In this way, denotes the feature of frame used to correlate with translation filter and we get the translation response map .

And denotes the scale of the target the tracker got after frame, and denotes the rotation angle of the target after tracking frame.

In terms of the convolution theorem, the correlation in spatial domain can be transformed to element-wise manipulation, which will dramatically reduce the correlation computation load. Thus, for computation efficiency, correlation manipulation is proposed to use Fast Fourier Transform (FFT) method in frequency domain. So, let the uppercase variables be the Fourier transforms of their lowercase counterparts, i.e., , , and corresponding to , , and , respectively.

3.2. Standard Translation Tracker Based on Correlation Filter

As Figure 2 shows, given a video sequence, draw a rectangular bounding box (the very close same size as the target, the red one) around the target in first frame and extract a feature map from the chosen region (the green rectangle, two times the size of the red one). And then train a correlation filter to correlate with to get an ideal response . In the next frames, use the correlation filter to correlate with extracted feature map from the chosen region and get a response map as follows:where represents convolution operation.

In normal tracking process, there should be one peak in the response map. And the peak position is considered as the center of target (and in this sense tracking executes). The key of tracking is to find a robust feature extractor and maintain the correlation filter to counter a variety of adverse effects such as target appearance transformation, occlusion, and so on, using appropriate updating strategy.

3.3. Scale Tracker Based on Correlation Filter

Being different from that in the original DSST, the number of scales (or the number of image patches) in this paper is an optional positive integer determined by the trade-off between tracking speed and tracking accuracy (i.e., smaller is selected if tracking speed takes priority to tracking accuracy, vice versa). Let be the shape of the target, and construct image patches centered on the target position with different scales to form an image patch setwhere is scale step. Resize each from into the same shape to form a bounding box set . As Figure 3 shows, instead of extracting one feature map from a bounding box with fixed scale, the tracker extracts a feature map for each patch from the bounding box set (the number of feature maps is 33 in Figure 3). Each feature map is concatenated into a vector, and all these vectors are combined into a feature map . And we design a scale correlation filter to correlate the feature map , and the scale where maximum response taking place is the predicting scale to match current scale of target.

3.4. Rotation Tracker Based on Correlation Filter

The target may rotate during tracking, so we use rotated bounding box centered on target to crop every frame. As shown in Figure 4, let be the center of the target and the current angle the target rotated, and denotes the bounding box with the rotation in frame . And using the rotation around target center, we construct a set of bounding boxeswith the same size of ; here is the given maximum rotation angular displacement and is rotation step.

For each bounding box from , extract feature map , and correlate with the rotation correlation filter to get a maximum response value, where

Compare those values and find the largest one to get the predicting rotation angle . Let be the tracking result of frame , as follows:

In addition to the methods we used here, we also envisioned the “1-dimensional correlation rotation tracking” in the Supplementary Materials. However, after testing, it shows that this method requires too much calculation and is not suitable for use at edge nodes.

3.5. Multimodality Tracking Based on Correlation Filter

Integrate translation, scale, and rotation stated in previous section to form MTCF whose iteration procedure at the frame is briefly outlined with the known parameters obtained in the frame, including target position , translation filter , scale filter , scale , rotation filter , and rotation angle .

3.5.1. Translation Estimation

(1)Construct bounding box with the scale , centered at in the frame.(2)Extract feature map from .(3)Calculate the correlation map using and .(4)Obtain the target new position corresponding to the position where the largest correlation value of taking place.

3.5.2. Scale Estimation

(1)Construct image patches of different scales centered on the target position in the frame(2)Extract feature map patches from image patches , and concatenate each feature map to form an vector, and then combine those vectors to form a feature matrix (3)Calculate the correlation map using and (4)Update the target scale with the optimal corresponding to the position where the largest-scale correlation value is located

3.5.3. Rotation Estimation

(1)Construct image patches from the bounding box set centered on target position with rotation angle (2)Extract feature maps for every patch from (3)For every feature map , make the correlation with the original rotation filter, and get a maximum response value (4)Update the target rotation angle with the optimal corresponding to the best

3.5.4. Model Update

(1)Construct the bounding box centered on target position with scale and rotation angle (2)Extract , , and (3)Update translation model(4)Update scale model(5)Update rotation model

3.5.5. Keep Tracking

Output the tracking results of the frame and return to the next frame tracking.

4. MTCF: The Entire Model

4.1. Translation Tracking Procedure

The simplest correlation-based tracking only focuses on translation of the target. In the first frame, we label a rectangular region centered on the target. So, the tracker can extract the feature map of target appearance. The feature map must maintain a spatial mapping because the tracker uses the position where the maximum response happens to predict new target position.

The simplest feature map is gray intensity matrix transformed from the specific region (for example, ) of original frame. Many researchers use 2-dimensional Hanning window (see Figure 5) to preprocess the primitive intensity matrix. After being processed by the Hanning window, the intensity matrix focuses on the central region of target and weakens the background information near the bounding box edge. Because in the first frame we draw the bounding box tightly around the target, the tracker may lose some features and behave unstable.

To address this issue, the simple way is to expand the search region. Define a parameter to determine how many times the size of the search window is. As a rule, greater parameter bb will contribute to extracting more features of the target and making the tracker stable. But much more time will be spent on the extracting features from the large search region.

This is clearly demonstrated by “S1” presented in the Supplementary Materials. In this paper, we adopt a trade-off policy and select the .

From now on, we will use to represent the translation search windows.

From we get the , and because we need to train the translation correlation filter , an initial is required. In prior papers, most of researchers take the Gauss-shaped response map as initialization, as follows:and Figure 6 shows an example of .

Though the intensity feature is cheap in computation, it is unstable. Because it only takes advantage of little information in the frame. Recently, lots of deep features (for example, convolution neural network feature) are introduced to object tracking and behave well in accuracy and robustness. However, it is too computational expensive, and in this paper, we focus on FHOG [36] feature.

We use an FHOG feature extractor to get the feature map . In translation step, 27 dimensions in FHOG and 1 dimension of intensity feature are taken into account. According to DSST [12], discriminative correlation filters for multidimensional features are applied as follows.

Minimize the cost function,

Here, is the ideal response about the correlation between feature map and filter, and the parameter is the regularization term. In FFT domain, the solution [12] can be written aswhere indicates the complex conjugation, is for element-wise multiplication, and is the dimension number.

The translation filter can be solved as below:

Equation (9) is employed in offline learning to obtain the correlation filter. In the practical tracking, the tracker (for example, MOSSE, KCF, and DSST) takes the target position in the frame as the center of bounding box in the frame, extracts feature map from , and then calculates the correlation mapto determine the target new position corresponding to the element with maximum value in ; here, indicates the inverse Fourier transform.

Afterwards, reconstruct bounding box centered on the target new position from which feature map is extracted, and then update the translation CF to get . Lastly, an iterative formula for equation (9) is presented as the following equations from equations (9)–(12) according to [10, 12]:where is the learning rate.

4.2. Scale Tracking Procedure

As for scale tracking procedure, two methods are commonly used. One is called “exhaustive scale tracking” and the other is “1-dimensional correlation filter scale tracking.” In this paper, we use “1-dimensional correlation” method.

In the previous frame, we got the target position and scale .

Let be the shape of the target, construct image patches centered on the target position in terms of the method presented in Section 3, and resize to form a bounding box set . FHOG extractor is applied to extract a feature map for each patch from the bounding box set . Each feature map is concatenated into a vector, and all of these vectors are combined into an integrated vector . Estimating the target scale can be solved by learning a separate 1-dimensional correlation filter. Design a 1-dimensional filter to correlate with . The initial ideal response is a Gauss-shaped peak, as Figure 7 shows.

The scale with the largest correlation response value is taken as the optimal scale .

Afterwards, extract feature map from the centered on the target new position with the target final scale , and then update the scale correlation filter to get using equations (13)–(17).

In this process, set the parameter “spatial bin size” to 4 to save time in the next process and use all FHOG dimensions. So, the length of S feature vector is .

Estimating the target scale can be solved by learning a separate 1-dimensional correlation filter. Treat the feature vector as multidimensional features and vectors turn into 1-dimensional feature .

Construct different groups containing different number of patches. The number of patches varies from small to large (e.g., from 10 to 55), and all patches are centered at the tracked target location determined by translation CF in current frame. Let the basic DSST [12] perform on the visual track data set [35], and calculate the tracking speed and the tracking accuracy which is characterized by the Euclidean distance between tracking window center and ground truth center for each group. The experiment results are shown as in Figure 8.

From Figure 8, it can be seen that the tracking accuracy and speed are different with different numbers of patches. The larger number of the patches corresponds to a low tracking speed, and vice versa. Thus, on the premise of ensuring the tracking accuracy, appropriately select the number of patches in order to save the time for the introduction of the target rotation into DSST to form a multimodality tracking. After experiments, it is found that most of the time is spent in the feature extraction module.

4.3. Rotation Space Tracking Procedure

Set the target attitude with in the first frame; construct a set of bounding boxes as described in the previous section in successive frame. FHOG extractor is applied to extract a feature map for each patch from the bounding box set . Estimating the target rotation can be solved by learning a separate 1-dimensional correlation filter. Train a 1-dimensional a single rotation filter as the similarity function to compute the maximum correlation response for each feature map . Therefore, the best tracking angle is calculated by using the following equation:

Afterwards, extract feature map from the centered on the target new position with the target final rotation angle , and then update the rotation correlation filter to get using the following equations:

We take Figure 9 as an example to demonstrate our search rotation angle. As Figure 9 shows, , , , and. Construct a set of bounding boxes with 5 patches ; train the rotation correlation filter using the samples in the first frame. And Figure 10 shows the correlation response with each patch, where the corresponds to the highest response. Thus, we can make a conclusion that is the best predicting rotation angle in Figure 9, which demonstrate the effectiveness of our proposed search rotation method.

In this process, how to set parameters of and is very important to obtain the good tracking performance including tracking speed and tracking accuracy. Greater and smaller will contribute to the good tracking performance, but much more time will be spent on the extracting features of the tracked target, which has a negative influence on the tracking speed. This is clearly demonstrated by “S2” presented in the Supplementary Materials. As a rule, parameters of and are fixed by experiments according to the requirements of tracking tasks.

In this paper, we also adopt such a policy.

5. Experiment

5.1. Experiment Setup

In this paper, our method is implemented in MATLAB R2019a on Windows 10 system. The experiments are conducted on a PC with Intel Xeon® 2.4 GHz and 63.9 GB RAM. The data set is selected from the visual track data set [35]. Our experiment is divided into 3 groups with different parameters. All of them are used to testify our proposal method: on the premise of ensuring the tracking accuracy, appropriately decrease the number of patches in order to save the time for the introduction of the target rotation into DSST to form a multimodality tracking, to verify the effectiveness of our proposed rotation tracking algorithm, and to demonstrate the whole tracking performance of our proposed visual object multimodality tracking algorithm based on correlation filter.

In the experiment of each group, the visual track data set is selected to have target translation, scale, and rotation simultaneously. And the number of scales S, the scale factor , and the learning rate are kept unchanged in each group and are fixed as (33, 1.02, and 0.015) and (27, 1.0247, and 0.015), respectively, which means that maximum and minimum scale field of two groups are the same, as Figure 11 shows. And we test the influence of different size of searching window in the Supplementary Materials.

5.2. Experiment of the First Group

In this group experiment, is selected to be , and is selected to be . As a result, the tracking speed is 31fps, and the experiment results are shown in Figures 12 and 13 consisting of some typical tracking frames.

From Figure 11, it can be seen that appropriately decreasing the number of patches completely can save the time for the introduction of the target rotation into DSST to form a multimodality tracking on the premise of ensuring the tracking accuracy and that our proposed rotation tracking algorithm can work well.

5.3. Rotation Tracking Performance Test

In this group experiment, is selected to be 12, and is selected to be 4. As a result, the tracking speed is 29 fps, and the experiment results are shown in Figure 13 consisting of some typical tracking frames.

In this group experiment, the tracking speed is 29 fps which is lower than that in the first group experiment because is selected to be 4 which means the number of is increased, resulting in much more time being spent on extracting feature map from . But our proposal visual tracker still can work well in tracking the target with translation, scale, and rotation. This can be shown by Figure 13. From this perspective, we can say that the rotation step can be appropriately increased if tracking accuracy is preferred, and vice versa.

5.4. Multimodality Tracking Performance Test

In both of the two group experiments, our proposed MTCF algorithm is performed on the visual track data set [35] to demonstrate the multimodality tracking performance; the tracking results are shown in Figure 14.

From Figure 14, it can be seen that our proposed MTCF has good multimodality tracking performance, which can enable us to obtain the position, scale, and attitude angle of the tracked target simultaneously.

The generalization ability of this algorithm still maintains the same level as DSST and is very dependent on the HOG extraction algorithm.

6. Conclusion and Future Work

In this paper, on the premise of ensuring the tracking accuracy, we introduce the alterable patch number for target scale tracking and the space searching for target rotation tracking into the standard DSST tracking method and propose a multimodality tracking MTCF to simultaneously cope with translation, scale, and rotation in plane for the tracked target and to obtain the target information of position, scale, and attitude angle at the same time. Experimental results demonstrate that the proposed multimodal target tracking algorithm MTCF (1) can reach the approving tracking speed which is largely exceeded 25 fps at least for practical visual object tracking by appropriately decreasing the number of patches for target scale tracking and (2) can obtain good tracking performance for translation, scale, and rotation simultaneously. In the future, our work will focus on the distributed hardware and software implementation of the proposed multimodal comprehensive tracking algorithm.

For terminal devices not equipped with GPU units, low-resolution video is used to reduce the computational pressure on target features. For edge devices with certain computing capabilities, they are responsible for the main target tracking tasks. Finally, for a few critical and high-risk areas, the network bandwidth saved by the above two is used to upload to the central cloud processor for calculation to achieve hierarchical governance coordination.

Data Availability

All the source codes and related pictures will be uploaded to GitHub and will be available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61772575, National Key R&D Program of China under Grant 2017YFB1402101, and Independent Research Projects of Minzu University of China.

Supplementary Materials

S1: influence of different size of searching window and analysis of the results. S2: 1-dimensional correlation rotation tracking, . (Supplementary Materials)