Abstract

This paper studies moving object tracking in satellite videos. For the satellite videos, the object size in the images may be small, the object may be partly occluded, and the image may contain an area resembling dense objects. To handle the above problems, this paper puts forward a kernelized correlation filter based on the color-name feature and Kalman prediction. The original image is mapped to the color-name feature space so that the tracker can process the image with multichannel color features. The Kalman filter is used to predict the moving object position in the tracking process, and the detection area is determined according to the predicted position. The Kalman filter is updated with the detection results to improve the tracking accuracy. The proposed algorithm is tested on Jilin-1 datasets. Compared with the other seven tracking algorithms, the experiment results show that the proposed algorithm has stronger robustness for several complex situations such as rapid target motion and similar object interference. Besides, it is also shown that the proposed algorithm can prevent the problem of tracking failure when the moving object is partially occluded.

1. Introduction

With the development of remote sensing technologies, the earth observation satellites are extensively applied to several fields [17]. However, traditional earth observation satellites can only take a single image of a certain area. The valuable and interesting dynamic data in the object area is hard to obtain only based on the static medium and high resolution optical remote sensing data, which may limit the reconnaissance capability of the earth observation satellites in emergencies [8].

On the other hand, video satellites can overcome the limitation of traditional earth observation satellites and can obtain high time-resolution images [911]. Video satellites are used to obtain a series of images of fixed object areas on the ground. Therefore, the videos can be formed to obtain the dynamic object information directly. Currently, the continuous monitoring capability of high-resolution satellites in a certain time range has been realized [12]. Due to the above advantages, video satellites are used to observe and track the states of moving objects on the ground [13] and have wide application potential in the fields of vehicle real-time monitoring [8, 14, 15], rapid response to natural disaster emergency [16], major engineering monitoring [17], and so on. In recent years, some representative video satellites are Skysat-1 and Skysat-2 in the United States [18, 19], TUBSAT series satellites by the Technical University of Berlin [20] and Jilin-1 [7, 21] and Tiantuo-2 (TT-2) [22] in China.

Besides, moving object tracking (MOT) in traditional videos has been a research hotspot in computer vision. MOT is widely used in automatic monitoring, automatic driving, human-computer interaction, and so on [23, 24]. The task of MOT is to predict the size and position of the object in subsequent frames based on the size and position of the object in the initial frame. Much research has been done in this aspect, and numerous algorithms have been investigated for accurate tracking in ordinary videos [2530]. Notice that the commonly used visual object tracking methods can be divided into generation methods [2527] and discrimination methods [2830].

On the one hand, the generative methods establish the object model and describe the real-world object, so as to search for the position with the highest similarity with the object template in the next frame [31]. Many breakthroughs have been made and several generative methods have been utilized in MOT, including mean shift, sparse coding, dictionary learning, particle filter, and sliding window [3237]. On the other hand, compared with generative methods, the discriminative methods regard object tracking as a detection problem, also known as tracking by detection (see [38, 39] and the related references). The discriminative methods generally train the classifier in the first frame to separate the object from its surrounding background. In particular, considering the complex background such as background changes, the discriminative methods establish an online discrimination classifier model to distinguish objects from cluttered backgrounds to provide more effective features and avoid unwanted model drift. Recently, some representative machine learning techniques are adopted into the discriminative methods, such as Boosting, Support Vector Machines (SVM), Multiple Instance Learning (MIL), Random Forests, Semisupervised Learning, and Structured Output Support Vector Machines (SOSVM) [4042]. A theoretical framework of dense sampling in tracking-by-detection is presented [43]. In [44], a tracking learning detection (TLD) algorithm is also proposed, where learning and detection are introduced into the long-term tracking of the objects in the videos to enhance the tracking accuracy. Hare et al. [45] propose an adaptive tracking-by-detection method called STRUCK, which exploits Gaussian kernels and SVM as a structured output to accurately locate the objects. A cooperative model tracking algorithm based on sparse representation is presented, which is suitable for the situation of object occlusion and blur [46].

Apart from the machine-learning-based discriminative methods, the kernelized correlation filter (KCF) can also accelerate the calculation speed and improve tracking accuracy simultaneously [43, 47, 48]. In particular, the KCF-based method can efficiently handle the object in changing environment [47]. A scale adaptive tracker is proposed based on the separate discriminative correlation filters, where the computational cost is reduced [49]. Galoogahi et al. [50] demonstrate a background-aware correlation filter (BACF) for real-time visual tracking, where the handcrafted features are introduced to effectively describe the changing background. In [48], a KCF and the Normalized Cross-Correlation (NCC) template matching is proposed for long-term target tracking of UAVs to improve the tracking performance.

In addition, note that the moving objects in satellite videos include vehicles, airplanes, rockets, and ships. Compared with moving target tracking in ordinary videos, real-time object tracking in satellite videos should overcome three main challenges as follows [51, 52]. (1)Compared with the ordinary videos on the ground, the object size is small in satellite videos, which may lose their effective features (see Figure 1). For instance, the length of a car is about 4-5 m in real life, while it is about four pixels in satellite videos. The size of these objects in the satellite videos is so small that it may be difficult to track these objects reliably(2)Since remote sensing images have a relatively large field of vision, there is low contrast between the background and the objects (see Figure 1(a))(3)The background of the image sequence of the satellite videos may be fuzzy and chaotic (see Figure 1(b)). Besides, there may be more than one moving target in the image sequence, which have the characteristics of high similarity, serious mutual interference, and low resolution. This will result in partial or complete occlusion between moving targets (see Figure 1(c))

Due to these above challenges and the increasing demand for MOT in the field of remote sensing, up until now, several moving object detection/tracking algorithms have been carefully designed for MOT in satellite videos [5155]. Lei and Guo [53] propose a road masking and Gaussian mixture method to achieve multiple object detection and tracking of the remote sensing video satellite. Meanwhile, the method can improve the reconnaissance capability of the remote sensing satellite for the dynamic mobile small target. In [51], a fusion tracker is introduced where the kernel correlation filter and the three-frame-difference method are synthesized for satellite videos to improve the performance of the tracker. Li and Man [52] put forward an optical-flow-based detection algorithm with video attention saliency for the moving ships in the satellite videos, where the Gabor filter is utilized to extract the texture feature, and the registration of the sea-scene images can be avoided. Liu et al. [54] present a kernelized correlation filter where multifeature fusion and motion trajectory compensation are employed for satellite videos to mitigate the tracking drifts. In [55], an object tracking algorithm is also proposed for the high-resolution multispecial satellite images with multiangular observation capability, where a novel regional operator is constructed and the tracking capability is verified in the WorldView-2 satellite images.

However, the above trackers [51, 52, 54, 55] generally rely on the original pixel information (such as HSV or HOG), which ignores the color information. In fact, several satellite videos contain some color-name (CN) features. Compared with other color features, CN features show better discriminative capability in MOT [56, 57]. In addition, due to undesirable environmental factors, such as shadows, similar backgrounds, and other interferences, it will become more complicated to realize the MOT.

On this foundation, aiming at MOT in satellite videos, this paper proposes the kernelized correlation filters based on color-name features and Kalman prediction (CNK-KCF). The images are mapped to CN feature space so that the tracker can process the image with multichannel color features. Moreover, the KCF can improve the calculation speed in the Fourier translation based on the cyclic matrix and kernel trick. Meanwhile, the Kalman filter can help correct and update the predicted position of the moving object and, together with the kernelized correlation filters based on the color-name features (CN-KCF), can improve the tracking accuracy. The proposed algorithm is tested on Jilin-1 datasets. Experimental results are analyzed and show that the proposed method is robust to some environmental factors such as partial occlusion, background similarity, and rapid motion.

In summary, the contributions of this paper are twofold. (1)For MOT in satellite videos, a framework named CNK-KCF is carefully designed based on the CN feature and the Kalman filter. Besides, the proposed CNK-KCF algorithm is stable in complex situations such as rapid target motion, occlusion, and similar object interference, which can solve the problem of tracking failure when a moving object is partially occluded(2)In the experiment section, the CNK-KCF algorithm is compared with other algorithms, and it is shown that the CNK-KCF algorithm possesses better tracking accuracy and success rate for the airplane, rocket, and ship in Jilin-1 satellite videos. The performance of each type of tracker in satellite videos is analyzed in detail

The rest of this paper is organized as follows. Section 2 presents the design of the CNK-KCF for satellites videos MOT. Section 3 introduces the experiment results and some analysis. Finally, Section 4 concludes this article.

2. Materials and Methods

In this section, to solve the problem of MOT in satellite videos, the CNK-KCF is developed carefully. As shown in Figure 2, the CNK-KCF mainly consists of 3 parts: (1) KCF [58, 59], (2) CN, and 3) KF prediction. In the rest of this section, each part is described in one subsection each.

2.1. Kernelized Correlation Filter (KCF)

The KCF has a relatively low computational cost and relatively high tracking accuracy, especially for rapid deformation. KCF is also robust to illumination changes. Hence, KCF is suitable for MOT in satellite videos. The procedures of the KCF tracking algorithm are shown as follows. Firstly, in order to construct the tracking area, the tracking object is selected from the initial frame in the satellite videos. Then, according to the cyclic matrix theory, the tracking area is cyclically shifted. The kernel function is applied to calculate the similarity between the possible region of the target location and the tracking objects. Finally, the area with the largest output response is selected as the new target, and the classifier is trained based on the Fourier transform to reduce the calculation time.

In the KCF, the following regression function is trained to obtain weight coefficients , where is an -dimensional vector. Correspondingly, the cost function can be minimized as where is the data matrix, is the desired output, and is the regularization parameter to prevent overfitting. Based on [64, 65], the extreme value of Equation (2) can be obtained as where is an identity matrix. In Equation (3), calculating the inverse matrix () is very time-consuming. Therefore, the calculation is performed in the Fourier domain, and Equation (3) can be rewritten in a complex field as where XH represents the Hermitian transpose of . Obviously, if is a real matrix, Equation (4) can be considered as equivalent to Equation (3).

Besides, in order to accelerate the calculation speed of MOT, the cyclic shift is also introduced. The cyclic shift operator is a permutation matrix, which can be used to simulate the one-dimensional translation of this vector. The permutation matrix can be shown as so the cyclic shift of can be presented as

Since the product shifts by one element, shifts are used to chain with the matrix power and achieve more translations. The same signal can be obtained periodically every -time translation based on the cyclic property of the cyclic matrix, which means that all shifted signals can be represented as

Correspondingly, the data matrix can be the circulant matrix and is denoted as

One-dimensional vector cyclic displacement is given in Figure 3. All cyclic matrices can be diagonalized through the discrete Fourier transform (DFT), which is independent of the generated vector . Therefore, can be diagonalized as

where the constant matrix is known as the discrete DFT that does not depend on . The Hermitian transpose of is represented as . The matrix is the diagonal matrix. Accordingly, the vector is the DFT of and is defined as

In the following section, the DFT of the vector will be represented by the hat (^). Due to the diagonalization property of the matrix , the matrix can be obtained as where is the complex-conjugate of and can be defined as a conjugate symbol. In addition, is represented to be a noncentral covariance matrix. From (9) and (11), it follows that

Because the diagonal matrices are symmetric, the Hermite transpose is used only after complex-conjugation . In this way, the factor can be eliminated. Moreover, the operations on diagonal matrices are done by elements, so Equation (12) can be rewritten as where is the dot product of two vectors. It can be seen in (13) that the original complex matrix operation is transformed into a simple vector and dot product operation based on the diagonalization property of the cyclic matrix. Based on (4) and (13), the DFT of is obtained as where fractions represent the division element. can be recovered in the spatial domain on the basis of the inverse DFT.

In addition, a nonlinear mapping function is employed to make the mapped samples linearized in the new space. Therefore, can be transformed into

Furthermore, the kernel technique can be used to map the input of a linear problem to a nonlinear feature space . Correspondingly, the solution is rewritten as where is a column vector. Therefore, the parameters of the solution are changed from to . Meanwhile, Equation (2) can be rewritten as

Besides, the dot products can be defined as , where is the Gaussian kernel expressed as where is a mapping function in the high dimensional space. Furthermore, the dot products between all pairs of samples are stored in a kernel matrix , which can be defined as

The complexity of the regression function increases with the increase of the sample size, so that can be rewritten based on (16) as

Meanwhile, Equation (17) can be rewritten as

The optimal solution of (21) is given by

If the kernel function satisfies Equation (22), is circulant for any permutation matrix as

In addition, if the kernels can make circulant, Equation (22) can be diagonalized as where is the first row of the kernel matrix defined as

Furthermore, is a circulant matrix and can be expressed as so it further follows that

2.2. KCF Tracking Algorithm Based on Color-Name (CN) Feature

In the traditional KCF algorithm, the original pixel and the directional gradient histogram (HOG) feature are required, and the original pixel features are utilized to convert the images into gray images. The pixel gray value is regarded as the image feature. However, the generation process of the HOG feature descriptor is lengthy, which results in slow speed and poor real-time performance for MOT in satellite videos. In addition, the HOG feature descriptor is relatively sensitive to noise, so it is difficult to deal with occlusion based on the gradient characteristics. Compared with the original pixel feature and HOG feature, the CN feature has better stability properties. As a result, the proposed algorithm applies a CN statistical feature to extract the image feature.

CN feature space is a special color space based on a potential probability model. CN space contains 11 color channels: yellow, red, black, blue, gray, pink, white, brown, green, orange, and purple. In the form of mathematical description, a color image represents the color pixel value at position , and the image is mapped to CN space so that can be converted into an 11-dimensional (11D) probability feature vector . Specifically, the RGB value is represented by an 11D color with a total probability sum of 1, so as to realize the low-dimensional extraction of color information. The model can be expressed as follows: where presents a region with as the center and as the radius. is a Gaussian function. is the standard deviation. However, in the process of tracking, since not all of useful object information can be provided by the 11D color attributes, the 11D color attributes are firstly reduced to 10D. Subsequently, the PCA is employed to reduce 10D to 2D, which can reduce calculation and accelerate the calculation speed of the algorithm.

We supposed that is the CN feature extracted from the target region in the th frame and is the Fourier transform of . To reduce dimensions, the dimension reduction matrix is given as

where is the CN feature through the dimension-reduction operation in the th frame and is a dimension-reduction matrix. The reconstructed minimum cost function as the decision function is given to obtain the dimension-reduction matrix as

where and are weight coefficients and can be used for the solution of the dimension reduction matrix in Equation (29). Meanwhile, due to the poor discriminant performance in this process, is introduced to increase the robustness of in (30), where the first th frames are added for training. The forms of and are

In Equations (31a) and (31b), represents the CN features in the th frame, s is the total number of basis vectors, and is a nonnegative weight. Accordingly, based on Equations (31a) and (31b), Equation (30) can be rewritten as

Subsequently, based on (30), obtained from (32) and the values and can be updated as where is a learning rate parameter.

2.3. Motion Estimation by Kalman Filter

The Kalman filter is initialized before tracking, and the initial state vector containing the manually labeled real coordinate value of the target center and the velocity component on the coordinate axis is obtained. The state equation and observation equation of the Kalman filtering algorithm are, respectively, shown as follows: where is the state vector of the system at time . is the observation vector at time . is a state transition matrix defined as and is an observation matrix. Let the initial state , where is the initial coordinate of the target center point, and is the velocity component at -axis and -axis. Both process noise and observation noise are white noise sequences with the mean value of 0, which are uncorrelated.

The current state matrix and covariance are used to predict the speed and position of the target in the next frame according to the recursive estimation principle. Finally, the prediction equation is obtained: where is the state prediction vector, is the covariance matrix, is the covariance matrix of the process noise , and is time interval that is usually taken as 1.

After the predicted coordinates are obtained, the target sampling area is expanded to make the sampling area 3.5 times that of the target image, and the CN features of the model are extracted. Samples are constructed using a cyclic matrix. At the same time, the Fourier transform is performed. The Gaussian kernel is also obtained based on the properties of the cyclic matrix and (18).

Combined with the new actual observations and the prior estimates obtained in the previous step, an a posteriori estimate is obtained using the feedback method. The correction equation is given as where is the covariance matrix of the noise vector . is the Kalman gain matrix.

In theory, the trajectory of a moving target can be regarded as a smooth curve in a short time. However, on the one hand, there is a certain jitter in the target trajectory curve obtained by KCF. Besides, when there are shadows, similar backgrounds, or other interference, it can easily lead to the failure of MOT. The above two problems can be improved by using KF to correct the tracking results. On the other hand, the reason is that KF can predict the possible position in the next frame based on the current frame. Hence, the search area can be relatively reduced. Correspondingly, the tracking accuracy can be relatively improved and is robust to some environmental factors such as partial occlusion, background similarity, and rapid motion.

3. Results and Discussion

3.1. Datasets and Compared Algorithms

In this paper, the experimental datasets are from the Jilin No. 1 satellite constellation developed by the China Changchun Satellite Technology Co., Ltd. There are six videos used in our experiments. Among the videos, six scenes are selected from the satellite videos. Besides, the moving objects are the airplanes in the first four videos, and in the other two videos, the moving objects are the rocket and the ship. The number of the objects is only one in each video. The videos show the part of the process of aircraft flying, ship driving, and rocket launching. Each object is manually labeled through a directional bounding box to describe the position. Figure 4 illustrates the overview of the moving objects in the datasets. In addition, the sizes of the targets in the satellite videos are shown in Table 1. Notice that the sizes of moving objects are small, which may result in tracking failure.

Besides, there are many problems for MOT in satellite videos, which will affect the tracking performance. As a result, to analyze the MOT performance of the proposed method, the authors choose seven trackers, KCF, the kernelized correlation filter-based Kalman filter (K-KCF), CN-KCF, the circulant structure of tracking-by-detection with kernels (CSK), STRUCK, MeanShift, and CamShift, to compare with the proposed CNK-KCF algorithm. In Tables 27, the comparison among the above 8 trackers is made in detail.

3.2. Details on the Setting of Parameters

The KCF, K-KCF, CN-KCF, CSK, STRUCK, MeanShift, CamShift, and CNK-KCF are implemented in MATLAB R2018b and NVIDIA GeForce GXT 2080Ti GPU. Therefore, for all tested video datasets, the basic parameters of the algorithms are shown in Table 8, where is the standard deviation of the Gaussian kernel, is the regularization coefficient, and is the learning factor. Besides, the parameters of each algorithm are consistent in all video sequences.

3.3. Evaluation Metrics

In this paper, two common evaluation criteria are used, that is, precision plot and success plot [60]. The horizontal axis of the accuracy chart is the center location error (CLE). In a frame image, CLE is described as where CLE represents the average Euclidean distance, is the tracked target center position coordinates, and is the manually marked real coordinates. When CLE is less than this prescribed threshold, it indicates that the tracking target is correct. That means the smaller the prescribed CLE value is, the more accurate the MOT is. The vertical axis of the accuracy map is the percentage of frames in which tracking accuracy is greater than the threshold.

On the other hand, the horizontal axis of the success rate graph is the overlap threshold of the bounding box. The mathematical expression of overlap rate is as follows: where is the predicted tracking box obtained from the algorithm, is the real target box marked manually, and is the ratio of the overlapping area of and to the total area of and . represents the number of pixels in the region. The vertical axis of the success rate graph is the proportion of successful frames to all image frames. In this experiment, the area under the success rate curve (AUC) is used as the performance evaluation criterion of the algorithm. The larger the value is, the better the tracking performance is.

3.4. Experimental Analysis on Moving Airplane Tracking

The moving objects are the airplanes in the first four real satellite videos of this experiment, recorded as airplane-1, airplane-2, airplane-3, and airplane-4. Note that the objects are small, but the scales are large. First, the results of the moving airplane-1 tracking by CNK-KCF are shown in Figure 5. Note that the airplanes at the airport are not completely occluded. It is shown that for the airplane-1 whose shape is clear, CNK-KCF performs best with a precision of 0.957 and a success rate of 0.917. However, the other seven trackers do not possess similar performance as the CNK-KCF in the aspects of success rate and precision (in fact, in Tables 27, we can see the CamShift tracking failure in the satellite videos). Figure 6 shows the success rate and precision plots containing the eight trackers for airplane-1.

For the airplane-2 video, Figure 7 shows the results of the moving airplane-2 tracking. Because the environment around the moving target is relatively complex, the K-KCF and STRUCK have a weaker performance than the proposed CNK-KCF, which have a precision of 0.284 and 0.260 and success rate of 0.272 and 0.224, respectively, followed by the KCF. However, the proposed CNK-KCF has better performance with a precision of 0.902 and a success rate of 0.967. Figure 8 shows the success rate and precision plots containing the eight trackers for airplane-2.

For the airplane-3 video, Figure 9 shows the results of the moving airplane-3 tracking. In Table 4, the KCF, CN-KCF, K-KCF, CSK, STRUCK, and CNK-KCF have similar precision performance. However, the success rate of CN-KCF, KCF, K-KCF, and CSK is less than 0.9, compared with that of the proposed CNK-KCF. Figure 10 shows the success rate and precision plots containing the eight trackers for airplane-3.

For the airplane-4 video, Figure 11 shows the results of the moving airplane-4 tracking. In Table 5, the proposed CNK-KCF has better results than the other 7 trackers. The CN-KCF and CSK have slightly weaker results, followed by STRUCK. However, the KCF and K-KCF have relatively weaker results, only with a precision of 0.691 and 0.848 and a success rate of 0.506 and 0.632, respectively. Figure 10 shows the success rate and precision plots containing the eight trackers for airplane-4. In all, in the experimental results, on the MOT of airplanes, it can be seen that the designed CNK-KCF has better tracking success rate and precision, compared with the other seven algorithms. Figure 12 shows the success rate and precision plots containing the eight trackers for airplane-4.

3.5. Experimental Analysis on Moving Rocket Tracking

The results of moving rocket tracking are shown in Figure 13. Because a rocket over the sea will be partly occluded in the satellite videos, it is necessary to consider the occlusion detection in the MOT. Besides, the rocket is small in satellite videos. Hence, the texture features and shape are not clear. In Table 6, CNK-KCF performs best with a precision of 0.980 and a success rate of 0.941, as well as CN-KCF. KCF, K-KCF, STRUCT, and CSK showing weaker performance in the aspect of precision and success rate. Besides, MeanShift fails to track the rocket. Figure 14 shows the success rate and precision plots containing the 8 trackers for the rocket.

3.6. Experimental Analysis on Moving Ship Tracking

The results of moving ship tracking are shown in Figure 15. CNK-KCF performs best with a precision of 0.979 and a success rate of 0.954. KCF and K-KCF show weaker performance in the aspect of precision and success rate. The performance of MeanShift and CamShift is also weaker with a precision of 0.458 and 0.279 and a success rate of 0.426 and 0.251, respectively. Figure 16 shows the success rate and precision plots containing the 8 trackers for the ship.

In all, compared with the other 7 algorithms, the proposed CNK-KCF possesses better tracking performance.

4. Conclusions

In this paper, the authors propose an effective tracker called CNK-KCF based on the framework of the correlation filter for the MOT in satellite videos. Based on the CN feature, the proposed tracker can process the videos in the multichannel color feature. The Kalman filter is also utilized to improve the tracking success rate and accuracy.

The improved algorithm is tested on Jilin-1 datasets many times. Meanwhile, we compared with other seven tracking algorithms. The experimental results are analyzed and it has been determined that the proposed method is robust in several complex situations such as rapid target motion, occlusion, and similar object interference. The proposed algorithm solves the problem of tracking failure when a moving object is partially occluded. In the future work, the above tracker will be combined with the controller rather than working separately.

Data Availability

The data used to support the findings of the study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (62133001 and 61520106010) and the National Basic Research Program of China (973 Program) (2012CB821200 and 2012CB821201).