Abstract

Aiming at the problem of limited payload and endurance of micro-UAV, the target tracking algorithm based on monocular vision is proposed. Since monocular vision cannot directly measure distance between the UAV and the target, triangulation and triangle similarity are used to calculate the distance information. Then, a target tracking method based on Kalman filter and KCF is designed. The tracking result of KCF is modified by Kalman filter to solve the problem of target occlusion. Finally, the position of the target in the world coordinate system is calculated through the coordinate transformation matrix, which is used to control the UAV for tracking the moving target. In order to verify the feasibility of the algorithm, target size estimation and target tracking algorithms are carried out. The experimental results show that the proposed algorithm can track the moving target effectively under the condition of short-term occlusion.

1. Introduction

In recent years, with the rapid development of vision technology, communication technology, and flight control technology, unmanned aerial vehicles (UAVs) have been widely used in real-time monitoring, investigation, traffic control, and civil photography [15]. According to the UAV dimension, the UAV can be classified into micro-UAV, small-UAV, and large-UAV. Due to its small dimension, light weight, good mobility, and strong concealment, the microunmanned aerial vehicle (MAV) has unique advantages in target tracking [6]. However, the MAV is limited by payload and endurance, making it impossible to carry a large computer and huge detection sensors. One of the hot research problems is how to study accurate and robust target tracking algorithms for MAV platform.

Compared with other detection sensors, cameras with optical sensors as the core component receive more institution feedback on environmental information and key points. Moreover, the camera has the characteristics of low cost and light weight, so it has great potential in the field of target tracking. Cameras with optical sensors as the core component can be classified into monocular cameras, binocular cameras, and depth cameras based on the sensors they carry. All of these types of cameras have been used in target tracking. Aiming at the problem of inaccurate acquisition of depth images caused by UAV jitter, Tayyab Naseer’s team of Technical University of Munich presented to simultaneously carry depth camera, monocular camera, and other sensors in the UAV system. And the team used a monocular camera and label positioning methods to assist the depth camera to obtain accurate depth image information for human motion tracking [7]. However, the system is currently only suitable for indoor environments and small-scale movements. Liu et al. presented to use a UAV equipped with a three-axis pan-tilt for tracking the target, which could filter the noise caused by UAV jitter and expand the field of view [8]. However, due to the large size of the three-axis pan-tilt, it cannot be carried on a MAV.

Target tracking algorithms can be divided into generative methods and discriminant methods. The generative methods only focus on the target feature, ignore the background information, and match the detected images by establishing a target model. Discriminant methods find the optimal region in the next frame of image by training a classifier to achieve the purpose of target tracking. The generative methods assume that the target features remain constant for a period, so these methods cannot track the target motion in complex situations. Discriminant methods based on correlation filter and deep learning can adapt to complex application scenarios.

In [9], the researcher used the correlation filtering algorithm to track the target and presented a minimum output sum of squared error (MOSSE) algorithm. The tracking speed of this algorithm can reach more than 600 frames per second, and it has the function of resisting illumination and the shape change of the target, which improves the tracking robustness. Henriques et al. presented the Kernelized Correlation Filter (KCF), which replaces the gray features of the original filtering method with histograms of oriented gradients (HOG) features [10]. Furthermore, the nonlinear classification problem is mapped to a high-dimensional space to make it linearly separable, and the computational complexity is reduced by applying kernel functions and the diagonalizable properties of circulant matrix. In order to solve the edge effect, Danelljan et al. presented a spatially regularized discriminative correlation filters (SRDCF) algorithm [11]. In [12], the researchers used real shifts to generate negative samples, used real samples to train filters, and expanded the search area to improve the tracking effect. However, the algorithm is easy to lose the target when the appearance of the target changes greatly. In order to further improve the performance of the correlation filter tracking algorithm, many algorithms extract deep features to represent the target [13, 14]. Although the tracking effect is improved, the tracking speed of the correlation filter algorithm based on deep features is slow and not suitable for the computing resources of the UAV platform. Aiming at the problem of background noise generated by UAV in flight, Huang et al. presented an aberrance repressed correlation filter (ARCF) algorithm, and the experiment results show that ARCF performs well on most UAV data sets [15]. However, it is difficult to effectively deal with tracking failure caused by target occlusion and size change.

With the rise of deep neural network technology, it has received extensive attention in the field of target tracking. Convolution neural network has strong target expression ability because of the deep features obtained by learning, which gradually replaces traditional manual features. It has been introduced into the target tracking task and has made great progress [1618]. Siamese instance search tracker (SINT) creatively uses Siamese neural network to measure the similarity between template images and search images, which provides a new idea for target tracking [19]. To solve the problem of poor real-time performance of deep learning in target tracking, Bertinetto et al. proposed the fully-convolutional Siamese network (SiamFC) algorithm [20]. Due to the complex network structure of the deep learning tracking algorithm, it cannot achieve both speed and accuracy to a certain extent. In [21], the researchers presented a Siamese region proposal network (Siam-RPN) tracking algorithm. Due to the limited data set, the training quality of the Siam-RPN network is not high. Aiming at the tracking accuracy problem of Siam-RPN, Yu et al. presented a dis-tractor-aware Siamese region proposal networks (DaSiamRPN) tracking algorithm based on Siam-RPN, which improved the anti-interference and discrimination ability of tracking and achieved a tracking speed of 160 frames per second [22]. Although deep learning tracking algorithms have made great progress, the lack of training samples makes it difficult to train high-quality neural networks for different tracking scenarios. In addition, deep neural networks have very high requirements for computer hardware resources, which also affect the application of the MAV platform.

In summary, the MAV target tracking mainly faces the following challenges:(1)Limited by the structural characteristics of the MAV, ensuring target tracking accuracy and reducing the complexity of the algorithm are key problems that need to be resolved(2)During the flight of UAV, the airframe jitter may cause camera shake, target blur, and other problems.

In addition, there may be short-term obstacles between the UAV and the target, which will lead to target drift and loss in tracking. Therefore, it is difficult to achieve stable and robust tracking of the UAV. This paper proposes a target tracking algorithm of MAV based on monocular vision to solve the abovementioned problems. Firstly, aiming at the problem that monocular camera cannot measure the depth information between the UAV and the tracking target, the initialization method of triangulation is proposed to measure the target size. Then, the triangle similarity method is applied to estimate the depth between the target and the camera to solve the two-dimensional limitation of the monocular camera. Secondly, aiming at the deficiencies of the KCF filter algorithm, a target tracking algorithm based on Kalman filter and KCF fusion is proposed. The tracking results of KCF are corrected by Kalman filter to improve the tracking accuracy and robustness. Finally, the position of the target in the world coordinate system is calculated by the coordinate transformation matrix, which is used as the expected input of the position to control the UAV to track the moving target.

2. System Architecture

In order to perform the tracking task, the UAV carries the monocular camera for image acquisition. As the optical flow sensor can measure the horizontal velocity of the UAV, the UAV usually uses it to achieve fixed-point flight indoors, and it also can be used in conjunction with GPS in outdoor environments. In addition, the Nvidia Jetson Nano is applied as an onboard computer; its Quad-core ARM A57 CPU and 4 GB RAM can fully meet the experimental requirements. The compact size of 100 mm × 80 mm × 29 mm can perfectly adapt to the size of the UAV. For flight control system, the UAV utilizes Holybro Pixhawk 4 as the UAV attitude control unit. Its PX4 firmware can run Offboard mode and execute upper control instructions. The UAV target tracking system is shown in Figure 1.

Concerning software, the robot operating system (ROS) is installed on the airborne computer to establish communication connections between multinodes, multitasks, and multiprocesses. The software mainly includes the following modules: (1) target tracking module fused with KCF and Kalman filter, (2) target position calculation, (3) position control, (4) the data collection module for sensor, and (5) MAVROS software package. The UAV acquires images of the tracking target through the monocular camera; the fusion KCF and Kalman filter are used to track the dynamic target. The three-dimensional motion information of the target is calculated by position solution and sent to the flight control as the expected input of the position controller to perform the target tracking task. In the meantime, the QGroundControl (QGC) and the remote desktop can monitor the flight attitude and mission command of the UAV in real time. The software architecture is shown in Figure 2.

3. State Estimation of the Target

The prerequisite for performing target tracking is to estimate the position motion information of the target. The target tracker based on discriminant is used to generate the 2D motion information of the target in the image, and then the Kalman filter is established to fuse the abovementioned 2D motion information to obtain the final target tracking result.

3.1. The KCF Target Tracking Algorithm

The KCF (kernelized correlation filters) algorithm is a discriminative target tracking algorithm based on online learning. The initial frame is used to generate training sample sequences through circulant matrix shift. The target is detected by the ridge regression training classifier, and the area with the largest response is the target area. Although the KCF algorithm needs to generate multiple virtual samples through circulant matrix in the process of target tracking, there are plenty of matrix inversion calculations in the process of training the classifier. The algorithm makes use of the property that the circulant matrix can be diagonalized and applies the discrete Fourier matrix to diagonalize the sample set. Due to the diagonal matrix operation only needing to calculate the nonzero elements on the diagonal line, it can greatly reduce the occupation of CPU and memory resources. In addition, the KCF algorithm introduces the Gaussian kernel function to map the nonlinear problem to the high-dimensional space and converts it to the linear problem, which greatly improves the calculation speed and meets the demands of the MAV for fast response and lightweight in the tracking process. The algorithm procedure is shown in Figure 3.

To obtain more training samples, a training sample set is generated by the circulant matrix. The n × 1 dimensional vector is used as the basic sample, and the sample vector x is shifted by the permutation matrix L for n times. The training sample set of the current frame is formulated as follows:

The definition of the circulant matrix L is as follows:

In order to improve the calculation speed, the discrete Fourier matrix is used to diagonalize the sample set as follows:where is the discrete Fourier transform of the basic sample x, is the diagonal matrix, F is the Fourier matrix, and represents the complex conjugate transpose matrix.

We created the classifier with the ridge regression model, where z is the candidate sample. The goal is to minimize the squared error over training samples xi and regression targets yi, which can be written as follows:where ω is the weight coefficient of the classifier and λ is the regularizing term coefficient. In order to improve the generalization ability of the classifier and prevent the overfitting phenomenon of the classifier, a regularizing term is used to control the overfitting.

By setting the partial derivative of to zero, the expression of is as follows:where E is the unit matrix and y is the column vector composed of the regression label yi of each sample. We converted equation (5) into the complex field, which can be written as follows:

Using the diagonalizable property of the circulant matrix, the expression of equation (6) in the frequency domain can be represented as follows:where represent the Fourier transform of , respectively, and represents the conjugate matrix of .

As the target tracking is a nonlinear problem, the sample x can be mapped to a high-dimensional space through the mapping function φ (x) to make the nonlinear problem linearly separable. The weight coefficient of the classifier can be expressed as follows:where αi is the linear combination coefficient, and the kernel function k is defined as follows:

The dimensional kernel matrix K composed of kernel functions between the samples is expressed as follows:

Then, the ridge regression function can be expressed as follows:

The expression of α can be derived as follows:where α is a coefficient vector composed of αi. The Fourier transform of equation (12) can be expressed as follows:where is the Fourier transform form of and is the Fourier transform form of the first row of matrix K.

After training the classifier with numerous samples obtained by the circulant matrix, the target can be detected and located. First of all, the kernel matrix Kz between the sample x and the candidate sample z is calculated to match the position results.where represents the circulant matrix of vector .

The regression function of the candidate sample is as follows:

Equation (15) is converted into the frequency domain, which can be expressed as follows:

In particular, the Gaussian kernel is selected as the kernel function; the Gaussian kernel function can be obtained as follows:

By Fourier transforming, the matrix inversion process is avoided. The time complexity of the algorithm is reduced from O (n2) to O (nlogn), which realizes fast detection and reduces the dependence on computer performance.

3.2. Design of Target Tracking Algorithm Based on Kalman Filter

In the previous section, a good balance between speed and accuracy is achieved by using the KCF filter to track the target and obtain the target motion state while the camera is stationary. However, the UAV tracking target is a dynamic process and the position estimation based on the previous section is not robust enough for this process. During the tracking process, it is not guaranteed that the target is always within the field of view of the camera, and occasionally the target may be partially or fully occluded, leading to target loss. Although the complete loss of the target caused by long-term occlusion may not be solved, the proposed method can deal with small-scale occlusion problem in a short time. Based on the abovementioned situation, this section applies Kalman filter to establish the linear motion model of the target and fuses the tracking results of KCF, while considering camera jitter as Gaussian noise. According to the input and output of the model, the optimal estimation of the motion state of the target can predict the target motion position at the next moment, so as to improve the tracking accuracy and robustness.

The Kalman filter is widely applied in the state estimation of target motion [2325]. Due to noise during the measurement of target motion, Kalman filter can effectively remove noise by using the motion information of the target and obtain the optimal estimation of the target position.

Firstly, due to the high sampling frequency of the camera, the time interval between adjacent frames of the image is very short, the motion of the target between two frames can be regarded as uniform motion, and the acceleration of the target obeys Gaussian distribution. The state space vector of the system can be expressed as follows:where xk and uk are the state vector and control vector of the system at time k, respectively; xik and yik represent the position of the target at time k in I, respectively; represent the velocity of the target at time k in I, respectively; and represent the acceleration of the target at time k in I, respectively.

The motion state equation of the system is as follows:where Ak is the state transition matrix of the system at time k, xk-1 is the state vector of the system at time k-1, Bk is the control input matrix of the system at time k, uk is the control vector of the system at time k, and is the noise of the system at time k.

Assuming that the motion of the UAV tracking target is uniform, the specific forms of A and B are as follows:

The KCF tracking result can be used as the observation of Kalman filter. The observation equation can be written as follows:where zk is the target tracking result at time k, Hk is the state observation matrix, and is the measurement noise at time k.

The specific form of H is as follows:

During the process of estimation, Kalman filter can be divided into two stages: prediction stage and iterative update stage. The specific processes are as follows:(1)Prediction stageFrom the motion state equation,where is the prior state estimation of the target at time k, is the posterior state estimation of the target at time k-1, is the prior estimation covariance matrix, is the optimal estimation covariance matrix, and Q is the process noise covariance matrix.(2)Iterative update stagewhere Kk is the Kalman gain matrix, R is the measurement noise covariance matrix, and E is the unit matrix.

In summary, the tracking process based on the KCF and Kalman filter is shown in Figure 4. Firstly, the KCF target tracking algorithm and Kalman filter are initialized, and the target state prediction value at the current moment is calculated from the optimal estimation value of the target state at the previous moment. Then, the predicted covariance at the current time is calculated from the optimal estimated covariance matrix at the previous time and the process noise. In the update stage, the KCF algorithm is applied to track the selected target. After the target tracking result zk is obtained, the forecasting result is corrected by Kalman gain. Finally, the optimal estimate of the current target state is obtained.

4. Three-Dimensional Position Solution

After obtaining the target’s plane motion coordinates in the two-dimensional image from Section 3, the coordinates are converted into three-dimensional space using the following method, so that the UAV can track dynamically.

As shown in Figure 5, the world coordinate system, body coordinate system, camera coordinate system, image coordinate system, and pixel coordinate system are defined, and the relative motion relationship between the UAV and the target is described. Among them, is the world coordinate system, B = {ob, xb, yb, zb} is the body coordinate system, C = {oc, xc, yc, zc} is the camera coordinate system, I = {oi, xi, yi} is the image coordinate system, is the pixel coordinate system, and the unit is the pixel. The pixel coordinate system takes the left vertex of the image as the origin, u as right axis, and as down axis.

Suppose the coordinate of the target point M in W is , the coordinate of its projection m in I is (xi, yi), and the coordinate of the origin oi of I in G is . Then, the relationship between G and I can be expressed as follows:where dx and dy are the physical dimensions of the unit pixel on the xi axis and the yi axis, respectively.

Let the coordinate of the target point M in C be xc, yc, and zc. According to the projection transformation, the relationship between I and C can be expressed as follows:where f is the focal length of the camera, which is determined by the internal parameters of the camera.

Invoking equation (18) with equation (19), the relationship between G and C can be written as follows:where represent the horizontal pixel focal length and vertical pixel focal length, respectively, and let be the camera internal parameter matrix.

Then, the coordinate of M in W can be expressed as follows:where is the transformation matrix from C to B, is the rotation matrix from B to W, and rij is determined by the attitude angle of the UAV. The specific forms are as follows:

Invoking equation (21) with equation (20), the relationship between G and W can be written as follows

After determining the coordinates of the target point M in the image sequence, its position coordinate in W can be calculated. However, as the monocular camera cannot obtain the depth information zc, the similar triangle estimation method is used to estimate the depth information of the target. The premise of the estimation is to know the actual height of the target, so the height of the target is measured by the triangulation method. The triangulation method is shown in Figure 6.

For images I1 and I2, with the left image as a reference, the camera optical centre moves horizontally from oc1 to oc2. During the movement, it is assumed that the camera does not rotate and the displacement of the zc axis and yc axis are negligible. Suppose I1 has the feature point m1 and its coordinate in C is xc1, yc1, and zc1. The feature point in I2 is m2, and its coordinate in C is xc2, yc2, and zc2. According to the definition of epipolar geometry [26], the coordinate relationship can be expressed as follows:where are respectively the normalized coordinates of m1 and m2 in C, t12 is the translation vector from oc1 to oc2, and its value is known. Left multiply on both sides of the equation, where represents the outer product operation, and the following relationship is formulated as follows:

According to the right side of the equation, zc1 can be calculated, and the depth value of the target in I1 can be calculated. The actual height of the target is calculated according to the similar triangle, as shown in Figure 7.

Assuming that Hm is the actual height of the target, hm is the height of the target in the image, then Hm can be expressed as follows:

After estimating Hm based on the first two frames, the depth value zc of the target in the subsequent frames is formulated by the similarity relationship as follows:

5. Experiment and Analysis

The flight experiment was carried out in an open outdoor environment. During the experiment, As the UAV and the target are in motion, the difficulty of pose estimation is increased. In addition, during the occlusion experiment, the flight parameters of the UAV are set to prevent the UAV from large-scale manoeuvring in this paper. The flight parameters are shown in Table 1.

First, the ground station is applied to check the sensor data of the UAV after power on. Then, the UAV is switched to fixed-point mode by using a 2.4 GHz remote controller, and the UAV is unlocked and controlled to hover at a fixed point after taking off to a certain height. After selecting the tracking target, in order to estimate the three-dimensional position coordinate of the target, the size of the target is first measured and estimated to provide a reference for subsequent depth estimation. In this paper, the sizes of three different types of targets are estimated. The matching results are shown in Figure 8, and the estimation results are shown in Table 2.

It can be seen from Table 2 that the proposed estimation method can effectively estimate the size of different types of targets. The estimation errors are within 100 mm, which is completely acceptable for depth estimation. To verify the depth estimation algorithm proposed in this paper, targets with different distances are selected for depth estimation.

Table 3 shows the estimated distances of Person, Car, and UAV at different distances. It can be seen that the estimation errors of the algorithm are within 0.2, and the estimation errors do not change greatly with the increase of distance. After that, the target tracking experiment can be carried out.

As the tracking process is processed in real time on an onboard computer, the outputs of the tracking system send control instructions to the flight control system through serial communication. Limited by the processing speed of the onboard computer, this paper uses the remote control to make the flight control system enter the Offboard mode when switching the Offboard mode. The tracking algorithm is automatically started to track the target when the target is selected. The first perspective tracking view of the UAV is shown in Figure 9, where the green border is the KCF tracking result, and the yellow border is the Kalman forecasting result.

Tracking experimental results of target occlusion are shown in Figure 10. It can be seen from the results that even when the tracked target is completely occluded or partially occluded, the KCF tracking result will drift, but the algorithm proposed in this paper can still track the target effectively.

When the tracking target is occluded, using only the KCF algorithm results in significant position estimation errors. However, using the KCF algorithm to fuse the Kalman filter, the errors are within the allowable range. The experimental results are shown in Figure 11.

In Figure 11, the tracking target is occluded at 120 s and 220 s. It can be clearly seen that the proposed algorithm improves the tracking effect in the occlusion process and effectively reduces the position estimation errors of the target. The position estimation error of the x-axis and y-axis is reduced from about 0.8 m to about 0.3 m, and the position estimation error of z-axis is reduced from about 0.2 m to 0.1 m.

To further evaluate the system, the dynamic position of the target and the estimated results are compared, as shown in Figure 12. The system can effectively estimate the position of the target in three-dimensional space for most of the time. Despite jitter and occasional drift, the proposed algorithm can still relocate the target in a short time.

The errors between target position and estimated position in x-, y-, and z-axes are shown in Figure 13. For most of the time, the errors of the estimation results on the x-axis are mostly kept within 0.6 m, and the errors on the y-axis and z-axis are kept within 0.2 m. The RMSE (root mean square error) and MAE (mean absolute error) are further calculated, and the results are shown in Table 4. The experimental results show that the proposed algorithm can track the target effectively.

Compared with the 3D target pose estimation system in the paper [27], it is robust enough for real-time dynamic position estimation. In addition, in order to analyze the effect of the distance between the UAV and the target object on the accuracy of the target position estimation, several of target trajectory estimation experiments were performed. As shown in Table 5, it can be concluded that the performance of the proposed method does not deteriorate significantly when the distance between the UAV and the tracking object increases.

6. Conclusion

The payload and endurance of MAV are limited, and it is impossible to carry a large onboard computer to run complex visual tracking algorithms. Aiming at the above problems, this paper proposes a MAV target tracking algorithm based on monocular vision. The main contributions are as follows:(1)For the problem of measuring the distance between the MAV and the target, a triangulation algorithm has been designed for a monocular camera to estimate the object’s size. Based on this, the triangle similarity can measure the distance between the micro-MAV and target;(2)To address the problem of target occlusion, the paper proposes a target tracking algorithm based on KCF and Kalman filter. The algorithm combines the tracking results with the Kalman filter, solving the short-term occlusion problem and improving the anti-interference ability in the tracking process;(3)The proposed target tracking algorithm is evaluated through numerous experiments in a real environment. The experimental results demonstrate the feasibility and robustness of the proposed algorithm.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was funded by the Liaoning Provincial Education Department Project (Grants no. LJKMZ20220614).