#### Abstract

Unmanned aerial vehicles (UAV) play a pivotal role in the field of security owing to their flexibility, efficiency, and low cost. The realization of vehicle target detection, tracking, and positioning from the perspective of a UAV can effectively improve the efficiency of urban intelligent traffic monitoring. In this work, by fusing the target detection network, YOLO v4, with the detection-based multitarget tracking algorithm, DeepSORT, a method based on deep learning for automatic vehicle detection and tracking in urban environments, has been designed. With the aim of addressing the problem of UAV positioning a vehicle target, the state equation and measurement equation of the system have been constructed, and a particle filter based on interactive multimodel has been employed for realizing the state estimation of the maneuvering target in the nonlinear system. Results of the simulation show that the algorithm proposed in this work can detect and track vehicles automatically in urban environments. In addition, the particle filter algorithm based on an interactive multimodel significantly improves the performance of the UAV in terms of positioning the maneuvering targets, and this has good engineering application value.

#### 1. Introduction

With the rapid increase in the number of motor vehicles in urban environments, traffic congestion, accidents, and other problems occur frequently. Under the conditions of increasingly restricted traffic conditions, intelligent transportation means must be used for improving the system efficiency and stability. Traffic monitoring is the basis for the study of various traffic problems. Traditional traffic monitoring technologies such as induction coils, geomagnetism, and roadside cameras not only have a small detection range but also have low accuracy and poor mobility, and this severely restricts the development of intelligent transportation systems [1–5]. In recent years, unmanned aerial vehicle (UAV) technology has developed rapidly. UAV is a kind of aircraft with high flexibility without manual driving. At present, UAVs are being increasingly used in areas such as postdisaster rescue, aerial photography, daily monitoring, and military observation.

The use of UAVs to detect, track, and locate vehicle targets is of great significance for the construction of smart cities. A UAV is used in diverse collection scenarios and can collect global traffic information of complex road sections, intersections, or multiple roads. In addition, it can obtain high-definition videos from the vertical upper perspective of the road and obtain key traffic parameters that cannot be extracted by conventional monitoring methods. It can adapt to diverse collection needs and can continuously monitor key areas. Therefore, it has significant advantages in traffic information collection and monitoring.

The dynamic and complex UAV videos pose severe challenges for video detection technology. According to the principles of traditional video vehicle detection, the texture of a target vehicle is used as a feature for detection. However, it is easily affected by light. The other is a textureless vehicle detection method that uses gradient calculations, and it does not perform well in complex environments and occlusion conditions. Traditional vehicle classification algorithms, such as support vector machine and classifiers that calculate the gradient histograms of the captured images as features, have the weak discriminative ability, provide unsatisfactory results, and are difficult to apply for complex and changeable traffic roads. Therefore, it is difficult to achieve vehicle automation and precise information extraction using traditional video detection technology. In recent years, target detection methods based on the deep neural network have made significant breakthroughs in terms of their robustness and detection accuracy. The rapid development of convolutional neural network (CNN) in visual inspection has laid a strong technical foundation for the precise processing of vehicles in traffic videos captured by UAVs.

In the application scenario of a UAV aerial video, the performance of the tracking algorithm is affected by many factors such as the changes in the illumination, scale, and occlusion [6]. Unlike fixed cameras, tracking tasks in aerial videos are hampered by the low sampling rate and resolution and image jitter, which can easily lead to drift. When flying at a high altitude, the very small size of the ground objects is also a major challenge for the tracking task. In general, there are two types of video-based target tracking models, namely, the generative model and the discriminative model. The theoretical idea in the generative tracking model is that given a certain video sequence, for the target that needs to be tracked in the video, a model is built according to the tracking algorithm. Following this, the response area that is most similar to the target in the subsequent video sequence is found and is used as the target area. In this way, the tracking task continues further. The commonly used generative tracking algorithms include the optical flow method [7], the particle filter method [8], the mean-shift [9] algorithm, the continuously adaptive mean shift (CAMshift) [10] algorithm, and so on. A generative algorithm focuses mainly on tracking the characteristic information of the target itself and conducting an in-depth search on the target characteristics. However, such algorithms often ignore the influence of other factors on tracking performance. For example, severe scale changes, background information interference, or occlusion of the target can easily lead to a situation where the target cannot be tracked. The difference between the discriminative algorithm and the generative algorithm is that the former considers the influence of the background information on the target tracking task and then distinguishes the background information from the target information. In other words, to model a discriminative tracking model, it is necessary to distinguish the target and background information in a given video sequence. After the model is established, the subsequent video sequences are searched to determine further whether the searched target or background is found or not. The common discriminative tracking algorithms include correlation filtering methods and deep learning methods [11–13]. Due to the success of the deep CNN in visual recognition tasks, a large number of studies have been performed using CNN for tracking algorithms [14, 15]. These studies show that the accuracy of a CNN-based tracker is better than the tracking algorithm based on manual feature extraction.

For static targets, one can directly obtain the position and attitude information between the UAV and the target at the moment of positioning and also the angle and ranging information of the photoelectric platform into the positioning solution model, in order to perform a quick calculation of the three-dimensional coordinates of the target. However, for maneuvering targets, the platform and the target are always moving, and the tracking of the target will have interference from various factors. Under such circumstances, in order to study how a UAV can achieve target positioning with high accuracy, it is necessary to study the problem of state estimation of the UAV for a moving target. Its purpose is to use the obtained observation data to estimate the parameters such as the position and speed of the target and to estimate its current state. For nonlinear system estimation, the most commonly used filtering algorithms include the extended Kalman filter (EKF) [16], the unscented Kalman Filter (UKF) [17], and the particle filter (PF) [18, 19]. The EKF and UKF algorithms are the modified and improved forms of the linear KF algorithm, and thus, both are restricted by the KF, i.e., the system state must satisfy the Gaussian distribution. The PF algorithm is suitable for nonlinear/non-Gaussian systems and can provide good filtering effects. When the tracking target has strong mobility, the tracking performance of PF is poor, and thus, it is necessary to study the dynamic model of PF. Magill et al. [20] first proposed a multiple model (MM) algorithm that uses multiple filters to correspond to multiple target motion models. Based on the MM algorithm, Blom et al. [21] studied the interaction between multiple models in detail and proposed the interacting multiple model (IMM) algorithm with the Markov transition probability. In 2003, Boers et al. [22] proposed an IMM-based PF algorithm, namely, IMM-PF, which has a superior tracking effect for highly maneuvering targets.

In this work, vehicle detection and tracking algorithm for UAVs have been proposed based on the currently available mainstream deep learning image processing algorithms, and a vehicle target location estimation model has been designed. In particular, based on the “you only look once version 4” (YOLO v4) [23] algorithm, a vehicle detection model with superior robustness and generalization performance has been proposed. This model has been combined with the detection-based multitarget tracking (tracking-by-detection) algorithm DeepSORT [24] to realize real-time vehicle tracking. Finally, the IMM-PF algorithm has been used for achieving high-precision positioning for vehicle targets.

#### 2. System Structure

The system structure of the automatic vehicle detection and tracking method is shown in Figure 1. A UAV uses a camera to monitor the flight area, and the acquired aerial video is transmitted back to the ground station via a data chain. At the ground station, vehicle target detection is performed on the downloaded aerial video. After the vehicle target is detected, the moving target is tracked in the subsequent video frames. In order to obtain the geodetic coordinates of the target, after extracting the pixel coordinates of the target, the latitude and longitude of the target are estimated by combining the measurement data of the position, attitude, and the camera pointing angle of the UAV, in order to realize a fully automatic detection, tracking, and positioning of the vehicle target by the UAV based on vision technology.

#### 3. Algorithm

##### 3.1. YOLO v4 Target Detection Network

YOLO v4 is the latest detection network in the YOLO series, with innovations based on the integration of advanced algorithms on the basis of YOLO v3. Therefore, the YOLO v4 target detection network for vehicle detection has been used in this work. Innovations at the input of YOLO v4 include mosaic data enhancement, cross minibatch normalization (cmBN), and self-adversarial training (SAT). Innovations in the backbone network include CSPDarknet53, mish activation function, and dropblock. Innovations in the neck network involve the target detection network, often inserting a few layers between the backbone and the final output layer, such as the spatial pyramid pooling (SPP) module and the feature pyramid network (FPN) + PAN structure. The anchor frame mechanism of the prediction part of the output layer is the same as YOLO v3. The main improvement is the loss function, CIoU_Loss, during training, and the nonmaximum suppression (NMS) filtered by the prediction frame is changed to DIoU_NMS. YOLO v4 uses CSPNet and Darknet-53 as the backbone network for feature extraction. Compared to the design based on the residual neural network (ResNet), the target detection accuracy of the CSPDarknet53 model is higher, but the classification performance of ResNet is better. However, with the help of the mish activation function and other technologies, the classification accuracy of CSPDarknet53 can be improved.

In order to detect targets of different sizes, a hierarchical structure is required to enable the head of the target detection to detect the feature maps at different spatial resolutions. To enrich the input header, the bottom-up and top-down data streams are added or concatenated on an element-by-element basis before the header is input. Compared to the FPN network used in YOLO v3, SPP can greatly increase the receptive field and separate the most significant context features at hardly any reduction in the network operating speed. In addition, YOLO v4 selects the path aggregation network (PANet) from different backbone layers as the parameter aggregation method for different levels of detectors. Therefore, YOLO v4 uses the modified versions of SPP, PAN, and self-attention-based deep learning method (SAM) to gradually replace FPN, retaining the rich spatial information from the bottom-up data stream and the rich semantic information from the top-down data stream.

In addition, YOLO v4 reasonably uses the bag of freebie and bag of special methods for tuning. Compared to YOLO v3, the average precision (AP) and FPS of YOLO v4 show an increase of 10% and 12%, respectively.

##### 3.2. DeepSORT Vehicle Tracking Model

DeepSORT is an improved version of the SORT algorithm. It uses the KF prediction in the image space, uses the Hungarian algorithm to correlate the data frame-by-frame, and calculates the overlap rate of the bounding boxes from the correlation metric, which exhibits good performance at a high frame rate. Its specific process of dealing with tracking problems mainly includes trajectory processing and state estimation, information association, and cascade matching.

###### 3.2.1. Trajectory Processing and State Estimation

An eight-dimensional space is used for representing the state of a trajectory at a certain moment, where represents the center coordinates of the predicted bounding box, represents the height of the predicted target frame of a vehicle, and refers to the aspect ratio of the image. The remaining four variables represent the speed information of each parameter relative to the image coordinates. A counter is set for each tracker of the target. If the tracking and detection results match each other, the tracker counter is reset to 0. If the tracker cannot find a matching result for a period of time, the tracker is deleted from the list. When a new detection result appears in a certain frame (that is, a detection result that cannot match the current tracking list appears), a new tracker is created for the frame. If the prediction result of the new tracking target position matches the detection result for three consecutive frames, it is considered that a new target has appeared. Otherwise, it is considered that a “false alarm” has occurred, and the tracker is deleted from the tracker list.

###### 3.2.2. Information Association

The Mahalanobis distance between the detection frame and the tracker prediction frame is used for describing the calculation of the degree of correlation of the target motion information: where represents the predicted position of the th detection frame, represents the predicted position of the target by the th tracker, and represents the covariance matrix between the detected position and the average tracking position. Taking into account the continuity of movement, the Mahalanobis distance matching method has been used in this work, and the 0.95 quantile of the distribution has been used as the threshold. Considering that the Mahalanobis distance association method will be invalid when the camera is in motion, the target appearance information association has been introduced, and its process is as follows:

For each detection block, , a feature vector, , is calculated using a CNN network, and the condition is imposed on it. A channel for each tracking target is constructed in order to store the feature vector of the last 100 frames successfully associated with each target. Following this, the minimum cosine distance between the latest 100 successfully associated feature sets of the th tracker and the feature vector of the th detection result of the current frame is calculated. If the distance is less than a certain threshold, the association is successful. The distance is calculated as follows:

The DeepSORT algorithm adopts the way of the fusion measurement and considers the information on the association of motion and object appearance at the same time. The two pieces of information are linearly weighted to calculate the degree of matching between the final detection and the tracking tracks using the following expression:

###### 3.2.3. Cascade Matching

When a target is occluded for a long time, the uncertainty of filtering prediction will be greatly increased, and the observability of the state space will be greatly reduced. At this time, if two trackers compete for the matching right of the same detection result, the trajectory with a longer occlusion time is often blocked because the position information is not updated for a long time, thus increasing the uncertainty of the predicted position during tracking. In other words, the covariance will be larger, and the Mahalanobis distance will be smaller. Therefore, the detection result is more likely to be related to the trajectory having a longer occlusion time. This undesirable effect often destroys the continuity of tracking. The core idea of cascade matching is to match trajectories having the same disappearing time from small to large in order to ensure that the most recent target is given the greatest priority. Its specific process is shown in Figure 2.

The method developed in this work first uses YOLO v4 to detect the vehicle targets. Then, the tracking-by-detection DeepSORT algorithm is used to write the result of the frame of the traveling vehicle into the tracking queue for trajectory processing and state estimation. Finally, real-time tracking is done by information association and cascade matching. The overall flow of the algorithm is shown in Figure 3.

##### 3.3. Vehicle Positioning Algorithm

###### 3.3.1. Problem Description

During the detection and tracking of ground vehicles by the UAV, the vehicle is surrounded by a detection frame and marked with an ID. The center of the detection frame is taken as the target point, and its pixel coordinate is . The target line-of-sight vector, , is defined as the vector between the optical center of the camera and the target point. As a result, can effectively reflect the relative position between the target point and the UAV. The relationship between the parameters is shown in Figure 4.

The geographic coordinate system of the UAV is defined with the center of the GPS receiver as the origin, the -axis pointing toward the true north direction, the -axis pointing toward the true east direction, and the -, -, and -axis form a right-handed coordinate system. The line-of-sight angle is defined as (, ), where is the angle between the target line-of-sight vector and the *Z*-axis and is called the line-of-sight height angle, is the angle between the projection of the target line-of-sight vector on the XOY plane and the -axis, and is called the line-of-sight direction angle. During the flight of the UAV, the attitude angle of the UAV, the pointing of the camera, and the position of the target jointly determine the values of and .

In order to calculate the target line-of-sight angle, three coordinate systems are defined, namely, the camera coordinate system (abbreviated as the coordinate system, with the optical center of the camera as the origin), the inertial measurement unit (IMU), inertial platform coordinate system (abbreviated as the coordinate system, with the IMU measurement center as the origin), and the UAV geographic coordinate system (abbreviated as the coordinate system, with the center of the GPS receiver as the origin). The relationship between the spatial positions of the three coordinate systems is shown in Figure 5.

Let be the coordinates of the target image point in the camera coordinate system, where is the focal length of the camera. Assuming as the UAV yaw angle, as the pitch angle, as the roll angle, as the azimuth angle of the camera, and as the elevation angle, the coordinates of in the geographic coordinates of the UAV can be obtained as where is the rotation matrix from the series to the series and is determined by the azimuth angle, , and the elevation angle, , of the camera. is the rotation matrix from the series to the series, and is determined by the yaw angle,, of the UAV, the pitch angle,, and the roll angle,.

After obtaining according to Equation (4), the line-of-sight angle (,) can be calculated using the following equations:

In the process of discovering, tracking, and locating the target, the position coordinates of the UAV are (, , ), and the target coordinates are (, , ). Selecting the state variable to represent the estimation of the target position, the discrete state equation of the system can be written as where is the state transition matrix and is the system noise matrix, .

The measurement equation of the system is where is the random noise in the measurement and its covariance matrix is . The system noise, , and the measurement noise, , are uncorrelated zero-mean Gaussian white noise.

Using the triangular relationship between the UAV and the vehicle target point, we get where is the distance between the UAV and the target. In an urban environment, due to the flat terrain, this value can be calculated from the relative height of the UAV to the ground and .

###### 3.3.2. IMM-PF Algorithm

In actual practice, the state of the vehicles changes dynamically, and this is difficult to describe with a single motion model. On the one hand, the UAV target positioning task is mainly composed of three major systems: the aircraft, the camera, and the global positioning system (GPS)/inertial navigation system (INS). The GPS measurement has errors in estimating the latitude and longitude of the aircraft, and the INS also has errors in the measurement of the attitude of the aircraft. In addition, the sight axis of the camera also has a jitter. For moving targets, the PF algorithm can be used, which can handle nonlinear and non-Gaussian system filtering. In this work, the advantages of the IMM and the PF algorithm have been combined, and the IMM-PF algorithm has been employed to achieve target positioning.

For multiple models, the state transition equation and observation equation are

In the above equations, represents the target state vector of the model at time , and represents the corresponding state observation variable. The state transition matrix, , the observation matrix, , the process noise, , and the observation noise, , are all related to the model . The probability density of and is defined as and , respectively.

Assuming that there are a limited number of system models, , and the model probability is , the transition probability between the models can be represented by a Markov chain as follows: where represents the probability that the model at the time transfers to the model at time , assuming that it remains unchanged during the tracking process.

Assuming that the initial value of the state is known, the initial model probability, , and the observed value, , at each time are known, the posterior probability density, , of the state at that time is estimated, and subsequently, the estimated value, , of the system state is obtained.

Using the IMM algorithm as the basic framework, the IMM-PF algorithm uses PF as the model matching filter. The IMM algorithm is divided into four steps: input interaction, model matching filtering, model probability update, and estimated output. Taking the IMM algorithm as the basic framework, recursive Bayesian filtering is used for deriving the IMM-PF algorithm from time to time .

*(1) Input Interaction*. First, the interaction probability of the system model at time is calculated using the following expression:

The normalization factor is given by

The particles in each model interact with the state estimates of the other models :

*(2) Interactive Model Matched Filtering*. The particle state at time is predicted by the state transition equation (Equation (9)) as

The observed value of the particle state at time is predicted by the observation equation (Equation (10)) as

The particle weight is obtained from the system state observation value, , and the observation noise probability density, , as

Weight normalization is expressed as

is resampled using the expression to obtain a new particle set, and set the particle weight is . Then, the estimated state of the model at time is

*(3) Model Probability Update*. The residual error of particle observations is calculated by

The mean of the particle observations is calculated by

The residual covariance is

The likelihood function of the model is expressed by

Model probability is updated using where

*(4) Estimated Output*. The estimated state of the target is calculated using the following expression:

The interaction, filtering, estimation, and resampling of the IMM-PF algorithm are based on particles.

#### 4. Simulation Tests and Analysis

##### 4.1. Object Detection and Tracking Tests

The dataset used in this study consists of aerial images of road traffic in an urban environment selected from the VisUAV multiobject aerial photography dataset. Most of the labeled vehicle objects in the dataset are cars, buses, trucks, and vans, with a total of 15741 images, which are used as the training dataset for the YOLO v4 network. Subsequently, VisUAV2019-MOT is used as the benchmark dataset to test the algorithm of this study. VisUAV2019-MOT is a video sequence acquired by an unmanned aerial vehicle (UAV), covering different shooting perspectives as well as different weather and lighting conditions. On average, each frame contains multiple-detection frames, while each sequence contains multiple objects. The resolution of the video sequence is .

In this work, the model is trained and tested on a platform with Intel Core i7-8700 k [email protected], 32GB RAM, and GeForce GTX 2080 8GB GPU. The operating system is Ubuntu 16.04. The supporting environments are Python 3.5.2, , , and .

Figure 6 presents sample images of scenes from the VisUAV2019-MOT dataset. The VisUAV2019-MOT dataset contains several complex scenes, such as highway entrances, pedestrian streets, roads, and -junctions. The scenes have high-traffic flow and a large number of vehicle objects with changing motion characteristics. Furthermore, the moving UAV shots can fully reflect whether the performance of the algorithm is satisfactory.

In this work, the following evaluation criteria are used to analyze the advantages and disadvantages of multiobject tracking algorithms for different cases.

Multiple object tracking accuracy (MOTA) is an intuitive measure of tracking the performance of detecting objects and maintaining trajectories, independent of the estimated accuracy of the object position. The larger its value, the better the performance. It is calculated as follows: where is false negative, is false positive, is ID Switch, and is the number of all objects.

Multiple object tracking precision (MOTP) indicates the positioning accuracy. The larger the value, the better. It is calculated as follows: where is the average metric distance (i.e., the value of the bounding box) and denotes the number of successful matches for the current frame.

Mostly tracked (MT) denotes the number of successful tracking results that match the true value at least 80% of the time.

Mostly lost (ML) represents the number of successful tracking results that match the true value in less than 20% of the time.

ID switch indicates the number of times the assigned ID has jumped.

FM (fragmentation) indicates the number of times in which the tracking was interrupted (i.e., the number of times the tagged object was not matched).

FP (false positive) is the number of false alarms, referring to the trajectory of false predictions.

FN (false negative) is the number of missed detections and undetected tracking objects.

Based on the above eight metrics, the trained YOLO v4 vehicle detection model is used as a detector in this section. Further, video sequences with different viewing angles and lighting are selected from the VisUAV2019-MOT dataset to test the tracking algorithm. The evaluation results are presented in Table 1.

From Table 1, the MOTP values of the algorithms in this work are relatively high, and all remain above 78%. The positioning accuracy of video sequence uav0000306_00230_v reaches 84.4%, which further demonstrates the satisfactory performance of the detector trained. In addition, the lowest ID jump value in video sequence uav0000077_00720_v is only 46, and the test values of false alarm number and missed detection number vary widely among video sequences, because of different sequences shooting backgrounds and number of vehicles.

Figure 7 shows the tracking results for a video sequence of a road intersection. From the figure, the proposed algorithm shows satisfactory results in a complex environment. The algorithm not only accurately achieves the detection of multiple vehicle models for multiple objects in each frame but also establishes the correspondence with the object when performing tracking for the same vehicle.

**(a) Frame 180**

**(b) Frame 190**

**(c) Frame 205**

**(d) Frame 220**

##### 4.2. Simulation of the Object Positioning Algorithm

In this work, a maneuvering object is simulated for positioning to verify the feasibility and effectiveness of the algorithm. The object alternately performs constant velocity motion, constant turn motion, and constant acceleration motion. The system process noise and observation noise are both Gaussian white noises. The sampling period is s, and the simulation time is 40 s. Figure 8 shows the trajectory of the maneuvering object.

The CV-EKF and IMM-EKF algorithms are used for comparison with the IMM-PF algorithm used in this paper. A total of 50 Monte-Carlo simulation experiments are performed, where the number of particles used in the PF algorithm is .

The positon estimates of the three algorithms are shown in Figure 9, and a comparison of the RMSE curves of locations estimated by the three algorithms is shown in Figure 10. The figure shows that IMM-PF has significantly better tracking accuracy for maneuvering objects than CV and IMM-EKF algorithms. The EKF based on the single CV model can hardly track the object effectively when the maneuvering object turns, resulting in a larger distance error; the IMM-EKF can track the object when the object turns due to the IMM, but the distance error is larger than that of the IMM-PF. The IMM-PF filtering algorithm can handle nonlinear motions better, which makes the algorithm maintain stable tracking of the object under nonlinear motions. The simulation experiments show that the IMM-PF filtering algorithm has a smaller RMSE than the IMM-EKF algorithm and thus has better performance for nonlinear positioning.

To represent the effect of the IMM filter, the model probabilities are plotted as a function of time, as shown in Figure 11. The figure shows three initialized models with the same probability, which quickly converge to the CV model as the filter is updated. After 40 s of motion, the CV model no longer holds true, and the probability of the CT model becomes very high. In the final time of motion, the CA model obtains the highest probability. The switching of multiple motion models verifies the effectiveness of the IMM filter.

#### 5. Conclusion

This paper investigated related technology to address the need for automatic UAV-based vehicle detection, tracking, and positioning. The YOLO v4 object detection algorithm was used as the basis to train a vehicle detection network from the UAV perspective. At the object tracking stage, the DeepSORT algorithm was adopted. The combination of YOLO v4 and DeepSORT algorithms effectively improves the accuracy and robustness of multivehicle detection and tracking in complex urban scenes. The particle filtering and IMM algorithms were combined and applied to the UAV for positioning of maneuvering objects, which can improve the target positioning accuracy effectively.

#### Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.