Abstract
Aiming at the 1vs1 confrontation problem in a complex environment where obstacles are randomly distributed, the DDPG (deep deterministic policy gradient) algorithm is used to design the maneuver decision-making method of UAVs. Traditional methods generally assume that all obstacles are known globally. In this paper, a UAV airborne lidar detection model is designed, which can effectively solve the problem of obstacle avoidance when facing a large number of unknown obstacles. On the basis of the designed model, the idea of transfer learning is used to transfer the strategy trained by one UAV in a simple task to a new similar task, and the strategy will be used to train the strategy of the other UAV. This method can improve the intelligence of the UAVs in both sides alternately and progressively. The simulation results show that the transfer learning method can speed up the training process and improve the training effect.
1. Introduction
In the battlefield, UAVs can play a role in reconnaissance, detection, target tracking, attack interception, damage assessment, and others [1]. UAVs can also be used to intercept the enemy UAV [2]. How both sides maneuver to achieve the corresponding task objectives has aroused the attention and research interest of military experts and a large number of scholars.
At present, many experts have proposed different algorithms to solve the maneuver decision-making problems in different situations. In the traditional method, the main algorithms are the differential game method [3], expert system method [4], and guidance law [5]. These methods have shown good effect on simple tasks, but they cannot be applied to complex battlefields where the environment is unknown, and it is difficult to obtain analytical solutions. Therefore, scholars try to apply intelligent algorithms to UAV attack and defense confrontation problems, including bionic modeling [6], fuzzy cybernetics [7], and swarm intelligence algorithms [8].
Deep reinforcement learning, as an artificial intelligence technology that combines neural networks and reinforcement learning, is a new type of a decision-making method, which has good application prospects for the research of UAV countermeasures. For the scenario of UAV swarms chasing enemy targets, the DDPG algorithm is used to train UAVs to pursue targets [9]. Aiming at the confrontation problem with multiple UAVs, the cooperative decision-making method of multiple UAVs based on the multiagent reinforcement learning algorithm is proposed [10]. An MPPO algorithm is proposed to solve the confrontation problem of a large-scale UAV swarm [11]. A hierarchical framework based on reinforcement learning and two kinds of motion planning strategies for the problem of chasing and escaping games in the presence of obstacles is presented [12]. Liu and Wang proposed an adversarial decision generation method based on the generative adversarial network for the confrontation between UAVs in a barrier-free environment [13]. Wen and Shi proposed an intelligent decision making method for multicoupled tasks of cluster UAV confrontation in complex environments [14]. Wang and Guo improved the reward function of the cluster UAV confrontation model and optimized the reward calculation method [15]. These works have verified the feasibility of applying deep reinforcement learning to the UAV confrontation problem. Most of the current research is carried out under the condition that the scene information is completely known, and the designed strategy is suitable for specific confrontation scenarios. If the scene becomes complicated, these studies may turn ineffective.
In this paper, to solve the problem of obstacle avoidance when facing a large number of unknown obstacles, a UAV airborne lidar detection model is designed, and a 1vs1 maneuver decision-making method based on the DDPG algorithm is proposed. To get a better training effect, three training methods are designed by the idea of transfer learning. The scenarios corresponding to these three training methods are interrelated, that is, gradually increasing the task difficulty and fixing the strategy of the other UAV when one UAV is trained so as to make the confrontation environment of the agent relatively stable. We can transfer the relevant experience gained during the interaction between the UAV and the environment into new training scenarios to improve the intelligence of the UAVs on both sides alternately and progressively. The experimental comparison between the transfer and nontransfer methods shows that the transfer reinforcement learning makes the two UAVs have their own intelligent strategies in a 1vs1 confrontation game. It also shows that the method can speed up the training process and improve the confrontation effect.
2. Problem Description and Modeling
2.1. 1vs1 Confrontation Problem
The scenario of 1vs1 confrontation can be described as that there are one blue UAV and one red UAV in a limited planar area, which are called the attack UAV and the defense UAV. The purpose of the attack UAV is to break through the interception of the defense UAV to reach the target area (light red area in the figure) from the initial position (blue flag). The purpose of the defense UAV is to intercept and destroy the attack UAV from the initial position (red flag). As shown in Figure 1, this paper assumes that circular obstacles (black areas) are distributed in the environment randomly. Only when the obstacles are within the detection range of the UAV’s airborne radar, the UAV can obtain their positions.

In Figure 1, and represent the attack UAV and the defense UAV, respectively. represents the position coordinates of the UAVs. represents the heading angle of the UAVs. and represent the radar detection radius of the UAVs, respectively. represents the position of the center point of the target area. represents the effective radius of the target area. represents the position of the kth obstacle center point. For the convenience of research, there is a battlefield boundary in the limited confrontation environment, and neither UAV can move out of the boundary.
It is assumed that the defense UAV can obtain the position and heading of the attack UAV in real time through the ground surveillance radar, and both sides carry lidar to detect obstacles and boundary of the local environment. It is also assumed that the attack UAV knows the position of the ground target area in advance.
2.2. Kinematics Model of UAVs
It is assumed that the UAVs fly in a two-dimensional plane. The kinematics equations of the UAVs are shown in formula:where represents the speed of the UAVs. and represent the acceleration and angular velocity of the UAVs, respectively.where and represent the boundary of the area. represents the upper limit of the UAV speed. represents the maximum value of the UAV acceleration. represents the maximum value of the UAV angular velocity.
The current state of the UAV i is , and the state will change under the action of the acceleration and angular velocity . The state at the next moment will be determined by the state transition equation as
2.3. Radar Detection Model
It is assumed that both UAVs are equipped with lidar to detect the circular obstacles and enemy in the environment. As shown in Figures 2 and 3, the detection area of the UAVs is discretized into state variables. In the figures, represents the UAV radar detection radius. represents the detection angle range. represents the radius of the circular obstacle, where represents the number of obstacles with different radius sizes. represents the position of the obstacles, where represents the total number of obstacles.


As shown in Figures 2 and 3, in order to better represent the detection state of the radar, the detection angle range of the UAV radar is discretized into l (l = 7) directions at equal intervals. In the figure, it is represented by 7 rays, and the length of each ray is (n = 1, …, l). The length of the blue ray is the maximum detection radius of the UAV radar, and the length of the red ray is the relative distance between the UAV and the obstacle or boundary detected in the corresponding direction. represents the ratio of to the UAV radar maximum detection radius. If the ratio is closer to 1, it indicates that the UAV is farther from the obstacle or boundary in this direction. Otherwise, it indicates that the UAV is closer to the obstacle or boundary in this direction.
3. 1vs1 Confrontation Maneuver Decision-Making Method Based on Reinforcement Learning
In this paper, the reinforcement learning algorithm of DDPG is used to study the 1vs1 confrontation scenarios. Before using this algorithm, it is necessary to define the state space, action space, and reward function.
3.1. State Space
The position, speed, and heading of the attack UAV can be characterized as . The discretization number of the radar detection range is set to 7, so the detection state can be characterized as . The attack UAV usually knows the position of the target area in advance. To simplify the input state dimension of the UAV, the position of the target is combined with the radar detection state. As shown in Figure 4, the direction corresponding to the maximum value of state quantity in (there may be multiple such directions, such as the four blue ray directions in Figure 4) will be determined, and then the direction with the smallest angle with the UAV target line of sight direction will be selected as the optimal heading (such as the green ray direction in the figure) of the attack UAV. The number of this direction is marked as , and let equals to 2, which means that the attack UAV moves in this direction as much as possible.

In summary, the state of the attack UAV includes the UAV’s own position, speed, heading angle, the radar’s detection state, and the target’s direction. Therefore, the state contains 10 dimensional data in total, which is defined as formula:
For the defense UAV , the status is similar to the attack UAV, which is defined as formula.
3.2. Action Space
It is assumed that the attack UAVs have stronger maneuverability. The control inputs of both UAVs are acceleration and angular velocity, and the action space is shown as formula:
3.3. The Reward Function
Reinforcement learning mostly uses sparse rewards in the field of AI games and has achieved good results [16]. However, the sparse reward cannot make the UAVs to learn efficiently at the beginning of the confrontation task.
Therefore, the reward function of this experiment is set by the combination of guided reward and sparse reward. The design of guided reward is shown in the following formula:where and represent the relative distance between the UAV and the target at time t − 1 and t, respectively. represents the variation of relative distance. represents the cumulative value of the UAV radar detection state variable relative to 1. represents the reward of the current speed of the UAV. represents the deviation of the current heading of the UAV from the optimal heading .
The design of sparse reward is expressed in the following formula:where represents the penalty for the UAV colliding with the boundary. represents the radius of the obstacle. represents the Euclidean distance in two-dimensional space. represents the penalty for UAV colliding with the obstacle. represents the radius of the target area. represents the attack distance of the defense UAV. is the reward of the attack UAV to reach the target or the punishment for it being destroyed. The success signal of the defense UAV is that the attack UAV is destroyed.
3.4. The DDPG Algorithm
The DDPG algorithm is a classic reinforcement learning algorithm based on the actor-critic framework [17]. It is a deterministic policy gradient algorithm referring to the experience playback mechanism and the dual network structure in the DQN algorithm, and it realizes the direct mapping from the continuous state space to the specific high-dimensional action space through the actor network. The network architecture of DDPG is shown in Figure 5.

As shown in Figure 5, the algorithm mainly includes the interactive environment, the experience pool, and the network module of the algorithm. Before the UAV interacts with the environment, it is necessary to determine the number of layers and nodes of the network. We need to initialize the current network parameters randomly and copy the evaluated network parameters to the corresponding target network for the first time. In each step of interaction, the initial state of environmental feedback is taken as the state input of the actor evaluated network, and the action value of UAV is obtained by the actor network. We need to add Gaussian noise to increase the exploration of the action space on this basis. Due to the limitation of the UAV’s angular velocity, the action of the UAV is the combination of Gaussian noise and motion constraints, which is expressed in the following formula:where represents the limitation function of the UAV action, is Gaussian noise, which should obey the formula :where represents the variance of action noise. The state of the UAVs is determined by the state transition formula (3), and the corresponding reward is obtained according to the reward function. Then, the network training sample is obtained, and we stored it in the experience pool. If the number of samples reaches the requirements for starting training, the parameters of the network are trained according to the method of random sampling. The specific method is to randomly take sets of sample data from the experience pool. represents the sample. The back propagation algorithm can be used to update the evaluated network parameters.
The loss function of the critic evaluated network is calculated as formula:where represents the parameters of the critic evaluated network, represents the evaluation value of the critic evaluated network of the current state and the actions performed, and is defined as formula:where represents the reward after the UAV performs action , represents the attenuation coefficient of the reward, and represents the evaluation value of the critic target network.
The parameter of the critic evaluated network is updated as formula:where is the learning rate of the critic evaluated network, and is calculated as formula:
The parameter updating method of the actor evaluated network:where is the learning rate of the actor evaluated network. is calculated as formula:
The parameters of the actor target network and the critic target network are updated through a soft update method. Such a slow updating process makes the training process more stable. The process of updating as formula:where represents the soft update coefficient.
4. Confrontation Maneuver Decision-Making Method Based on Transfer Reinforcement Learning
4.1. Transfer Learning
It is common that the trained strategies of deep reinforcement learning can only be applied to specific environments. As the complexity of the task increases, it is more difficult for the strategies to apply to new scenarios. Transfer learning is an algorithm that can make full use of the knowledge and experience that could be gained in previous related tasks and applied to new tasks [18]. Transfer learning has a strong ability of model generalization. This idea can also be reflected in daily learning. For example, people use their mother tongue to learn foreign languages. People who are familiar with C++ can quickly learn other programming languages. A solid mathematical foundation is helpful for learning professional courses. All those mentioned previously are based on the previous knowledge to continue learning to solve new problems. Different scenarios or tasks in transfer learning are generally called domains. The domains that have learned experience and knowledge are called source domains, and the domains to be learned are called target domains. The definition of transfer learning is as follows.
Based on the given source domain and source domain task , the knowledge learned in the source domain is used to learn in the target domain to complete the task of the target domain.
The idea of transfer learning can also be applied to reinforcement learning. In this paper, the parameter transfer method of transfer learning is used to deal with the scenario of 1vs1 confrontation. The core idea of this method is that the agent learns in a simple task firstly, and if the learned strategy is getting better, the difficulty of the agent’s task can be gradually increased. The agent strategies which are suitable for simple tasks will be transferred to more complex tasks to continue learning. This process can reduce the difficulty of exploring complex tasks effectively and avoid the problems caused by sparse rewards successfully.
4.2. Confrontation Maneuver Decision-Making Method Based on Transfer Learning
Aiming at the 1vs1 confrontation model established in Section 2, this paper lets the UAV learn in a simple environment firstly and gradually transfer the learned experience to more difficult mission scenarios. In the learning process, when one side’s strategy is to be trained, the other side’s strategy trained in the previous scenario will be used initially. After the training is completed, the strategy of this training will be used to train the other side. We can use alternate training methods to improve the strategy of the UAVs from the two sides progressively. The specific training process is shown in Table 1.
The pseudocode of the strategy training algorithm for DDPG-based 1vs1 confrontation is shown in Table 2.
5. Simulation Experiment
5.1. Experimental Environment and Parameter Settings
The experimental software package is PyCharm 2020.1 and Anaconda3. The experimental program is based on the Python language. The settings of the confrontation scenario are shown in Figure 1. This paper uses the standard GUI writing library named Tkinter of Python to build a two-dimensional environment. The neural network is constructed by the PyTorch module, and the version of it is 1.8.1.
The specific parameters of the experimental environment are introduced as shown in Table 3. The obstacles are distributed in each episode randomly, and they are limited in the specific area.
The simulation step is 1s. The PyTorch module is used to build the neural networks of this paper, which all are 3-layer fully connected feedforward neural networks. The number of neurons in each layer of the actor network is [10, 128, 64, 2], and the number of neurons in each layer of the critic network is [12, 128, 64, 1]. The activation function is the ReLU function. To ensure that the action output by the actor network is reasonable, the value output by the final output layer is multiplied by the maximum action limit value by the tanh function. The network parameter optimizer uses the AdamOptimizer module. To reduce the burden of the neural network and speed up the training of the network, the state input of both UAVs will be processed in advance. In this paper, the position coordinates are divided by the maximum boundary length, and the angle is limited to and divided by .
The algorithm training parameter settings are shown in Table 4.
In addition, there are two specific conditions of episode termination in this experiment. One is the number of time steps that the UAV interacts with the environment reaching the maximum number of time steps per episode. The other is that the UAVs collide with obstacles and boundaries or successfully achieve their required targets. For sparse rewards, if the UAV collides with an obstacle or boundary, the rewards and are set to −10. If the UAV completes the required task, the reward is set to 10. For the guided rewards, different reward coefficients , , , and are set to 0.3, 0.2, 0.2, and 0.3 in formula (8).
5.2. Training Result Analysis
The purpose of the reinforcement learning algorithm is to train the agent’s strategy to maximize its cumulative reward expectation. The evaluation index of training results can generally be the average reward value of the episode. It is a graph which shows the change of the reward value obtained by the agent training with the number of the episodes. The faster the reward value rises and the more stable and higher the reward value converges, the better the training effect is. This paper uses the average reward of the last 100 episodes as the final average reward value. If there are less than 100 episodes from the beginning of training, only the average reward value of the existing rounds will be used.
According to the training steps in Table 2, we can use the strategies trained by the UAVs in the simple task scenarios in step 1 and step 2 to in the scenario of step 3. In step 3, the task difficulty increases gradually, and the transfer and nontransfer methods are used for comparative analysis, respectively. The migration methods are based on the network parameters of 1500 episodes previously trained. The details are as follows:
As shown in Figure 6(a), the offensive UAV has prior information of its starting position and goal position in the environment of step 1, and it is trained to avoid obstacles and boundaries. After 1500 episodes of training, the reward function curve of the attacking UAV is shown in Figure 6(b).

(a)

(b)
The abscissa of Figure 6(b) represents the number of training episodes, and the ordinate represents the average rewards of the most recent 100 episodes. It can be seen from the figure that the UAV is not clear about what it is going to do at the beginning. It is just exploratory interaction with the environment, and the data of these interactions are extremely useful. After the experience pool is filled (about 520 rounds), as the algorithm begins to train, the reward curve begins to rise gradually, and it starts to show a trend of convergence after 720 episodes with good stability.
As shown in Figure 7(a), in step 2, the defense UAV uses the trained strategy of the attack UAV in step 1 to avoid obstacles and boundaries, and on this basis, the defense UAV is trained to intercept the attack UAV. If the distance from the attack UAV to the target location (yellow) is less than the distance from the defense UAV to the target location, the defense UAV cannot complete the interception and strike mission, it is due to the fact that the maneuverability of the attack UAV is better than the defense UAV. Therefore, the episode will be terminated early, and it means that the attack UAV completes its task successfully and the defense UAV fails to defend.

(a)

(b)
The abscissa of Figure 7(b) represents the number of training episodes, and the ordinate represents the average rewards of the most recent 100 episodes. It can be seen from the figure that after the experience pool is filled (approximately 580 episodes), the training curve begins to gradually rise and begins to converge around 850 episodes with good stability.
In step 3, the defense UAV used the defensive strategy trained in step 2. It is assumed that the attack UAV can detect the defense UAV by its airborne lidar and take the defense UAV as obstacles to avoid. Then, the attack UAV is trained by the strategy of the attack UAV trained in step 1 and the nontransfer method, respectively. The training results are shown in Figure 8. Similarly, if the distance between the attack UAV and the target position (yellow) is less than the distance between the defense UAV and the target position, the episode will be terminated in advance, and it will be judged that the attack of the attack UAV is successful and the defense of the defense UAV is failed.

The abscissa of Figure 8 represents the number of training episodes, and the ordinate represents the average rewards of the most recent 100 episodes. It can be seen from the figure that both transfer and nontransfer methods can converge within a certain period of time. In contrast, the transfer method has a better round reward value before training and a higher reward value after convergence.
5.3. Experiment Result Analysis
In this paper, the training results after 1500 episodes are tested by Monte Carlo for 10000 times. The parameters of the trained actor evaluated network are set in the UAVs.
Three different scenarios are tested. The effects of this test are shown in Figures 9–11 (each small circle represents the current position of the UAV at every time step, which is 5 s.). The test result data are shown in Figure 12.


(a)

(b)

(a)

(b)

The test results of step 1 show that the attack UAV trained by the presented method can avoid obstacles successfully. The final strategy can achieve stable convergence, and the success rate of avoiding obstacles and reaching the designated area is 99.29%.
The test results of step 2 show the success and failure of the defense UAV, respectively. As shown in Figure 12, both UAVs can avoid obstacles successfully. The defense success rate of the defense UAV is 55.54%. Most of the cases of defense failure are that the two UAVs evade from different sides of the obstacle, so the defense UAV cannot intercept effectively.
The test results of step 3 show the success and failure of the attack UAV, respectively. Compared with the results of the nontransfer method (86.05%), the transfer reinforcement learning method proposed in this paper can increase the offensive success rate (87.56%). Moreover, the results of both sides are greatly improved compared to step 2 (43.55%).
1000 Monte Carlo experiments are conducted between the attackers and defenders trained by the traditional MADDPG algorithm and the attackers and defenders trained by the DDPG algorithm based on transfer learning. The experiment results are shown in Figure 13.

As shown in Figure 13, on the attack side, the winning rate of the transfer learning algorithm is 94.2%, which is significantly higher than MADDPG’s 45.2% winning rate, while on the defense side, the winning rate of the transfer learning algorithm is 54.8% which is also significantly higher than MADDPG’s 6.8% winning rate. These results demonstrate the effectiveness and superiority of the algorithm proposed in this paper.
6. Conclusion
In this paper, reinforcement learning is applied to the UAV confrontation problem, and a 1vs1 confrontation method is designed based on the DDPG algorithm. Based on the model, transfer learning is introduced to train the UAVs. The results show that the proposed method can make training converge faster and can increase the offensive success rate.
Due to its limited mobility, the task success rate of a single defense UAV is not high. Therefore, the next step will continue to study the maneuver decision-making of multiple defense UAVs against a single offensive UAV on the basis of the method proposed in this paper. In the far future, optimizing the framework structure of the algorithm or complicating the environment and adding more UAVs to the scenario will be the development direction.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare no conflicts of interest.
Acknowledgments
This research was funded in part by the Aeronautical Science Foundation of China under Grant number 2020Z023053001.