Abstract

With the increasing complexity of UAV application scenarios, the performance of a single UAV cannot meet the mission requirements. Many complex tasks need the cooperation of multiple UAVs. How to coordinate UAV resources becomes the key to mission completion. In this paper, a task model including multiple UAVs and unknown obstacles is constructed, and the model is transformed into a Markov decision process (MDP). In addition, considering the influence of strategies among UAVs, a multiagent reinforcement learning algorithm based on SAC algorithm and centralized training and decentralized execution framework, MA-SAC (Multi-Agent Soft Actor-Critic), is proposed to solve the MDP. Simulation results show that the algorithm can effectively deal with the task allocation problem of multiple UAVs in this scenario, and its performance is better than other multiagent reinforcement learning algorithms.

1. Introduction

Unmanned aerial vehicle, also known as UAV, has the characteristics of strong mobility, low safety risk coefficient, no need for personnel to take off, repeatability, and so on. UAV was first used in military fields [1], such as reconnaissance, target strike, air early warning, and electronic jamming. In recent years, UAV technology is developing rapidly, the size of UAV is decreasing, and the cost is getting lower and lower. Therefore, UAV is more and more widely used in civil fields such as sensing [2], cargo transportation, communication relay [3], fire monitoring, and aerial mapping.

With the increasingly complex application scenarios, such as the combination with the Internet of vehicles [4], a single UAV cannot effectively complete complex and diverse tasks. It is important to make multi-UAV perform tasks collaboratively not only to meet the requirement of complicated scenarios but also to make the accomplishment of tasks to cause less time-and-resource consumption.

Task planning is the most important part for the cooperative execution of multi-UAV, and task allocation is the basis of task planning. Task assignment refers to the complex task environment existing in several UAVs; after taking full account of the energy consumption, load, nature, role, and other constraints of UAVs, the coordination between UAVs and various resources is coordinated to assign one or more orderly tasks to each UAV, so as to minimize the time and cost and ensure the efficient and successful completion of tasks to the maximum extent.

The task allocation problem is generally approximated to the path planning problem [5], that is, how to generate a collision-free path from the starting site to the destination to ensure the safety of the vehicle [6]. However, in the multi-UAV environment, not only the collision between UAVs and obstacles but also the collision between UAVs should be considered. At the same time, with the increase of the number of UAVs, the variation of the environment is also increasing. In addition, every action decision of each UAV can be regarded as simultaneous, and no one UAV can know the current decision of other UAVs, so it is more difficult to avoid collisions between UAVs.

Fortunately, reinforcement learning (RL) techniques are emerging to help solve the problem of real-time decision-making in complex and changing environments. The technology allows the drone to learn a strategy to maximize returns or achieve a specific purpose through its constant interaction with the environment.

In this paper, a UAV task allocation model including UAV collision and communication energy consumption is presented; at the same time, an MA-SAC algorithm is proposed to assign tasks and plan paths to UAVs.

The specific works of this paper are as follows:(i)A multi-UAV task assignment model based on collision and communication energy consumption is proposed(ii)Based on this assignment model, the dynamic process of task assignment is transformed into MDP(iii)A multi-agent reinforcement learning algorithm MA-SAC is proposed to solve the MDP process

The rest of this article is organized as follows. Section 2 describes the related work. In Section 3, the multi-UAV task assignment model is presented. Section 4 introduces the task assignment algorithm proposed in this paper. In Section 5, simulation is performed and the results are analyzed. Finally, the works of this paper are summarized in Section 6.

In the past few years, many researchers have done a lot of research on multi-UAV task allocation model and the algorithm to solve the model. They not only make the model more close to the increasingly complex reality environment but also look for high-performance algorithms. This section will introduce relevant work from these two aspects.

2.1. Task Allocation Model

In various scenarios, different task allocation models need to be established based on a variety of problems that need to be solved by UAV. In the paper [7], and this problem is modeled as a traveling salesman problem (TSP), which minimizes the total flight time and total range of all UAVs by considering the flight capability of UAVs. Jia et al. [8] construct a heterogeneous UAV cooperative multitask allocation scenario by considering kinematic constraints, resource constraints, time constraints, and vehicle path model. Song et al. [9] describe the UAV logistics problem as a mixed integer linear programming problem considering UAV flight time, load, and other constraints. In addition, the task allocation problem of multi-UAV is usually described as multidimensional multiple choice knapsack problem (MMKP) [10, 11], dynamic network flow optimization (DNFO) problem [12], and multiple processors resource allocation (CMTAP) problem [13, 14].

2.2. Task Assignment Algorithm

Task assignment algorithms are mainly divided into optimization algorithm, heuristic algorithm, and reinforcement learning algorithm.

Optimization methods include Hungarian algorithm [15, 16], branch-and-bound method [17], and other commonly used integer linear programming methods. These algorithms are only applicable to scenarios with simple tasks and small UAV scale. Their calculations grow exponentially as the number of UAVs increases, and these algorithms cannot generate an accurate trajectory for UAVs in complex environments. Heuristic algorithms are proposed relative to optimization algorithms, including GA [18], ACO, and PSO that simulate animal behavior in nature. These algorithms are generally combined with other algorithms to solve task assignment problems. In [18], GA is combined with clustering algorithm to solve the task allocation and path planning problems of multiple UAV. In [19], the author proposed two improved heuristic algorithms to solve TSP problems, one is IGA algorithm proposed by improving the coding rules of genetic algorithm, and the other is PSO-ACO algorithm combining PSO and ACO. In [20], the author improves swarm gap algorithm and puts forward three algorithms: location loop (AL), sorting and allocation loop (SAL), and limit and allocation loop (LAL), which solves the task allocation problem of the UAV team in a military operation. However, the heuristic algorithm has the disadvantage of falling into local optimum easily, and the real-time performance of the algorithm is worse and worse with the increase of environment complexity. Therefore, many researchers began to study the application of reinforcement learning in task assignment.

Reinforcement learning is a kind of algorithm that makes an agent learn the optimal strategy through trial and error in the environment. Reinforcement learning has been widely used in UAV mission assignment scenarios over the past few years. In [21], a transaction inspired multiagent reinforcement learning algorithm was proposed to solve the path planning and coordination problems of UAV clusters. In reference [22], the author proposed a MADOL algorithm to enable multiple UAVs to solve the ambiguous BSN allocation problem in an ambiguous boundary scenario. The literature [23] has developed a multiagent reinforcement learning framework, which solves the problem of dynamic resource allocation of UAV communication network in uncertain environment and realizes the balance between performance gain and UAV overhead. In reference [24], the author proposed a multiagent reinforcement learning algorithm, compound-action actor-critic (CA2C), which solves the problem that UAVs perform sensing tasks through cooperative sensing and transmission. In [25], the author proposed an FTA algorithm by combining DQN algorithm with priority experience replay, which effectively solved the problem of UAV task allocation in uncertain environment. In [11], the author proposed a DDQN-per algorithm to solve the task assignment problem of MCS. However, these single-agent algorithms regard the agents in the environment as independent and cannot train a good agent cooperation model. The proposed MADDPG [26] algorithm adopts the method of centralized training and distributed deployment, which well solves the problem of cooperation and competition among multiagent. In [27], the author proposed an MADDPG algorithm, trained the MADDPG model offline, and then solved the resource allocation problem in the UAV-assisted vehicle network online. However, DDPG algorithm is a deterministic strategy, which may fall into local optimum due to greed. The proposed SAC algorithm [28] introduces entropy, which requires not only maximum reward but also maximum entropy to enhance the spatial exploration ability of agents. Based on the idea of centralized training and separate deployment, this paper applies SAC algorithm to the cooperative task assignment environment of multiple UAVs and proposes an MA-SAC algorithm.

3. Task Assignment Model

Multi-UAV should not only complete each task but also pay attention to their own safety and energy consumption. Figure 1 shows the task allocation framework of multi-UAV. In this paper, the distance from UAV to the mission positions, the collision of UAV, and the communication between UAV and base station are comprehensively considered to establish the task assignment model, and the specific modeling is as follows.

3.1. The Distance between the UAV and the Mission

This paper considers how to assign multiple UAVs to multiple task points and plan a safe path so as to achieve the goal of reducing the total cost while completing the task quickly and safely. In this paper, the UAV cluster is represented by . The position and track data of each UAV can be obtained by the GPS device carried by the UAV itself, and the data will be transmitted to the MEC layer for calculation. For each UAV , is used to represent its current position.

The set of tasks to be completed is represented by . For each task , is used to represent task position.

The distance between the UAV and the mission location can be calculated using the following formula:

3.2. UAV Collision

In order to simulate the real environment, some obstacles are added to the environment to block the route of UAV. At the same time, the collision between UAV and other UAVs is considered. As shown in the picture, there is a certain safety buffer area between the UAV and the obstacles.

The distance between UAVs can be calculated using the following formula:

Once the distance between UAVs or between UAVs and obstacles is less than the safety zone, UAVs are considered to have a safety risk of collision.

3.3. UAV Communication

In order to grasp the status of UAV in real time, the communication between UAV and base station needs to be considered, and the position of base station is represented by . In this paper, UAV’s altitude to the ground is ,and the straight-line distance between UAV and base station can be calculated by the following formula:

Transmitting the data collected by UAV sensors needs to consume the energy of the sensor node [29]. In order to study the energy loss of UAV transmission, we consider the path loss of UAV communication with base station. In Friis free space model [30], the relationship between signal transmitting power and signal receiving power can be calculated by the following formula:where is the receiving signal power, is the transmitting signal power, is the transmitting antenna gain, is the receiving antenna gain, is the signal wavelength, is the system loss factor unrelated to propagation, and is the propagation distance. In this paper, is the distance between each time slot UAV and the base station .

In order to ensure normal communication, the power of the attenuated UAV signal needs to be greater than the receiving power of the base station. Therefore, the signal transmitting power of each time slot of UAV must meet the formula

The communication energy consumption of each UAV to complete the task can be expressed aswhere is the time slot set for the UAV to complete the task. In this paper, the time slot is approximated to each step in the simulation. is the duration of each time slot . In this model, is set to 1.

The total communication energy consumption of UAV cluster can be calculated by the formula

4. Task Assignment Algorithm

In this section, we consider the application of reinforcement learning in multi-UAV task allocation, apply a soft actor-critic (SAC) algorithm to multiagent environment, and propose an MA-SAC algorithm. This algorithm is usually used to solve the problem described as Markov decision process (MDP). So, this section will introduce the MDP of this model, SAC algorithm and MA-SAC algorithm in turn.

4.1. Markov Decision Process

MDP is usually composed of state, action, and reward function. Therefore, the MDP of the model can be described as follows.

4.1.1. State

In this process, the state space is composed of the position and speed of the UAV, the distance between the UAV and the destination, and the collision risk of the UAV.

4.1.2. Action

The action space is usually the optional action set of all UAVs in different states. In this model, the action space of UAV is expressed as < front, back, left, right, hover >.

4.1.3. Reward

In this model, when multiple UAVs are faced with multiple tasks, this paper aims to reasonably allocate task targets and carry out path planning for each UAV, so that each task can be completed safely and quickly with the minimum total energy consumption. Therefore, for UAV , the reward can be described as

The task assignment problem can be described aswhere is the reward for completing the task, and the value is constant. is the collision reward. is the distance reward. In order to guide the UAV to the mission point, it can be expressed as , indicates that the mission is carried out by UAV , and indicates that UAV performs mission . Formula (10) means that only one UAV can be assigned to perform each task, and formula (11) means that each UAV can only perform one task.

4.2. SAC Algorithm

SAC algorithm is a kind of off-policy reinforcement learning algorithm. This paper is improved based on SAC algorithm proposed in [31]. The algorithm improves the critical network on the first version of SAC algorithm [32]. It removes the value network and uses two Q networks. Therefore, the SAC algorithm has one actor network, two critic networks, and two target-critic networks. Among them, the actor network is used to give the corresponding action according to the change of state, and the critic network is used to calculate the Q value to evaluate the action. In order to solve the overestimation problem, the SAC algorithm adopts a pair of independent critic network and takes the smaller value of the two when updating. In order to stabilize the training of Q network, the SAC algorithm introduces a pair of target-critic networks whose update frequency is less than the critic network.

In order to prevent the strategy from getting into trouble due to greed, it is necessary to increase the random exploration ability of the algorithm, so SAC introduces entropy regularization. When the strategy distribution is more uniform, the entropy of the strategy is greater, and the random exploration ability of the algorithm is stronger. Therefore, the objective function of SAC algorithm not only requires the maximum final reward but also the maximum entropy. Its objective function can be expressed aswhere is the entropy of strategy, is the reward for time , and is the optimal strategy.

4.3. MA-SAC Algorithm

Figure 2 shows the MA-SAC algorithm that we proposed by improving SAC algorithm based on the multi-UAV task allocation model. MA-SAC algorithm is based on actor-critic network framework. In this multi-UAV environment, each UAV has an actor network, a target-actor network, two critic networks, and two target-critic networks, which are all composed of fully connected neural networks.

In the multi-UAV environment, UAV itself is not only an intelligent body but also a part of the environment of other UAVs. Therefore, for the critic network of each UAV, we not only input the environmental state into the critic network. The actions of other UAVs are also fed into the critic network to calculate the Q by a part of the overall environment. SAC, like DDPG and other algorithms, introduces the experience replay mechanism to reduce the correlation between data. Therefore, the whole training process is divided into two parts: experience collection and network training. In the experience gathering phase, the agent performs the actions generated in each step, and then stores the tuples that include states, action, next state, and reward into the replay buffer.

When the data in the replay buffer reaches the threshold, the network training stage can be entered. At each step, some data will be sampled from the replay buffer to update the parameters of actor networks and critic networks. The actor network is trained by the strategy gradient. For each UAV , the actor network update targets are as follows:where represents the policy network of the agent , represents the parameter of the policy network , and represents the current status of all agents.

Critic networks are updated by minimizing the loss function as a goal. The loss function is the mean square error that can be calculated by the formula:where represents the next status of all agents, represents the next action of the agent , and represents the next state of the agent .

To ensure the stability of training, the parameters of actor networks and critic networks will be copied to the corresponding target networks in each iteration. Here, the algorithm adopts the soft update method, so in each step, some actor and critic network parameters are updated to the corresponding target network, which can be calculated by the formulawhere is the parameter of target-critic network, is the parameter of the critic network, and is the update ratio.

The pseudocode of the MA-SAC algorithm is demonstrated in Algorithm 1, and the meanings of the parameters are shown in Table 1.

(1)Initialize environment
(2)Initialize critic network and actor network
(3)Initialize max episodes, replay buffer, batch size
(4)for episode [1, episodes] do
(5) Reset environment
(6) Get current state for each agent,
(7)for step [1, steps] do
(8)  Select actions for each agent
(9)  Get all agents next states and rewards
(10)  Store < , , , > to replay buffer D
(11)  if > then
(12)   Sample batch B from replay buffer D
(13)   for , where  = 1:N do
(14)   Update the critic network
(15)   Update the actor network
(16)   Update the target network according to formulas (15), (16)
(17)   end for
(18)  end if
(19)end for
(20)end for

5. Experimental Results and Analysis

In this section, the performance of MA-SAC algorithm in multi-UAV task assignment environment is studied. We use the Pytorch deep learning framework to simulate this scenario and compare it with MADDPG algorithm. Table 2 shows the relevant hyperparameters of the algorithm simulation in this paper.

In this experiment, we constructed an environment in which multi-UAV cooperate to complete tasks. The environment consists of three UAVs, three mission positions, one obstacle, and a base station to communicate with the UAVs. Firstly, the MADDPG algorithm proposed in reference [26] is selected to compare the convergence performance. Figure 3 shows the convergence process of MA-SAC algorithm and MADDPG algorithm during training in this environment. In this experiment, we performed 50,000 training episodes and averaged the rewards every 1,000 episodes. By comparing the two algorithms, it can be found that the proposed MA-SAC algorithm can finally converge to around 300, while the MADDPG algorithm finally converges to around 220. It can be seen that the convergence speed of the two algorithms is similar in this scenario, but the convergence result of the MA-SAC algorithm is better than that of the MADDPG algorithm, because the training goal of the MA-SAC algorithm is not only to maximize the reward of the drone but also to maximize the entropy of the UAV strategy. This increases the ability of the UAV to explore the space, thereby improving the performance of the algorithm.

To verify the effectiveness of the algorithm in this scenario, we conducted 500 episodes of tests on the MA-SAC algorithm in this environment and compared it with other multiagent reinforcement learning algorithms. As shown in Table 3, the task completion rate of the MA-SAC algorithm reaches 95.16%, which is a great improvement compared with that of the COMA and VDN algorithms, and the task completion rate is also increased by 2.4% compared with the MADDPG algorithm.

Figure 4 shows the dynamic assignment process of UAVs in the task area before training. At this time, none of the three UAVs has learned any strategy, so they are in an exploration state in the environment. It can be seen from the route of the UAV in the task assignment process that the UAV does not have a clear mission target at this time, and they move randomly in space. UAV 2 even collides with obstacles.

Figure 5 shows the rendering of the multi-UAV task assignment process when using the proposed MA-SAC algorithm for 20,000 episodes of training. It can be seen that although the UAVs have learned to approach the mission point at this time, there is no coordination between them. Both UAV 2 and UAV 3 flew to the same mission location, resulting in not all missions being completed.

Figure 6 shows the effect of the task assignment process of the UAV when the training reaches 50,000 episodes. At this point, the trained model can already solve the task assignment problem in this environment well. UAVs not only consider their distance when assigning tasks but also take into account the strategies of other UAVs and cooperate with each other to complete all tasks in the mission area. At the same time, UAVs have also learned to stay away from obstacles to reduce their own risks when completing tasks. It can be seen that UAV 2 is relatively close to the obstacle at the beginning, so there is a possibility of collision. In order to ensure its own safety, it first flies away from the obstacle, and then flies to the mission location after reaching the safe area.

6. Conclusions

In this paper, a multi-UAV cooperative task assignment model in complex environment is constructed by considering UAV distance, collision, and communication. Meanwhile, we propose an MA-SAC algorithm to solve the model by combining the SAC algorithm of deep reinforcement learning with multiagent framework of centralized training and decentralized execution. Simulation results show that the MA-SAC algorithm is superior to the MADDPG algorithm in convergence result in multi-UAV task allocation environment. In terms of task completion rate, the model trained by the MA-SAC algorithm also achieved a better result.

In the future work, more complex factors will be considered in the environment, such as making the communication model more suitable for real scenes and weather changes. At the same time, it will also study the larger-scale dynamic task allocation of UAV. Since this paper only studies the UAV cooperation scenario, the UAV task allocation in the countermeasure scenario will be studied in the future.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by National Natural Science Foundation of China (11974058 and 61901050); Beijing Nova Program (Z201100006820125) from Beijing Municipal Science and Technology Commission; Beijing Natural Science Foundation (Z210004); and State Key Laboratory of Information Photonics and Optical Communications (IPOC2021ZT01), BUPT, China.