Unmanned aerial vehicle (UAV) technique with flexible deployment has enabled the development of Internet of Things (IoT) applications. However, it is difficult to guarantee the freshness of information delivery for the energy-limited UAV. Thus, we study the trajectory design in the multiple-UAV communication system, in which the massive ground devices send the individual information to mobile UAV base stations under the demand of information freshness. First, an energy-efficiency (EE) maximization optimization problem is formulated under the rest energy, safety distance, and age of information (AoI) constraints. However, it is difficult to solve the optimization problem due to the nonconvex objective function and unknown dynamic environment. Second, a trajectory design based on the deep Q-network method is proposed, in which the state space considering energy efficiency, rest energy, and AoI and the efficient reward function related with EE performance are constructed, respectively. Furthermore, to avoid the dependency of training data for the neural network, the experience replay and random sampling for batch are adopted. Finally, we validate the system performance of the proposed scheme. Simulation results show that the proposed scheme can achieve a better EE performance compared with the benchmark scheme.

1. Introduction

With the explosive increasing of global mobile devices and connections in the future wireless network, Cisco forecasts that the global mobile data traffic will reach 77 exabytes per month by 2022 [1], which is almost two times over the data traffic in 2020. To meet the needs of high volume of data traffic and massive connections, the sixth generation (6G) wireless communication system enables some promising technologies to improve the communication rate, enhance the wide coverage, access the massive devices, and strengthen the intelligence and security [2, 3]. The unmanned aerial vehicle (UAV) communication working as one of the promising technologies, which has the advantages of flexible deployment, controllable maneuver, and low cost, becomes an interesting topic in the industry and academia to drive the development of Internet of Things (IoT) applications [47].

Due to the influence of the UAV’s trajectory on the rate performance and energy consumption directly, how to design the UAV’s trajectory is vital in the various types of communication scenarios. Although there exists some literatures investigating the trajectory design [810] for single-UAV communication systems under different settings, multiple-UAV may serve the specific area to provide better communication rate and coverage performance, which can increase the interference level of the UAV communication network. In [11], the authors analyzed the influence of UAVs’ positions on the rate performance and obtained the optimal positions in the two-UAV interference channel. The authors in [12] designed the trajectory of the multiple-UAV based on the successive convex approximation (SCA) method considering the backhaul. However, exploring the energy-efficient trajectory design is vital for the energy-limited UAV to enhance the sustainable communication capability. In [13], the comprehensive energy consumption model including the communication energy and propulsion energy was proposed for rotary-wing UAV. The SCA-based technique was employed to optimize the UAV trajectory. Furthermore, a joint of scheduling the backscatter devices, trajectory design, and transmit power was proposed in [14] to maximize energy-efficiency (EE) performance. In addition, the authors in [15] optimized a constructed trajectory to minimize the propulsion energy of fixed-wing UAV. Above literatures focus on the UAV communication to provide the high performance communication link and enable the information delivery.

With the increase in the real-time and computation-intensive applications, a new metric is required to satisfy the demand of information freshness beyond the scope of the delay time performance metric. To characterize the freshness for information delivery accurately, age of information (AoI) has been proposed in [16], which precisely describes the timeless of information updates from the original generation at the perspective of the receiver. In general, AoI is defined as the time gap between the observed time and the recent update time. Taking the generation time and information update into consideration, it is different from the conventional performance metrics, such as delay. In order to meet the demands of the various communications, AoI performance metric has been introduced to optimize the generation policies and user scheduling [1720]. To guarantee the information freshness of the UAV communication system, the authors in [21] proposed a dynamic programming-based path planning to update the collected data to minimize the AoI value. Recently, some learning methods were proposed to make the decisions in various dynamic scenarios [2224]. Under the UAV’s energy constraint, the AoI optimization scheme based on reinforcement learning (RL) was proposed in [23] by optimizing the UAV’s trajectory. In [24], the authors designed a joint trajectory and packet scheduling scheme based on deep reinforcement learning (DRL) approach to minimize the weighted AoI performance of the single-UAV system. However, how to design the trajectory that not only improves the energy efficiency but also guarantees the information update on time in the UAV communication system remains an open problem.

In this paper, we consider the UAVs’ trajectory design in the multiple-UAV-enabled communication system to maximize energy-efficiency performance, in which each ground device sends the individual information to the corresponding UAV. First, we formulate an energy-efficient trajectory design optimization problem in multiple-UAV communication systems under the practical constraints, such as rest energy, safety distance, and AoI metric. However, it is difficult to solve this optimization problem due to the nonconvex objective function and the unknown spaces for UAV’s trajectory. Second, a deep Q-network (DQN) method is proposed to optimize the UAV’s trajectory and reduce the computational complexity. We design the state space considering the UAV’s position, energy efficiency, rest energy, and AoI and construct the efficient reward function related with the objective function and the constraints. Furthermore, to avoid the dependency of training data for the neural network, the experience replay and random sampling for batch are adopted, which can grantee the stability of the proposed scheme in the dynamic environment. Finally, we verify the system performance of the proposed scheme. Simulation results show that the proposed scheme outperforms the benchmark scheme.

The rest of paper is organized as follows. In Section 2, the system model and energy-efficiency optimization problem in the multiple-UAV communication system are presented. In Section 3, we present the proposed DQN scheme to solve the optimization problem in detail. The simulation results of the proposed scheme are shown in Section 4. Finally, we conclude the work in Section 5.

2. System Model

We consider a UAV communication system consisting of single-antenna UAV base stations denoted as and single-antenna IoT device denoted as (e.g., smart grid, agricultural, safety, or geographic information). All the UVAs working as the aerial base stations serve the ground devices depicted in Figure 1. The UAVs service range is within a specific radius, and the flight height is fixed as .

For simplicity, and denote the three-dimensional (3D) position of the -th UAV at time and the -th device’s position, respectively. Thus, the 3D distance between the -th device and the -th UAV is written as

In order to model the channel information of the UAV communication link more practically, we adopt the probabilistic line-of-sight (LoS) channel model proposed in [25]. The probability of LoS link for the communication link between the -th device and the -th UAV, which is related with the device’s elevation, can be expressed aswhere and are the channel parameters related with the environment of the communication link and denotes the elevation angle between the -th device and the -th UAV at time . Thus, the probability for non-line-of-sight (NLoS) link between the -th device and the -th UAV is . Therefore, the average channel power gain between the -th devices and the -th UAV is defined as follows [25]:

In this work, we adopt the time-division multiple access to serve the ground devices, in which the interference is originated from the devices using the same time resource. Thus, the signal-to-interference-plus-noise-ratio of the -th user at the -th UAV at the time is expressed aswhere and are the transmit power of the -th ground device and the received noise power at each UAV, respectively. According to (4), the communication rate between the -th device and the -th UAV is written aswhere denotes the transmission bandwidth. Since the UAV working as the aerial BS serves IoT devices with small data packets, UAV only requires comparably short time to receive the data, and we neglect the hover energy consumption. Thus, the energy consumption of UAV communication mainly consists of the communication energy consumption and the propulsion energy consumption for flight . We assume that each UAV has a constant receiving power , and the energy consumption consumed by the communication link from starting time to the time is expressed as

The energy consumption to support the UAV flight is the propulsion consumption. In general, the required power consumption can be modelled as follows [13]:where and denote the flight speed of the -th UAV and the flight acceleration of the UAV, respectively. The parameters and depend on the weight, the wing length of the UAV, and the air density in the flight environment, and is the acceleration of gravity. Thus, the propulsion energy is , and the energy consumption at the time is written as

Since the UAV has the limited energy for flight and communication, it is necessary to remain the enough rest energy for safe flight and return.

Let denote the maximal energy of the -th UAV; thus, the rest energy of the -th UAV is defined as

In order to meet the essential safety for UAV flight, the rest energy of UAV should not be less than the minimum rest energy with denoting the coefficient of rest energy.

The energy efficiency for UAV to serve IoT devices at the time is expressed as

AoI describes the age of the received packet at the destination to characterize the freshness of information collected by UAVs and becomes a new metric for the future communication system [16]. For example, the device sends different packets at the times and the UAV receives these packets at the times . At time , the immediate vicinity time for receiving the packets between UAV and the device is . Thus, AoI of UAV is defined as the gap between the observed time and the maximal received time, which is written as

It is noted that the smaller AoI, the higher freshness of the information while the bigger AoI, the less freshness of the information. For simplicity, , i.e., the AoI of the system at the time sets zero. The average AoI is long-term metric to measure the total freshness of information over the duration , which is expressed as

The target is to maximize the energy efficiency of the UAV system by optimizing the trajectory under the AoI and energy constraints. Therefore, the optimization problem is expressed as follows:

Constraints (13) and (14) originate from the safety requirements that UAVs’ distance and rest energy all should have the minimum thresholds and to avoid the flight conflict, while constraint (15) comes from the demand of the data freshness with the threshold . The objective function is nonconvex, and there exists the mass of the computational complexity to search the serve sequence of the devices under AoI constraints and design the UAV trajectory with unknown spaces in the system; thus, it is challenging to handle the problem (P1). To solve the nonconvex problem, a reinforcement learning method is adopted by combining the deep neural network and reinforcement learning to design the UAVs’ trajectories intelligently.

3. Proposed Solution Based on Deep Reinforcement Learning

RL is one of the potential machine learning since it can make decisions by choosing the beneficial action from the action space based on its past experiences in dynamic environments, in which the agent interacts with the environments and updates the rewards on the current state [26]. There are three fundamental parts in each RL algorithm: state of environment, action of agent, and the reward from the environment.

3.1. State, Action, and Reward Function

In this work, UAVs are regarded as the agents to decide the trajectory to satisfy the requirements of energy and distance, and the state space of the environment for UAV trajectory design can be defined as a five-dimensional state space including the UAV’s position, energy efficiency, rest energy of UAV, and AoI of current state, respectively, i.e., the state of the -th UAV is expressed as follows:

In the initial state, each UAV equips with the maximal energy and does not build the communication link with any devices. Thus, the initial state , where AoI in the first time is set as zero for all UAVs. The rest energy, energy efficiency, and AoI in each state can be updated according to (9)–(11), respectively. If the rest energy is smaller than the minimal threshold, UAV makes a decision to the initial position to guarantee the safety of UAV and stops the update of the state.

At any state of environment, the UAV can select the flying direction to serve the ground devices or reduce the interference in the UAV communication system. For simplicity, the action space is set as , with uniform directions in . If the , the setting is typical with four orthogonal directions {left, right, frontward, backward}. After selecting the action in the action space , the UAV can transit the current state to the next state , e.g., the next position of the -th UAV in the flying time duration :

Based on the new UAV position , the communication rate is obtained while the energy efficiency and AoI values can also be updated to show the performance of the new state. Since the reward function is significant to obtain a optimal policy, the system adopts the energy efficiency to construct the reward function, i.e.,where denotes the normalized parameter for energy efficiency (since the value of energy efficiency is comparatively high, the normalized parameter can balance the reward between negative and positive reward; in general, it is almost the order of magnitude of , which is based on the reward value in the specific communication system). It is observed that the UAV will obtain the positive reward when the action satisfies the corresponding constraints, while it will obtain any reward or penalty if the UAV violates the constraints. Since the reward function is monotonically increasing with respect to (w.r.t) the objective function in P1, the UAV makes the decision toward the energy-efficiency maximization, and the system can obtain the optimal solution.

Since the size of the state space is nonlinear with the size of UAV’s position, action space, rest energy, and AoI, it exists the intensive computation complexity to obtain the optimal action by exhaustive search exists, which is impractical for the energy-limited UAV system. Moreover, the Q-learning method based on Q-table requires the large memory to store the Q-table, and it cannot be suitable for this optimization problem.

3.2. Control Policy and Deep Reinforcement Learning Algorithm

Recently, the deep RL algorithm has become a promising technique to tackle the resource allocation and performance optimization in the wireless communication systems. However, the continuous action space in the UAV communication system needs to be quantized into discrete and formulate the state-action function to characterize the influence of the selected action on the performance with a specific state. The deep RL method is applied to solve the problem because it can use the neural network to learn the policy to reduce the high dimensionality of the state space instead of storing the value. The detailed design framework based on DRL in the UAV communication system is shown in Figure 2. The Q-function originated from the Q-learning is adopted to maximize the long-term cumulated reward. Given the control policy for the -th UAV, the Q-function is defined aswhere is the discount factor. If the discount factor , the Q-function is only related with the current reward, i.e., the selected action only maximizes the current reward without considering the future reward. Thus, the optimal action to maximize the objective function in P1 can be written as follows:

To obtain the optimal control policy , the Q-function is updated based on the Bellman equation:where is the learning rate. According to (21), the UAV can update the Q-function and learn the control policy based on the stored Q-values by the selecting the action to maximize the reward. However, there exists one puzzle during the processing of the learning: how to select the action in the limited state-action values. At the starting of learning, the UAV only has the some partial Q-values and cannot choose an appropriate action. Thus, the UAV should explore the environment sufficiently to obtain Q-values of all state-action pairs. To tackle this issue, an -greedy strategy is applied to explore the environment with the probability , which is written as

At each state, the UAV can take a random action with the probability to explore the environment. As the number of exploration increases, the probability can decrease to guarantee the system performance with the UAV selecting the optimal action. Since the unknown state space for the UAV’s trajectory may lead to a large memory size and a slow convergence rate, thus deep neural network is an effective method to extract features from the existing data sets intelligently and reduce the computational complexity by predicting the output in parallel. According to the framework in Figure 2, the tuple consists of the state, action, reward, and next state working as the input of deep neural network to output Q-value as in the estimate and target neural networks, where denotes the parameters of the neural network during -th training. The target neural network is the replica of the estimate neural network every steps to make the two neural networks as close as possible to guarantee the stability. Therefore, it is important to optimize the parameters of neural network based on the suitable loss function to obtain the optimal Q-function. The loss function based on the error is defined as follows:

Based on the loss function and the training data set, some optimizers can be used to obtain the optimal parameters of neural network, such as gradient descent algorithm and Adam algorithm.

The training data are vital for training the deep neural network. However, there exist the following challenges: first, the UAV communication system is time-varying and the objective function is related with the UAV’s trajectory and AoI. How to obtain the sufficient number of training data in the dynamic environment is crucial for optimizing the neural network. Second, empirical evidence demonstrated that independent training data can enhance the stability and improve the convergence for the neural network. Thus, obtaining independent training data is another challenge to optimize the neural network. To address these aforementioned challenges, the experience replay and random sampling method are adopted. For the fixed experience replay memory with the size , the training data will be updated every steps to replace the history data, which can automatically fresh the memory to obtain the fresh training data. To avoid the dependency of the stored data to train the neural network, the random sampling method is used to form the batch by choosing the experience from the replay memory randomly, which can smooth the changes between the history data and the new observation. The proposed DQN-based trajectory design scheme for energy-efficiency maximization is shown in Algorithm 1.

(1)Initialize learning parameters , memory size, batch size, the maximal episode , observed time , and  = 0;
(3)Initialize the environment and state .
(5)If random() , select an action using (20). Otherwise, select an random action from the action set .
(6)Execute the action , compute EE, rest energy, and AoI, and obtain the next position to form the next state . According to (18), compute the reward. Store into the experience-reply memory.
(7)If  = = 0, duplicate the estimate neural network to target neural network.
(8)Train the neural network based on loss function in (23) to optimize the parameter , .
(9)end for
(10)end for

4. Simulation Results

In this section, we consider three UAVs flying in the square area with to verify the proposed scheme. All ground devices are uniformly distributed in this area to transmit the independent information to UAVs during the observed time . For performance comparison, DQN without experience-reply is adopted as the benchmark scheme. Unless otherwise stated, the simulation parameters are shown in Table 1.

In order to train the deep neural network, three fully connected hidden layers with 100 hidden nodes are adopted. The size of experience-reply memory is 400 and is randomly selected from the memory to construct the batch. The gradient descent method is used to optimize the parameter of neural network. For the experience-reply memory, the new training data always update the oldest history data. To guarantee the size of efficient batch data, the training starts after 100 steps. Other parameters of the deep neural network are shown in Table 2. The simulation environment is Intel i7 CPU, Python 3.7, and TensorFlow 1.14 to train the UAV’s deep neural network. All results are averaged over 500 episodes.

Figure 3 shows the effect of the flight speed on the AoI performance. It is observed that AoI value decreases with increase in the flight speed, i.e., the information freshness can improve by increasing the UAV’s flight speed, which comes from the fact that the frequency of the information update increases as the speed increases in the limited flying environment. Compared with the benchmark scheme, the AoI value decreases by 3.5% and 9.3% with  m/s and  m/s. However, since the propulsion power increases cubically w.r.t the flight speed, it is unwise only to increase the UAV’s speed to maximize EE performance under the AoI constraint.

Figure 4 demonstrates the influence of the learning rate on the EE performance. We can find that the optimal EE performance of the proposed scheme can be achieved when the learning rate equals 0.4. Since the learning rate directly affects the convergence of the proposed scheme, the proposed scheme has a slow convergence rate to obtain the optimal EE performance. The proposed scheme can achieve 22.2 Kbit/J at least compared with 21.7 Kbit/J when the learning rate becomes large.

Figure 5 shows the stability of the proposed scheme with the training number. It is noted that the loss fluctuates when the number of the training is small, and it converges to a small value as the number of training becomes large, which is result of the increasing number of the training batch. It is also found that the UAVs may have different convergence rates. There are two reasons as follows: (1)UAVs have different initialized actions to result in the different estimate and target neural networks(2)Each UAV has independent channel power gain related to the distance between UAV and served ground users to form the different state space and reward value

5. Conclusion

In this letter, we considered the UAV communication system to serve the IoT ground devices, which send the fresh information to the corresponding UAVs. Taking the rest energy and AoI into account, the DQN-based trajectory design was proposed to maximize the energy-efficiency performance. The state space related with the rest energy and AoI and the reward function related with EE performance are constructed, respectively. Under the experience replay and random sampling for batch, the simulation results show that the proposed DQN scheme achieves better performance compared with the benchmark scheme. In the future, considering the hover energy consumption for the massive IoT devices is a significant work while it is also interesting to design the collaborative UAV communication by exchanging the information for neural network optimization.

Data Availability

The data used to support the findings of this study are included within this article.

Conflicts of Interest

The authors declare there are no conflicts of interest regarding the publication of this paper.


This work was supported in part by the National Key R&D Program of China (2019YFB1705100), Foundation of Southwest University of Science and Technology (18zx7142 and 19xn0086), the Sichuan Science and Technology Program (2019JDTD0019), and China Scholarship Council Program (202008515123).