Abstract

The task offloading in space-aerial-ground integrated network (SAGIN) has been envisioned as a challenging issue. In this paper, we investigate a space/aerial-assisted edge computing network architecture considering whether to take advantage of edge server mounted on the unmanned aerial vehicle and satellite for task offloading or not. By optimizing the energy consumption and completion delay, we formulate a NP-hard and non-convex optimization problem to minimize the computation cost, limited by the computation capacity and energy availability constraints. By formulating the problem as a Markov decision process (MDP), we propose a multiagent deep reinforcement learning (MADRL)-based scheme to obtain the optimal task offloading policies considering dynamic computation request and stochastic time-varying channel conditions, while ensuring the quality-of-service requirements. Finally, simulation results demonstrate the task offloading scheme learned from our proposed algorithm that can substantially reduce the average cost as compared to the other three single agent deep reinforcement learning schemes.

1. Introduction

The current in-depth development of fifth generation (5G) and beyond 5G technology is envisioned to build an interconnected world opening up to everyone. The increasing number of various ultradense heterogeneous Internet of things (IoT) devices and the continuous improvement of application requirements have put forward higher requirements for data transmission rate and network coverage [1, 2]. Compared with a fixed terrestrial network, the advantages of versatility, manoeuvrability deployment, as well as seamless coverage make apace-air-ground integrated network (SAGIN) an emerging hot research topic [3, 4]. Meanwhile, mobile edge computing (MEC) is a promising approach to improve the quality of service (QoS) and network performance [5].

Therefore, the MEC technology of the terrestrial network is introduced in SAGIN to provide efficient and flexible computing services by utilizing multilevel and heterogeneous computing resources at the edge of network. Especially, in the case of cellular base station damaging by natural disaster or the case of special senses (e.g., mountainous areas, polar regions, and oceans), UAVs and low earth orbit (LEO) satellite constellation can act as aerial relays or stations, and ground users (GUs) can offload computation tasks for fast processing [6]. In general, cooperative communication by multiple UAVs can be a possible solution to reduce the offloading delay and extend UAVs’ service lifetime [7]. However, there introduces more challenging issues to minimizing offloading delay while employing the multiple UAV architecture.

Recently, task offloading has been studied extensively, and task offloading processes are generally modeled as mixed integer programming problems [8], solutions such as heuristic algorithms [9, 10], and convex relaxation [11, 12]. However, these optimization methods require a large number of iterations to reach a satisfactory local optimum, which makes them unsuitable for real-time offloading decisions when environmental conditions change rapidly and significantly [13, 14]. Meanwhile, deep reinforcement learning (DRL) has been widely used as an effective approach to optimize different problems including offloading policy, which can help overcome the prohibitive computational requirements [15].

The research on task offloading of space/aerial-assisted edge computing has been at its initial stage. In the SAGIN, various single-agent DRL-based task offloading schemes are proposed to maximize the network utility or minimize the computation cost [16]. Considering the limited capacity of MEC server and channel conditions of UAVs, reference [17] proposed a computation offloading scheme based on deep Q-learning network (DQN) to solve the dynamic scheduling problem, and reference [18] adopted a risk-aware reinforcement learning algorithm using actor-critic architecture to minimize the weighted sum of delay and energy consumption. Furthermore, reference [19] proposed a joint resource allocation and task-scheduling methods based on a distributed reinforcement learning algorithm to achieve the optimal partial offloading policy. Reference [20] adopted a deep deterministic policy gradient (DDPG)-based computation offloading scheme to solve high-dimensional state space and continuous action space. Multiagent reinforcement learning (MARL) has been applied in different problems such as path planning [21], dynamic resource allocation [22], and channel access [23]. Compared with single-agent reinforcement learning methods, distributed multiagent systems undoubtedly have better performance. However, the study on task offloading considering the cooperation of space, aerial, and ground multilayer network under multi-UAV multiuser environment is still missing in above research studies. None of above references take full advantage of a possible collaborative framework, but only used multiple parallel deep neural network and decisions are taken independently by each agent of the system. In this paper, a MARL-based method is proposed to solve the cooperative task offloading issue in the space/aerial-assisted edge network. The multiple agents can achieve the offloading optimization collaboratively, in order to reduce the cost of computation tasks. In particular, our main contributions of this work are as follows:(1)Different from traditional UAV-enabled MEC task offloading scheme, we design a space/aerial assisted edge network for dynamic task offloading in a cooperative environment with multi-UAV.(2)This paper considers the problem of computation offloading under the SAGIN architecture with the joint communication and computing (C2) service. We formulate the above-mentioned problem as a Markov decision process to minimize the computational cost. We assume each agent shares information with other agents and makes a decision according to current strategies and local real-time observations to select which component of the system to execute.(3)We propose a multiagent deep deterministic policy gradient (MADDPG)-based task offloading approach. Unlike other DRL algorithms such as Q-learning and DQN which restrict the agent actions to a low-dimensional finite discrete space, the agents in MADDPG can search the best action in an independent consecutive action space and maximize the long-term reward to reduce the computation cost by finding optimal strategy. Furthermore, MADDPG can be decentralized executed once the network has been centralized trained.

The remainder of this paper is organized as follows. In Section 2, the space/serial-assisted edge network architecture and task offloading models are introduced. Section 3 describes the problem formulation and the MADDPG-based solution. The simulation results and analysis are presented in Section 4. Finally, this work is concluded in Section 5.

2. System Model and Problem Formulation

2.1. Network Architecture

As shown in Figure 1, a remote region without cellular coverage is considered; therefore, we provide network access, edge computing, and caching through the aerial segment. We consider a space/aerial-assisted edge computing framework, which consists of ground users (GUs), UAVs and, a low earth orbit (LEO) satellite constellation. Figure 1 depicts a multi-UAV, multicomputational node (satellite with remote cloud server, UAV as an aerial MEC server) to provide services to GUs, and let be the set of GUs. The SAGIN components that tasks can be offloaded to are denoted by , where indexes and 0 denote UAVs and the LEO satellite constellation, respectively. Considering a discrete time-slotted system with equal slot duration . Furthermore, we assume that the overall system has tasks, denoted by the set .The main parameters of this paper are shown in Table 1.

The GU can either execute the computation task locally or offload it to edge server in two ways. Each GU can determine whether or not to offload its computing task to the edge server , and let denote the task offloading decision of task of GU . Specifically, means that the GU offloads the task to the edge server , and means that the GU disposes its task locally. There exists constraint (1) indicating the binary constraint of offloading decision:

2.2. Computation Model

Without the loss of generality, a tuple is adopted to model the computing tasks from GU devices, where (in bits) represents the size of computation task, and (in CPU cycles per bit) indicates the computing workload means that how many CPU cycles are required to process one bit input data [17]. The delay and energy consumption of downloading can be ignored when the computing results are transmitted back to the GUs by the edge server because the key point of policy is task uploading in the considered scenario [9, 24]. In the following, we consider the computation overhead in terms of completion delay and energy consumption for edge computing and local execution.

2.2.1. Edge Computing Model

The computing capability (in CPU cycles per second, the clock frequency of the CPU chip) of edge servers mounted on UAVs and satellite is denoted by and , respectively. Consequently, the computational cost of all SAGIN components can be calculated as the following equation:

2.2.2. Local Computing Model

Since the limited computing capability of GUs, we assume that the remaining tasks wait to be processed in the queue. The delay of local processing is the sum of the computation execution time and the queuing time. The local execution time of GU is given bywhile the queuing time is calculated aswhere denotes the unaccomplished computation task at the beginning of time slot, is the maximum length of the computing queue, denotes the floor function, and is the computing capability of GU .The local execution energy consumption is given bywhere denotes the effective switched capacitance for the chip architecture [25]. Clearly, the can be adjusted to achieve the optimum computation time and energy consumption by using the DVFS [8].

2.3. Communication Model

Since UAVs and satellites use different frequency bands to communicate, we suppose that there is no interference between UAVs and satellite in this work [26]. Meanwhile, we neglect the propagation delay from GU devices to the UAVs because we assume that the UAV is sufficiently close to GU devices [27]. The aerial to ground communication channel depends on the altitude, angle of elevation, and type of propagation environment [28]. Based on reference [16], the average path loss of the aerial to ground channel can be defined aswhere represents the line of sight (LoS) connection probability between GU and UAV , and , , , denote the UAV flying altitude, horizontal distance between the UAV and the GU, and the additive loss incurred on top of the free space path loss for line-of-sight and not-line-of-sight links [29], respectively. We set the altitude of the UAV to 10m. denotes the carrier frequency, and denotes the velocity of light. According to [19], the values of are (0.1, 2.1) in remote area. Adopting the Weibull-based channel model [30], we generate the channel gain when , which can be given by=where and are antenna gains, denotes the rain attenuation, and denotes the distance between GU and the satellite. Consequently, the data rate denoted by is calculated bywhere indicates the transmission power, denotes the channel bandwidth of the aerial-ground link and the ground-satellite link, indexes , and 0 denotes the UAV swarm and the LEO satellite constellation, respectively. and represent the noise power. In line with mentioned earlier, we can define the transmission delay for task offloading over aerial-assisted computing aswhere denotes the propagation delay between the LEO satellite and GUs, which cannot be ignored. While the communication-related energy consumption can be defined as

2.4. Problem Formulation

In line with the computation model and communication model, the computation cost can be defined as the weighted sum of completion delay and energy consumption for completing all tasks of GU . Generally, the completion delay is the sum of the local execution time, the queuing time, the computational delay of all SAGIN components, and the transmission delay, which is defined aswhile the energy consumption is given by

Consequently, the computational cost of GU in space/aerial-assisted network can be calculated aswhere and denote the weights for the energy consumption and the completion delay, respectively, which can be regarded as the trade-off between delay and energy consumption. We can adjust the weights to meet different user demands by using this form of computational cost. Notably, the weights can be further divided into local execution model and edge computing model to increase diversity among these cases. We formulate the optimization problem in our scenario to minimize the computational cost. Therefore, the optimization problem is given by

3. MARL for Task Offloading

3.1. MARL Framework

Since above optimization problem in (14) is non-convex and NP hard, we adopt a multiagent reinforcement learning approach to achieve the feasible solution. In thissection, we model the formulated optimization problem as a Markov decision process (MDP) [31], and the purpose of action selection is to maximize the reward function. In the space/aerial-assisted network environment, each GU acts as an agent, chooses an action, and then receives the reward at time slot . The state space, action space, and reward function are described as follows.

3.1.1. State Space

The state consists of the channel vectors, the task size randomly generated in the time slot , the unaccomplished task queue, and the remaining energy. These quantities change over time because of the impact of single and combined actions of this system, so we define the state in our scenario as

3.1.2. Action Space

Based on current state and other agents’ experience, each agent is supposed to select its action to schedule the computation tasks. Formally, we define the vector as the binary offloading decision, where indicates that GU whether to offload its task to the MEC server or not. The constraints of the problem (14) are considered as the binary offloading decision strategy, and computation is offloaded to at most one node at time slot t.

3.1.3. Reward Function

In line with reinforcement learning, each agent can select its own action in a decentralized execution to maximize the global reward. The agent’s choice is based on the reward function, which specifies the goal of the algorithm. With the objective of long-term weighted sum of delay and energy consumption of all tasks, we define a function to minimize the computation cost as follows:

Let denotes the stationary policy, and a value function is defined to determine the value of reward, which is given bywhere denotes the discounting factor, which refers to the cumulative utility. The overall reward of all agents at time slot can be calculated as

The main objective is to minimize the computation cost in the space/aerial-assisted network. We denote the group of GUs’ optimal strategies as . We maximize the long-term reward aswhere the can be expressed aswhere and denote the set of all possible strategies taken by GU and other agents’ strategies, respectively. Therefore, the MADRL algorithm can obtain the optimal policy through convergence.

3.2. MADDPG-Based Task Offloading Scheme

In thissection, the MADDPG [32]-based task offloading scheme is proposed to derive the near-optimum decision by optimizing the continuous variable . MADDPG not only retains the greatest advantages of DDPG that can consider continuous action space but also solves the shortcomings of Q-learning or policy gradient algorithms that are not suitable for the multiagent environment by extending the DDPG algorithm into a multiagent domain.

As shown in Figure 2, the MADDPG framework is carried out by centralized training and distributed execution. Each GU has a critic and an actor as agent , critic can choose the appropriate action according to the observation , and actor would evaluate the action based on the global observation . During the training procedure, each actor collects the policies of other agents and denotes them as . In our MADDPG architecture, the actor is trained to generate a deterministic policy, the critic is trained to evaluate the actor, and the experience replay buffer is denoted to effectively avoid the correlated action, which can store the minibatches of samples .The random minibatch size of samples can be denoted by .

In the MADDPG algorithm, each agent selects an optimal action as the output of actor network . The actor network can be updated in the form of gradient, which is given byand the critic network is updated as the following expression:where , is the predication of the next action in target actor network. To minimize the policy gradient of agent , the parameters are soft updated for both the target actor and networks as follows:where is the forgetting factor.

The details of MADDPG-based task offloading scheme are shown in Algorithm 1. At first, we initialize four DNNs for each GU, i.e., critic network, actor network, and the two target networks (line 2-3). At the beginning of each episode, each GU obtains its observation state (line 7). Without the loss of generality, we divide each episode into time slots. For each time slot, agents firstly select an action according to the current policy, and the noise is added into exploration (line 8-9). Afterward, all agents execute their actions, and each agent can receive the corresponding reward and the next state (line 10-11). Then, the experience tuple generated from the above iteration is stored into the replay buffer for parameter update (line 12). Finally, given the sampled minibatch of transitions from the replay buffer, each agent updates the parameter of the critic network by minimizing the loss value, updates the parameter of the actor network by gradient ascent, and updates the parameters of the target networks using (23) (line 15-17).

(1)Initialization:
(2)Randomly initialize critic network and actor with weights and
(3)Initialize target network and with weights and
(4)Empty replay buffer
(5)for episode do
(6)   Initialize a Gaussian noise with mean = 0;
(7)   Receive initial observation state ;
(8)   for time slot do
(9)    Select action according to the current policy and exploration noise
(10)    Execute action and observe the reward , and the next state
(11)    Collect the global state , and the action ;
(12)    Store transition in ;
(13)    Sample a random mini-batch of transitions from ;
(14)    Set ;
(15)    Update the critic network by minimize the loss
    
(16)    Update the actor policy by using the sampled policy gradient
    ;
(17)    Update the target networks for each agent :
     and ;
(18)   end
(19)end
3.2.1. Analysis of Complexity

The deep neural network of the actor-critic framework can be represented as matrix multiplication. Let and define the dimension of output and the number of hidden layers, respectively. We can get the computational complexity between agents , and the complexity of each actor can be expressed as . In our proposed algorithm, the training algorithms can be affected by the agent cost , the number of training episodes , and batch size . Therefore, the computational cost of critic networks and training procedure can be estimated as and , respectively.

3.2.2. Analysis of Convergence

In the proposed algorithm, the gradient method is adopted to approximate the optimal by updating the weight of target networks. Obviously, the parameters and will converge to a particular value after a finite number of iterations. Therefore, the convergence of our proposed algorithm can be guaranteed. Furthermore, the convergence can be observed through simulations.

4. Simulation Results

4.1. Simulation Settings

In thissection, simulation is carried out to verify the proposed model and algorithm. Specifically, we begin by elaborating on the simulation settings. Afterwards, we present an evaluation on the experiment results. Simulation environment is implemented via Python 3.6 with TensorFlow 2.0 on a personal computer with a AMD R7-4800H CPU. ReLU function is used as the activation function after the fully connected layer, and L2 regularization is used to reduce DNN overfitting. The number of neurons in the two hidden layers are 256 and 128, and we set 2000 and 0.001 to the number of episode and learning rate. Other important constant parameters are listed in Table 2.

4.2. Convergence Analysis

To evaluate the performance of our proposed scheme, we further compare the convergence of three algorithms and the mean computation cost in the system. We adopt the other two benchmark schemes: DDPG [31] and DQN [18]. We first conducted a series of experiments to determine the optimal values of the hyperparameters used in the algorithm. The selection is based on the performance of the algorithm under different learning rates and discount factors. The convergence performance of the algorithm under different learning rates is shown in Figure 3. We can observe that the convergence speed will be reduced when the learning rate is too small, and generally, if the learning rate is too large, the algorithm will not converge normally.

As shown in Figure 4, we show the convergence of our proposed algorithm on average reward with episodes, where the weight index of time delay , and energy consumption weight index , and the results are averaged from ten numerical simulations, proving the effectiveness of neural networks. The average award values of MADDPG, DDPG, and DQN increase continuously until convergence. Obviously, the MADDPG algorithm become stable earlier than DDPG and DQN algorithms, and the other two schemes become stable after more than 400 training episodes. Moreover, the average reward of final convergence of the MADDPG algorithm is higher than the other two algorithms. Based on the simulation result, we conclude that the MADDPG algorithm outperforms the benchmark schemes in minimizing the long-term cost, which can reduce the wastage of resource by learn the policy of cooperation and maximize the global reward.

4.3. Average Cost

In this section, we discuss the average cost in terms of several number of GUs’ devices and the size of offloading tasks. Figure 5 demonstrates the performance of proposed scheme and the three baselines in space/aerial-assisted network to minimize the average computation cost. We provide one other baseline: greedy algorithm [33], and each GU firstly makes its best effort to offload computation tasks, and then the remaining tasks will be processed locally. From Figure 5, we can observe that the average computation cost increases as the number of GUs’ devices increasing. DQN algorithm, DDPG algorithm, and MADDPG algorithm all use deep reinforcement learning to automatically generate offloading strategies. According to the simulation results, the computation cost is reduced by the MADDPG algorithm, 29.066% compared to the DDPG algorithm, 36.392% compared to the DQN scheme, and 51.602% compared to the greedy scheme. Therefore, the proposed MADDPG scheme can still keep the computation cost lower than benchmark schemes, proving the validity of cooperative policy.

Figure 6 demonstrates the average cost under different sizes of offloading tasks. Obviously, the average cost increases as the size of offloading data size increases. This is because each agent needs to unload a large amount of data, which increases the computational cost of the system. The proposed MADDPG scheme can obtain the best reward to minimize the computation cost than these benchmark schemes. The performance of strategies generated from the DQN algorithm and the DDPG algorithm is general in various scenarios, which is mainly because the training results of the two algorithms are unsuitable in multiagent environment. In contrast, MADDPG can effectively learn stable strategies by decentralized execution and centralized training. Therefore, we conclude that the MADDPG algorithm outperforms the three comparison algorithms in terms of different scenarios.

5. Conclusion

In this paper, an efficient task offloading scheme is proposed for the space/aerial-assisted edge computing system in SAGIN. Firstly, we elaborated the SAGIN architecture. Then, we express task offloading as a nonlinear optimization problem with the goal of minimizing the weighted sum of energy consumption and delay. On this basis, we propose an algorithm based on MADDPG to solve this problem. Finally, the simulation results show that the computation cost can be significantly saved by offloading the task to the edge server on the UAV or satellite, and the convergence performance and effectiveness of the proposed scheme in the simplified scenario are also proved by compared with three benchmark schemes.

In the future, we will further consider the mobility management of satellites and UAVs. In addition, the task offloading scheme of SAGIN in areas with rich computing resources is also worth for further study.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under grant (61761014), Guilin University of Electronic Technology Ministry of Education Key Laboratory of Cognitive Radio and Information Processing (CRKL190109).