Abstract

In the Internet of Vehicle (IoV), the limited computing capacity of vehicles hardly processes the intensive computation tasks locally. The computation tasks can be offloaded to multiaccess edge computing (MEC) servers for processing, where MEC provides the required computing capacity to the nearby vehicles. In this paper, we consider a scenario where there are cooperation and competition between vehicles, the offloading decision of any vehicle will affect the decisions of the others, and the computing resource allocation strategies by MEC will dynamically change. Therefore, we propose a joint optimization scheme for computation offloading decisions and computing resource allocation based on decentralized multiagent deep reinforcement learning. The proposed scheme learns the optimal actions to minimize the total weighted cost which is designed as the vehicles’ satisfaction based on the type of stochastic arrival tasks and dynamic interaction between MEC server and vehicles within different RSUs coverages. The numerical results show that the proposed algorithms based on decentralized multiagent deep deterministic policy gradient (DDPG) which is named De-DDPG can autonomously learn the optimal computation offloading and resource allocation policy without a priori knowledge and outperform the other three baseline algorithms in terms of the rewards.

1. Introduction

With the development of wireless communication technology and the rapid growth of vehicles, Internet of Vehicle (IoV) has become one of the most important applications of the Internet of Things (IoT) [1, 2]. However, due to the limitation of the computing resource of the vehicles, several tasks cannot be executed locally within the required delay [3]. To solve this problem, offloading IoV tasks to mobile edge computing (MEC) server is proposed as a feasible solution [4]. MEC is close proximity to the mobile vehicles, supplying more sufficient computation resource to the offloaded tasks [5, 6].

In recent years, many researches regarding MEC computation offloading in IoV have been studied [7, 8]. Some researchers have conducted to develop the optimization scheme in computation offloading under certain constraints, such as reducing the delay, computation resource overhead, and energy consumption [911]. Moreover, computation congestion that affects the performance of MEC server and the load balance of computation resource among MEC server have been considered into the computation offloading problem [12]. Although MEC server provides vehicles with the resource far beyond theirs, the resource of the MEC server may also be insufficient when massive vehicles access to MEC server simultaneously. Therefore, the rational resource allocation to optimize the performance of various objectives is also a significant issue of edge computing offloading [1315]. Because of the high-speed mobility of vehicles and the randomness of tasks, as well as the cooperation and competition between vehicles in IoV, the computation resource allocation policy of MEC server based on different offloading decisions of vehicles has been discussed by many researchers. By jointly optimizing resource allocation and offloading strategy in IoV, the overall cost of computation resource, energy, and the delay is minimized in [1618]. However, these methods require a large number of iterations to obtain a satisfied local optimum, which is not suitable for application scenarios where the environment changes rapidly and decisions need to be made in real time. Meanwhile, solving this type of optimization problem is usually nonconvex and NP-hard.

Deep reinforcement learning (DRL) which is the combination of deep learning (DL) and reinforcement learning (RL) can tackle the nonconvex optimization problem and has been widely used as an effective approach to optimize different issues including offloading decision-making and resource allocation strategy [14, 1923]. The previous works make many efforts to optimize task offloading in IoV. For example, deep Q-network (DQN) is adopted in multiple vehicles offloading system to obtain the optimized offloading decisions which maximize the QoS of digital twinning-empowered IoV system [23]. Similar work proposes multiagent DQN-based computation offloading scheme, in which the uncertainty environment is considered so that the vehicles can make offloading decisions to achieve an optimal long-term reward [24]. A dynamic task offloading scheme based on Q-learning is implemented to minimize the delay, energy consumption, and total overhead in IoV system [25]. URLLC-aware task offloading algorithm based on deep Q-learning is studied to maximize the throughput of vehicles with satisfied constraints in [26]. Jointly considering the task priority, vehicles’ service availability, and computation resource sharing incentive, an optimal offloading policy based on soft actor-critic (SAC) maximizes both expected reward and the policy entropy of the offloading tasks in the dynamic vehicular environment [27]. Moreover, DQN-based joint computation offloading and task migration optimization are applied to minimizing the total system cost in a 5G vehicle-aware MEC network [28]. The two-stage scheme is designed to joint optimization, where DQN is used in the first step to obtain the offloading strategy and deep deterministic policy gradient (DDPG) is utilized to generate the transmit power determination strategy of the vehicles [29]. None of the above researches consider the joint optimization of offloading strategy and computation resource allocation when multiple agents interact in a dynamic IoV environment.

Different from the existing works, we propose a decentralized multiagent deep reinforcement learning-based method to solve the joint optimization of computation offloading decision and resource allocation for MEC server in IoV. The objective of our work is to minimize the weighted cost of multiagent. In summary, our main contributions are as follows:(1)We propose a IoV scenario supported by MEC server for dynamic task offloading decision and computation resource allocation in the environment with multiple RSUs cover multiple vehicles. In this cooperative scenario, because of the mobility of multivehicle and the stochastic arrival tasks, the computation offloading decision and resource allocated to multiple RSUs and multiple vehicles change in different time slots.(2)Based on the proposed model, we consider both offloading decision-making and computation resource allocation to gain the minimum weighted cost, which is related to the end-to-end delay and computation resource cost. Moreover, we formulate the problem as a Markov decision process (MDP) and design the state, action, and reward functions.(3)In order to effectively solve the abovementioned problem with continuous variables and meet the requirement of convergence, a joint optimization scheme based on decentralized multiagent DDPG (De-DDPG) is proposed. The simulation results show that the convergence of our proposed algorithm is verified and our proposed algorithm has better performance than other three baseline algorithms.

The remaining of this paper is organized as follows: In Section 2, an MEC framework with multiple RSUs and vehicles is introduced, and we construct the network model, communication model, and computation model. Section 3 describes the problem statement of the joint optimization. The solution based on decentralized multiagent DDPG (De-DDPG) is proposed in Section 4. In Section 5, the simulation results and analysis are presented. Finally, we conclude this paper in Section 6.

2. System Model

2.1. Network Model

A three-layer Internet of Vehicle (IoV) is considered in this paper (see Figure 1), which consists of an MEC server, roadside units (RSUs), and vehicles on a multilane road of length L.

The MEC server is connected to RSUs via the fiber-optic link for receiving and transmitting the computation tasks. We assume that the total computing resource of MEC server is denoted as . The RSUs denoted by locate along the road with the same coverage range . Therefore, we divide the road into segments, and all vehicles are randomly and independently distributed in the segments with arrival rate . The RSU is responsible for forwarding messages between the MEC server and the vehicles. A set of vehicles periodically send messages to RSU within its communication range, which is denoted as . Vehicles have the same local computing capacity which is determined by the onboard unit (OBU) [30]. For each vehicle-i, it sends not only task messages but also its driving characteristics , where and represent its 1-D position and speed, respectively. Here, we assume that the distances between vehicles follow the exponential distribution and the speeds of the vehicles are truncated Gaussian distributed, which is more appropriate for the actual situation of the road [31, 32]. In addition, we assume that each vehicle only processes one computation task within the current time period. The computation task of each vehicle is denoted as , where is the required computation capacity to complete the task, and are the data size of the input and output for computing, respectively, and is the maximum tolerable delay for the task completion. Vehicle needs to execute a computation task within a tolerable time period, and the task can be either processed locally or offloaded to the MEC server. We define the binary offloading strategy of vehicles as , where and means that the vehicle-i decides to execute the computation tasks locally or offloads the task to the MEC server, respectively.

Moreover, when the vehicle leaves the coverage of the RSU, the vehicle will be disconnected from the RSU and can no longer transmit data to the MEC server through the RSU. The time available by the vehicle before leaving the communication range of RSU-j, , i.e., sojourn time, can be given aswhere represents the vehicle’s equivalent speed and .

2.2. Communication Model

When the vehicles decide to offload the task to MEC server, the vehicles will transmit data to MEC server through the RSUs. Generally, the propagation time of the fiber-optic transmission between RSUs and the MEC server can be ignored [12]. We consider the V2I communication between the vehicle and the RSU is based on IEEE 802.11p in this work [33]. According to [34], the uplink and downlink transmitting rate of the wireless communication between vehicle-i and its belonged RSU-j is expressed aswhere is the number of vehicles which decide to offload task to MEC server via RSU-j. represents the probability that vehicle-i connects to the RSU-j in a random time slot. is the duration of a time slot. RTS stands for request to send interval, AIFS denotes the arbitration inter-frame spacing interval, and expresses the propagation delay. is defined as the success transmission period between vehicle-i and RSU-j, which is written aswhere is specific to the MAC protocol, and it equals . represents the packet header’s overhead. SIFS, , and CTS stand for short interface space interval, acknowledgment interval, and CTS interval, respectively. denotes the bandwidth of RSU-j, is the transmission power of vehicle-i, and stands for the channel gain between vehicle-i and RSU-j.

The uplink/downlink transmitting time under this situation is calculated as

And the two-way transmission time between vehicle and RSU is given by

2.3. Computation Model

The processing time is considered under two situations: the task is processed locally, and the task is offloaded to the MEC server for computing.

2.3.1. Local Processing Model

When vehicle-i processes its computation task locally (), the processing time of vehicle-i is only dependent on its own computing capacity. The local execution time is formulated asHere, is denoted as the vehicle’s computation capacity, which is related to the vehicle’s CPU cycle frequency.

2.3.2. MEC Processing Model

When the task is offloaded to the MEC server (), the end-to-end delay of vehicle-i includes the task execution time and the transmitting time. The execution time of vehicle-i offloading the task to MEC server is given aswhere denotes the computation capacity assigned to RSU-j which connects to vehicle-i by the MEC server, and denotes the allocated CPU cycle frequency of RSU-j by the MEC server. The end-to-end delay between vehicle-i and the MEC server is obtained by

The main notations and descriptions are described in Table 1.

3. Problem Statement

In this section, the optimization problem is formulated by jointly considering the offloading decision and resource allocation with the aim of load balance and system cost minimization. First of all, we define the cost function as follows.

Cost function is considered to quantify the satisfaction level of the vehicle’s offloading decision, which is inversely related to the satisfaction and identified by the delay sensitivity and the cost of computation resource. The logarithmic function is known as proportional fairness in many researches [35], which can achieve load balance, and a logarithm function is used to represent the cost function in this paper. The processing delay of a task is generally considered to be inversely proportional to the satisfaction; that is, the shorter the task processing delay, the higher the satisfaction. In addition, if the task is completed within the maximum tolerable delay, the satisfaction of the vehicle should be non-negative. But once the completion processing time of the task exceeds its maximum tolerable delay, the processing result of the task will lose its value because the tasks in IoV are extremely tolerant of delays. Here, the penalty mechanism is brought into consideration. Another metric in the cost function is the computation resource cost. It is necessary to pay for the vehicle’s computation resource when the vehicle processes the task locally. Furthermore, when the task is offloaded to the MEC server, it takes the vehicle’s corresponding cost for computation resources allocated by the MEC server, which will also reduce the satisfaction of the vehicle. Therefore, the cost function for vehicle-i to process the task locally is given bywhere and represent the weights of delay and computation resource cost, respectively. The weighted function provides a flexible scheme for different applications’ specific requirements by adjusting the weight parameters. ensures that is non-negative. is the unit cost of the computing resource. And, represents the penalty for the task that is not completed within its maximum tolerable delay.

Similarly, the cost function of vehicle-i offloaded the task to the MEC server for processing and can be expressed as

Because when the vehicle leaves the coverage of the RSU, the vehicle will disconnect to the MEC server through the RSU regardless of whether the task is processed or not. is used to depict the task’s actual tolerant delay.

Combining equations (9) and (10), the cost function of vehicle-i can be expressed as

This work aims to minimize the system cost by jointly determining the offloading decisions of vehicles and the computation resource allocation of the MEC server. The optimization problem is formulated as

Constraint C1 ensures that the available local computation resource is non-negative and less than the MEC server. C2 is the constraint of the available computation resource assigned for each vehicle-i within the coverage of RSU-j by the MEC server. The sum of the computation resource allocated to all the offloading tasks through RSU-j does not exceed the total computation resource of the MEC server in the constraint C3. C4 shows the binary offloading decision constraint for vehicle’s task.

Since the cost function in the above problem involves the end-to-end delay, which is related to the indicators of the stochastic arrival tasks , the computing resource is allocated to RSU-j and the relative position of vehicle to RSU based on vehicle’s driving characteristics . Therefore, the computation complexity is an additive change on all of tasks and the vehicle characteristics. In addition, the computation complexity also depends on the number of the generated tasks. In this optimal problem, the offloading decisions and the allocated computation resource are two main challenges which make the problem into a mixed-integer nonlinear programming problem that is generally nonconvex and NP-hard [36]. We adopt a multiagent deep reinforcement learning approach to feasibly solve the problem of jointly optimizing the computation offloading decision and computation resource allocation.

4. DRL for Computation Offloading and Resource Allocation

4.1. Scheme Design

We assume that the state is determined by the arrival tasks and the vehicle’s characteristics which are updated in each step. The state of the next time slot is related to the state of the current time slot. Therefore, the formulated problem can be modeled as a Markov decision process (MDP). MDP is the iterative process in which agents observe the states in state space from the environment, select an action from action space, obtain an immediate reward sequentially, and then transit to another state, which can be represented as a tuple , where is state space, is action space, is transition probability space, is reward space, and is discount factor. MDP policy is completely dependent on the current state. The state space is designed to accommodate the proposed IoV environment. Each vehicle acts as the agent. At first, we define the state space, action space, and reward space as follows.

4.1.1. State Space

The state at time slot t is corresponding to the required computation capacity to complete the task , the input data size of the task , the output data size of the task , the position of vehicle , the speed of vehicle , and the computing resource allocated to RSU-j. Thus, the state can be described as

4.1.2. Action Space

The action is the joint decision-making for the computation offloading and the resource allocation. The vehicle needs to decide to process the task locally or offload to the MEC server. If the task is offloaded to the MEC server, the computation resource is allocated to the vehicle by the MEC server via the linked RSU. Therefore, the action is composed of the binary offloading decision and the computation resource allocated to vehicle-i, depicted as

4.1.3. Reward Space

We assume that all vehicles with the same functionality can share the same reward function. Each agent selects its action based on the reward to obtain the maximum global reward. The long-term weighted sum of the cost function of all the tasks is considered as the objective, and we define the below function to maximize the reward function (minimize the cost function) during the whole time period T.

The average rewards of all agents in time slot t can be calculated as

Minimizing the weighted cost function of the proposed model amounts to maximizing the average cumulative reward. The expectations of future rewards can be used to measure whether the selected action is appropriate or not. The reward is the return of the selected action based on the state in time slot t. Therefore, the cumulative reward which is generally indicated as the weighted expectation is maximized to select the optimal actions, formulated aswhere is the discount factor, is the policy, and is the optimal policy. corresponds to the agents’ optimal policy .

4.2. Optimal Scheme Based on Decentralized DDPG

After formulating the MDP, we propose the optimization strategy based on the decentralized multiagent DDPG (De-DDPG) in this subsection, in which each agent is initialized with four deep neural networks (DNNs): the critic network, the actor network, and two copies of the actor and critic networks as target networks, respectively (see Figure 2) [37]. Each agent’s state, action, and reward are obtained and used to train the DNNs during the training procedure. After training, each agent can select the next step strategy by its own actor network according to the local observation from the environment.

As shown in Algorithm 1, the process of De-DDPG algorithm can be divided into three parts: initialization, interaction, and update. At the beginning of the algorithm, four networks of each agent and the replay buffer are initialed, where the critic network is , the actor network is , the target critic network is , and the target actor network is , respectively. In addition, the replay buffer can be large because the proposed De-DDPG is an off-policy algorithm, which allows the algorithm to benefit from learning across a set of uncorrelated transitions [38]. In the interaction procedure, for each episode, a sampled noise from the random noise process is added to an exploration policy into the actor policy. The reason for introducing random noise is to solve the problem of insufficient exploration of the environment by the output actions in deterministic policy algorithms. The Ornstein–Uhlenbeck process is used to generate temporally correlated exploration for exploration efficiency. Then, the actions interact with the environment and obtain the corresponding rewards and the next step states. According to the observation, the transitions store in the replay buffer . When updating, a random mini-batch of transitions is sampled from the replay buffer. Then, update each agent’s critic network, actor network, and two target networks in turn. Loop through each episode until the algorithm ends. In the update process, the critic network is updated by minimizing the loss in Algorithm 1 which is the approximation function of other agent policy by each agent. Here, is the predication of the next action in target actor network in formula (17). The actor network is updated by using the sampled policy gradient in Algorithm 1 which is the unbiased estimation of the policy gradient expectation calculated by the mini-batch transitions according to Monte Carlo method. After training the mini-batch transitions and updating the weights of critic network and actor network ( and ), the weights of two target networks for each agent ( and ) can be soft updated as a running average algorithm, which is shown in Algorithm 1, respectively.

Randomly initialize critic network and actor with weights and
Initialize target network and with weights ,
Initialize replay buffer
for episode
 Initialize a random process foe action exploration
 Receive initial observation state
  for
   Select action according to the current policy and exploration noise
   Execute action and observe reward and observe the next state
   Store all transitions in
   Sample a random mini-batch of transitions from
   Set
    
   Update critic network by minimizing the loss
    
   Update the actor policy by using the sampled policy gradient
    
   Update the target networks for each agent :
    
    
  end for
end for
end for

5. Numerical Results

This section describes the comprehensive numerical simulation analysis from simulation setup, simulation comparison, and simulation results.

5.1. Simulation Setup

Firstly, we evaluate and verify our proposed De-DDPG by TensorFlow 1.13.1. A personal computer with a RTX2070 GPU and 8GB video memory is used to train and test De-DDPG.

There are 20 vehicles driving on the road, 4 RSUs are located at the stationary region on the roadside, and an MEC server directly connects with RSUs. That is to say, 2 groups of RSUs are set to serve all vehicles. One RSU group includes the main RSU and a secondary RSU, and the purpose is to prevent the main RSU from being abnormal due to accidents (such as power failure and communication blockage). Therefore, , . The required computation resource is set as (G cycles/Mbits). The size of arrival tasks follows uniform distribution MetaBits. The arrival probability of tasks is . The maximum tolerable delay of the computation task is set as time slots. One time slot is set to ms. The uplink transmission rate of vehicle-i is between and Mbits per time slot. The bandwidth of RSU . The transmission power for vehicle-i is . The max computation resource allocated to RSU-j is gigacycles. The computation resource of each vehicle is gigacycles.

In terms of each vehicle-i (), the neural networks (two actor networks and two critic networks) for De-DDPG are composed of four layers, i.e., an input layer and two fully connected layers. Table 2 illustrates the parameters and values.

5.2. Simulation Comparison

For verifying the performance of the proposed DE-DDPG, three benchmark algorithms are set: centralized deep deterministic policy gradient (Ce-DDPG), all tasks offloaded to RSU(A-RSU), and all tasks executed by local processor (A-LP), which are described as follows:(1)Ce-DDPG: on the MEC side, a centralized controller captures global information such as tasks generated from vehicles, computation, and communication resources of RSUs. That is to say, there is only one agent which interacts with the MEC environment. In order to improve the convergence of Ce-DDPG, the allocated computation resources by all RSUs are the same. The structure of the neural network is the same as each sub-network of De-DDPG.(2)A-RSU: all tasks from vehicles are offloaded to RSU in the corresponding coverage region. Furthermore, the computation resources allocated for each vehicle are the same.(3)A-LP: all tasks are executed by the local processor of the vehicle.

5.3. Simulation Results

In this section, we analyze the simulation results in detail from the two aspects: the convergence of proposed De-DDPG and the advantages of De-DDPG compared to three baseline algorithms.

In terms of De-DDPG’s convergence, the average cumulative reward of De-DDPG in one period (the whole time slots) for each episode is formulated.

Figure 3 shows the convergence of proposed De-DDPG with different critics’ learning rate . The choice of learning rates can obviously affect the convergence effect and speed of De-DDPG. From Figure 3, it can be observed that De-DDPG cannot be convergent when . However, from the blue curve when , although De-DDPG can be eventually convergent, the convergence speed is too slow to influence the performance of De-DDPG. Therefore, we set because De-DDPG is more stable. In terms of actor’s learning rate, we set .

is the total computation resources preallocated to RSU-j from the MEC server. To ensure the fairness of computation resources allocation and the robustness of the proposed De-DDPG, we periodically update the allocated computation resources of each RSU by adding the fluctuation volatility rate. As shown in Figure 4, when the volatility rate is set to , the convergence and performance of De-DDPG are different. When , the performance of De-DDPG is better than the other two curves. However, the stability of De-DDPG is slightly worse. As the volatility rate increases, the performance of De-DDPG decreases, but we can see that when the volatility rate is set to , the stability and convergence of the De-DDPG are optimal.

Figure 5 reveals the convergence of De-DDPG with different control parameters . From formula (10), the greater the is, the greater the cost is (the less the reward is). Although the average cumulative rewards of De-DDPG are worst when , the stability is much better than that of the curves when and . From the curves shown in Figure 5, as the control parameter increases, the performance of De-DDPG declines less and less. However, when , the convergence effect and training speed of De-DDPG are obviously improved. Therefore, we set the control parameter for the cost of computation resource in this paper.

Furthermore, Figures 68 will verify the performance and advantages of the proposed De-DDPG compared to other baselines Ce-DDPG, A-RSU, and A-LP. The average cumulative reward of De-DDPG for all episodes (1000 episodes) is introduced to show the performance of four algorithms.

Figure 6 illustrates the comparison of four algorithms with the different numbers of arrival vehicles. When the number of vehicles within the coverage region of each RSU is 6 or 8, because the computation resources of each RSU are sufficient to meet the computational demands of the vehicles for all generated tasks, the performance of De-DDPG, Ce-DDPG, and A-RSU is not significantly different. In terms of A-LP, regardless of the number of vehicles, its local processing ability remains unchanged, and the computation resources of RSUs have no effect on its performance. On the contrary, because the computation ability of the local processor for vehicles is insufficient for arrival tasks, a large number of penalties in each episode will be incurred due to the incomplete tasks, thus affecting the average cumulative reward of A-LP. However, when the number of vehicles within the coverage region of each RSU is greater than 10, the computation resources of RSUs are insufficient for completing all tasks generated by all vehicles, and the performance of four algorithms degrades significantly. From the curves, the performance of the proposed De-DDPG is better than that of the other three algorithms.

The uplink transmission rate of the wireless communication between vehicle-i and its belonged RSU-j can influence the performance of four algorithms. Figure 7 shows the performance of four algorithms with different uplink transmission rates . We can see that A-LP is not influenced by the transmission rate because A-LP executes all tasks by its own local processor. For De-DDPG, Ce-DDPG, and A-RSU, different uplink transmission rates mean different transmission delays which affairs the average cumulative rewards of all three algorithms. As shown in Figure 7, the performance of De-DDPG is obviously better than that of Ce-DDPG and A-RSU due to the number of agents participating in training, offloading decisions, and the ratio of resource allocation.

In Figure 8, we describe the comparison of four algorithms with different trade-off coefficients for latency cost. From formula (10), is the weighted parameter of delay cost and computation resource cost. In terms of A-LP, since the computational resource cost is fixed, the performance of A-LP decreases as the trade-off coefficient increases. However, the other three algorithms consider the trade-off between delay cost and computation resource cost, so the performance of De-DDPG, Ce-DDPG, and A-RSU varies with the changing of . As shown in Figure 8, in terms of the average cumulative rewards, the performance of De-DDPG outperforms that of Ce-DDPG, A-RSU, and A-LP no matter the size of .

6. Conclusions

We propose a computation offloading and resource allocation scheme based on DRL for the MEC-assisted multiagent with stochastic arrival task model in the IoV environment. To minimize the total weighted cost of the proposed model, we adopt a decentralized multiagent DDPG-based approach (De-DDPG) to solve the nonconvex joint optimization problem. The simulation results demonstrate that our proposed approach has a stable learning capacity and effectively learns the optimal offloading policy and resource allocation to obtain the maximum reward (minimum cost). Compared with the three baseline algorithms, our proposed algorithm has better performance for various parameter configurations. In this paper, the binary offloading decision is used and the task priority is not considered. We will improve these two points in our future work, such as considering partial offloading and task prioritization in this joint optimization problem.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by China National Natural Science Foundation (61572229 and 6171101066).