With the increase of Internet of vehicles (IoVs) traffic, the contradiction between a large number of computing tasks and limited computing resources has become increasingly prominent. Although many existing studies have been proposed to solve this problem, their main consideration is to achieve different optimization goals in the case of edge offloading in static scenarios. Since realistic scenarios are complicated and generally time-varying, these studies in static scenes are imperfect. In this paper, we consider a collaborative computation offloading in a time-varying edge-cloud network, and we formulate an optimization problem with considering both delay constraints and resource constraints, aiming to minimize the long-term system cost. Since the set of feasible solutions to the problem is nonconvex, and the complexity of the problem is very large, we propose a Q-learning-based approach to solve the optimization problem. In addition, due to the dimensional catastrophes, we further propose a DQN-based approach to solve the optimization problem. Finally, by comparing our two proposed algorithms with typical algorithms, the simulation results show that our proposed approaches achieve better performance. Under the same conditions, by comparing our two proposed algorithms with typical algorithms, the simulation results show that our proposed Q-learning-based method reduces the system cost by about 49% and 42% compared to typical algorithms. And in the same case, compared with the classical two schemes, our proposed DQN-based scheme reduces the system cost by 62% and 57%.

1. Introduction

With the rapid development of the Internet of Vehicles (IoVs), vehicles have generated an increasing number of intensive computation tasks, such as online interactive applications, route planning, and traffic flow prediction. However, since these applications require a certain amount of computing resources and have Quality of Service (QoS) requirements, basic equipment in IoVs cannot meet the needs of vehicles suffering from hardware and other limitations [1].

To relieve the pressure of tight resources in IoV, Mobile Cloud Computing (MCC) is often used as a promising solution in such cases. In MCC networks, cloud servers (CS) have large volume of computation and communication resources to provide offloading services to multiple users at the same time [2]. However, since CS is usually deployed far away from vehicles, MCC can lead to significant transmission delays and system costs [3].

Mobile edge computing (MEC) is believed to serve as a reliable paradigm due to its ability to improve the QoS of vehicles in a way that reduces latency and energy costs [4]. To address latency-sensitive tasks while satisfying their demand for resources, MEC sinks computing services to the network edge, while network resources are also fully utilized [5, 6]. In IoVs, MEC servers are deployed on roadside units (RSUs), and vehicles within their coverage area are able to receive offloading services. The computing resources in MEC are still limited, which may result in some of the computation tasks failing when the amount of tasks in the network is too large [7]. Most of the previous studies have considered the optimal allocation of computing resources and model selection separately. Moreover, in terms of computation offloading, some studies offload computing tasks from vehicles to edge servers or CS without considering the optimization of resources or the combination of both. Therefore, it is necessary to put forward a combined solution when solving practical problems.

Moreover, with the development of machine learning, reinforcement learning (RL) is considered as an effective method for finding optimal computation offloading strategies in time-varying scenarios. Compared with other optimization methods, agents in RL can find an optimal policy by observing the current environment to select actions as well as obtaining future rewards by constantly interacting with the environment [8]. Therefore, it is of great interest to design an efficient RL-based computation offloading and resource allocation scheme.

In this paper, we propose an efficient offloading scheme in the edge-cloud network by jointly optimizing offloading decision and resource allocation and ultimately achieving the optimization goal of minimizing system costs. In the concerned scenario, CS makes offloading decisions for each vehicle based on the current system state. Distinct from existing works, we consider both cooperation between CS and RSUs, as well as cooperation between RSUs. We formulate the optimization problem as a mixed-integer nonlinear programming (MINLP) problem, and to solve the optimization problem, we first propose a Q-learning-based approach. However, since Q-learning method may lead to dimensional catastrophe when the state and action space become too large [9], we further propose a DQN-based approach to compensate for this drawback. The main contributions of this article are as follows: (i)We construct an edge-cloud network for time-varying IoVs, in which CS together with RSUs can process computing tasks for vehicles cooperatively(ii)We study the cooperative offloading problem in proposed model and formulate the optimization problem as a MINLP problem, aiming to minimize the system cost(iii)After defining the state space, action space, and reward function, we approximate the optimisation process as a Markov decision process (MDP). Based on the MDP, we propose Q-learning and DQN-based methods to solve the optimisation problem, respectively(iv)Numerical results demonstrate that our proposed schemes significantly reduce system cost compared with other typical algorithms

The remainder of this paper is organized as follows. Section 2 introduces the related work of this article. Section 3 describes the system model in detail, including the computational model and the communication model. In Section 4, we describe the optimization problem and formulate the optimization problem as an MINLP problem. In Section 5, a Q-based method and a DQN-based method are proposed for the optimization problem, respectively. Simulation results and analysis are given in Section 6. Finally, Section 7 gives a summary of this paper.

In recent years, research on computational offloading in MEC and MCC scenarios has become increasingly popular due to the need for practical scenarios. Specifically, in [10], Mao et al. perform a joint optimization for power and computational offloading in a MEC scenario using NOMA in order to achieve the optimization goal of minimizing system energy consumption. In [11], Ning et al. consider a MEC heuristic offloading scheme based on partial offloading with the optimization goal of reducing the system delay. In [12], Kuang et al. use a hierarchical approach to obtain suboptimal solutions to optimize the offloading pattern and power allocation in the MEC scenario. In [13], authors offload tasks to nearby vehicles as well as edge devices, and in this way solve the problem of computational offloading and probabilistic caching. In [14], Bi et al. consider the problem of service delivery and deployment in a single-user MEC system with the optimization goal of minimizing the overall system latency.

Considering the different characteristics of MEC and MCC, some studies have also considered the option of combining both. In [15, 16], the authors studied the system architecture of an edge-cloud system. In [17], Lin et al. proposed a directional charging scheme and improved energy transfer model in the MEC system. In [18], Wang et al. consider the server allocation problem for edge computing system deployment where each edge cloud is modeled as an M/M/c queue. In [19, 20], authors study the computation, communication, and the storage resources problems in both IoV and MEC networks. Wang et al. in [21] consider using D2D technology in MEC system to collect larger and better quality sensing data.

For the problems in MEC and MCC scenarios, some extant studies have used RL or DRL as solutions. In [22], Wang et al. transform the edge caching problem as a Markov decision process and propose a distributed cache replacement strategy based on Q-learning to address the optimization problem. In [23], Su et al. propose a Q-learning-based spectrum access scheme to optimize spectrum and maximize the transmission rate. In [24], Dinh et al. propose a Q-learning-based scheme to solve the optimization problem in a multiuser multiedge-node computation offloading scenario. He et al. use a dueling DQN approach in [25] to solve joint optimization problem in connected vehicle networks, considering not only network gains but also caching and computation gain in the proposed framework. In [26], Wang et al. investigate the best strategy for resource allocation in ICWNs by maximizing spectrum efficiency and system capacity across the network and propose a DQN-based task offloading scheme for MEC networks in urban cities. In [27], Zhou et al. use a DDQN-based approach to solve the energy minimization problem and simultaneously efficiently approximate the value function.

Unlike these existing studies, the content of this paper mainly considers the problem of MEC and MCC collaboration in an IoV environment. Through the collaboration of edge-side and cloud-side servers, the paper is aimed at minimizing the energy consumption of the system. In order to solve the optimization problem of offloading decision and resource allocation, we propose two algorithms based on Q-learning method and DQN method, respectively.

3. System Model

In this section, we first present an edge-cloud network including mobile vehicles, RSUs equipped with MEC servers, and cloud servers (CS). Next, we give precise definitions of the model components.

3.1. Model Architecture

The edge-cloud network we consider is shown in Figure 1. The set of vehicles is denoted by , and the set of MEC servers deployed at roadside units (RSUs) is denoted by . In particular, we set that in this system time is divided into time slots of , where is the finite time horizon. And the computing task on a vehicle in time slot is defined as , where represents the total number of CPU cycles required to process the task, represents the size of the computing task, and denotes the maximum delay tolerant of the task. Typically, MEC servers are deployed at the edge of the network, consisting of cellular networks to provide services to vehicles. The CS, on the other hand, is deployed away from vehicles and provides computing services through the core network. In order to guarantee the reliability of data transmission and provide offloading service for vehicles, RSUs and CS are connected via core networks. In general, the amount of computing resources and bandwidth is much higher on CS than on MEC servers, but the offload service is more costly on CS. The descriptions of the main symbols in this paper can be found in Table 1.

In the relevant scenario, there are multiple vehicles driving on the road within the coverage of CS and RSUs. Here, we denote as the link between vehicle and RSU , where means that in time slot vehicle is in the coverage area of RSU and vehicle is associated with RSU . In time slot , vehicle needs task offloading service. After obtaining its motion information and task information, the vehicle’s offloading request is sent to its associated RSU. If its associated RSU does not have enough resource and cannot fulfill its requirement, its task information will be further sent to other RSUs in this area for cooperative offloading. If all the RSUs in this area fail to meet the task’s demand for resources, the computing task will be offloaded to CS. We define the offloading decision of vehicle as the integer variable , where means that the computing task of vehicle is offloaded to RSUs in time slot , and means that the computing task is offloaded to CS. Based on the current resource status of the RSUs, the offloading decision of the vehicle and the resource allocation are made dynamically by the control center in the CS. The task offloading process described above is shown in Figure 2.

3.2. Communication Model

Each vehicle can only be connected to one RSU at a time within the RSU’s coverage range. We assign the bandwidth of to the link between vehicle and RSU . As shown in eq. (1), we calculate the data transmission rate according to Shannon’s formula where denotes the channel gain, denotes the transmission power, and denotes the power level of white noise.

3.3. Computation Model
3.3.1. Computing Model for CS

For , vehicle decides to offload the computation task to CS. Although cloud servers have a huge volume of computing and communication resources, the amount of resources available in each current time slot is still limited considering the maintenance cost of the resources and other factors. The offloading process in this case is divided into the following parts: (i) task transmission between vehicle and its associated RSU, (ii) the process of uploading tasks to CS, and (iii) task processing on CS. According to eq. (1), we can obtain the data transmission rate between vehicle and its associated RSU as . The data transmission rate between CS and RSUs can be obtained as . To sum up, we can obtain the transmission times for the first two processes as where denotes the task size of the vehicle. We define as the assigned computational capacity from CS to vehicle. Thus, we can obtain the computation time of this process as where denotes the required CPU cycles of task. Then, we define the total execution time to offload tasks to CS as . Thus, we have

Based on the above discussion, the system cost when tasks are offloaded to CS is formulated as where denotes the transmission power between vehicle and RSUs, denotes the transmission power between RSUs and CS, and denotes the execution power of CS.

3.3.2. Computation Model for Associated Offloading

For and , the vehicle’s associated RSU has enough resource to process its computing task, then vehicle decides to offload the computation task to its associated RSU. Thus, the task processing in this case can be divided into two parts: (i) the transmission process from vehicle to its associated RSU and (ii) task processing on the associated RSU. Similar to the above, we have where denotes the assigned computational capacity. The total execution time can be obtained as

Based on the above discussion, the system cost for associated offloading is formulated as where denotes the execution power of RSUs.

3.4. Computation Model for Cooperative RSUs

For and , since the associated RSU cannot meet its requirements in terms of resources, vehicle decides to offload the computation task to cooperative RSU . The task processing at this point consists of three processes: (i) the transmission process from vehicle to its associated RSU, (ii) the transmission process of the task between RSUs, and (iii) task processing on the target RSU . We define the computational capacity assigned from target RSU to vehicle as , thus, we have where denotes the transmission rate between RSUs.

The total task execution time in this case is expressed as

Combining the above discussion, the system cost when tasks are further offloaded to cooperative RSUs can be formulated as where denotes the transmission power between RSUs.

4. Problem Formulation

In this section, we first calculate the total system cost based on the previous section. Then, we formulate an optimization problem, aiming to minimize the long-term system cost.

Based on the previous section, we define total system cost as follows

By jointly optimizing computational offloading and resource allocation in the proposed system, we formulate the optimization problem with minimizing long-term system cost as the optimization objective, which can be indicated as follows:

The meanings of the constraints are explained as follows: (i)Constraint (18) indicates that the decision variable is a Boolean value(ii)Constraint (19) guarantees that tasks need to be completed within the maximum time delay(iii)Constraint (20) guarantees that the computation resources allocated by each RSU do not exceed its current available computation resources(iv)Constraint (21) guarantees that the bandwidth allocated by each RSU does not exceed its current available bandwidth

According to the previous discussion, represents the Boolean variable for the offloading decision. Meanwhile, and represent the computational resources and the bandwidth allocated for the task, respectively. Also, there are nonlinear conditions in the optimization problem. Therefore, optimization problem is a typical mixed integer nonlinear programming (MINLP) problem [28], which is an NP problem and cannot be solved in polynomial time.

5. Problem Transformation and Solution

In this section, we describe the optimization problem as a Markov decision process (MDP). Next, to solve the optimization problem based on Q-learning method and DQN method, we define state space, action space, and reward function of this problem.

5.1. State, Action, and Reward Definitions

(i)State Space. The state space indicates the current state of the environment in the system. In the concerned scenario, the system state of the available resources at current time slot is determined by , and , which, respectively, represent the available bandwidth and the available computation resources. Moreover, in order to compare among the states to determine if the system has reached the optimal state, we need to obtain the system cost in each time slot. Hence, the state vector can be obtained as (ii)Action Space. In our concerned scenario, agents need to perform multiple actions including developing offloading decisions and deciding how much resource to allocate at each time slot. Therefore, the action vector consists of the offloading decision vector, the computation resource allocation vector, and the bandwidth resource allocation vector. Hence, the action vector in current time slot can be obtained as (iii)Reward Function. The optimization objective in this paper is to minimize the system cost, which is the opposite of the meaning of the system reward value. Therefore, we define the reward that agents can obtain at state when performing action as , where is the system cost of the current state

5.2. Markov Decision Process

In this step, we transform the optimization problem into a MDP problem where agents perform adaptive learning and decision making through iterative interactions with the unknown environment. The specific steps are as follows: first, an agent observes the current system state . This intelligence performs action based on the current policy for each time slot. As a mapping from the current system state to action, policy can be obtained as , where denotes the set of actions and denotes the set of states. The probability of an agent moving to the next state is , and the reward can be obtained as .

To summarize what was discussed above, a state value function can be defined to indicate the long-term effect of the current state. Hence, the state value function under the policy can be expressed as where denotes the initial state, and denotes the discounting factor indicating the importance of future rewards.

Finally, combined with the optimization objective, the agents need to obtain the optimal strategy in the current state to maximize the cumulative reward. Therefore, the optimization problem can be translated into an optimal state value function as

Thus, the optimal action for state can be obtained as

5.3. Q-Learning-Based Solution

As an efficient value-based model-free iterative learning algorithm, the Q-learning approach enables agents to continuously approximate the optimal -value by learning the optimal action in the corresponding environment at each time slot. Specifically, agents of Q-learning method need to obtain the results of the state-value function for each policy and update the two-dimensional Q-table with the corresponding -values. Thus, agents can get the optimal strategy for each state based on the magnitude of the -values.

Specifically for the content of this paper, we use a Q-learning method to solve the optimization problem. The optimal action values can be defined as , and the optimal state value function can be obtained as

Therefore, the cumulative reward after performing action can be obtained as

Summarizing the two formulas above, the expected reward can be obtained as

The iterative formula for the optimal -value can be obtained by updating the state-action function as where denotes the learning rate parameter.

Combining the above discussion, our proposed algorithm is shown in Algorithm 1. In order to make a trade-off between the exploration and the exploitation, we use -greedy strategy to choose actions [29].

Input: state space , action space , learning rate , discount factor
Output: the Q-values for every state-action pair
1:  arbitrarily for ,
2: for each episode do
3:  for each step of episode do
4:   In the current state choose an action with a random probability
5:   If < then
6:    randomly select an action
7:   else
8:    select
9:   end if
10:   Execute action , observe the reward and the next state
11:   Update according to eq.(29)
12:   Update state
13:  end for
14: end for
5.4. DQN-Based Solution

In the time-varying scenarios, we consider the number of vehicles and the size of the tasks are stochastic in nature. This leaves the possibility of a huge action-state space, where although the above Q-learning-based solution can obtain the best policy by updating the Q-table, it may lead to a dimensional disaster in real scenarios. If we stick to the above Q-learning-based solution, finding the corresponding -value in a huge Q table can be costly in time and memory.

To avoid this drawback of Q-learning method, we further use a DQN-based approach to solve the optimization problem. Compared to Q-learning, DQN is essentially an improvement method. As a value function approximation, in order to solve the problem of large state space, also known as dimensional disaster, DQN uses the architecture of deep neural network (DNN) to replace Q-table. As a nonlinear approximator of the optimization problem, the DNN in DQN can capture the complex interaction between states and actions [30]. After taking the states as the input to the DQN network, we can get the -value of the actions as the output. By doing so, we can estimate the -value as , where are the weights of the DQN. Therefore, as in eq. (25), the optimal action in this method can be obtained as

In actual engineering application, DQN mainly needs to solve two obvious problems: low sample utilization rate and unstable value obtained by training. In order to deal with these two problems, DQN uses the following two key technologies (i)Experience Replay. An experience pool is constructed to remove data correlations which is a dataset consisting of the recent experiences of the intelligences(ii)Freezing Q-Target Networks. The parameters in the goal are fixed for a time period (or for a fixed number of steps) to stabilize the learning goal

Next, we describe the specific execution steps of DQN. Figure 3 shows the network structure of DQN and the difference between DQN and Q-learning. The network first outputs a prediction -value , then selects the next action based on this -value and passes it into the environment for interaction, then obtains a new state value and continues to feed it into the training. At the same time, the results of each interaction with the environment are stored in a fixed-length experience pool. A target Q-network with the same structure and parameters is copied from the Q-network at certain steps to stabilize the output target, and the target Q-network samples the data from the experience pool to output a stable target value . And can be obtained as

And the update of the value function of -value can be obtained as

DQN approximates the value function using a deep convolutional neural network. The value function here corresponds to a set of parameters, which in a neural network are the weights of each layer of the network, denoted by . At this point, updating the value function is actually updating the parameter . When the neural network is determined, indicates the value function. And the update method of is the gradient descent, which can be expressed as

In each training iteration, we train the DQN network by minimizing the loss function. In the previous Q-learning based method, we updated the Q-table by iterating through it using the rewards and the current Q-table at each step. Then, we can use this calculated -value as the label to design the loss function, and we use the mean squared difference between the approximate and true values to represent the loss function, which can be obtained as

To summarize the above about DQN, the specific algorithm steps are shown in Algorithm 2. Same as the Q-learning-based method, we use -greedy strategy in the selection of actions.

1:  replay memory set
2:  action-value function with random weights
3:  target action-value function with weights
4: for episode =1, M
5:   sequence and preprocessed sequence
6:  for t =1,2,...,T do
7:   With probability select a random action
8:   Otherwise select
9:   Execute action , observe the reward and the next state
10:   Set and preprocess
11:   Store experience in
12:   Sample random minibarch of experience from
13:   Set if episode terminates at step
14:   Otherwise
15:   Perform a gradient descent step on with respect to the network parameters
16:   Every step reset
17:  end for
18: end for

6. Performance Evaluation

In this section, we evaluate numerical results of the proposed joint computation offloading and resource allocation algorithm in a dynamic edge-cloud network and compare it with other typical schemes.

6.1. Simulation Settings

In the simulation experiments, we consider a dynamic scenario in which there are several vehicles driving in this area and several RSUs distributed on the roadside. Similar to the experimental in [31], for each vehicle, the required CPU cycles of the computing tasks are randomly selected in the range of [0.4, 0.6, 0.7, 0.8, 0.3, 0.2, 0.8, 0.9, 0.4, 0.5, 0.2, 0.3, 0.8, 0.9, 0.4] Gcycles.

6.2. Simulation Results

In the next simulation experiments, in general, our proposed schemes are compared with the random offloading and resource allocation (RORA) scheme and greedy offloading and resource allocation (GORA) scheme [32].

Figure 4 reveals the convergence performance of our two proposed algorithms. In terms of general trends, both curves tend to increase in reward value and then converge after a period of time. However, the number of training episodes to reach convergence and the final convergence to the reward value differ due to the difference between these two methods. What can be seen from the figure is that the Q-learning method reaches convergence after about 280 training episodes, while the curve of the DQN method reaches convergence after about 250 training episodes. In addition, the DQN-based approach allows for higher reward values at the time of final convergence.

Figures 5 and 6 show how the total system cost and average delay is affected by changes in the number of vehicles, respectively. We intercepted the curve with independent variables changing in the range of 10-50. According to the optimization objective equation, the number of vehicles has direct effect on the system cost and average delay, so the general trend of all curves is that the system cost and average delay are positively related to the number of vehicles. As joint optimization schemes of computation offloading, the system cost of our proposed Q-learning based algorithm and DQN-based method keeps increasing, and they always work better compared to the other two schemes in the figure. This is because our proposed schemes consider the offloading decision making of RSUs and CS cooperatively, by which the utilization efficiency of network resources can be increased and system cost has been greatly reduced. For example, when the number of vehicles equal to 30, our proposed Q-learning-based method reduces the cost of the system by about 16% and 5% compared to the classical solutions RORA and GORA, respectively. And in the same case, compared with the classical two schemes, our proposed DQN-based scheme reduces the system cost by 18% and 7.5%. When we take the average delay as the dependent variable, the performance improvement of our proposed Q-learning-based method relative to RORA and GORA can reach 18% and 4%. And likewise, the performance improvement of our proposed DQN-based method relative to RORA and GORA can reach 19.9% and 9%.

Figures 7 and 8 show how the total system cost and the average delay are affected by changes in the computation capacity of RSUs, respectively. Compared to the two graphs above, the curves in Figures 7 and 8 show more dramatic changes. In general, the system cost and average latency are reduced with the increase of the computation capacity of RSUs. However, the performance varies due to the different schemes. Obviously, our two proposed solutions have some performance advantages. Specifically, for computational resources, when the computation capacity of RSUs is equal to 30 (in GHz), when compared to RORA and GORA, the performance improvement of our proposed Q-learning-based method can reach 49% and 42%. And the performance improvement of our proposed DQN-based method relative to RORA and GORA can reach 62% and 57%.

In addition, in order to reflect the impact of two key techniques, experience replay and freezing Q-target networks, on DQN method, we tried to add two experiments without one key technology under the same experimental conditions, and the effect is shown in Figure 9. Since the impact of the above two techniques on DQN is mainly in eliminating data correlation and speeding up convergence, the effect of the scheme after removing these two techniques, respectively, is similar to Q-learning method. From the evaluations, these two key technologies have a relatively obvious performance additive effect on the DQN method. Without these two key techniques, DQN method’s performance is close to Q-learning method.

In summary, we have compared our proposed scheme with RORA and GORA. From the experimental results, it is clear that the proposed scheme minimizes the system cost and has some advantage over the other two schemes. For the proposed two schemes, compared with the Q-learning-based method, the DQN-based method has a better performance due to the advantages of using deep neural networks.

7. Conclusion

In this paper, we propose a joint computation offloading and resource allocation scheme for the edge-cloud network. The optimization problem aims to minimize the system cost, including the computation cost and the radio cost. Meanwhile, we also consider the capacity constraints and the latency constraints. Then, we transform the original optimization problem into a Mixed Integer Nonlinear Programming problem. Eventually, a Q-learning-based method for computation offloading and resource allocation is developed to enable tractable analysis. To avoid the dimensionality catastrophe due to the two-dimensional table structure of Q-learning, we further propose a DQN-based algorithm to solve the optimization problem. Through a series of comparative experiments, it is clear that the proposed schemes have good performances in system cost minimization.

Data Availability

The processed data required to reproduce these findings cannot be shared at this time as the data also forms part of an ongoing study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This work was supported in part by the National Natural Science Foundation of China (Grant no. U1703261). The corresponding author is Shouzhi Xu.