Abstract

The increasing number of computing platforms is critical with the increasing trend of delay-sensitive complex applications with enormous power consumption. These computing platforms attach themselves to multiple small base stations and macro base stations to optimize system performance if appropriately allocated. The arrival rate of computing tasks is often stochastic under time-varying wireless channel conditions in the mobile edge computing Internet of things (MEC IoT) network, making it challenging to implement an optimal offloading scheme. The user needs to choose the best computing platforms and base stations to minimize the task completion time and consume less power. In addition, the reliability of our system in terms of the number of computing resources (power, CPU cycles) each computing platform consumes to process the user’s task efficiently needs to be ascertained. This paper implements a computational task offloading scheme to a high-performance processor through a small base station and a collaborative edge device through macro base stations, considering the system’s maximum available processing capacity as one of the optimization constraints. We minimized the latency and energy consumption, which constitute the system’s total cost, by optimizing the computing platform choice, base station’s choice, and resource allocation (computing, communication, power). We use the actor-critic architecture to implement a Markov decision process that depends on deep reinforcement learning (DRL) to solve the model’s problem. Simulation results showed that our proposed scheme achieves significant long-term rewards in latency and energy costs compared with random search, greedy search, deep Q-learning, and the other schemes that implemented either the local computation or edge computation.

1. Introduction

The wide spread of advanced applications such as interactive online gaming, face recognition, autonomous driving, real-time object recognition, virtual reality (VR), and augmented reality (ARS) comes with a significant challenge in computation with the Internet of things (IoT) devices. This problem is due to the limited processing, memory, and energy resources present in these IoT devices [13]. The disadvantages of the IoT device’s computational ability have drastically reduced users’ quality of experience (QoE) and IoT devices’ performance. One way to address this problem is to employ mobile cloud computing (MCC) technology to assist the IoT device’s powerful computing and storage capability. However, relying on MCC for the computational offloading service results in the inability to support delay-sensitive applications due to distance from users and data loss, leading to unreliable wireless connection caused by deep fading. Its geographically centralized position caused congestion when there is an explosive growth of users’ computational demand [4, 5].

To deal with the drawback seen with MCC, mobile edge computing (MEC) found at the edge of the network is used to provide computing services [6, 7]. MEC server provides more computational capability than the IoT devices, but its computational capacity is less than MCC because of its distributed structure [6]. MEC enhances the QoE of users due to its proximity to IoT devices. It guarantees traffic decongestion at the core network due to MEC servers’ distributed structure and supports delay-sensitive applications due to the transmission latency significantly reduced [6, 8]. Nevertheless, an efficient MEC offloading scheme should still consider if offloading the IoT devices’ computational tasks is required due to limited resources. The instantaneous user’s available processing capacity (APC) should always be greater than the user’s required processing capacity (RPC) needed before deciding to offload [9]. It is important to note that MEC tasks’ stochastic nature and frequency indicate that the instantaneous computational demand and input data cannot overhaul to the next scheduling interval. Hence tasks break into units that can be efficiently managed within the scheduling time interval. Moreover, the offloading from IoT devices to MEC servers utilizes the dynamic wireless channel. Therefore, a proper offloading scheme is required to consider the wireless channel’s variability too [1014]. It is usually best to select offloading devices based on their radio link’s strength and the critical need for an additional computation facility. Furthermore, power constraints are the major limiting factor to IoT devices’ computation capacity, frequently querying the limited MEC servers for computational resources. Hence, to further improve the quality of service (QoS) for IoT devices and reserve MEC for critical requirements, renewable energy harvesters (EH) can be deployed to extend the battery lifetime of IoT devices [1518]. The IoT device can capture ambient renewable energy such as solar radiation, radio-frequency (RF) signals, wind power generation, and human kinetic motion. These renewable energy sources provide the greener energy required for the IoT central processing unit (CPU) and the radio transceiver using EH module [13, 19, 20].

In [21], the computation of task offloaded from IoT device was implemented using satellite and gateways in satellite communication. Here tasks can be handled by satellite or serve as a medium for task transfer to the gateway where it is computed. When adopted to terrestrial communication, this idea, computing task from IoT devices, can be handled by the high-performance processor connected to the small base station. The small base station also serves as a medium of transfer to a collaborative edge device connected to the macro base station, where the computation of the task is completed. Reference [21] does not consider user computation’s effect on overall energy consumption and latency reduction. Also, they did not consider efficiently allocating the available power (CPU and RF modules) to improve the system’s available processing capacity. In summary, we considered both the user computation and available power in our overall energy and latency optimization expressed in equation (5), which is an addition to the choice of computing platform, selection of the base station, and resource allocation considered in the work of [21].

The offloading scheme should adapt to the practical scenario whereby IoT devices’ tasks are not predictable. Furthermore, due to the MEC tasks’ stochastic nature, the computation demand and input data cannot be carried to the next scheduling interval but immediately processed when the task is brought into the network. To achieve this, the user’s available processing capacity (APC) at the time instant of task arrival should be greater than the user’s required processing capacity (RPC) [9].

However, the APC is a function of the execution latency from [9]. If enhanced, this indeed improves the APC. At [9], the research allocation policy ensures that the user’s APC is more than the RPC for computing tasks. In a practical scenario, it means that if the computing task of MEC is constant, for the condition always to hold, then there arise cases where the user’s task may not get access into the system due to insufficient power to properly allocate between the CPU and RF modules to make APC more than RPC. In addition, because the computing capacity of the MEC is finite. This problem results in users computing tasks missing their deadline for delay-sensitive tasks. Therefore to support the proposition of APC always greater than RPC, the computing capacity of MEC can be enhanced by adding to its computing ability by forming a D-D connection with all the unengaged devices within the proximity of the MEC. This leads to a better quality of experience (QoE) from the user’s perspective. Also, to efficiently utilize the power resources, an exact amount of power needed for the CPU and RF modules for the chosen computing platform for maximizing APC needs to be ascertained as power is a limiting resource. In addition, since there are many deployments of these computing devices that are attached to either the small base station or macro base station, a choice of the particular base station a user can associate from the many available needs to be ascertained that leads to a reduced cost in energy consumption and delay. A resource allocation policy that considers user association, offloading decision, computing platform, and power as a limiting resource with the sole aim of maximizing the APC will not only promote the QoE of users. Still, it further adapts the scheme to the practical scenario of addressing the unpredictable and stochastic arrival of computing tasks into the network. The use of mathematical optimization to solve the following observations is complex. It results in a high cost of implementation and severe delays as the optimization runs at every time slot during the resource allocation policy. The use of deep reinforcement learning leads to a less complicated and low cost of implementation.

Motivated by the observations above, this paper focuses on the MEC IoT networks with multiple users, multiple high-performance processors, and multiple collaborative edge devices. The high-performance processor provides access to the collaborative edge device for tasks offloaded from IoT nodes/users and is not handled by the high-performance processor. The IoT devices are power limited, coupled with their computing and communication resources being scarce. Therefore, considering computing platform choice, base station choice, and resource allocation for task offloading should be investigated to minimize the MEC IoT system’s total cost due to latency and energy consumption.

It is quite challenging to solve the formulated optimization problem using the standard techniques due to the problem’s nonconvexity. We point out the primary contributions of this paper as follows:(i)We demonstrated how the total cost could be optimized considering three-tier computing platforms with multiple users, high-performance processors, and collaborative edge devices. Unlike the existing methods, we considered the joint user association, offloading decision, computation, and communication resources to optimize the system’s service latency and energy consumption under a time-varying channel state and optimal APC.(ii)We increased our system’s computing capacity by making it possible for the small base station high-performance processor to assist in processing tasks offloaded from IoT nodes within the coverage of the small base station. Similarly, the collaborative edge device replaced the conventional MEC server seen in other schemes. The task edge device can intelligently form D-D connections with all the unengaged devices (resource edge devices) within its vicinity to complete task execution from the user. It should be noted that tasks handled at the collaborative edge device must pass through the high-performance processor.(iii)We formulated the problem of minimizing latency and energy consumption while optimizing computing platform choice for execution, the base stations to associate with, and resource allocation. We intelligently determine the appropriate power the computing task requires to maximize the APC. We adopted the actor-critic architecture of deep reinforcement learning proposed in [19, 22] to carry out our investigation.(iv)We compare our method’s performance with other known benchmark algorithms (random search, greedy search, deep Q-learning) together with schemes of local and edge computation and show that our scheme provides a significant reduction in total cost due to latency and energy consumption.

1.1. Related Work

The study conducted by the group known as European Telecommunications Standards Institute (ETSI) first gave the motivation, definition, protocol architecture, and challenging issues of MEC [23, 24]. The computational task offloading mechanism in edge computing is responsible for MEC system all-around performance. The power constraint of IoT devices led to energy-efficient computation offloading, as we see in the study of [23, 25, 26]. In [25], an actual data measurement was suggested based on the optimization method to save users’ energy consumption by jointly formulating the joint scheduling and computation problem. Reference [26] proposed the system cost, which considers delay and task failure as a performance metric in the dynamic computation offloading deploying the Lyapunov energy harvesting process. The amount of energy usage and task delay in the MEC method relies only on processing the task and transmitting it. However, based on optimizing radio resources and computing offloading, this consumed energy has increased [13, 27].

Reference [13] investigated the resource allocation for the multiuser MEC offloading problem under TDMA and OFDMA scenarios. Likewise, in [27], joint optimization of computation task scheduling and radio resource allocation was carried out for a multi-access-assisted computing offloading. In all these works, the MEC maintained a static position, unlike the work of [23] in which flying UAVs serve as the mobile edge servers. Also, in [28], they proposed a mobile edge computing framework with the aid of a UAV-mounted cloudlet. They jointly formulated the UAV trajectory and bit allocation problems to reduce mobile energy consumption through a nonconvex optimization solution. The role of edge computing enhanced industrial IoT was carried out in [21, 29, 30]. The nondominated sorting genetic algorithm, as proposed by [31], was used to study the trade-off between the amount of consumed energy and delay for their user-MEC system and further proves that the algorithm can solve problems of similar kinds. A partial and binary offloading of a three-tier computational platform was proposed by [32], which improves the node’s energy efficiency factor by optimizing both computing and communication resources. An information asymmetry and information uncertainty as proposed by [33] were used to improve the system service in vehicular fog computing network. In [34], under a user-MEC system, the system performance was enhanced by jointly considering three indices: admission control, power control, and resource allocation. However, the increasing number of IoT devices and small or macro base stations that host the edge computing devices seen with the above schemes must be considered to associate the users appropriately for computational purposes. This consideration is necessary as the number of base stations (small or macro) is generally small compared to the number of IoT nodes.

Reference [35] carried out joint computation and user association for multitasks MEC systems to minimize the overall energy consumption.

Some works focusing on the supply of renewable energy and EH include the following: [19, 36] suggested a security disjoint routing-based verified message (SDRVM) using energy data from Denver’s National Solar Radiation Database. Here, solar energy and battery collection for energy storage are included in the energy consumption model. Reference [37] defined the communication between the EH wireless devices (WDs) and energy-transmitting devices by suggesting a wireless powered communication (WPC) model along with a radio-frequency (RF) energy receiver model. Reference [38] minimized capacity using the number of computing bits and causality of energy harvesting as constraints. By proposing a learning-based computing offloading scheme for IoT devices coupled with EH, [1] maximizes system utility.

The delay in data transmission and energy consumption on the wireless network is defined below. Reference [39] improved delay and energy consumption efficiency when changing forwarder nodes and duty cycle using the packet aggregation routing system (AFNDCAR) to monitor node residual energy. The alternating direction method of multiplier (ADMM) to optimize the computation was performed by joint optimization of local computing or offloading to a MEC server and allocating transmission time [14]. To reduce the average offloading delay, they follow an adaptive learning-driven task scheme based on the multiarmed bandit algorithm in [40]. Deep learning (DL) and DRL optimize computing offloading and resource distribution or transmission delay and energy consumption. An example can be seen in the work of [41] which, by implementing a joint offloading and edge server provisioning issue based on RL, minimized the long-term device expense. Using a postdecision state (PDS) learning algorithm, state transformations’ unique structure in the MEC system was also implemented.

The actor-critic RL approach was used by [22] to optimize caching, computing, and communication jointly. A two layered RL algorithm is used to find the trade-off between latency and delay for a resource constraint IoT device while offloading computing task to the MEC server as can be seen in [42]. Reference [43] used a three-layer neural network through the design of a DRL-based offloading model to learn the optimal offloading strategy with varying data transmission rates. A DRL-based joint optimization scheme for computing task scheduling and wireless resource allocation in a vehicular network was proposed by [44] by modelling the communication system and edge computing. Likewise, in the work of [4548] the use of DRL was adopted for an efficient task computation offloading to the edge server to minimize both latency and energy consumption.

2. System Model

A typical MEC IoT network with multiple users, multiple high-performance processors, and numerous collaborative edge devices forming the three computing platforms are presented in Figure 1.

We design a MEC system consisting of three parts where each part is equipped with a computing platform for the task’s execution.(i)IoT nodes layer: This includes the smartphone, visual terminal, etc. Let denote a collection of IoT nodes.(ii)The service equipment layer: This comprises a small base station and high-performance processor. Let denote the collection of high-performance processors. Since each high-performance processor is attached to one small base station, it then means we have high-performance processors and small base stations in the system.(iii)The MEC layer: This consists of a macro base station and a collaborative edge device. Let . Since each collaborative edge device is connected to a macro base station, there are collaborative edge devices and macro base stations in the MEC model.

This paper uses the small base station and high-performance processor interchangeably. Similarly, the macro base station and collaborative edge device refer to the same thing as we have earlier mentioned that the high-performance processor is hosted on a small base station and the collaborative edge device is hosted on a macro base station.

We note that the IoT device and small base station have energy harvesting capability and tap renewable energy from an external source. In contrast, the task edge device subset of collaborative edge devices is powered by a conventional power grid. Thus, the three computing platforms’ computing capabilities can be written: IoT node High-performance processor collaborative edge device. Data flow from the IoT nodes to the collaborative edge device through the high-performance processor. It means that the task scheduled to be processed at the collaborative edge device passes through the high-performance processor but will not be handled there. We considered a system with multiple IoT nodes, service equipment layers, and MEC layers. The user needs to associate with the best small base station, macro base station, and computing platforms to reduce the cost due to latency and energy consumption. Table 1 lists the key notations mentioned in this work.

2.1. Collaborative Edge Device

It is a set of task edge devices and all the resource edge devices which can form a D-D connection with the task edge device. The connection between the task edge device and resource edge devices is wired. Let us denote a task edge device by and resource edge devices that can establish connection with as . We have that + , where is a collaborative edge device .

2.2. Communication Model

Our MEC system adopts OFDMA in the two transmission segments with , orthogonal subcarriers represented by , respectively. Let the bandwidth of the small base station and macro base station be denoted as and , respectively. The uplink transmission rate for the transmission segment from user to small base station is given by

Similarly, the transmission rate for the second segment from small base station to the collaborative edge device is given bywhere is an indicator variable. and represent the power allocated to the subcarriers and , respectively. and denote their respective channel gains. is the power spectral density of the additive white Gaussian noise (AGWN). denotes the channel response, while is an indicator variable which is 1 when subcarrier is assigned and 0 when no subcarrier is assigned. To avoid interference, the following indicator variables should be satisfied:

2.3. Computation Model

We assume that the IoT nodes, high-performance processor, and collaborative edge devices can change their CPU computational frequency to adopt the power consumption and execution latency using the dynamic voltage scaling (DVS) techniques [9]. We can model their computational power as follows:where denote the input powers for the IoT device, high-performance processor, and collaborative edge device, respectively. denote the CPU computational frequencies measured in Hz of IoT device, high-performance processor, and collaborative edge device, respectively. denote their effective capacitance coefficient that depends on the chip architecture. We set for all the computing platforms. We equally set with frequency upper bounds given by . This implies that their computational power satisfies . The power present at the first transmission segment is constrained by the available power of the IoT nodes covered by the small base station and is given by

Similarly, the second transmission segment is constrained by the available power of the high-performance processor given bywhere and are the CPU modules of the IoT node and high-performance processor, respectively. and represent the RF modules of the IoT node and high-performance processor (small base station). and are the available power present in IoT node and high-performance processor, respectively. We ignore the RF modules’ received power, consistent with other schemes, as their feedback is negligible.

2.4. Latency Model

For every task , we can compute the latency in any three computing platforms. This latency includes waiting, transmission, processing, and propagation. We look at the three possibilities in evaluating the system’s latency and consider only the processing latency if processed at the IoT node.

2.4.1. Task Handled at the User

For the processing of task, at the user, we express the latency as follows:where is the number of bits in task at time slot . is the computing capacity allocated to task at time slot for the IoT node.

2.4.2. Task Handled at the High-Performance Processor

If task is handled at the high-performance processor , we express the latency as follows:

We use and to denote the length of a time slot and waiting latency. denotes the link’s allocated communication capacity from user to high-performance processor, . denotes the latency due to propagation for task processed by high-performance processor , where is the distance between IoT node and high-performance processor, . is the speed of light. We ignore the transmission delay from small base station to IoT node because of the lesser number of bits in the return link but not so for their propagation delay.

2.4.3. Collaborative Edge Device

Task is processed at the collaborative edge device. It gets offloaded through the small base station to the collaborative edge device without processing at the high-performance processor. Thus we express the latency as follows:where is the communication capacity allocated for the second transmission segment from the small base station to the collaborative edge device. is the allocated computing capacity assigned to the task by the collaborative edge device. where is the distance between high-performance processor and collaborative edge device. The transmission latency from the collaborative edge device to high-performance processor and from high-performance processor to users is also omitted as seen before.

From equations (9)–(11), the cost due to latency can be defined as

In (12), are the indicators showing where to offload task for processing. If , task will be handled at the IoT node, if , task goes to the high-performance processor for processing. If , task goes to the collaborative edge device for processing. In addition , and must satisfy . Considering the latency of all task , the total cost of the system due to latency is expressed as

In (13), we express the cost at time slot due to latency for our user-MEC system as the summation of weighted latency for all task scheduled. comprises all the scheduled new tasks that are to be associated with the small base station at time slot and is the weights of each task. With reference to (9)–(11), is related to choice of computing platform, choice of base station, and resource allocation (computing, communication) at each time slot.

3. Available Processing Capacity

The APC of a user is given by the maximum available computation it can obtain in the time interval between and . From [9], it was shown that, at time instant , user APC is expressed aswhere denotes the maximum available computation obtained within the time interval to . It was also proved from [9] that APC can be further expressed by

For any of the unpredictable tasks to be completed without an extra delay, a sufficient condition to realize this is by making the user’s APC greater than its required processing capacity (RPC). The RPC is the total computation demands per second of all tasks present at the user. Let us denote the APC of the user from user to the high-performance processor as and that from the high-performance processor to the collaborative edge device to be . The total APC is then given:

This implies that, for all unpredictable tasks arriving at the user, where is the required processing capacity at the user.

The APC from user to the high-performance processor is given by

Similarly the APC from high-performance processor to the collaborative edge device is given bywhere the expression for and has been given in (4) and (5), respectively. Also the expression for and has equally been given in (1) and (2), respectively.

We note the following constraints:(i)(RPC constraint): Here, the resource allocation targets making the total APC of the system satisfy the RPC constraint. We express it as follows:(ii)(Server capacity constraint): let and be the total processing capacity seen at the high-performance processor and collaborative edge device, respectively. We have that

3.1. Energy Cost Model

We considered power consumed at the small base station and IoT device as they are the two power-limited devices in our network. We omit power consumed at the macro base station as the power supplied to the static offset power is stable. The energy consumption changes only slightly because the macro base station receives its power from the conventional power grid. Therefore the total cost in energy due to the collaborative edge device hosted on this macro base station is not affected.

According to [49], we can model the power consumed at the small base station by the sum of its static baseline power and consumed power as follows:where is the static power offset (baseband processor, cooling system). denotes the total consumed power due to wireless transmission by the small base station . At time slot (L), we express the components of the total consumed power of the small base station aswhere is the power consumed at the IoT device for local execution at time slot covered by a small base station .

denotes the transmit power of the first transmission segment at time slot .

denotes the transmit power of the second transmission segment between the high-performance processor and collaborative edge device .

denotes the ratio of offloading to high-performance processor and the collaborative edge device .

4. Problem Formulation

In light of the total cost due to latency and energy consumption model seen in the previous section, we formulate our optimization problem to reduce the total cost due to latency and energy consumption at time slot . Let . We formulate the problem as follows:

denotes at time slot the tasks associated with a high-performance processor (small base station) . It consists of the ongoing tasks and new tasks scheduled to be processed at time slot at the high-performance processor . represents all the tasks to be processed at the collaborative edge device at time slot and because the tasks offload to the collaborative edge device through the small base station (high-performance processor) at time slot . consists of all the tasks scheduled to be processed at the high-performance processor at time slot . because some of the new tasks are not processed at the high-performance but merely pass through there to the collaborative edge device where they are processed. and are the highest available communication capacity available for the transmission segment from IoT device/users to the high-performance processor and high-performance processor to the collaborative edge device, respectively. Similarly, and are the highest available computing capacity available at the high-performance processor and collaborative edge device , respectively. We also point out that all the resources deployed for the ongoing tasks can no longer be utilized for the new task scheduled at time slot . and , respectively, represent the latency weight due to latency and weight due to energy consumption. It should satisfy . Constraint (23b) signifies that the summation of all the communication resources (RF modules) allocated to the set of users connected to the small base station should not be more than the maximum communication capacity the high-performance processor can offer. Similarly, constraint (23c) indicates that the summation of all the communication resources from the high-performance processor to the macro base station should not be more than the maximum communication capacity present at the collaborative edge device. Constraint (23d) signifies that the total computation offloaded to the small base station should not be more than the high-performance processor’s processing capacity. Constraint (23e) also notes that the total computation offloaded to the macro base station should not exceed the collaborative edge device’s processing capacity. Constraints (23f) and (23g) show that the sum of the computation and communication powers should not exceed their available powers. Constraints (23h) and (23i) show that the sum of the RF modules for all the users offloaded to the small base station should not exceed the high-performance processor’s computing capability. It is also similar to (23i) for the sum of RF modules offloaded to the macro base station. Constraints (23j) make the user’s total APC be more than the required processing capacity of a user . Constraints (23k) and (23l) indicate that there should be positive power allocation and that the computational power has a maximum limit.

We note from 25 that the total cost due to latency and energy consumption depends on the choice of computing platform, choice of base station users can associate, and resource allocation at each time slot . Also, the three metrics coupled together at each time slot depend on the state of time slot . Hence, we formulate the problem as a dynamic programming problem based on metrics considered. We deploy actor-critic architecture based on deep reinforcement learning to solve the problem since it involves many variables.

4.1. Latency and Energy Optimization Based Deep Reinforcement Learning

The optimal solution to the problem formulated in this paper is a mixed-integer problem of nonconvex nature. It is hard to get a solution due to a combination of discrete and continuous variables. The CPU modules and RF modules that make up the available power for each computing platform and resource allocation variables are continuous. In contrast, the user association and offloading decision variables are discrete. To reduce the complexity in finding a solution, we use an actor-critic DRL-based algorithm to solve the joint user association, offloading decision, and resource allocation that involves many states and action space. We represent the MDP using , where is the system state space, represents the system action space, and denotes the transition probability space from state and action .

4.1.1. State (H)

The system states are expressed for all time slots since all the parameters we want to optimize are defined therein. Hence, we represent the state at time slot as follows:

is further expressed as is a collection of ongoing tasks connected to the high-performance processor (small base station) at time slot .

is further expressed as is a collection of ongoing tasks being computed at the high-performance processor time slot .

is further expressed as is a collection of ongoing tasks at time slot being computed at the collaborative edge device .

is further expressed as is a collection of communication resources at time slot engaged from the users to the high-performance processor.

is further expressed as is a collection of communication resources at time slot engaged from high-performance processor to the collaborative edge device.

denotes the collection of tasks at time slot not scheduled (served).

is further expressed as is the set of computing resources at time slot allocated by the high-performance processor to the ongoing task.

is further expressed as is the collection of computing resources at time slot allocated by the collaborative edge device to the ongoing task.

represents the matrix that shows the location of all IoT nodes, high-performance processor , and collaborative edge device .

is a vector list of SINR of the transmission segment between users and high-performance processor and a list of SINR of the transmission segment between high-performance processor and task edge device.

is the computational requirement for task .

is the number of high-performance processor.

is the number of task edge devices.

4.1.2. Action (A)

The unserved tasks need to be associated with the small base station and collaborative edge server at time slot . Also, resources for computing and communication need to be assigned to the appropriate choice of the computing platform and base station.

We define at time slot the action space for the choice of computing platform, choice of base station user can associate, and resource allocation (computing and communication) of the system as follows:

denotes whether an IoT device is capable or not of completing a task at time slot .

depicts at time slot if task should be considered for processing.

denotes at time slot the small base station that associates with the task .

indicates in which of the computing platforms the task should be processed.

denotes the collaborative edge device used to handle the task .

where can be obtained by .

where can be obtained by .

where can be obtained by .

 = number of communication and computing resources allocated to users from high-performance processor denoted as and , respectively, at time slot .

 = number of communication and computing resources allocated by the collaborative edge device to tasks offloaded from high-performance processor denoted as and , respectively, at time slot . Under a particular action , we obtain the choice of computing platform, base stations to which users can associate , and appropriate computing and communication resources .

4.1.3. Transition Probability (P)

A model-free DRL framework is adopted here for the following reasons: It is hard to get the MDP transition probability from one state to another with an action because some state has continuous variables. Secondly, it involves large state space and action space.

4.1.4. Reward (R)

We define in time slot the reward under state and action that minimizes the system weighted cost due to latency and energy consumption as follows:

From (26), the state and action will affect the reward . The reward that is negatively correlated with the weighted-sum of latency and energy cost is fed back for every new scheduled task at each time slot , whereas the reward is fed back for tasks in which only the waiting latency constitutes the value of and the value of is zero due to the knowledge that the task has not been scheduled.

4.1.5. Policy and Value Function

We use the stochastic policy to optimize the long-term performance of the action selection strategy as . We equally define the expected return of a trajectory that begins at time under state and action aswhere is a discount factor.

By selecting the greedy action, the policy for estimating the Q-values for all state and action pair (s, a) can be derived as follows:

4.1.6. Evaluating Value Function

The value function can be denoted as making use of a fully connected DNN with many hidden layers which are parameterized by collection of weights . To realize this, the two units of the DNN input layer introduce to the hidden layers the system state and action . We express the outcome of the neuron located in layer , which makes use of ReLu as the nonlinear activation function aswhere denotes the output value, denotes the inputs for layer , denotes the associated weights for the neuron inputs, and is a bias. The estimated Q-value is provided by the output layer of DNN. By repeatedly reducing the loss function, the DNN acquires knowledge of the best fitting weight as follows [22]:where represents the parameters of the neural network while the target value is the term . The difference between the target value from the estimated value gives us the error also known as temporal-difference (TD) error, given as

4.1.7. The Actor-Critic Architecture of the Deep RL Method to Finding a Solution

The actor-critic algorithm shown in Algorithm 1 was used to combine the concept of policy-based and value-based methods by estimating collections of the actor and critic parameters at the same time [22, 50]. The framework for the actor-critic framework shown in Figure 2 comprises actors that map states to actions and critic that maps state-action pairs to expected long-term cumulative reward [22]. DNN is used in both actor and critic networks to correctly predict value function and policy due to the accurate prediction it provides [22]. The actor uses a parameterized stochastic policy to generate an action after observing the current state of the environment. The critic then evaluates action performed by the actor by TD error (or loss function signal) resulting from estimating Q-value with after observing the reward and the next state of the environment. The output from the critic guides the learning seen in the network of the actor and critic. The teaching perfects when the actor uses the critic’s output to either increase or decrease its action probabilities. The actor increases its action probabilities if the critic’s outcome showed good performance and decreases its action probabilities if it was terrible. Similarly, the critic network parameters are updated to improve the predicted Q-value using the gradient descent method. However, the actor and critic DNNs are trained using experience buffer .

(1)Require: Data from the user/IoT nodes
Data from the high-performance processor data from the Task edge device position of the nodes choice of computing platform computing and communication resources
(2)Network initialization
Initialize parameters of the actor and critic network
(3)For incident = 1 to perform
(4)  Renew the environmental situation of the proposed user-collaborative edge device model
(5)  reset the state
(6)  reset  = 0
(7)  for step = 1 to do
(8)    Choose in the simulation environment in line with
(9)    obtain the reward and the next state
(10)    cache in the replay buffer as the experiences used for training the actor and critic network
(11)    Arbitrarily extract minibatch of turples from that will be utilized for training the primary network of the actor and critic
(12)    Update the critic network parameters as follows:
(13)    Update the actor-network parameters as follows:
(14)    change the two target networks parameters every steps as follows: , . Where  = 0.001
(15)    end for
(16)  end For
4.1.8. Critic DNN

One of the problems with this architecture is the inability to converge easily. To facilitate this convergence issue and improve the stability of the algorithm, we employ fixed target network techniques [51] and replay of experience [52] to remove problems of nonstationary targets and break up the temporal correlations arising from the different training episodes [22]. The replay buffer stores experiences in the form of I turples as and providesminibatches which are sampled to allow DNN update parameters .

Using the fixed target network and experience replay techniques, we get the loss function which can be expressed aswhere and are the parameters of the target network and experience replay buffer, respectively. If we continuously differentiate the loss function with respect to parameter , the gradient of loss function updates the parameters of the critic DNN as follows:where represent the target value and denotes learning rate of the critic. We use the average value over the minibatch to update parameters as follows:where can be gotten from all critic network output and immediate reward.

4.1.9. Actor DNN

The agent observes most of the training samples that have good rewards due to taking the right action and then tries to improve the probabilities of selecting those right actions. Similarly, it gives negative rewards when poor actions are taken and does not increase the probability of favouring those wrong actions. We make use of a different parameterized DNN to depict the network of the actor, and the initial policy depicts arbitrarily collection of parameters . There is one unit from the actor’s DNN input layer to pass information to the hidden layer of their current state. The actor DNN output layer then decides the current system state. There is experience replay also in the actor-network to cache samples of minibatch used to train the DNN. The policy objective function iteratively improves the policy to maximize the long-term average reward as follows:where denotes the state distribution. When the objective function is partially differentiated with parameters , we get the following gradients:

When approximated Q-value is used, the partial differentiation of the objective function with respect to parameter gives the approximated gradient as follows:

The natural policy-gradient method is adopted to avoid standard methods that seldom converge to the local maximum with high variance. The natural policy-gradient process looks for the steepest ascent direction concerning the fisher information metric (FIM) given in [53] which is expressed as

We get the natural gradient of the policy by using the inverse FIM to transform the standard gradient as follows:

The parameter is updated towards the natural gradient with respect to the learning rate as follows:

5. Numerical Results

This section evaluates the performance of our proposed joint optimization of computing and communication resources, choice of computing platform, and selection of base station users can associated with minimizing the total cost due to latency and energy consumption of our MEC system through computer simulations.

5.1. Simulation Setup

We considered a framework of IoT devices, high-performance processor, and collaborative edge devices. We assumed that each IoT device is at an equal distance of from the small base station and from the macro base station. At a particular time slot , each small base station can cover one IoT device. Similarly, the macro base station can cover one high-performance processor at each time slot. Moreover, . The maximum allocated power for IoT devices is . The maximum allocated transmit power of an IoT device to high-performance processor, , is of . Also the maximum transmit power from the small base station to the collaborative edge device is set to be of . The maximum power at the collaborative edge device is . We equally set each small base station’s static power to be . We set the maximum communication and computing capacity of the high-performance processor to be and respectively. We set the maximum communication and computing capacity of the collaborative edge device to be and . The maximum bandwidth of the collaborative edge device and high-performance processor is and .

We used a fully connected DNN with two hidden layers of 300 neurons and a ReLu activation function. Three hundred neurons were used in the hidden layers to maintain a trade-off between higher complexity in computation and accuracy for the value function approximation. Too many neurons lead to higher complexity in analysis. We generate two separate target networks for the actor and critic to regularize the learning algorithm and increase stability. It is followed by replacing the actor and critic network’s parameters once every iteration using the current estimates of their primary network’s parameters when . We use an experience replay buffer of size 10000 for training the DNN that returns a minibatch of experiences of 64 when needed. We set the full episode and the maximum number of steps in each episode to 1000. The learning rate of the actor and critic is, respectively, 0.0001 and 0.001. We showed the other scenario and actor-critic DRL parameter in Tables 2 and 3, respectively.

5.1.1. Convergence of the Proposed Scheme

In this subsection, we showed the training results of the proposed scheme. The convergence process of , as defined in the training process, is shown in Figure 3.

Here we see that converges at about steps, which implies that the approximated Q-function is close to the target action-value function . The complexity of the convergence steps due to the large action space in our work reduces. This reduction is because our proposed scheme deals with the large action space during the training process and the offloading, and we obtain resource allocation on the trained Q-network with no iterations. Similarly, total cost defined in (4) converges almost synchronously with at about episodes shown in Figure 4.

We can conclude that the algorithm not only converged, but it did so at a faster rate making it suitable for a dynamic environment. Our model’s convergence analysis went like this: The gradient sharing operation facilitated exploration in our approach, which automatically resulted in various policies among agents. This policy diversification opens up a more extensive search field without jeopardizing long-term convergence to the best policy. In the same way, there is less weight in entropy loss terms for long-term convergence to the optimal policy that does not jeopardize exploration. Furthermore, the algorithm’s search space was extended by the diverse rules from the gradient sharing operation, increasing the chances of finding the rewarding path. It is not easy to achieve long-term convergence if we do not regulate the sharing process. This regularization was achieved by ensuring that the process complies with [54] work by selecting an appropriate learning rate and clipping gradients. The algorithm’s initial investigation was more thorough because of the various policies, but it automatically converged to the optimal policy in the long run due to the lower bias. We also ran the other performance test, seeing the converged algorithm, which measured average service latency and average service energy usage.

5.1.2. Performance Analysis

In the MEC system, we compare our scheme’s performance with three benchmark algorithms, which are random search, greedy search, and deep Q-learning.

We take an average of over 5000 tests for every point in the simulation results. From Figure 5, we found that our proposed algorithm performed better in the system’s total cost than the other benchmark algorithms. In terms of this total cost, random search increased from to when users increased from 1 to 7. Greedy search increased from to . Deep Q-learning increased from to and finally our proposed scheme increased from to . These showed that our proposed scheme improved the total cost by 42.03%, 24.64%, and 13.04% when compared with random search, greedy search, and deep Q-learning, respectively. The rise in total cost results from sharing the limited resources to all the participating users in a time slot . However, our scheme can intelligently deploy more of the task computation at the users when there is competition in the computing and communication resources at the edge servers. Also, users’ renewable energy improves their lives to complete tasks when there is no appropriate edge device the user can associate.

Figure 6 shows the dependence on the total cost on weight of latency and energy consumption. The plots show a linear relationship between the total cost and weight. For the delay-sensitive task, a higher value of importance is required to prioritize delay over energy consumption for the execution in any of the three computing platforms. Our scheme performed better than the other one because it associates the task with the best small base station or macro base station for faster computation.

In Figure 7, we saw that the increase in the number of small base stations decreases the system’s total cost, which is the same for all the schemes. As the number of small base stations varies, our method showed better performance because it associates their computing task to the appropriate small base station for processing. Our scheme considers the energy expended at the small base station and intelligently allocates the computing and communication resources in a way that leads to minimizing energy consumption, which results in an overall reduction in total cost. We considered the small base station’s energy consumption since IoT nodes receive their power from the small base station’s coverage for its local computation and offloading to the high-performance processor. The available power present in the small base station also provides the transmit power to the collaborative edge device.

Similarly, the graph of the total cost for the collaborative edge device computing capability shown in Figure 8 shows our scheme performed better than the other methods when the collaborative edge device’s computing capabilities increased. Our collaborative edge device consists of one task edge device and resource devices. The capabilities of our collaborative edge device are varied by the number of resource devices connected to it. The edge device computing capability is not fixed as in the other schemes. Our system’s task edge device subset of the collaborative edge device adequately uses all those not engaged for computation. These resource devices increase their computing ability and complement the analysis of tasks not handled at the high-performance processor.

We also saw that deep Q-learning and our proposed scheme performed optimally in total cost when the edge servers’ computing capabilities were varied. The total cost reduction is higher with these two schemes because the offloading rate is higher than the greedy search and random search with a low offloading rate. However, our proposed system still had the best-reduced cost due to the high-performance processor’s presence that had assisted with the task computation before being offloaded to the collaborative edge device. The small base station boosts the user’s transmit power for its onward journey to the collaborative edge device, contributing to our scheme’s increased offloading rate.

The average delay of the proposed scheme, local computation, and edge computation decreases as the number of users varies from 2 to 20, as shown in Figure 9. The average delay of the local computation showed a minor performance due to the low processing power of the IoT nodes. This delay in computation increases the average delay as tasks get queued waiting to be scheduled for computation. Unlike the schemes that involve local computation, the edge computation and the proposed scheme showed lesser average delay as the number of users increased due to the higher computing power of the MEC to compute tasks at a faster rate. However, our scheme showed the best performance in the average delay as the capacity of the MEC is further improved to handle more complex applications by forming a D-D connection with all the unengaged devices within its vicinity.

6. Conclusion

This paper has shown an effective use of the three computing platforms for processing computationally intensive tasks. Our proposed algorithm chooses the best computing platform and optimally allocates computing and communication resources to process task. The high-performance processor between the user and the collaborative edge device increases our system’s available processing capacity to adapt to the practical scenario where task arrival in the system is random and unpredictable. Our approach ensures that the appropriate computing resources are attained to increase the available processing capacity. It also ensures that our system can reliably handle complex and unpredictable computing tasks that arrive at the network by ensuring that available processing capacity (APC) is always more significant than the required processing capacity (RPC). We employ actor-critic deep reinforcement learning to solve the joint problem of allocating resources and deciding on the best computing platform. Simulation results showed a minimized cost of latency and energy consumption adopting our proposed scheme. In furthering this work, the priority of the user’s task to access the network to be processed should be noted. The maximal APC obtained in each scheduling time should consider high-demanding applications and user’s satisfaction.

Data Availability

The dataset used to support this article’s conclusion is available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.