Increasing scale of task in cloud network leads to problem in load balancing and its improvement in parameters. In this paper, we proposed a hybrid scheduling policy which is hybrid of both Particle Swarm Optimization (PSO) algorithm and actor-critic algorithm named as Hybrid Particle Swarm Optimization Actor Critic (HPSOAC) to solve this issue. This hybrid scheduling policy helps to each agent to improve an individual learning as well as learning through exchanging information among other agents. An experiment is carried out by the help of Python simulator with TensorFlow. Outcome shows that our proposed scheduling policy reduces 5.16% and 10.86% in energy consumption, reduces 7.13% and 10.04% in makespan time, and has marginally better resource utilization over Deep Q-network (DQN) and Q-learning based on Modified Particle Swarm Optimization (QMPSO) algorithm, respectively.

1. Introduction

Nowadays, cloud network is a popular technology that serves various services to its clients over the internet. Therefore, the number of client requests increases but the server or physical machine (PM) in the datacenter is limited. To handle large number requests, virtualization concept is used, through which one server is logically divided into number of virtual machine (VM) and tries to handle all incoming user requests [1]. Due to dynamic nature of cloud network, loads are fluctuated with respect to time for allocating a suitable VM. It shows load balancing problem within VM that put a direct impact on PM. This problem will lead to show high cost, minimize the profit of an organization, and degrade system performance. Each PM consumes electric power for work. That means if a server has number of workloads, then it consumes more energy [2]. To overcome this problem, a better scheduling algorithm is required that can handle the load among the VMs and execute all the incoming tasks within less execution time and consume less energy on the datacenter. Therefore, our objective is to transfer the extra task to the underloaded machine to complete the task execution within less time period and handle the load of the server to minimize the consumption of energy but maximize the utilization of resources.

Basic framework of cloud network is shown in Figure 1, where it contains two layer such as task layer and datacenter layer. These two layers store needed information to get our objective. In task layer, all needed information of incoming task such as expected finishing time (EFT), transfer speed of task, task length (), and file size () are stored in task queue to find best VMs from datacenter. Similarly, datacenter layer contains all the required information of both servers and VMs, such as storage unit, processing unit, data transfer capacity, and processing speed. This data is forwarded to the scheduler where it will be applying best scheduling policy to accomplish the objective and getting optimum result.

Due to arrival of enormous number of jobs at the datacenter, it makes complex and dynamic environment. If the environment cannot be adequately predicted, then the reasonable and effective scheduling algorithm cannot be successfully applied. Due to the unpredictability of upcoming activities and the dynamic environment, it is too difficult to represent the environment correctly in practice. There are no patterns that can be used to predict how the task will arrive at the datacenter. It means the number and size of tasks that will be assigned in the future are unclear. As a result, the algorithm has to schedule the tasks at once, with no prior experience or information. For example, network bandwidth, speed, processing performance of multiple machines, capacity, and load of the required resources to support task execution all affect task execution time. The demand for data center services or jobs fluctuates depending on the time period, environmental conditions, and other factors. When a server is used for an extended period of time, it consumes more energy. It is influenced by the server’s status and can have a direct impact on service costs. Based on changing demand, the scheduling algorithm must automatically optimize execution time, resource use, and energy consumption.

Almost all publications have attempted to reduce makespan, increase resource utilization, reduce energy consumption, and balance the load of all VMs inside datacenter. But still load balancing problem is arising because if a machine is heavy load, then it takes more time to transfer the extra task. Also, if a machine is not used for long time, then it consumes energy that degrades system performance. Similar objectives are taken in this paper and try to solve these objectives with the help of machine learning as well as swarm optimization algorithm. Various types of machine learning approaches such as reinforcement learning (RL), deep reinforcement learning (DRL), and actor-critic (AC) are used in the field of cloud computing. In these methodologies, an agent selects a suitable action to perform in the state of the environment to get the reward. This learning interaction proceeds until an agent receives maximum reward from the environment. In cloud networking, these methods are used to drive the action of cloud state to get the result [3]. But whenever we deal with high-dimensional problem, RL such as Q-learning represents the limitation of state space and it is hard to apply to get the objective outcome [4, 5]. In DRL strategy, for example, Deep Q-network (DQN), it may not manage the problem in continuous nature regularly [6, 7].

To take care of the issue in RL and DRL, in this paper we developed a hybrid scheduling policy which depends on actor-critic learning method and PSO-based population method, known as HPSOAC (Hybrid Particle Swarm Optimization Actor Critic) scheduling policy. This policy finds the better solution where the problem is continuous and dynamic in nature. In our proposed method, the number of agents learns at the same time that depends on two learning methodologies such as individual learning and learning from group by passing their data to achieve the goal. Individual best of each agent is obtained from the current information of the state, and the worldwide best is found out from the whole agent. After getting the result, each agent updates its actor-critic parameter value by utilizing the update conditions of PSO. In our proposed method, we assume VM as an agent. Server within the datacenter is represented as an environment. The state of the server is active (both running and idle) and sleep. Operation within the server is known as action. Whether all the objectives are achieved or not can be defined as rewards. To train our agent, we retrieve an important data from the environment such as capacity, bandwidth, RAM, CPU, network, and storage.

Cloud requests have stochastic properties which cause problems with balancing the load and increasing the cost of services. Proper decision can be taken in the cloud environment (datacenter) to allocate suitable resources to the incoming requests. AC can train the agent to learn the optimal decision through interaction with environments without requiring any prior knowledge. AC acquires knowledge of the environment to choose the best actions based on the parameters of the environment. It should handle the extra task and migrate into the underloaded VM. This decision on the actions is to maximize the reward of each agent. PSO is chosen as the metaheuristic optimization technique to enhance the system performance and reduce search time even while improving the fitness value. Based on the fitness value, agents are exchanging their information among themselves to get the suitable VM to allocate the incoming task. The fundamental commitments of our proposed work can be given as follows: (1) To solve load balancing problem and its parameters, we developed an effective scheduling policy. (2) This scheduling policy is a combination of PSO and AC. (3) If one VM is heavily loaded, then we take an appropriate action to transfer extra task to underloaded VM that solves load balancing problem.

To improve the makespan and energy consumption of the average work, the proposed DQN algorithm in [3] is compared to three different algorithms: random, round robin, and Modified PSO (MoPSO). During training, the author used the DQN method with a weight parameter of 0.9. Two types of experiments were carried out to obtain the results in energy consumption and makespan time. In comparison to its competitors, the DQN algorithm reduces energy consumption by 2.6% to 11.7% and reduces makespan time by 8.4% to 21.1%. In [10], the proposed Modified PSO and Q-learning algorithm (QMPSO) method are compared to two existing algorithms: Q-learning and Modified PSO. The algorithm’s performance was evaluated in terms of makespan, energy consumption, and throughput, which are indicators of the external services’ effectiveness. The fitness function and the weight of the fitness value are used to achieve these goals. The fitness value’s weight ranges from 0.5 to 1.0. Three types of experiments were carried out to determine the system’s energy usage, makespan time, and throughput. In comparison to competitors, the QMPSO algorithm saves 3.5% to 7.4% energy and 5.3% to 18.6% makespan time while increasing throughput by 60%. In [14], the proposed hybrid scheduling algorithm is compared with the Cloud Workflow Scheduling Algorithm (CWSA), Dynamic Power Saving Resource Allocation (DPRA), Ant Colony Optimization and Particle Swarm Optimization (ACOPSO), and a heuristic approach. The algorithm’s performance was evaluated in terms of makespan time, power consumption, and resource utilization. In comparison to competitors, hybrid algorithm reduces 8.3% makespan as compared to CWSA and the heuristic approach. Similarly, it reduces power consumption by 10% as compared to the DPRA approach. Finally, it increases resource utilization by 10% as compared to the ACOPSO and CWSA approaches. In [33], the proposed improved weighted round robin (IWRR) algorithm is compared with two heuristic-based algorithms, such as the round robin and the weighted round robin (WRR) algorithm. In comparison to competitors, the IWRR algorithm reduces 5% and 10% of makespan time as compared to the WRR and RR algorithms. All three methods, except IWRR, have two common objectives, i.e., makespan time and energy consumption. To achieve the goal, there is no suitable action defined to handle the extra task that can increase the reward in a dynamic environment. Therefore, in this paper, we define a suitable action that can handle the extra task and give the maximum reward as compared to the above methods. Comparison between various scheduling policies is presented in Table 1.

The rest of the paper is composed as follows: related work, objective formulation, actor-critic method, HPSOAC scheduling policy, pseudocode, experimental analysis, and conclusion of the paper.

There are an enormous number of scheduling policies have already developed that are used on static policy, dynamic policy, and hybrid policy and focused on either a single objective, biobjective, or multiobjective. These policies have concentrated their research work on maintaining the load balance between machines as well as various parameters of load balancing, such as makespan time, energy consumption, and resource usage. For this purpose, researchers proposed various machine learning and population-based optimization scheduling strategies for the cloud environment to solve the above issue. To improve user service quality necessities, authors in [3] proposed a technique dependent on the Deep Q-network (DQN) calculation to diminish energy utilization and makespan time by changing the extent of the reward of various streamlining objectives. This algorithm solves difficult multiobjective optimization problems and provide a sensible and efficient resource allocation and task scheduling strategy. It modifies the weight of rewards to make a trade-off in the link between energy consumption and task makespan. The authors in [4] divided the complicated cloud scheduling problem into two subscheduling problems: job scheduling and resource allocation. Heterogeneous Distributed Deep Learning (HDDL) is utilized to handle job scheduling, while Deep Q-network (DQN) is employed to solve resource allocation. This method is commonly used to increase energy efficiency. Furthermore, the suggested framework has strong scalability and low computation time, as well as the potential to obtain a global near-optimal by attaining local optimization at each level. Authors in [6] presented an advantage actor-critic-based reinforcement learning (RL) paradigm for resource allocation in cloud datacenter. First, depending on the critic’s evaluations (evaluating actions), the actor parameterizes the policy (allocating resources) and determines continuous actions (scheduling jobs). The policy is then updated using gradient ascent, and using the advantage function, the variation of the policy gradient can be greatly reduced. Furthermore, in terms of work latency, the suggested method beats previous resource allocation algorithms and provides a faster convergence time than the typical policy gradient method. [8] described a machine learning-based adaptive resource management technique that allows a cloud storage system to self-manage and give higher performance in the face of a wide range of workloads and resource bottlenecks. A stochastic policy gradient-based reinforcement learning technique was also used to detect performance concerns in cloud storage and take the required actions, such as load balancing and data movement, to improve storage performance. Authors in [9] used design Deep Deterministic Policy Gradient (DDPG) task scheduling approach which depends on deep reinforcement learning (DRL) method to diminish the reaction time and keep up the load balancing. Without any prior knowledge, the proposed algorithm may learn straight from its experience and make the proper scheduling decision for VMs for ongoing online task requests. The agent’s status is determined by the information about VMs and tasks received by the scheduler. The agent distributes each task to the appropriate VM adaptively using the defined states, and the task response time is rewarded as the agent’s activity. To improve the exhibition, keep up load adjusting and increment the throughput, and [10] developed a hybrid technique which combines both Modified PSO and Q-learning algorithm known as QMPSO. This hybridization process is used to change the MPSO’s velocity using the gbest and pbest functions, which are based on the best action obtained by improved Q-learning. This aids in achieving a goal. [11] proposed Q-learning-based foresighted task scheduling to limit reaction time and makespan however increment the resource effectiveness. With the help of the Q-learning algorithm, the scheduler allocates all the incoming tasks to the suitable VMs in a first-come-first-serve manner. The scheduler takes all the necessary information about the task and the VM before allocating the task, such as task length, entering time, and VM capacity. [12] proposed QL-HEFT task scheduling approach that consolidates Q-learning and Heterogeneous Earliest Finish Time (HEFT) calculation to diminish the makespan. The algorithm employs the upward ranked value of HEFT as the immediate reward in Q-learning. The algorithm then sorts the original task order using the converged Q-table to arrive at the best result. The entire process employs the Earliest Finish Time (EFT) allocation approach, which assigns the best processor to each task in the most efficient order. [13] introduced the advantage actor-critic (A2C) algorithm, which is based on deep reinforcement learning and is used to improve job scheduling performance and datacenter resource usage. The actor network is used to pick actions in A2C, whereas the critic network is used to evaluate them. Without any prior information about the upcoming jobs, the proposed technique is used to reduce average task completion time and improve resource use efficiency. [14] developed an approach that is dependent on hybrid Ant Colony Optimization (ACO) and deep reinforcement learning (DRL) to limit task execution time and improve the usage of resources. In a cloud environment, the DRL is used to increase the utilization of idle resources by dividing them into action space and state space. DRL-based resource management reduces resource consumption in large-scale cloud environments with a large number of servers receiving a significant number of requests per day. Once the criterion is met, ACO is used to map jobs to relevant virtual machines. [15] developed energy-efficient resource allocation method which is the combination of reinforcement learning (RL) and fuzzy logic to reduce energy consumption and increase CPU utilization. At the datacenter level, reinforcement learning (RL) is used to recommend the best allocation policy based on Power Usage Effectiveness (PUE), Datacenter Infrastructure Efficiency (DCiE), and CPU utilization. Fuzzy logic, on the other hand, is used in the allocation process that leads to a green datacenter. [16] presented the Asynchronous Particle Swarm Optimization (APSO) algorithm, which is an asynchronous multithreaded parallel PSO algorithm. APSO is a basic, easy-to-implement algorithm that solves the problem of the classic PSO algorithm falling into a local minimum. Then, using APSO in an asynchronous reinforcement learning algorithm called Backward Q-learning, based on State Action Reward State Action (SARSA), a novel asynchronous reinforcement learning method called Backward Q-learning and Asynchronous Particle Swarm Optimization (APSO-BQSA) is suggested. APSO-BQSA uses the APSO method to optimize the parameters while updating them with the improved BQSA. The proposed approach outperforms the existing reinforcement learning algorithm in terms of convergence speed and performance, and it may be used to solve a variety of situations. [17] proposed an algorithm to combine policy gradient (PG) with Particle Swarm Optimization parameter exploration (PG-PSOPE) algorithm to avoid the problem in continuous nature and improve convergence rate. PG is a solution that involves parameterizing the policy and updating the parameters via gradient estimation. The gradients of policy can be unbiasedly assessed using the normal PG by sampling various interactive trajectories. To circumvent this issue, the mapping between PG and PSO was introduced initially. Following that, guidelines for exploring and updating are devised, including the mutation probability, to prevent local optima and excessive variance caused by gradient estimation. As a result, the suggested PG-PSOPE explores the parameter space rather than the action space. In [18], Deep Q-network (DQN) method is proposed, which follows deep reinforcement learning (DRL) approach to reduce energy consumption and average response time while increasing the success rate. The DRL approach is utilized in this case to schedule jobs in the queue on a first-come, first-served (FCFS) basis. The reward is derived using the reward function, and the scheduling is based on the values evaluated by a deep neural network. A job will be given to the best VM from the cloud resource in terms of meeting job criteria and consuming the least amount of energy. Multiple jobs are scheduled to different VMs in the cloud resource, and the proposed technique is designed for online learning for a public cloud. Each job is self-contained and must be scheduled to execute in a virtual machine (VM) within a server, with each server containing exactly one virtual machine. Furthermore, during the job execution, each VM only has one instance. If a job is not at the first in the queue, it has to wait for all the jobs before it to be executed.

2.1. Objective Formulation

In this part, we will expound the objectives of cloud framework, including task makespan time, resource utilization, and energy consumption. Based on each objective, we found the reward function that helps us to reach the optimization of the system.

2.2. Makespan Time

Makespan time is referred to as most extreme time to finish or execute the task on a particular VM. If the task takes maximum execution time, then it shows poor in load balance and minimum execution time shows the better load balancing. Suppose datacenter contains and number of tasks and VMs, respectively, such as and . In any case, the condition for execution of such task is greater than number of VMs, i.e., and task length with VM speed is dealing with millions of instructions (MI) and millions of instructions per second (MIPS), respectively. To accomplish our objective, we figure the processing rate of VM utilizing Equation (1), which depends on the properties of VM, such as MIPS, CPU, and memory.

The expected finishing time (EFT) of task on virtual machine can be address as the following equation:

When task is allocated on , then it takes some time, which is called as task allocation time () and it is determined by the task file size with respect to bandwidth of VM (), represented in the following equation:

Total finishing time () of all entering task to execute on VMs is calculated as the addition of expected finishing time () and task allocation time (), represented in Equation (4) where is the decision variable and represented in Equation (5).

Makespan time (MST) can be diminishing when all tasks are completed their execution effectively. It can be determined as the maximum of [19], and it can be calculated by the following equation:

Thus, our first objective function can be formulated as in the following equation:

2.3. Resource Utilization

Next enhancement method is to build the utilization of resource which is determined by the division of total finishing time with respect to makespan time and available in Equation (8). A high utilization rate shows that cloud provider can acquire greatest benefit.

Therefore, our second objective function can be formulated as in the following equation:

2.4. Energy Consumption

Excessive demanded for cloud services has expanded energy utilization in data centers. High energy utilization makes more expensive to operate as well as prompts carbon emissions. The energy received by a datacenter is mainly distributed to functioning servers or PMs, for cooling frameworks and for organizational components. Among these three sections, PMs are the greatest energy consumption [2]. In datacenters, VMs play an important role in reducing energy consumption. If we reduce energy in VMs, then it automatically reduces the energy in the associate server. Then, the third optimization target is to reduce energy consumption of VMs in datacenter. Each VM has two states, such as active (running and idle) and sleep state.

Let datacenter having number of server and each server having number of VMs and each VM having number of resources such as CPU, MIPS, memory, and bandwidth. We take physical servers in datacenter , expressed as . Then, the total number of servers in datacenter is , and each server contains the total number of VMs, such as . Among all VMs, some of servers are in active state and rest are in sleep state. Let be the total number of active VM, which contains all running mode and ideal mode. Total running VMs on the server at time is . Then, the number of VM is in idle mode is . The number of VM is in sleep condition within server is . Utilization of total resources among each VM is . Total active server () in a datacenter can be expressed as in the following equation:

In datacenter, running server as well as idle server still consume energy. Then, total energy consumption in active state () is represented in Equation (11). Decision variable is shown in Equation (12).

To reduce energy consumption in datacenter, we should turn off the idle server that is no longer use but it maintains QoS. Whenever extra load come, then we activate the sleep server [20, 21]. To accomplish our objective, we should follow the two conditions that are described as bellows.

Step 1. Rest down inactive servers: In each time slot, if there may exist an idle or inactive server with no loads are assigned, then we put them into sleep mode. It is depending upon the following equation: where which means no task is allocated to VM at time period and which means VM is allocated to server at time period . is the number of incoming task or load. If all available VM utilization is less than total server utilization for the time period and this condition exists for longer time period, then we transfer this type of server into sleep mode, which is represented in the following equation:

Step 2. Rollback and wake-up: Sometimes, there is an insufficient number of servers to handle the load, so we can recall upon the inactive servers and change its previous movement to ensure the additional loads are to be handled. The checking condition is represented in the following equation:

If Equation (15) is satisfied, then we should wake up the sleep server. It is represented in the following equation:

For handling such large amount of task, we wake up the sleep server into active server and it consumes some energy, which is denoted as . As per [21], sleep state server may consume approximate energy as compared to active state server. So, overall energy consumption () by the datacenter is represented in the following equation:

Finally, our third objective function can be formulated as in the following equation:

2.5. Reward Function

The performance of cloud computing depends on the server, where it assigns the user request to provide required services, consume energy to run, and fully utilize all its resources. Due to the continuous nature of tasks in cloud computing, load balancing problem is figured as Markov decision process (MDP). In this process, VM is an agent and server is the environment, where VM takes an action by interacting with the server at each cycle. MDP can be represented as four-tuple , and these are described as follows [22]:

representing as state space: denotes the state of the server at time period .

representing as action space: is the action through which VM takes at time period .

is defined as the transition function. means when we take an action on state , then we are reaching at next state .

represents reward function. At the time period , VM locates at current server state and takes an action to reach next state while receiving a reward which is considered in the following equation:

In RL, there are two important factors through which an agent can get the maximum reward, such as policy and value function. In policy function method, policy is set as and agent find the best policy () to take an action on state to get highest rewards in future. But finding the best policy, we need a value function. Value function shows that our policy is best or not to get the maximum reward. There are two kinds of value function: state value function and state-action value function [23, 24]. Common type of state value function () can be defined in the following equation:

where is the discount factor, , and the current state is . Commonly, this value function shows the expected discounted reward which depends on a policy and it is the combination of current reward with discounted future reward, known as Bellman condition [25]. Optimal state value function is represented in the following equation:

As like as state value function, state-action value function presents a value for selecting an action on current state that follows a policy . State-action value is represented in the following equation:

is similar to except that by following the policy , an initial action is taken to reach the successive next state. Optimal state-action value function is shown in the following equation:

To manage the continuous nature with getting the most elevated reward, we will probably track down the ideal policy which is based on probability distribution function [6, 26] and represented in the following equation:

where denotes the stationary distribution function under policy . The same as the state-action value function (), an advantage function () represents relative state-action values, which shows the selected action is valuable or not. Frequently, it is simpler to discover the suitable action for getting highest reward than other [27]. Advantage function can be represented in the following equation:

where shows the state value function or baseline function. An action is performed using an advantage function in value-based RL algorithms [28, 29]. Similarly in our methodology, we take an advantage value to take an action on each iteration.

2.6. Actor-Critic Method

Actor-critic is an advance version of RL algorithms in which both the value-based (i.e., Q-learning) and policy-based (i.e., policy gradient) methods are used. In AC method, each agent made out of two parts, i.e., actor part which is responsible to calculate the ideal policy by using policy gradient method and another is critic part, which is responsible to calculate the value function and calculate TD-error that demonstrates if the reward is good or not as the expectation. According to the ideal policy actor, select the action to get the reward, while critic is criticizing the current approach is better or not to get maximum reward from an environment.

2.7. Actor Part

Policy gradient (PG) method [26] is generally received in the actor part to refresh the defined policy for improve the target work in Equation (24). The policy can be at first worked by utilizing the parameter vector , which is indicated by . Then, at that point, the gradient of the policy w.r.t as far as the target work in Equation (24) is represented in the following equation:

2.8. Critic Part

Objective of critic part is to find the approximate value function. This function can be used by actor to improve the policy [25]. The value function is parameterized by and represented in the following equation: where is the basis function vector when agent chooses the action at the state and the parameter vector of weights is represented as .

To calculate the approximate value function, first of all, critic finds temporal difference (TD) error of Bellman equation [25, 26]. TD error is measured as distinction between the approximated value function and the actual value function at a current state-action situation. It is represented in the following equation:

where is state-action value function and transition function is . By using TD error, the critic parameter can be updated by the gradient-descent update rule [30] which is represented in the following equation:

where is learning rate of critic value function and the feature of value function is represented as . After optimizing the parameter in Equation (29), actor updates its value function in Equation (27).

2.9. HPSOAC Scheduling Policy for Parameter Updating

This segment shows the essential system of our proposed HPSOAC scheduling policy method. In this method, each agent can get their individual reward by taking appropriate action in each state of an environment. After getting their individual best, they want to get the global best reward by exchanging their information with their neighbours by applying PSO algorithm. Thus, our proposed method selects the action to optimize both the policy and value function to get the personal best reward for each VM. After that, each VM can find the global best by sharing their information among others. Finally, all agents update their parameters {,} according to the respective iteration. Figure 2 shows our proposed HPSOAC scheduling policy. To improve the convergence speed for searching for the global best value in actor-critic algorithm, we combine PSO with actor-critic method to investigate policy parameters.

PSO is a swarm-based optimization computation procedure through which numbers of particles with various positions and velocities are built to discover solutions [31, 32]. Similarly, all agents in HPSOAC algorithm have different weights of load and different velocities of task transfer rate for getting the optimum result. The velocities of the task transfer rate determine the weight of the load. All learning parameters are updated as follows:

2.10. Action Selection

In cloud computing, all incoming tasks should be allocated to suitable VMs for executing in minimum time. It depends upon the capacity of the VM that is represented in Equation (30). Load of the VM is calculated in Equation (31). After allocating the task, some of VMs may be overloaded or underloaded. If the load of a VM exceeds the capacity of VM, then it is overloaded. To balance the load among VM, HPSOAC scheduling policy finds the best VM, i.e., either under load or ideal VM and then it transfers the extra task by taking the selected action. Task transfer speed can be calculated by using Equation (32). Thus, action selection depends upon current load of VM () and task transferred () to give maximum reward. Then, the selected action can be calculated by using Equation (33).

where is the extra task transfer between VM b and VM c. is the required bandwidth between two VM b and VM c in MIPS. means th task has taken from VM b and assigned to the VM c.

where and can be defined as the weight of VM load and task transfer speed in which value of . Thus, action selection is depending the set of weight and of an agent for getting maximum reward, i.e., mentioned in sub-section Update Velocity and Position.

2.11. Critic Parameter Update

Due to late response of the RL approach, it will impact on the result of the current reward and aggregate reward. Therefore, Equation (29) is not sufficient to produce an optimum result. Therefore, eligibility trace technique is used to accelerate the convergence rate [26]. In this technique, only qualified states and actions are taking an interest in increasing the speed. The updated eligibility trace vector is represented by the following equation:

where is trace-decay rate. This eligibility trace technique is used to update critic parameter vector can be expressed as in the following equation:

2.12. Actor Parameter Update

Some error and bias are still present in approximate policy gradient method that is not give an optimum result. But the error and bias can be removed by the help of compatible approximation function [6, 23, 26]. If the state-action value function is satisfy the given condition of compatible approximation function, then it will be reducing the error and bias. Conditions are shown in Equations (36) and (37).

Condition 1. Value function approximator is adjustable to policy.

Condition 2. Optimal parameter minimizes the mean-squared error; then, the expected value of gradient of is zero, which is represented in Equation (38). Then, can substitute directly into Equation (26) and gradient of policy with compatible approximation function is became as in Equation (39), to reduce bias and error. After reducing the error in Equation (37), actor updates its state-value function as in Equation (40). Still this policy gradient methods face high variance problem and solve this issue by improving the compatible approximation function in the critic, we used advantage function of Equation (25). This advantage function replaces in the policy gradient, and the updated policy gradient is represented in Equation (41). By the help of state action value functions parameter , actor is searching the best parameter to increase the expected reward under the given policy [25]. Finally, the update policy parameter for maximizing the reward is represented in Equation (42), where denotes the learning rate of the actor.

2.13. Update Velocity and Position

To maintain the load balance and achieve the objective function, each agent searches for the optimum result within the environment. Each agent can adjust their load by depending on their own arrangement as well as their neighbours. If an agent is heavily loaded, then it looks at its neighbours and transfers the extra task to the next suitable agent. Therefore, the velocity of task transfer depends on the weight of load. Suppose there are number of VMs or agents in an environment, and number of iterations called episode is available. Let and indicate the weightage of load, and task transfer speed or velocity of each VM is represented as and . At each iteration , an agent communicates with the environment and updates their weightage of load with velocity according to their neighbours’ best value. The updated load and velocity of each agent is represented in the following equations:

2.14. Pseudocode of HPSOAC Algorithm

Pseudocode of our proposed algorithm is represented in Pseudocode 1. Terms and meaning are shown in Table 2.

HPSOAC Algorithm
Input: Server set as , VM set as and Task set as with γ, 𝜆, , .
Output: Reduce makespan Time, reduce energy consumption, increase usage of resource and balanced VM load.
Particle or Agent VM
Position Load of VM allocated on PM
Velocity Speed of task transfer
Pbest Individual VM performance
Gbest Optimal result
Iteration or Episode Time period
Information of Server and VM MIPS, Memory, CPU and Bandwidth
Information of Task Length and File size
For i =1 to m
For j =1 to n
Incoming task is allocated on VM
// For Makespan
Calculate processing rate of VM, expected finishing time and task allocation time by applying Eq. (1), (2) and (3).
= /
= /
Find Finishing Time of each task on VM by the help of Eq. (4):
Calculate Makespan Time by utilizing Eq. (6):
//For Resource Utilization
Compute Resource Utilization by using Eq. (8):
End of for
End of for
//For Energy Consumption
Initialize active and sleep VM, utilization of both VM and server.
For j =1 to n
For s =1 to x
Compute active server by using Eq. (10):
Compute energy consumption of active server by using Eq. (11):
If ()
Put the server into sleep mode
Wake-up the sleep server
End of if
Compute total energy consumption by using Eq. (17):
End of for
End of for
// Applying HPSOAC
//Initialize AC parameters
Initialize j, e, , , , and
// Initialize PSO parameters
Initialize the w, , , population size, the number of iterations,
For j =1 to n
Initialize individual best , current position and velocity of agent j ()
End of for
Initialize individual best , target position
For e =1 to
For j =1 to n
Observe the environment state .
For t =1 to
Agent takes action according to the Eq. (33):
Receive the current reward and perceives the next state .
Compute value function by using Eq. (40):
Compute TD error by using Eq. (28):
Update eligibility trace in Eq. (34):
Update critic parameters in Eq. (35):
Compute advantage function in Eq. (25):
Update policy gradient by using Eq. (41):
Update policy parameters in Eq. (42):
End of for
// Compute fitness value
Calculate cumulative reward as the fitness value of agent.
// Comparing current fitness with individual best
End if
// Find out global best value from all individual best
Else ⃪ Optimal solution
End if
Update episode
Update weight and velocity according to Eq. (43) and (44)
End of for
End of for
2.15. Experimental Analysis

Based on simulation results, this section evaluates the performance of our proposed HPSOAC scheduling policy compared with other existing policy with respect to different load balancing parameters such as energy consumption, makespan time, and resource utilization. All simulation experiments were conducted in a Python environment with TensorFlow. The detailed information of our experiment is mentioned below.

2.16. Parameters Used for Simulation

This section provides necessary information for carrying out the simulation. For the test, we take continuous and independent tasks to get maximum reward from the environment. Tasks are distributed between number of VMs, and these VMs are assigned to a number of servers in a datacenter. Here we take 10000 to 50000 tasks that are distributed among 50 to 500 virtual machines in a datacenter. These tasks have independent length and file size. Our proposed scheduling policy is aimed at equally distributing the whole task among the available virtual machines, which can reduce the makespan time as well as energy consumption but increase the resource utilization. If a virtual machine is heavily loaded, then the extra task should be managed easily. That makes all the machines balanced. We take these objectives as our reward rate. Initially, our proposed scheduling policy is to rapidly increase to get the higher reward, but whenever it reaches iteration number 600, it is a little bit fluctuated and remains unchanged until the last iteration 1000. In our experiment, we take a total number of iterations of 1000 because after 1000 iterations, there is no change in our reward percentage. All the properties of task, server, and VM are shown in Table 3. Table 4 shows different properties of PSO and AC approach.

3. Result and Analysis

This section shows the performance of HPSOAC scheduling policy against two other optimization policies such as Deep Q-network (DQN) and Q-learning based on Modified Particle Swarm Optimization (QMPSO) algorithm. All simulation results are shown in Figures 38.

Figure 3 shows the total reward percentage per iteration under the different scheduling policy in our environment where the number of servers is 50 and tasks are entered into the environment at different lengths and speeds. From this figure, we can find that our proposed HPSOAC scheduling policy shows the best reward performance with a faster convergence speed and better performance as compared to other two policies. To get the optimum result, both DQN and QMPSO policies required more learning steps. As compared with QMPSO, DQN has better convergence speed but it is not sufficient. At the start of iteration, the percentage of reward of all algorithms is very similar but from the iteration number 20 to till the last iteration, our proposed scheduling algorithm gives better reward as compared to its competitors. When it reaches iteration 600, then the percentage of reward is near about 80% and it remains almost the same at the end of iteration 1000. But in the case of both DQN and QMPSO algorithms, the percentage of reward is near about 60% and 39%, respectively.

Figure 4 represents energy consumed by different scheduling policies where different numbers of tasks are taken with a fixed number of VMs. From this figure, it is concluded that when the number of tasks is increases, then energy consumption of DQN and QMPSO grows faster as compared to HPSOAC. At a particular time period, HPSOAC scheduling policy shows in marginally increase the energy consumption when number of tasks increases to 4000 to 5000. Finally, at the end of iteration, our proposed algorithm takes approximately 5.16% and 10.86% less energy consumed as compared to DQN and QMPSO, respectively. Table 5 shows the dataset for Figure 4.

Figure 5 represents energy consumed by different servers in a datacenter where each server can be virtualized by 10 number of VMs. In this figure, we take the total energy consumption that we get from the simulation for our proposed scheduling policy with respect to number of servers. From this figure, we found that a smaller number of servers can consume more energy as compared with a greater number of servers. Table 6 shows the dataset for Figure 5.

Table 7 represents the datasets for the total amount of energy consumed by our proposed scheduling policy that can be shared between two different states of the server, i.e., active state and sleep state. Figure 6 shows the energy consumption at different states of the server. From this figure, we take the total energy that can be consumed by HPSOAC scheduling policy for both active and sleep state of the server and found that majority of the energy is utilized by the servers in active state as compared to sleep state.

The outcome displayed in Figure 7 is obtained by various algorithms having 1000 to 5000 number of tasks, allocated to available VMs. From this figure, our proposed HPSOAC scheduling policy takes less makespan time than other two calculations. Percentage comparison of the lowest makespan time is approximately 3.02% and 5.04%, and percentage comparison of the highest makespan time is 7.13% and 10.04%, respectively, as compared with DQN and QMPSO policies. Dataset for Figure 7 is displayed in Table 8.

Total utilization of resources depends on the active number of running servers in datacenter, and the utilization result is shown in Figure 8. From the figure, it is clear that our proposed HPSOAC scheduling policy is increased resource utilization as compared to DQN and QMPSO. In this experiment, we take different sets of VMs, such as 50 to 500 and executed over maximum load. From 50 to 300 VMs, our proposed approach is marginally more but when the number of VMs increases with maximum load, then our proposed scheduling policy shows better resource utilization compared with two other techniques.

4. Conclusion

Balancing load in a cloud environment has been a challenging issue due to its dynamic nature. This paper proposes a hybrid scheduling policy which is the combination of both PSO algorithm and actor-critic algorithm named Hybrid Particle Swarm Optimization Actor Critic (HPSOAC). This scheduling policy is proposed to achieve improvement in various parameters of load balancing such as makespan, utilization of resource, and energy consumed in continuous state-action space. Our simulation experiment is done with the help of Python simulator with TensorFlow. Experimental results show that HPSOAC scheduling policy is used to reduce 5.16% and 10.86% energy consumption, reduce 7.13% and 10.04% makespan time, and have marginally better resource utilization than DQN and QMPSO algorithms, respectively.

In future, improving the resource allocation and resource management concept cloud be focused on cloud datacenter. To achieve this, a hybrid algorithm cloud be proposed which is the combination of two different machine learning algorithms. This algorithm will provide real-time analytics in the complex and dynamic cloud network.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.


This work was supported by Dongseo University, “Dongseo Cluster Project” Research Fund of 2022 (DSU-20220006).