Abstract

The new-generation of Internet of Things (NG-IoT) brings a wide range of challenging problems. At the same time, cloud computing technology is an important foundation for the development of the IoT. In this article, we focus on the task scheduling problem in IoT systems in cloud computing environment. Our goal is to minimize the task runtime. It is well known that the problem of the task scheduling has been a challenging problem. In the last decade, despite being theoretically hard problem, researchers design lots of state-of-the-art algorithms for solving this problem. In our work, we propose a novel efficient reinforcement learning (RL) algorithm to solve the task scheduling problem in IoT systems (EATS), which combines combinatorial optimization to make our proposed algorithm have stable lower bounds. We process a batch of tasks at a time, make decisions on task selection through reinforcement learning, and solve them further through combinatorial optimization methods. The results of the experiments show that our proposed algorithm has outstanding performance in different environments.

1. Introduction

The application of NG-IoT in many fields is becoming more and more popular. However, enormous number of Internet of Things (IoT) systems generates a large capacity of data, and how to efficiently process these data is becoming more and more important. The task scheduling problem in IoT systems refers to allocating tasks generated in IoT systems to virtual machines, so that the total time required is the least. As we all know, the problem of task scheduling has always been an NP problem. How to deal with this problem more efficiently has always been a challenging work. Artificial intelligence (AI) algorithm plays an important role in various IoT problems. Recently, Zhou et al. [1] proposed an accelerating artificial intelligence method for IoT (AAIoT) for the first time. Christou et al. [2] introduce an industrial IoT model based on machine learning (ML) and discuss its explainable.

We know that some problems in IoT are often difficult. Of course, there are many researchers who treat these problems as optimization problems and solve them well. Fu et al. [3] proposes an optimization method for resource management in terrestrial satellite systems. Liu and Zhang [4] proposed a joint optimization algorithm based on the Lagrange dual optimization with the aim of maximizing the transmission rate of the IoT. With the development of the IoT and cloud computing, these issues are becoming increasingly important.

IoT resources are very limited, and how to efficiently use IoT resources has always been an important problem. We study the task scheduling problem in IoT systems, which is a classical NP problem. For this class of problems, many heuristic algorithms and traditional algorithms are widely used. Most of the time, they can only solve small-scale problems. However, for large-scale problems, these algorithms do not perform well, and there are many researchers who solve this class of problems by using deep reinforcement learning. The task scheduling problem is naturally in line with the Markov decision-making process, and Li et al. [5] solve the task division and scheduling problem through deep reinforcement learning technology, and the proposed method has excellent performance. Chen et al. [6] proposed a deep reinforcement learning- (DRL-) based approach for dynamic task unloading on MEC. They both point out the flaws of traditional algorithms and solve the problem efficiently by reinforcement learning algorithms.

Zhang and Zhou [7] summarize the divided scheduling algorithms in different environments and then divide them two types: static task scheduling and dynamic task scheduling. Uniform scheduling after accumulating a batch number of tasks is known as a static scheduling algorithm, while scheduling arrangements that change as new tasks come in are called dynamic scheduling algorithms. In this article, we concentrate on the problem of scheduling tasks in large-scale static IoT systems. To better solve this problem, we propose an efficient reinforcement learning algorithm for solving this task scheduling problem (EATS) in IoT systems. We use reinforcement learning to guide which tasks are computed each time and solve the optimal solution for each selected task by a combinatorial optimization algorithm. Our proposed algorithm has a stable lower bound and can excel in solving task scheduling problems in large-scale IoT systems.

The algorithm proposed in this article is based on reinforcement learning and combines a combinatorial optimization algorithm to solve task scheduling in large-scale IoT systems. This strategy of combining reinforcement learning with other algorithms often has outstanding results. The use of some algorithm-assisted reinforcement learning algorithms is also currently a popular approach to ensure stable solution lower bounds on difficult problems [8]. With the rapid increases in the quantity of IoT systems, the scale of the task scheduling problem will be enormous. Therefore, it is crucial to design an efficient scheduling algorithm to solve this problem.

This article’s remaining sections are organized as follows. In Section 2, we model the problem of task scheduling in IoT systems and show problem formula. In Section 3, we propose a novel reinforcement learning algorithm, EATS, which combines a combinatorial optimization algorithm to excel in solving task scheduling problems in large-scale IoT systems, and we give a proof of a lower bound on the algorithm. In Section 4, we compare EATS with other algorithms in different settings. Section 5 reviews related work. Section 6 summarizes our work and describes our future work.

2. System Model and Problem Formulation

In this article, we study the problem of task scheduling in IoT systems in a cloud computing environment. Table 1 shows our notions and definitions. To better solve this problem, the following will describe this problem in more detail and show our system model.

2.1. Task Model

IoT system generates tasks every second, there will be tasks arriving in seconds, and these tasks will be handled by the virtual machine. When the amount of tasks is large, we will select tasks from the currently unprocessed task set , for priority processing. The size of the set is denoted as . The task we choose to process each time in is denoted by . Each task is either processed as soon as it is generated or queued waiting to be processed. In our model, all tasks in the system are processed and no tasks are discarded.

Each task has its time when it is generated in IoT system. The type of the -th task is , its maximum response time is , and its time to be processed by the virtual machine is . The time-out amount of task is formulated as

In this article, our goal is to minimize the sum of time-outs for all tasks in the systems. The sum of the time-outs of all tasks generated in the IoT system is . The smaller the , the more efficient the designed algorithm. is defined as

Tasks have multiple attributes. In this paper, we concentrate on the static task scheduling algorithms in IoT systems where we know in advance when a task is generated.

2.2. Virtual Machine Model

In our system model, all tasks are processed by virtual machines. We have virtual machines in total. Each virtual machine has threads. At the same time, each thread in the virtual machine can only process one task. is the number of idle threads on virtual machine . Since there are many types of task, the time required of each virtual machine to calculate the -th task is . If the -th idle thread in the virtual machine starts to calculate the -th type of task at moment , it will be idle at the next time; is formulated as

For ease of calculation, all virtual machines in this paper have the same number of threads and the same efficiency, but different virtual machines handle different kinds of problems with different efficiency. Tasks cannot be assigned to working threads in the virtual machine; only currently idle threads can process tasks.

2.3. Problem Formulation

An efficient scheduling algorithm will result in less total time-out . We process a batch of tasks at a time, rather than focusing on one task at a time. We process tasks at a time, and the remaining tasks are put into the queue . The set is the tasks that have not been processed at present, and these tasks will be put into the queue at the first time. is the length of the queue. is the maximum amount of tasks we can handle at time . represents that amount of tasks generated in IoT systems at time . At this moment, is the maximum amount of tasks we can handle, and it should satisfy the following constraints:

We will select tasks to process at this moment; should satisfy the following constraint:

There will be tasks generated at each moment, and we will process tasks; the queue length and the set will change. Thus, the queue length changes as the following equality:

In our system model, all types of tasks are placed in a queue, which simplifies the model without affecting the calculation results.

If we need to process a large number of tasks, one queue may not be able to hold enough. In this case, we will create a new queue and process the tasks in the new queue after all the tasks in the previous queue have been processed.

The new task generated by the IoT system at each moment first enters the waiting queue and updates it to the set at the same time.

Since is sometimes very large, we can choose number as the size of each time, and there are possibilities for . So how to choose and is worth considering.

In our preliminary experiments, we found that the choice of which batch of tasks to compute each time had a huge impact on the results. Of course, choosing which batch of tasks to compute each time is also a challenging problem, and in the next section, we will propose a novel reinforcement learning algorithm that learns from failure to better guide task selection.

3. Algorithm Design

The task scheduling problem in IoT systems is a challenging problem. As an NP problem, traditional algorithms cannot solve it in polynomial time. Reinforcement learning is a good way to solve this kind of problems, but it takes a long time to train to give a perfect solution. We propose a reinforcement learning algorithm with a lower bound, which can also give an excellent solution after less training.

We define a subproblem as scheduling a batch of tasks. We design a novel reinforcement learning algorithm that combines combinatorial optimization algorithms. Reinforcement learning selects the subproblems we process each time, and combinatorial optimization algorithms can obtain optimal solutions to the subproblems.

3.1. Problem Transformation

We match the task generated in the IoT system with the thread in the virtual machine, indicating that the task is processed by the thread . Since we do not need to match working threads with tasks, we only consider idle threads in thread . Algorithm 1 shows our matching process.

Input: The task set selected each time
Output: Bipartite graph of tasks and idle threads
1: for tasks do
2:  fordo
3:   ifthen
4:    add edge
5:    
6:   end if
7:  end for
8: end for
9: erase in
10: erase in
11: The length of also modifies accordingly

In order to better show the relationship of tasks and idle threads, we model it as a bipartite graph [9], as shown in Figure 1.

In a bipartite graph with weights , the set on left is the set of currently selected batch of tasks, and the set on the right is the set of currently idle thread in virtual machine. An edge in refers to the assignment of task to thread for processing, and the weight of this edge is the completion moment after task is assigned to thread for processing [10].

The weighted bipartite graph can not only clearly represent the relationship between tasks and idle threads but also can be calculated more conveniently.

In Algorithm 1, we decide the set one at a time by the reinforcement learning algorithm (Algorithm 2), and we connect the tasks in it to all currently idle threads in the virtual machine (line 3 and line 4) with an edge whose weight is the moment when the task is completed (line5). Finally, we need to modify the corresponding set (line9-line11). Each time the above process is executed, a batch of tasks is processed. As tasks arrive, we keep repeating this process until all tasks have been processed.

3.2. Novel Reinforcement Learning Algorithm

There are many possibilities for the choice of , and we cannot try all of them in polynomial time. We guide the selection of through a reinforcement learning algorithm.

-learning [11] is an algorithm of the value-based in reinforcement learning algorithms. Among them, is , which is the expectation that in the state of at a certain moment, taking action can obtain the expectation of income. The method of this algorithm is to construct a -table that combines states and actions, and the stored value represents the maximum benefit that we can obtain by selecting this action in this state. The environment will receive the corresponding reward according to the action of the agent, and this algorithm will choose the optimal decision according to the value.

Agent, environment (), reward (), and action () can abstract the problem into an MDP process, and each completed task sequence is regarded as a state ; is to take action a policy in state . is the probability of selecting action in state to transition to the next state . represents the reward of taking action a and transferring to in state . Our goal is to find a strategy that can process all tasks and obtain the maximum reward. Discount factor is recorded as . Horizon is recorded as . Our goal is the following formula:

The -learning algorithm has advantages in offline learning; it uses the bellman equation to deal with the decision problem of the MDP process. The state value function is used to evaluate whether the current state is excellent. The value of each state is not only determined by the current state but also related of the subsequent state. This makes actions in the current state more forward-looking. The -learning algorithm is used to deal with discrete problems; it is naturally suitable for dealing with scheduling problems. For scheduling problems, the action space corresponding to each state is very large, and we will discuss how to solve this problem later.

We briefly introduced the -learning algorithm, which we combined with the combinatorial optimization algorithm. When an to be calculated is given, we model it and the idle threads in the virtual machine as a bipartite graph , where denotes the number of vertices and denotes the number of edges. The network flow algorithm Hopcroft-Karp [12] can calculate the cost of , in which the time complexity is . The cost is the following formula:

We can determine through reinforcement learning and find a set of solutions that minimize the total cost of through the Hopcroft-Karp (HK) algorithm. In our solution, we model the task and idle virtual machine as a bipartite graph with weights and further transform the problem into a minimum cost maximum flow, which is solved by a combinatorial optimization algorithm.

However, the optimization goal of HK algorithm is formula (8), which is not completely equivalent to the optimization goal of minimizing the time-out amount. We need to further guide this process by reinforcement learning. More specifically, our reward function will be related to the average time-out amount solved by HK algorithm each time. By combining with combinatorial optimization algorithms, our algorithm can have a stable lower bound. In more detail, our algorithm outperforms a simple greedy algorithm regardless of the tasks chosen by reinforcement learning. Because we always solve for the minimum cost of a batch of tasks, greedy algorithms always only consider a single task that is currently coming.

Proof. Let be the cost of selecting for the -th time; all tasks are completed when the -th time is completed, and letbe the cost of processing of the-th task by the greedy algorithm with the shortest task first. In the above formula, represents the cost of the greedy algorithm that does the shortest task first to complete all tasks, and represents the cost of each time the reinforcement learning algorithm selects an to complete all tasks.

We next discuss how to guide each chosen task through reinforcement learning. Algorithm 2 demonstrates this process.

Input: All tasks
Output: Final solution and timeout
1: Episode = 0;
2: while Episode do
3:  fordo
4:   ifthen
5:    
6:   else
7:    
8:   end if
9:   
10:   Build a bipartite graph of ;
11:   Calculate cost by HK algorithm;
12:   
13:   
14:   ifthen
15:    break;
16:   end if
17:  end for
18:  Episode = Episode + 1;
19: end while
20: Return final solution and cost

We use the -table to judge the quality of each action and guide the choice of the next action . We set the parameter greedy, which represents how much probability we have to choose the best action in the past (line 4 and line 5). We set it to 0.2; the higher the value, the more likely it is to fall into a local optimum. and are our learning parameters, which represent how much we also receive rewards (line 11 and line 12). It is worth noting that we want the total time-out for the task to be smaller, so when making the reward, we set each reward to a constant minus the average time-out cost, so that reinforcement learning can guide us to options with less time-out.

We combine the combinatorial optimization algorithm in lines 9 and 10 of Algorithm 2, which gives us a stable lower bound. In the past, reinforcement learning algorithms were more inclined to choose which virtual machine to assign , but we choose which batch of tasks to calculate each time through reinforcement learning and then use the combinatorial optimization algorithm to calculate the minimum cost.

In reinforcement learning, the choice of which batch of tasks to perform each time is considered as an action, and if the average time-out per task is smaller, the better the action is proved to be. However, our earlier actions affect the later decisions, because each decision directly affects the state of the threads in each virtual machine. We view the decision process as a tree, each path in the tree represents a different choice for us. The tree is shown in Figure 2.

The root node is the moment when no tasks have been processed yet, and each time we process a batch of tasks, it represents the selection of the next level of nodes in a tree. When we have reached the leaf node, it proves that all the tasks have been processed. All of our previous processes can be seen as a path starting from the root node in the tree all the way to the leaf nodes, and we are guided by reinforcement learning to choose one path at a time, and the path we choose is our one action.

Each of our episode refinds a path from the root node in the tree to a leaf nodes, at which point the state of the virtual machines in any path taken is consistent; because we choose a path that has been taken before, it means that we choose the same task each time as before. This step can greatly reduce the range explored by our agent. This is necessary in scheduling problems where we sacrifice a small amount of space for exploration and can be extremely efficient. In the next subsection, we will show the outstanding performance of EATS in the dataset.

4. Experimental

In this section, we conduct experiments to evaluate EATS in IoT system of various performance metrics and analyze the results.

4.1. Experiment Setup

All the algorithms are run on Ubuntu 16.04.5 Linux Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40 GHz, 256 GB RAM. To verify the effectiveness of EATS, experiments were set up to test it in different environments. In the system, we have 6000 tasks that arrive in 1000 seconds. We vary the difficulty of the scheduling problem by dynamically adjusting the number of threads per virtual machine and the efficiency of each virtual machine, with more threads and higher efficiency representing a less difficult scheduling problem.

To test the effectiveness of the proposed algorithm EATS in IoT systems, we constructed different scenarios and chose two baseline algorithms for this article: (1)The MIN-MIN algorithm [13], which is a classical algorithm used in scheduling problems. This algorithm focuses on the best solution of the current task in each moment, and when the new task arrives, it assigns it to the thread in the virtual machine that can complete it first. There is also an algorithm similar to it, MAX-MIN [14], which prioritises the tasks that take the most time to complete. In our dataset, the MIN-MIN algorithm completely dominates the MAX-MIN algorithm, so we will not show it here(2)The random algorithm (RDA); although RDA does not have amazing ideas, it can still give good solutions in some cases. In most cases, RDA is not competitive with state-of-the-art algorithms, but it is a good baseline that accurately reflects the difficulty of the dataset. In our work, the amount of time-out generated by the random strategy and the proportion of time-out tasks are obtained by averaging the results of three times

4.2. Experiment Results

For the problem of the task scheduling in IoT systems, we evaluate the results of the different algorithm from two perspectives: (1) time-out (the lower the time-out, the better the performance of the algorithm) and (2) the proportion of time-out tasks (the lower the value, the better the performance of the algorithm, where the proportion of time-out tasks is the number of tasks that are currently time-out divided by the number of tasks that have been processed so far).

We compare the experimental results of our proposed EATS with those of the baseline algorithm in different scenarios. With Figures 3 and 4, we can see how the amount of time-out tasks and the proportion of time-out tasks change as the task number increases for EATS and its competitors. In our experiments, the amount of time-outs is in seconds. Observed from the results of Figure 3, the performance of our proposed algorithm and MIN-MIN totally dominates RDA. Although RDA is not competitive with the other algorithms, it gives a clear picture of the difficulty of the set of tasks. By looking at the above graph, we see that both the EATS and MIN-MIN algorithms have outstanding performance when the number of tasks is small. However, the performance of the MIN-MIN algorithm degrades as the number of tasks in IoT systems continues to increase. When 500 tasks were generated, EATS started to experience a time-out amount.

As can be seen in Figures 3 and 4, the performance gap between the different algorithms becomes more pronounced when the number of tasks increases. The poor performance of RDA in this set of environments also indicates that the set of tasks which need be scheduled is difficult to handle. The harder cases tend to be closer to the real world, while also showing performance differences between algorithms.

When processing the full task, we can see through Figure 3 that the MIN-MIN algorithm has almost twice as many time-outs as our algorithm, but in Figure 4, we can see that the percentage of task time-outs for the MIN-MIN algorithm does not reach twice that of our algorithm, indicating that our algorithm is forward-looking enough that it may choose to let the current task time-out but ensures that the total time-out is as small as possible. This foresight is important in large-scale task scheduling problems. When tasks become larger, it is a poor performance to focus more on the task at hand.

In order to show the effect of proposed EATS in different environments, we test 2000 tasks generated in IoT systems under the condition that the virtual machine efficiency is 1/2 of normal. We adjusted the number of different threads in the virtual machine. As shown in Figures 5 and 6, when the number of threads in virtual machine increases, this problem becomes simpler. As the RDA algorithm has a much higher amount of time-outs in this scenario, we do not show its time-outs in Figure 5. Even though this set of scenarios only handles 2000 tasks, it is extremely difficult. We can see that the lower the number of threads in the virtual machine, the more difficult the problem becomes. When the problem becomes difficult, the efficiency gap between us and other algorithms also becomes larger, which shows that our proposed algorithm has outstanding performance in different scenarios.

Not only do changes in the number of threads affect the difficulty of the scheduling problem but also changes in the efficiency of the virtual machine can make the problem difficult. In order to better demonstrate the practicability of our proposed algorithm, we tested the relationship between the virtual machine efficiency and the amount and proportion of task time-out, as shown in Figures 7 and 8.

As we can see, the problem becomes more difficult when the virtual machine becomes inefficient. When the virtual machine efficiency becomes 1/5 of the original, the proportion of time-out tasks has exceeded half. But that does not seem to be as big an impact as the reduction in threads. This is because when there are fewer threads, scheduling is not only more difficult but also affects the number of tasks that can be processed simultaneously. However, when the efficiency of the virtual machine becomes low, it will not affect the effective decision of the algorithm directly.

The experimental findings show that, in various environments, our proposed algorithm EATS performs better than the comparison algorithm. Additionally, the relationship between time-out amount and the proportion of time-out tasks demonstrates that EATS is adequately forward-looking. It does a good job of dropping the optimal case of the current task but guarantees the least total time-out for all tasks.

Different from other algorithms, we focus on the scheduling scheme of a batch of tasks at a time, and it will select some tasks to time out among a batch of tasks, but the overall impact is positive. Second, reinforcement learning guides the process to make this advantage even more obvious.

Although the RDA algorithm is not competitive with the state-of-the-art algorithms, we demonstrate the difficulty of the dataset through the performance of RDA. Our proposed EATS also has outstanding results in difficult scenarios, which shows that our proposed algorithm is efficient and practical.

With the rapid development of IoT systems, IoT systems are widely used in different scenarios. Recently, Chen et al. [15] proposed a game theory approach in QOS aware computing offloading of IoT in LEO satellite edge computing, which can better deal with complex scenarios. Moreover, Boursianis et al. [16] assisted precision agriculture intelligent irrigation system through the Internet of Things platform. This article also focuses on issues in IoT systems, which are often used in cloud computing environments. With the development of the Internet of Things technology, many problems in the Internet of Vehicles can be better solved [17, 18].

There are many areas that deal with scheduling and resource allocation. In the virtual machine resource allocation problem, Li et al. [19] used reinforcement learning method to approximate the optimal allocation strategy based on the feedback state and reward. Li et al. [20] decomposed the transformed problem into subproblems in the resource allocation problem of IoT devices in smart buildings and solved them by stochastic optimization techniques. In addition to this, many improved methods for task scheduling have been proposed in recent years [21].

Cloud computing is an important platform to support IoT applications. Cloud computing is an important platform for supporting IoT applications, where edge computing is no less important than task scheduling in the cloud. In our previous work, there were a lot of research on edge computing. Chen et al. [22] studied in the edge caching of IoT services. They proposed non-orthogonal multiple access (NOMA) technology to improve the efficiency of resource transmission and reformulated the optimization problem as non-cooperative game model. Wu et al. [23] are driven by edge computing for target detection and image enhancement. Chen et al. [24] have focused on the unloading problem in edge cloud systems and proposed the idea of game-based decentralized task offloading (GDTO) to obtain offloading strategy and analyze the upper bound for the convergence time.

In cloud computing and IoT problems, many of them are difficult and it is crucial that we need to dynamically adapt the decisions we make. An approach for data security in the IoT is proposed by Cai et al. and Cai and Zheng [25, 26]. In this article, we also dynamically select the task to be computed each time by means of a reinforcement learning.

Recently, Wan et al. [27] have proposed edge computing-based preprocessing methods that can effectively reduce the demand on the cloud. You et al. [28] focus on joint task scheduling problem in mobile edge computing, which divides the problem into multiple subproblems, and the authors define an optimization problem in order to minimize the overall energy consumption of all UAVs. The continuous convex approximation technique and the branch-and-bound method are used to solve the high-quality solutions of the subproblem. Similarly, EATS also focus on the optimal solution of the subproblem, select the subproblem for each computation by means of a reinforcement learning algorithm, and process the subproblem by means of a combined optimization algorithm. We devise an efficient algorithm for knowing the arrival of a task in advance.

Finally, we need to model the problem, which we do by modelling the tasks and idle virtual machines in the IoT system as a bipartite graph. There are many ways of modelling different problems [29]. Chen et al. [30] proposed a flood prediction model of BiGRU with attention mechanism based on IoT system. In the industrial Internet of Things (IIoT), Huang et al. [31] built a Markov queuing model that captures the dynamics of IoT devices and edge servers and designed intelligent computing methods. A differential-private framework to predicting traffic flow was proposed by Cai et al. [32]. In the future, we will consider additional modelling approaches to better deal with problems in IoT systems.

6. Conclusions

In this article, we focus on the problem of task scheduling in large-scale IoT systems. We propose a novel reinforcement learning algorithm EATS that incorporates a combinatorial optimization algorithm and prove its lower bound. The experimental results show that EATS proposed in this paper has outstanding performance in different environments. In our future work, we will further focus on the performance of reinforcement learning algorithms in other IoT applications.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work was partly supported by the National Natural Science Foundation of China (61902029), Project of Cultivation for Young Top-Notch Talents of Beijing Municipal Institutions (BPHR202203225), R&D Program of Beijing Municipal Education Commission (No. KM202011232015), Project for Acceleration of University Classification Development (Nos. 5112211036, 5112211037, and 5112211038), and BISTU College Students Innovation and Entrepreneurship Training Program (No. 5112210832).