Abstract
Deep Neural Network (DNN) models have achieved excellent performance in many inference tasks and have been widely used in many intelligent applications. However, DNN models often require a lot of computational resources to complete the inference tasks, which hinders the deployment of such models to resourceconstrained edge devices. In order to extend the application scenarios of DNN models, the edgecloud collaborative inference methods, represented by model partition, have attracted much research attention in recent years. In scenarios that have multiple edge devices deployed, the edgecloud collaborative inference method requires partial migration of tasks, but traditional scheduling methods only migrate tasks at the task level. In this paper, we propose two task scheduling methods, which can solve the problem of partial migration of tasks in multiedge scenarios. The first scheduling method is based on the optimal cutting of a single DNN. The cutting positions of all the models are the same, regardless of the influence of external factors. This method is suitable for chain and directed acyclic graph (DAG) type DNNs. The second scheduling method takes external factors such as congestion and queuing delay at the cloud side into consideration, which dynamically selects the cutting position of each DNN to optimize the overall delay and thus is applicable to chain DNN models. The experimental results show that, compared with the baseline method, our proposed scheduling method can reduce the delay by up to 6.48x.
1. Introduction
With the rapid development of Deep Neural Network (DNN), edge intelligence has been widely applied in the Internet of Things (IoT) scenarios in recent years [1, 2]. The growing proliferation of IoT devices with highquality sensors will result in massive data streaming to the edge or the cloud. Edge devices deployed generally have some constraints including energy and computational capacity [3]. Thus, process data and inference only at edge ends will cause high delays that cannot meet the requirements of most applications. Cloud servers have high capacities of computation; however, transferring all data to the cloud will suffer double network delay that is intolerant in bad network conditions. Moreover, transferring whole data to the cloud takes the risk of privacy leakage [4]. Edgecloud collaborative inference, which assigns tasks between the edge and the cloud on the basis of constraints in applications, has become a research focus nowadays.
In the edge computing environment, there are usually multiple edge devices collecting data at the same time, but the number of corresponding cloud servers is very small [5]. If a lot of tasks are generated at the edge at the same time, how to better schedule these tasks to reduce the overall delay is a problem we need to study.
The edgecloud collaborative inference can adaptively cut the DNN according to the network bandwidth without changing the original model parameters of the DNN [6], so as to minimize the delay or maximize the throughput. For the deployment of a single edge, determine the cutting position by network bandwidth is obviously the optimal solution. However, in the real edge environment, there are often multiple edge devices generating tasks. Due to a large number of edge tasks, the tasks that the cloud can process at the same time are limited. Therefore, how to schedule these tasks is very important. In this paper, we design a system that can perform task scheduling in multiple edge scenarios.
For a singleedge system, we determine how the DNN is cut based on the current network status and the fixed configuration parameters of the system (such as the execution time of the DNN on the edge and the cloud), through the goal of minimizing delay or maximizing throughput, find the optimal cutting point. However, in a multiedge scenario, the cutting point for minimizing the delay for a single task is not necessarily optimal for the entire system. Due to the limited processing capacity of the cloud, when the number of tasks generated at the edge is much greater than the processing capacity of the cloud, many tasks will face the problem of blocking. When the edge end completes its own inference task, it transmits the intermediate processing result to the cloud, and the cloud cannot process it in time, which causes the task to be blocked. Although the execution time of each task at the edge is optimal, the waiting time in the cloud increases the processing time of the entire system.
In this scenario, minimizing the delay of each DNN may not be the optimal solution for the entire system. If cloud task congestion is detected, the position of the DNN is appropriately adjusted so that the task it is waiting for is executed at the edge. Although the delay cannot be minimized for a single DNN, it will make the overall delay of the system less. To our knowledge, most of the scheduling schemes for the DNN model at present are scheduling the entire DNN model. Due to the dependency and complexity of the edgecloud collaborative system scheduling problem, few researches focus on its scheduling strategy. In [7], the author studied the scheduling system in the scenario of multiple IoT devices; it suggested that each type of IoT device has different computing capabilities and designed an online scheduling (Online) system. Online can decide whether DNN is deployed locally or in the cloud, and it can also adjust the scheduling sequence. The author compares Online with firstcome, firstserved (FIFO) and lowbandwidth first deployment (LBF) strategies. Compared with the other two methods, Online can improve the overall quality of service. In [8], the author proposed a migration scheduling problem for DNN tasks in edge environment, gave the formal definition and evaluation criteria of the problem, and proposed a greedy algorithm and a genetic algorithm for the migration. The problemsolving is approximately optimal, which can effectively solve the migration and deployment problems of DNN.
In this paper, we designed two task scheduling strategies for edgecloud collaborative inference. The first strategy finds the optimal cutting position of DNN models for every single task which is capable for DNN models with DAG and chain topologies. The second strategy decides cutting position on the basis of network condition to reach global optimal for all tasks, which is capable for DNN models with chain topology.
The remainder of this paper is organized as follows. In Section 2, we conduct a brief literature review of related studies. In Section 3, we define the problem to be addressed. In Section 4, we introduce our approach in detail. Section 5 shows the results of the experiments. We conclude in Section 6.
2. Related Work
Much work has been done to accelerate the inference of DNN models in IoT scenarios. In order to reduce the delay of inference at edge devices, the main research interests are divided into three aspects: model compression [9], distributed model deployment [10], and computation offloading [11].
Compression of DNN models uses technics such as prune [12], quantization [13], and knowledge distillation [14] to reduce the computation operations in model inference without significant accuracy reduction [15]. Although model compression speed up the inference on edge devices significantly, the compression itself needs lots of computation to complete and the inference of a compressed model may need specific hardware to complete. For the application environments that contain heterogeneous architecture edge devices, this method is not competent. There are multiple studies about process DNN models on resourceconstrained devices effectively [16, 17]. To inference DNN models effectively on resourceconstrained devices, researchers design the hardware architecture deliberately [18] and hardwareaccelerating methods usually combined with model compression methods which can improve the efficiency of specific model inference significantly [19]. However, these methods are not universal to various applications compared to other methods. Edge devices in real application scenarios are generally heterogeneous in architectures; hardware accelerating requires specific computation units or ICs to execute the model effectively. Distributed deployment of DNN models can make full use of the computing resources of devices [10] and is capable for largescale applications. However, for edge intelligence environments, the number of devices may vary dynamically; distributed deployment cannot handle this well. Moreover, distributed deployment on multiple edge devices may cause network congestion, especially under bad network conditions.
Edgecloud collaborative inference has unique advantages over the first two technologies; it does not change the original model compared with lightweight model and distributed network deployment. Edgecloud collaborative inference has high scalability and can be combined with the other two technologies. For edgecloud collaborative inference, a lot of previous work has been done to achieve the purpose of reducing overall inference delay [8, 20–23] or to satisfy resource constraints [24–27] (bandwidth, power consumption, etc.) or to protect privacy [28]. Applying traditional artificial intelligence technology to edge computing, which is usually resourceconstrained, researchers’ ideas are mainly divided into the following three kinds: DNN model selection depending on sample [29], design lightweight DNN architectures [30] or DNN model compression [15], and edgecloud collaborative inference by cutting DNN models and scheduling tasks between edge and cloud [25].
In [29], the authors apply an autoencoder to compress the data transmitted to cloud platforms. Kang et al. [6] first proposed Neurosurgeon, a method that partitions the DNN model to execute on end devices and cloud platforms simultaneously to improve the efficiency of the model inference. It cuts a DNN into two parts and executes on a mobile device and the cloud platform, respectively, to accelerate the inference of the model. However, Neurosurgeon can only partition chainlike DNNs; it cannot handle DAG structure DNNs, which limits its application. What is more, Neurosurgeon does not have enough accuracy on execution time prediction because of the linear regression method it used. This leads to a nonoptimal partition of the model. Teerapittayanon et al. proposed DDNN [31] to speed up the latency of inference of a DNN model over distributed computing hierarchies, consisting of the cloud, the edge (fog), and end devices. However, DDNN is designed for BranchyNet [32] and is hard to be extended to other types of DNN models. In [33], the authors presented a DNN as an encoding pipeline that encodes the feature space and transmits it to the clouds. It improves the energy efficiency and throughput of the model inference. Edgent [21] exploits two knobs: DNN partitioning and DNN rightsizing to find the optimal cutting point in a dynamic network environment. Hu et al. [22] proposed DADS, a partition scheme that optimally cut the DNN with DAG topology under different network conditions. However, it fails to reduce the overall delay because of the high time complexity of the algorithm. This method cannot guarantee realtime application. In reference [34], the authors studied the mobile Web AR scenario for edgecloud collaboration and proposed a finegrained adaptive DNN partition mechanism. In [35], the authors studied the edgecloud inference of RNN models. It can decide whether to offload to clouds depending on network condition and input size. Wang et al. [36] proposed a task scheduling algorithm for tasks that need to be transferred to the cloud based on the catastrophic genetic algorithm (CGA) to satisfy the latency constraint. In [37], a novel DNN architecture was design for edgecloud collaborative inference. However, this method is difficult to apply to currently running IoT applications. In [38], AutoSplit was proposed as an industry solution of DNN splitting for edgecloud collaborative inference.
3. Problem Definition
First, we consider the chain topology DNN. For a given chain DNN model, we construct it as a chain . Each vertex corresponds to one layer in the model, and corresponds to the two layers of model with data transmission. and represent the delays from the th layer to the th layer at the edge and the cloud, respectively. For a single DNN, after using the edgecloud collaborative inference strategy, if the cutting is performed on the th layer of the DNN, the first to th layers will be deployed on the edge, and the th~th layer will be deployed in the cloud. The output of vertex will be transmitted to the cloud through the network. When the bandwidth is , the total inference delay is . The best cutting layer of a single DNN is the point where the total inference delay is minimized. The optimization objective of a single DNN is
For chain DNN, we apply an iterative algorithm to enumerate the inference delay required by the DNN segmented in each layer and take the minimum value. In the above equation, means that the entire DNN is deployed at the edge, and represents that the entire DNN is deployed in the cloud. When multiple edge devices generate tasks at the same time, at this time, each DNN has inference delay , transmission delay , and cloud inference delay at the edge. Since the cloud is not always able to process the tasks sent from the edge in time, the waiting delay of each task needs to be considered. At this time, the goal of optimization has changed from minimizing the delay of a single DNN to minimizing the overall delay of the system.
As shown in Figure 1, there are edge servers and one cloud server in the multiedge scenario. Assume that edge servers have a total of DNN inference tasks, and all tasks use the same type of DNN for inference.
4. Method Design
In this section, we designed two scheduling strategies to solve the problem of partial migration of tasks in multiedge scenarios. The first strategy named Single Cutting treats each DNN as the same; this strategy is based on finding the optimal cutting position of a single DNN. The second strategy named Scheduling with Queuing takes other external factors into account; it adjusts the cutting position dynamic according to the network condition.
4.1. Single Cutting
For the scheduling strategy for each DNN inference task, we use the strategy that minimizes the delay of a single DNN for scheduling. Since we are using the same type of DNN, for each task, when the bandwidth is constant at , is the same. We use an iterative algorithm to find the edge processing time and data transmission time at this time , cloud processing time , and optimal cutting point .
We use FIFO to schedule all tasks. When there are tasks in task set , the scheduler will not terminate. Each edge end maintains a variable, which represents the earliest time at which the edge end can process the next task . In each poll, if there is still a task that has not been processed on the edge end , we will update the current to the maximum value between the arrival time of the current task and , and finally, add the current task processing time .
After updating the time at which all edges can process the next task, we select the edge that can be processed earliest among them. Next, we check whether the task queue in the cloud is full. If the task queue is not full, we transfer the intermediate variables of the current task to the cloud task queue. Otherwise, the current task will enter a blocking state, that is, after waiting for the cloud to process the first task of the team, we will transfer the current task to the cloud for processing. After the cloud finishes processing a task, it updates the processing time of the cloud and wakes up the blocked task to start the next poll. The whole process is shown in Algorithm 1.

4.2. Scheduling with Queuing
Scheduling strategy 1 uses the optimal division of a single DNN for each task, but it is not necessarily the optimal solution globally, because no matter whether the cloud is currently congested or not, each DNN will choose the optimal cutting point under noncongested conditions for segmentation. It may leave the edge end in an idle state, and the cloud has been waiting for tasks, which will cause the overall time to become longer.
Suppose the current DNN inference task to be processed is , and the start time at which the edge end can process it is . The task is divided according to the optimal division of a single DNN, the time to reach the cloud is , and the time when the previous task pretask is executed is . If , then there will be no waiting time for the DNN to reach the cloud. At this time, is cut into the optimal solution of the task. If , after the task arrives in the cloud, the waiting time is . At this time, a new cutting point can be selected after to divide the task. In this case, you can select a new cutting point after to divide the task. At the new cutting point, the edge end processing time is , the data transmission time is , the cloud processing time is , and the waiting time . If , then for the task, the new cutting point is better than , the waiting time will be less, and because the new cutting point is after the optimal cutting point, there will be fewer layers executed in the cloud.
However, when selecting a new cutting point, the execution time of the edge is , which causes the waiting time of the next DNN task to be processed too long. This may not be conducive to the overall delay reduction; we need to take into account the waiting delay of the next task in the edge device. Assume that the next task that needs to be inferred at the current edge is . At the original cutting point, we can get that the waiting time for is , then the overall waiting time of the system is
represents the arrival time of the next task. Similarly, the overall waiting delay of the system at the new cutting point is
It is necessary to ensure that the overall delay of Equation (3) is less than Equation (2), that is, . Based on the above analysis, we design scheduling strategy 2, and Algorithm 2 describes the process of scheduling strategy 2.
4.3. DAGType DNN Scheduling Method
In the above sections, we took the chain DNN as an example and proposed two scheduling algorithms. The first method is that for each DNN, we cut at the optimal cutting point of a single DNN and then schedule. The second method is to consider the waiting delay of the DNN in the cloud and find the cutting point that can make the overall delay smaller on the basis of the optimal cutting point of a single DNN to cut and then perform scheduling.
The cutting point of each DNN of Single Cutting is determined, so for DAGtype DNNs, we can modify the function in Single Cutting to QDMP algorithm. The Scheduling with Queuing method needs to find the point that can make the waiting delay smaller by enumerating the cutting points to shorten the overall delay of the system. For the chain model, we can find the optimal cutting point by enumerating the cutting points sequentially. But for the DAG model, the optimal edge cut set often corresponds to a set of vertices and edges, and enumerating all points and edges is an NPhard problem. We cannot find a better set of cut edges by enumerating vertices sequentially. Therefore, in this article, we only discuss Single Cutting on DAGtype DNNs.

5. Experimental Evaluation
5.1. Experimental Environment Settings
We use multiple edge devices and a single cloud for scheduling simulation. For the edge, we use 5 Raspberry Pi 3B platforms, which are equipped with a 4core ARM [email protected] GHz processor and 1 G RAM. For the cloud, we used a laboratory server equipped with an 8core Intel core i79700 [email protected] GHz processor and a NVIDIA RTX 2080Ti GPU. For the dataset, we use the selfacquired video dataset to evaluate our proposed scheduling algorithms. Each edge will sample the video frame, extract 5~15 frames of pictures from the video every second, and use the DNN model for inference. In the experiment, we considered six different DNN models, including three chain DNN models and three DAGtype DNN models. The three chain models are AlexNet, TinyYOLO, and DarkNet19, respectively. The three DAG models are AlexNetParallel, ResNet18, and GoogLeNet. We evaluated the edgeonly method, the cloudonly method, Single Cutting method, and Scheduling with Queuing method.
5.2. Inference Delay under 3G and 4G Networks
We first compare the inference delays of different methods in different network states. We use 3G and 4G as the default transmission technology, and the theoretical maximum uplink bandwidth is 1.1 Mbps and 5.85 Mbps, respectively. We use 5 Raspberry Pi 3B, each Raspberry Pi 3B will input a video stream, each edge end samples 30 frames of pictures in the video, and the cloud cache queue size is 20. In Table 1, we list the total inference delay of the system under different scheduling methods using different DNNs. For chain DNN, we use both two scheduling methods for scheduling. For DAGtype DNN, we use Single Cutting for scheduling.
We further evaluated the speedup of different scheduling methods with the “edgeonly” delay as a baseline. SC and SQ represent the inference delay of Single Cutting strategy and Scheduling with Queuing, respectively, and the speedup is defined as where represents the inference delay of only the edge and represents the inference delay of all comparison methods. “Edgeonly” is used as the baseline, and its speedup ratio is 1. The edgeonly method executes the whole task on edge devices while the cloudonly method executes the whole task on the cloud server.
Figure 2 compares the chain DNN. Figure 2(a) shows the latency acceleration ratios of the four methods in the case of 3G. Both Single Cutting and Scheduling with Queuing methods surpass the edgeonly and cloudonly methods. In the case of 3G, the efficiency of Single Cutting and Scheduling with Queuing is the same. Compared with the edgeonly and cloudonly methods, they achieve a delay acceleration of 2.03 to 3.30 times and 2.22 to 6.48 times, respectively. Figure 2(b) shows the delay acceleration ratios of the four methods in the case of 4G. Compared with only the edge end and only the cloud, the Single Cutting and the Scheduling with Queuing achieve 1.9~5.48 and 1.05~1.57 times acceleration, respectively. In both AlexNet and DarkNet19 models, Scheduling with Queuing achieves the optimal delay speedup ratio. On TinyYOLO, Single Cutting and Scheduling with Queuing have the same efficiency.
(a) 3G Network
(b) 4G Network
Figure 3 compares the DAGtype DNN. Figure 3(a) shows the latency acceleration ratios of the three methods in the case of 3G. The cloudonly method performs the worst due to network bandwidth limitations. Compared with the edgeonly and cloudonly methods, the Single Cutting achieves 1.11~2.72 times and 2.07~4.12 times delay acceleration, respectively. Figure 3(b) shows the latency acceleration ratios of the three methods in the case of 4G. The cloudonly method surpasses the edgeonly method on the three models. On AlexNetParallel and GoogLeNet, the efficiency of Single Cutting is the same as that of the cloud. On ResNet18, the efficiency of Single Cutting is 1.34 times higher than that of the cloud alone. Compared with the edgeonly method, the Single Cutting achieves a speedup of 1.11~2.72 times.
(a) 3G network
(b) 4G network
5.3. The Influence of the Number of Edge Devices on the Inference Delay
In this section, we compare the performance of different methods with different numbers of edge ends. In the same way, we use “edgeonly” as the baseline to evaluate the speedup ratio () of different methods. We use AlexNet and ResNet18 to verify the effectiveness of our scheduling algorithm under different edge numbers.
As shown in Figure 4, we deployed AlexNet on the edge and compared the speedup ratios of the four algorithms on the chain DNN in the case of 1 to 5 edge ends. Single Cutting and Scheduling with Queuing are better than the edgeonly and cloudonly methods under different numbers of edge terminals. Compared with the edgeonly method, the Single Cutting can achieve a delay acceleration of 2.88 to 3.27 times, and the Scheduling with Queuing can achieve a delay acceleration of 3.12 to 3.75 times.
As shown in Figure 5, we deploy ResNet18 on the edge and compare the speedup ratios of the three algorithms on DAGtype DNNs in the case of 1 to 5 edge ends. Compared with the edgeonly method, scheduling Algorithm 1 can achieve a delay acceleration of 2.94 to 3.20 times. Compared with the edgeonly method, scheduling Algorithm 1 can achieve a delay acceleration of 2.94~3.20 times. Compared with the cloudonly method, scheduling Algorithm 1 can achieve a delay acceleration of 1.25~1.43 times.
5.4. The Influence of Network Bandwidth on Inference Delay
We compared the impact of network bandwidth on the performance of different methods. We use “edgeonly” as the baseline to evaluate the speedup () of different methods. During the experiment, we accelerated the network bandwidth from 0 Mbps to 12 Mbps. Similarly, we used AlexNet and ResNet18 to verify the influence of network bandwidth on our scheduling algorithm.
As shown in Figure 6, we deployed AlexNet at the edge and experimented with different methods on the impact of AlexNet’s inference delay under different network conditions. When the network condition is 0 Mbps, the cloudonly method performs the worst, and the efficiency of the two scheduling algorithms and the edgeonly method is the same. When the network condition is 1 Mbps, Single Cutting and Scheduling with Queuing have the same efficiency, which is better than the two baseline methods. When the network condition is between 2 Mbps and 7 Mbps, the efficiency of Scheduling with Queuing is higher than that of Single Cutting and the edgeonly method. When the network condition is greater than 7 Mbps, Single Cutting, Scheduling with Queuing, and the edgeonly method have the same efficiency.
As shown in Figure 7, we deployed ResNet18 at the edge and experimented with different methods on the inference delay of ResNet18 under different network conditions. When the network condition is 0 Mbps, the cloudonly method performs the worst, and the scheduling algorithm is the same as the edgeonly method. When the network conditions are between 1 Mbps and 8 Mbps, scheduling Algorithm 1 performs optimally, at most 2.85 times faster than the edgeonly method and at most 1.93 times faster than the cloudonly method. When the network speed is greater than 8 Mbps, scheduling Algorithm 1 has the same efficiency as the edgeonly method.
6. Discussion
In Section 5, we implement the experiments using five Raspberry Pi platforms to mimic edge devices. According to the experimental results, the proposed scheduling algorithm can handle the circumstances that have more than 5 edge devices. With the increase of the number of edge devices, the number of tasks will be enormous, and the network resources are constrained; the first scheduling algorithm, i.e., Single Cutting, may not be able to find the optimal cutting point and fail to accelerate the inference, while the second scheduling algorithm, considering the network delay and queuing delay, can still find good cutting point to reduce the overall delay. The impact of cloud server is reflected in the queuing delay in the second scheduling algorithm; the larger the capacity of the cloud server, the less the queuing delay when scheduling. Moreover, different capacity of edge devices and cloud servers will cause different DNN model partitioning in proposed algorithms while the overall delay keeps low.
7. Conclusion
In this paper, we propose two DNN scheduling algorithms for edgecloud collaborative inference systems. The difference from previous work is that, in the case of multiple edges, we considered partial migration of tasks instead of whole migration. The first scheduling algorithm performs scheduling based on the optimal decision of a single DNN, and each task performs task migration at the same split point. The algorithm combined with the QDMP algorithm can be applied to the scheduling of chain DNN and DAGtype DNN and has a wide range of applicability. The second scheduling algorithm can search for a partition point that can make the overall delay smaller for scheduling based on factors such as the waiting time for tasks in the cloud and the execution interval between tasks. This scheduling algorithm can be used for chain DNN scheduling.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Authors’ Contributions
Shigeng Zhang and Yue Zhang are the corresponding authors of this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grant 61772559, 61901529, 61902434 and the Natural Science Foundation of Hunan under Grant 2020JJ5776 and 2019JJ50826.