Abstract

Recently, mobile edge computing (MEC) is widely believed to be a promising and powerful paradigm for bringing enterprise applications closer to data sources such as IoT devices or local edge servers. It is capable of energizing novel mobile applications, especially the ultra-latency-sensitive ones, by providing powerful local computing capabilities and lower end-to-end delays. Nevertheless, various challenges, especially the reliability-guaranteed scheduling of multitask business processes in terms of, e.g., workflows, upon distributed edge resources and servers, are yet to be carefully addressed. In this paper, we propose a novel edge-environment-based multi-workflow scheduling method, which incorporates a reliability estimation model for edge-workflows and a coevolutionary algorithm for yielding scheduling decisions. The proposed approach aims at maximizing the reliability, in terms of success rates, of services deployed upon edge infrastructures while minimizing service invocation cost for users. We conduct simulative experimental case studies based on multiple well-known scientific workflow templates and a well-known dataset of edge resource locations as well. Simulative results clearly suggest that our proposed approach outperforms traditional ones in terms of workflow success rate and monetary cost.

1. Introduction

Edge computing is an evolving computing paradigm offering a more efficient alternative: data is processed and analyzed closer to the point where it is created. It enables computation as a service model and prepares a proximity-based and mobility-aware resource provisioning model of virtualized resources applicable on demand [1, 2] The edge service providers are equipped with computational facilities, which allow them to provide necessary spaces required by commercial and noncommercial users. Recently, the edge computing paradigm has evolved as an increasingly popular force for supporting and enabling business process and scientific workflow execution [35]. A workflow is a set of dependent or independent tasks illustrated as a directed acyclic graph (DAG) [68], in which the nodes indicate the tasks and a directed arch represents the interdependency among the corresponding tasks. Workflow scheduling involves mapping workflow tasks to computational resources for execution, and the resulting optimization problem is well acknowledged to be NP-hard.

Recently, as novel bioinspired and genetic algorithms are becoming increasingly versatile and powerful, a great deal of research efforts are paid to applying them in dealing with edge-environment-oriented workflow scheduling problem [911]. However, it remains a great challenge to develop efficient scheduling algorithm with good scheduling performance, low service-level-agreement (SLA) violation rate, and high user-perceived quality of service.

In this paper, we propose a novel edge-environment-based multi-workflow scheduling approach by leveraging a multi-workflow-reliability estimation model and preference-inspired coevolutionary algorithms, i.e., PICEA-g, for yielding scheduling decisions. We show through simulative studies as well that our proposed method clearly outperforms traditional ones in terms of multiple metrics.

2. Literature Review

2.1. Related Work

It is widely believed that to arrange multitask business processes or workflows upon distributed nodes or computing resources with Quality of Service (QoS) constraints, e.g., reliability, is an NP-hard problem [12, 13]. It is therefore extremely time-consuming to yield optimal schedules through traversal-based algorithms. Fortunately, heuristic and metaheuristic strategies with polynomial complexity are capable of producing approximate or near-optimal solutions at the cost of acceptable optimality loss.

For example, Wang et al. [14] proposed a look-ahead genetic algorithm (LAGA), which utilized reliability-based reputation scores for optimizing the makespan and the reliability of a workflow application. Wen et al. [15] aimed at solving the problem of deploying workflow applications over federated clouds while meeting the reliability, security, and cost requirements. Wu et al. [16] proposed a soft error-aware and energy-efficient task scheduling method for workflow applications in DVFS-enabled cloud infrastructures under reliability and completion time constraints. Cao et al. [17] proposed a soft error-aware VM selection and the task scheduling approach to minimize the execution cost of cloud workflows under makespan, reliability, and memory constraints while considering soft errors in cloud data centers. Garg et al. [18] proposed a new scheduling algorithm called the reliability and energy-efficient workflow scheduling algorithm, which jointly optimized lifetime reliability of application and energy consumption and guaranteed the user-specified QoS constraint. Nik et al. [19] proposed a scheduling approach, which included four algorithms for minimizing the workflow execution cost while also meeting the user-specified deadline and reliability.

To minimize the overall error probability in a multiserver mobile edge computing (MEC) network, where the wireless data transmission/offloading was carried by finite blocklength (FBL) codes, Zhu et al. [20] characterized the FBL reliability of the transmission phase and investigated the extreme event of queue length violation in the computation phase by applying extreme value theory and provided an optimal framework for deciding time allocation and server selection. Peng et al. [8] proposed a novel method to evaluate the resource reliability in mobile edge computing environment and addressed the workflow scheduling problem by using a Krill-based algorithm. Kouloumpris et al. [21] considered an architecture consisting of an edge node, an intermediate node (hub), and the cloud infrastructure and then used a mathematical programming-based framework to derive an application-reliability-optimal task allocation based on multiple operational constraints. Wang et al. [22] developed a reinforcement-learning-based approach to the multi-workflow scheduling method. However, they considered the centralized cloud environment as the underlying infrastructure and thus ignored the overhead for inter-edge-node data transmission. For a similar optimization objective, Wang et al. [23] and Saeedi et al. [24] employed an immune-based PSO algorithm for scheduling workflows over centralized clouds.

3. Models and Systems

3.1. System Architecture

An edge computing system usually consists of an edge computing agent (ECA) and multiple edge servers. The edge computing agent manages all resources and each edge server owns several virtual machines (VMs), each of which can usually handle a workflow task that a user offloads at a time. An edge server usually has limited capacity for storage and computation. Due to the requirement of signal strength and channel stability, as illustrated in Figure 1, it is usually believed that an edge server can cover a limited circular range and thus users can only offload their tasks to the reachable edge servers in terms of such coverage ranges.

As can be seen in Figure 2, instead of considering the monolithic task configurations, we consider that user requests can be structured and process-like requests can be expressed as workflows with different constructs. A workflow refers to a directed acyclic graph (DAG), G = (T, E). T denotes the task set T =  , E is the set of edges between tasks, and is a priority constraint, indicating that is the precedent task of .

The notations used in this paper are shown in Table 1.

3.2. Problem Formulation

In engineering, reliability is the probability of a system or component to perform its required functions under the stated conditions and with dependable outcomes. Guaranteeing reliability of computing systems and applications is a challenging problem due to the fact that faults are hard to avoid due to hardware failure, software bugs, transient faults, devices that work in high temperature, and so on. The reliability issue of edge-environment-based multi-workflow can be further complicated due to the fact that structured and process-based task flows are more susceptible to varying types of faults, especially transmission errors and faults occurring when wireless communications between edge nodes and users are required.

As shown in Figure 3, the reliability of a workflow is usually structure dependent as follows:where denotes the number of tasks in a sequential routing, is the number of tasks succeeded by a split point in a parallel routing, and is that of a selective routing, respectively. For a task executed on the edge server , its reliability can be estimated as its success rate of execution, i.e., the probability that its time-to-failure exceeds its completion time:where

To estimate the monetary cost of workflows, we first have to estimate the cost for renting server p:where is the completion time of the task executing queue on and is the charge per unit time for renting server .

The transmission time for the task i can be estimated as , which is composed of three parts [25], where indicates the uplink communication time, is the the downlink time, and is the the backhaul link time. According to [26, 27], can be infinitesimal, and the downlink time can usually be a constant . Therefore, can be expressed aswhere is decided by the distance between the task (user) and the server; as the distance increases, the bit error rate increases and the average transmission speed decreases [27]. And, indicates the averaged bandwidth of the server and is the the data size of task i. If the transmission price per unit time of the server as , then the transmission fees can be estimated as

Based on the described system configuration, the problem that we are interested in is thus, for given proximity constraints of server-user communications and deadline, how to schedule workflows with higher reliability and lower cost. The resulting formulation is thussubject to

4. PICEA-g for Multi-Workflow Scheduling

4.1. Preference-Inspired Coevolutionary Algorithms Using Goal Vectors

It has long been known that preference-based approaches are useful for the generation of trade-off surfaces in objective subspaces of interest to the decision maker. Wang et al. [28] offered one realization of such approach named preference-inspired coevolutionary algorithm using goal vectors (PICEA-g), which had been testified to outperform four other best-in-class multiobjective evolutionary algorithms, e.g., NSGA-II, MOEA, HypE, and MOEA/D.

PICEA-g is a coevolutionary approach in which the usual population of candidate solutions is considered evolvable with a set of goal vectors during the search. In this algorithm, optimality of candidate solutions is decided by a Pareto-dominance model. To be specific, a family of goal vectors and a population of candidate solutions coevolved during the search process. A candidate solution gains fitness by meeting a set of goal vectors in the objective space, but the fitness contribution must be shared with other solutions satisfying those goal vectors. Goal vectors only gain fitness by being satisfied by a candidate solution, but the fitness is reduced the more the time a goal being satisfied by other solutions in the population. Ultimately, the population of candidate solutions and the goal vectors coevolve toward the Pareto optimal front. The fitness of a candidate solution and the fitness of a preference can be calculated by (9)–(11) as follows:where denotes the number of solutions that satisfy preference . In this formulation, when fails to satisfy any , the fitness is defined as 0. And,wherewhere is the population size of candidate solutions.

A elitist framework is usually used for implementing the above model as shown in Figure 4. As can be seen, a population of candidate solutions and a set of preferences, denoted by S and G, respectively, are evolved for a fixed number of generations, maxGen. In each generation t, genetic variation operators are implemented on parents to produce offspring, . Meanwhile, new goal vectors, , are randomly regenerated based on the predefined bounds. Then, and and and are pooled, respectively, whereafter the combined population is sorted according to the fitness. Finally, a truncation-selection is applied to select the best candidate solutions and vectors as the new population, and .

4.2. Encoding

For a workflow application, a chromosome is a data structure in which a scheduling solution is encoded. We use a two-dimensional string to represent a scheduling solution. One dimension of the string represents the index of resources, which depicts the task-resource mapping, while the other dimension denotes the order between tasks.As illustrated in Figure 5, in this solution, there are tasks from three workflows, namely, , , and , which are assigned to virtual machines on two edge servers. For instance, is executing four tasks with the processing sequence of . The decoding scheme can be described as the reverse of encoding.

4.3. Initialization

Two constraints are applied here to generate uniformly feasible chromosomes to improve the quality of the initial population, meanwhile, accelerating the convergence rate, i.e., the topological constraint and the proximity constraint. Based on the constraints, the initial population is generated as follows:(1)Firstly, each workflow is converted into a task list after topological sort.(2)Secondly, a resource from is selected as the computing resource VM, only if is available for . Then, is assigned to a VM.(3)Repeat the above steps until all workflow tasks are assigned. Then, a chromosome is generated.

When the population size reaches the defined value, the initialization process stops.

The initial goal vectors are randomly generated as objective vectors in the objective space within predefined bounds. In practice, the bounds are estimated via preliminary single-objective optimizations.

4.4. Population Update

The iterative update of population consists of discrete steps described below, until the termination condition is satisfied.

4.4.1. Genetic Variation

The genetic variation changes the workflow task allocation information to maintain diversity in the population. In our proposed genetic variation operation, a solution is mutated intelligently based on a resource priority heuristic. To generate a promising offspring solution, Dongarra et al. [29] have proven that the resource, which has the minimal multiplication value of some key performance indicators, should have a higher priority to be selected in the scheduling. Hence, we have

Then, we let indicate the priority of the server . The genetic variation operation randomly selects one task in the solution and reassigns it to any available server with a higher priority. As an example shown in Figure 6(b), task is originally scheduled to , whose priority is 3. Thus, the genetic variation reassigns it to with a higher priority of 4.

According to the precedence constraint, we insert into the position behind , as shown in Figure 6(c).

Simultaneously, are new preference sets and are randomly regenerated based on the initial bounds.

4.4.2. Fitness Calculation

Fitness calculation is based on the distribution of function value vectors and goal vectors in the target space. Assume that there are two candidate solutions and , their offspring and , two existing preferences and , and two new preferences and (i.e.,  =  = 2) as shown in Figure 7.

The process to calculate the fitness of a candidate solution s and fitness of a preference is shown in Table 2.

4.4.3. Truncation Selection

Truncation selection aims to select the best candidate solutions from the union population according to their fitness. However, some solutions with higher fitness may be Pareto-dominated. Therefore, we identify all nondominated solutions before the selection. If the number of nondominated solutions does not exceed the population size, then we assign the maximum fitness to all the nondominated solutions. However, if more than nondominated solutions are found, we then disregard the dominated solutions prior to applying truncation selection (implicitly, their fitness is set to zero).

4.4.4. Termination Conditions

This phase is a major part of the proposed algorithm, which can specify the final solutions. In this article, the termination condition is examined in two stages: (1) as soon as a maximum iteration criterion is met, the proposed algorithm terminates and (2) T is a threshold value for terminating algorithm, set to 0.9 in our study. In every generation, after calculating the fitness of the populations, if the fitness function value is less than T, the algorithm continues; otherwise, it terminates. Whenever the algorithm ends, a set of optimal solutions is presented to the user. According to all levels presented in this article, the final solution is the best solution for all objectives including reliability and cost.

Algorithm 1 presents all the operations of the PICEA-g algorithm.

Objective function: ;
Algorithm-related parameters: , , ;
Generate initial population and initial goal vectors ;
;
whiledo
 Generate new population from by genetic variation
 Merge and into
 Merge and into
 Find Pareto Nondominated from
 Generate a new goal vector
 Merge and into
 Evaluate the fitness of and
ifthen
  Set the fitness of as max value
  Updtae by truncation selection from
else
  Update by truncation selection from
end
 Update by truncation selection from
if, then
  Break
end
end

5. Performance Evaluation

To evaluate the effectiveness and correctness of our proposed method, we conduct extensive simulative experiments and show through simulative results that our proposed method outperforms traditional ones. We actually intended in the beginning to employ a real-world edge-workflow-scheduling environment to test our developed algorithms. However, we found out that such an edge environment for executing real-world scientific workflow is yet to come. Consequently, we have to rely on simulations and simulative datasets in for the model validation and comparison purpose.

These simulative experiments are based on three well-known workflow templates [30], namely, CyberShake, LIGO, and SIPHT, as shown in Figure 8.

We consider that all edge servers are with 3 different types of resource configurations and charging plans, i.e., tp1, tp2, and tp3, as shown in Table 3. We collected historical time-to-failure (TTF) records of three types as illustrated in Figure 9 as the input reliability data for edge servers. Then, the MTTF of each type of edge servers can be estimated by a Monte Carlo method [31].

We assume as well that edge servers and users are located according to the EUA dataset [2] as shown in Figure 10.

We compare our proposed method with three existing approaches, namely, NSGA-II [32], MOEA/D [33], and SPEA-II [34]. Figure 11 shows the solutions obtained by abovementioned approaches for different workflow cases where the x and the y axes represent the resulting success rate and cost. Figure 12 shows the comparison of Pareto optimal solutions of different methods with varying numbers of edge servers.

As can be seen from Figures 11 and 12, (1) it is evident that our method achieves better Pareto optimal fronts than its peers, regardless of workflow cases or the number of edge servers and (2) our method acquire more feasible solutions than its peers due to the fact that multiple goal vectors help to identify the solution population toward the Pareto front.

6. Conclusion

In this paper, we address the problem of reliability-guaranteed multi-workflow scheduling in the edge computing environment. We develop a reliability-driven scheduling strategy based on the PICEA-g algorithm. Extensive simulations based on several well-known workflow templates and a real-world edge-server-location dataset clearly indicate that our proposed method outperforms its counterparts in terms of different performance metrics.

Data Availability

The EUA dataset used to support the findings of this study is available at https://github.com/swinedge/eua-dataset.

Disclosure

Zhenxing Wang and Wanbo Zheng are co-first authors.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Zhenxing Wang and Wanbo Zheng contributed equally to this work.

Acknowledgments

This work was in part supported by the Chongqing Research Program of Technology Innovation and Application under Grants cstc2019jscx-msxm0652 and cstc2019jscx-fxyd0385; Key Research and Development Plan of Jiangxi Province (No. 20181ACE50029); Science and Technology Program of Sichuan Province under Grant 2020JDRC0067/2020YFG0326; and the Talent Program of Xihua University under Grant Z202047.