Abstract

Scheduling extensive scientific applications that are deadline-aware (usually referred to as workflow) is a difficult task. This research provides a virtual machine (VM) placement and scheduling approach for effectively scheduling process tasks in the cloud environment while maintaining dependency and deadline constraints. The suggested model’s aim is to reduce the application’s energy consumption and total execution time while taking into account dependency and deadline limitations. To select the VM for the tasks and dynamically deploy/undeploy the VM on the hosts based on the jobs’ requirements, an energy-efficient VM placement (EVMP) algorithm is presented. Demonstrate that the proposed approach outperforms the existing PESVMC (power-efficient scheduling and VM consolidation) algorithm.

1. Introduction

Large-scale complex scientific applications/workflow are executed and analyzed in the multi-disciplinary area of research such as astronomy and physics [1]. The workflow contains a large number of mutually dependent tasks which are executed according to their dependency constraint [2]. Due to dependency constraint, the child task can start its execution only when parent task finishes its execution. A directed acyclic graph (DAG) is used to represent workflows. These workflows often have disparate requirements (such as storage and CPU) and constraints (dependency) that need to be accounted during their execution. For example, the scientific workflow Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) [3] is a resource-intensive workflow with a good degree of scalability [4]. The strict necessity of the computing infrastructure makes the execution of scientific applications difficult and costly [5]. Cloud computing provides virtualized cloud resources as a service, on-demand, and pay-per-use basis [6, 7]. The characteristics of cloud computing such as elasticity and flexibility make this environment a major trend for computation and storage services. These characteristics motivate to execution of scientific applications in the cloud environment [8].

Scientific workflows are the constitution of distinct tasks with complex dependency. Resource provisioning and the order in which workflow tasks are executed are challenging problems. The inefficient utilization of resources while executing the workflows wastes a tremendous number of resources. The inefficient utilization of resources increases the number of unused provisioned resources. These unused resources increase energy consumption without performing any useful operation [9]. The resource utilization can be increased by efficient resource provisioning. An energy-efficient scheduling algorithm can be used to manage the resources that are required by the task while executing these scientific workflow tasks. In the literature, numerous workflow scheduling algorithms have been proposed. These scheduling algorithms focus on diminishing makespan and cost with inadequate resources. The selection and designing of a competent and operative workflow scheduling algorithm are also challenging tasks [10]. The energy-aware scheduling algorithm must be selected which can provision a proper resource from the offered resources which are efficient enough to complete the workflow tasks within their deadline constriction, and it can decrease the energy consumption. To minimize the energy consumption Dynamic Power Management (DPM) [11, 12], Dynamic Voltage and Frequency Scaling (DVFS) [1215], resource consolidation with migration techniques [6, 16], virtualization [6], and green policies [17], technologies are used. Energy consumption has also been minimized by reducing the computational power of the resources. The reduction of computational power has increased the workflow makespan.

An amalgamation of software and hardware-based techniques is necessary to reduce energy consumption. In this paper, the EVMP algorithm is proposed to schedule a scientific workflow on virtual machines (VMs). The EVMP algorithm integrates both hardware and software policies to minimize energy consumption. Virtualization technology is exploited to create VMs on a server. DVFS technique is used to save energy when the server/core of the CPU is idle. Dynamic provisioning of heterogeneous types of available resources is considered to show the infrastructure-as-a-Service (IaaS) cloud service. An energy model is presented to monitor and calculate the energy consumption. During the scheduling of tasks, server overloading is also prevented by monitoring the server status [18].

1.1. Paper Outline

The next section presents the cloud workflow model, task model, and energy model to execute the workflow. In Section 3, the workflow task scheduling algorithm is presented. The experimentation and performance metrics are presented in Section 4. Section 5 demonstrates the simulation results and discussion. The conclusion of the paper along with future directions is presented in the last section.

2. System Model

This section describes the cloud model, workflow model, task model, and energy model.

2.1. Cloud Model

In this research paper, large and nonhomogenous hosts or physical servers are deployed. In this paper, host/physical server is depicted as , where is Central Processing Unit (CPU) capacity, number of processing elements, Random Access Memory (RAM) capacity, network bandwidth capacity, and storage capacity on the , respectively. is equally divided into . Million instructions per second (MIPS) [6], megabytes (MB), gigabits per second (Gbps), and gigabyte (GB) measurement units are used to measure the capacity of CPU, RAM, network bandwidth, and storage, respectively [19]. VMs are used to execute the workflow, and more than one VM can be deployed on the host. Let be the number of VMs deployed on the , and it is depicted as . VMs are dynamically deployed/undeployed on the host as per the workflow demands. To execute the workflow, the fraction of host resources are allocated to the VM, and it is depicted as where , and are VM on host, fraction of CPU, processing elements, RAM, network bandwidth, and storage, respectively. In this paper, hosts are switched on/off dynamically. Based on the host utilization, hosts are characterized into three categories, i.e., underloaded, overloaded, and normal. If the resource utilization is less than the lower threshold value, then the host is categorized as an underloaded host. If any host is underloaded, then try to migrate the deployed VMs and switch off the host. This strategy is useful to minimize energy consumption. If the resource utilization is more than the upper threshold value, then, the host is categorized as an overloaded host. Migrate some of the VMs from the overloaded host because overloaded hosts consume more energy. Otherwise, the host is in the normal category.

2.2. Workflow Model

Workflow () is described as a set of interdependent computational tasks [20]. In the literature [21], many scientific workflows such as LIGO, Montage, Cybershake, Epigenomics, and Pan-STARRS exist. In this paper, Pan-STARRS scientific workflow is considered for task execution. Pan-STARRS project continuously monitors the entire sky to detect moving or variable objects. PS1 telescope is used to monitor the sky. John Hopkins University and Microsoft manage the generated astronomy data using two types of workflows, i.e., PSLoad workflow and PSMerge workflow. PSLoad workflow is used to collect the data from the telescope and store data in the database. PSMerge workflow is used to update the database. PSLoad and PSMerge workflows are described in Figures 1 and 2, respectively. Table 1 describes the detailed characteristics of the workflows.

2.3. Task Model

A workflow task is an activity that is carried out as part of the workflow description [20]. Workflow task has needed resources for the complete execution of the workflow with a set of constraints. For example, task length/size in million instruction (MI), number of processing elements, deadline in seconds, data transfer file size in MB, list of child tasks, and list of parent tasks, these are modeled as , respectively. Based on the task resource requirement and constraints (such as deadline, length, and dependency), VMs are dynamically deployed. The execution time, transfer time, start time, finish time, and makespan time are defined as (1)Execution time: the task’s execution time is measured in seconds, and it is determined by the task’s length and the processing capacity of the VM that is used to execute it. The execution time () of the task on at is calculated as (2)Transfer time: if the child and parent tasks are not executed on the same VM, then output data (i.e., ) is transferred from the parent task to the child task for its execution. Let , , and be the parent task, child task, VM which is deployed for a parent task, VM which is deployed about to execute the child task, and communication startup time, respectively. The transfer time from task to is calculated as (3)Start time: be the entry task; then start timeof a taskonatis calculated as where is the ready time of the at .

Task is not an entry task, and the same VM is used to execute the child task and its parent task; then, the start time of the child task is calculated as where is the finish time of the parent task on at .

If task is not an entry task and is allocated on the different VM on which its parent is not executed, then, the start time of the task is calculated as

If task is not an entry task and a new VM is deployed for its execution, then, is calculated as where is the creation time of the at . If the VM is migrated to a new host and a new VM is positioned on the host, then the start time of the task is estimated as where is the migration time of the If a new host is activated, then the start time of the task is evaluated as where is the start time of the (4)Finish time: finish time () of the task on at is calculated as (5)Makespan time: workflow makespan () is the total time that is taken to complete the execution of the workflow and is calculated as where is the submission time of the workflow.

2.4. Energy Model

CPUs, network interfaces, memory, and storage devices are the most energy-intensive components of host servers. The CPU consumes approximately 37% to 43% of total server energy [22, 23], and network devices consume approximately 33% of total data center energy [24]. In the proposed work, the energy consumption of the CPU [6] and data transfer between VMs [24] are taken into account, and the total energy consumption is calculated in five different scenarios. These scenarios are defined as follows.

Scenario 1. This scenario is used to calculate the energy consumption during the execution of the task on on (i.e., ) and is calculated as where is the energy consumption rate of the on . Energy consumption to execute the whole workflow is where symbolizes the mapping of task on at host . The remains “1” if the task is scheduled on VM at for execution; otherwise, is equal to “0.”

Scenario 2. This scenario is used when the server/host is active, but no VM is running on it; this situation is used to reduce energy consumption by switching the host to low voltage and frequency over some time (up to a threshold duration). Energy consumption of the idle hosts (i.e., ) is calculated as where and is energy consumption rate of at idle mode and idle time of .

Scenario 3. This scenario is used when the server/host is partially idle such as some idle VM is installed on the host. The VM is left idle up to the threshold period. The energy consumption of the partially idle host (i.e., ) is calculated as where , , and are the idle time of at , time at which is un-deployed from , and time at which is deployed at , respectively.

Scenario 4. This scenario is used to calculate the energy of unused resources of the servers/hosts. Energy consumption is minimized by applying core-level DVFS. It is evident from the paper [25] about 50% energy usage is minimized by reducing the voltage at 70% from its peak voltage. Minimum time is taken during scaling in which the operating frequency of the resources is in nanoseconds [6]. Therefore, during the calculations, scaling time of frequency is neglected. Energy consumption of unused resources of the hosts (i.e., ) is calculated as where is the time in which the reckoning of VMs in a host is distinct from the former time.
Total computational energy consumption () is the addition of the above four scenarios as shown in Equation (16).

Scenario 5. This scenario is used to calculate the energy consumption during the data transfer from one VM to another VM when parent and child tasks are not executed on the same VM. The energy consumption to transfer data () is calculated as where and are energy consumption rate of network bandwidth and transfer time of data from one VM to another VM, respectively. Total energy consumption of a data center during workflow execution is calculated by using Equations (16) and (17) as

3. Energy Efficient VM Placement (EVMP) Algorithm

This section describes the proposed algorithm which is used to execute the workflow in an energy-efficient manner and within the deadline constraint as shown in Figure 3. To execute the workflow, there is a need to follow some set of rules, and these rules are presented in the form of the algorithm. The following steps are used during the workflow scheduling: (Step 1)On the arrival of a new workflow, it is analyzed to get the type of the workflow, number of tasks, and dependency between them in the workflow. After that tasks are stored in the task pool queue (). Check the parent tasks of the tasks. If the task is an entry task, then activate the new host and create a new VM on it based on the task requirement and allocate the task to VM for its execution. After that, update the start time, execution time, finish time of the task, and ready time of the VM(Step 2)When any task executes successfully then check its child tasks. If any child task is ready for execution, then transfer the child task from to ready queue ()(Step 3)When any task is in , then, check the relationship of that task with its parent tasks. If the task can be executed on the same VM on which its parent task(s) are executed without violation of its deadline, then allocate the task to that VM(Step 4)If step 3 is not possible, then, sort the already deployed VM based on their energy consumption rate. If any VM fulfills the task requirement and the deadline is not violated, then, allocate the task to that VM(Step 5)If step 4 is not possible, then, a new VM is created based on task requirement and allocated the task to the newly created VM. There are three cases to deploy the new VM on the host. In the first case, a new VM is deployed on the already active host. If this case is not possible, then, try to migrate any VM from one host to another and deploy the new VM on that host. If this is also not possible, then, try to activate the new host and deploy the VM on the newly created host(Step 6)System status is updated such as energy consumption, makespan, and resource utilization

These scheduling steps are used to execute the workflow and are described in Algorithm 1.

1. add all the workflow tasks () to
2. initiate ;
3. for all tasks in the
4. if immediate parents of the tasks are executed or task is entry task then
5. add to and remove it from ;
6. end if
7. end for
8. if is not empty
9. schedule tasks by EVMP algorithm;
10. end if

Algorithm 1 is used to get the ready tasks for their execution. In this proposed algorithm, initially, all the tasks are stored in the task pool queue () and set ready task queue () to null (see lines 1 and 2). If all the immediate parents of the task finish their execution or task is the entry task, then, that task is ready for its execution. Store that task in the , and remove it from (see lines 4-7). If there is any task in , then, the EVMP algorithm is used to schedule the tasks for their execution (see lines 8-10). This algorithm is automatically called on the arrival of a new workflow or completion of any task within the workflow.

1. for all tasksin
2.  ;
3. 
4. if the task is an entry task then
5. select type to execute the task within its deadline;
6. start a new host and add it to ;
7. deploy on the newly created host;
8.  ;
9. schedule to VM ;
10. remove from
11. update ready time of the VM;
12. end if
13. if the task is not an entry task then
14. ;
15. For all in the
16. if parent can execute the task without violating the deadline and is not overloaded then
17. schedule on
18. remove from ;
19. ;
20. ;
21.  Eq. (2));
22. end if
23. end for
24. for each task from
25. call alreadyDeployedVM();
26. end for
27. end if
28. end for

Algorithm 2 is used to schedule the tasks. Initially make the tags such as and , null and false, respectively (see lines 2 and 3). If the task is an entry task, then, select the VM type which can fulfill the task requirement. After that, start a new host and add this host to the active list . Deploy the VM to the new host, and schedule the task on the new deployed VM. Also, update the ready time of the VM (see lines 4-12). If the task is not an entry task, then, firstly try to execute the task on the same VM on which its parent is executed. If it is possible, then schedule the task on the parent VM and update the ready tome, the transfer time (see lines 13-23). If this step is not possible, then call the alreadyDeployedVM() function (see lines 24-26).

1. initialize findFlag
2. sort the deployed VMs based on energy consumption rate in increasing order;
3. for all deployed
4. then
5.  Eq. (5);
6. schedule on after time and remove it from ;
7. 
8. findFlag; break;
9. end if
10. end for
11. if findFlag then
12. call scaleUp();
13. end if

Algorithm 3 is proposed to use the deployed VMs for workflow execution to save the VM creation time as well as energy consumption. In this function, firstly sort the deployed VM according to energy consumption rate (see line 2). If any VM can execute the task without violating the deadline, then, schedule task on that VM and update the system parameters such as ready time, the transfer time (see lines 3-10). If this step is not possible, then, call the scaleUp() function (see lines 11-13). The scale-down function is adopted from [18] to shut down the VMs and host to save energy consumption.

1. select type to execute the task within its deadline;
2. sort all the hosts in the list in decreasing order as per their utilization level;
3. ;
4. for all in
5. if host utilization does not exceed the upper threshold limit after VM allocation then
6. deploy ; break;
7. end if
8. end for
9. if then
10. select the which has minimum utilization level;
11. select the from which has minimum CPU capacity;
12. if cab be migrated to another host except then
13. migrate ;
14. end if
15. if utilization does not exceed the upper threshold limit after allocation then
16. deploy on ;
17. ;
18. end if
19. end if
20. if then
21. start a new host and add it to ;
22. deploy on the newly started host;
23. ;
24. end if
25. allocate to and remove from ;
26. update the ready time and transfer time;

Algorithm 4 is used to add new resources for workflow execution. When already deployed VMs are unable to complete the workflow tasks then the scheduler calls this algorithm to install a new VM. This function is implemented from [26] with some variations. In this algorithm, firstly VM is selected which can fulfill the task requirement (see line 1). The new VM may be positioned on an already active host without migration based on the host resources (see lines 5-8). If this is not possible, then, a new VM may be deployed on the already active host with live VM migration (see lines 9-19). If migration is not possible, then, a new host is triggered and a new VM is installed on it (see lines 20-24). Allocate the task to new VM, and remove the task from (see line 25). VM ready time and if this task has parent, then, data transfer time from parent task to child task is restructured (see line 25).

4. Experimentation and Performance Metrics

In this section, the workflow model, simulation parameters, and performance metrics used in the proposed model are presented.

4.1. Considered Workflow Model

Workflow () is defined as a set of interdependent computational tasks [20]. In the literature [21], many scientific workflows such as LIGO, Montage, Cybershake, Epigenomics, and Pan-STARRS exist. In this paper, Pan-STARRS scientific workflow is considered for task execution. Pan-STARRS project continuously monitors the entire sky to detect moving or variable objects. PS1 telescope is used to monitor the sky. John Hopkins University and Microsoft manage the generated astronomy data using two types of workflows, i.e., PSLoad workflow and PSMerge workflow. PSLoad workflow is used to collect the data from the telescope and store data in the database. PSMerge workflow is used to update the database. PSLoad and PSMerge workflows are described in Figures 1 and 2, respectively. Table 1 describes the detailed characteristics of the workflows.

4.2. Simulation Parameters

CloudSim framework is exploited to simulate the cloud environment [27] and to check the usefulness of the anticipated scheduling model. Detailed simulation parameters are described below: (i)HP ProLiant ML110 G4 and HP ProLiant ML110 G5 are two types of hosts are deployed [28](ii)The energy consumption rates of these two different types of hosts are 117 and 135[28](iii)The energy consumption rate to transfer 1GB of data is 2.3 W [29](iv)Four types of VM [19] are deployed with varying RAM (in MB) capacity and CPU speed (in MIPS). The configurations of different types of VMs are as follows: VM Type 1-500 MIPS with 613 MB RAM, VM Type 2-1000 MIPS with 1740 MB RAM, VM Type 3-2000 MIPS with 1740 MB, and VM Type 4-2500 MIPS with 870 MB RAM to execute the scientific workflow(v)As per workflow requirements, the average VM start-up time is 96.9 s [30](vi)In between VM, the average bandwidth is set to 20 MBPS, which is the imprecise bandwidth offered by Amazon Web Services [31](vii)Pan-STARRS real-world scientific workflow is considered. Each scientific workflow is divided into three groupings based on the number of tasks as defined in Table 1 [21]

4.3. Performance Metrics
4.3.1. Average Resource Utilization (ARU)

ARU is defined as the ratio of assigned computing resources to accomplish the scientific workflow tasks and total computing resources available on the server. ARU is intended as: where is the active time of the host .

4.3.2. Total Energy Consumption

It defines the total energy which is consumed by the servers to execute a scientific workflow. TEC is computed using Equation (19).

4.3.3. Makespan or Total Execution Time

Makespan is the time taken to execute the scientific workflow from start tasks to the end task. It is computed using Equation (11).

5. Results and Discussion

The proposed EVMP algorithm is compared with an existing algorithm PESVMC algorithm [32] to establish the enhanced performance. In the existing PESVMC algorithm, the workflow tasks are allocated to the VM which depletes less energy. The deadline of tasks was not considered while assigning to the VM. Tasks were selected as per their parent-child relationship but during VM allocation for the task; the parent-child task relationship was not considered. As a result, the execution time and data transfer time both were increased which also affected both makespan as well as energy consumption. The performance of the EVMP algorithm is evaluated based on the ARU, total energy consumption, and workflow makespan.

5.1. Performance Impact on Resource Utilization

ARU of EVMP and PESVMC is observed for PSLoad and PSMerge scientific workflows with varying numbers of workflow tasks. Experimental result in terms of average resource utilization is shown in Figure 4. The result shows that EVMP performs better in terms of resource utilization in comparison to PESVMC. EVMP performs better because of its dynamic nature. In the proposed algorithm, when currently deployed VMs are not sufficient to complete the tasks within the deadline, then, only new VMs are created. So, resources are properly utilized. VM migration policy is also used to consolidate the resources which impressively increases resource utilization. On average, 8.6% resource utilization is increased in comparison to the existing algorithm.

5.2. Performance Impact on Total Energy Consumption

The total energy consumption of EVMP and PESVMC is observed for PSLoad and PSMerge scientific workflows with the varying number of workflow tasks. Experimental result in terms of total energy consumption (measured in Kilowatt (kW)) is shown in Figure 5. In the existing algorithm, all the resources are active which consumes more energy without doing any useful work. But the EVMP algorithm deploys the resources as per the need of workflow tasks which impressively reduces the energy consumption. During the scheduling of workflow tasks, the existing algorithm does not consider the parent-child relationship which leads to the high data transfer energy consumption. But the proposed algorithm considers the parent-child relationship during task scheduling VM which helps to reduce the data transfer energy consumption and workflow makespan. On average, 42.3% of energy consumption is reduced by the EVMP algorithm in comparison to the PESVMC algorithm.

5.3. Performance Impact on Makespan

The makespan of EVMP and PESVMC is observed for PSLoad and PSMerge scientific workflows with the varying number of workflow tasks. Experimental result in terms of makespan (measured in seconds (s)) is shown in Figure 6. Makespan result shows that EVMP performs better in terms of makespan. On average, 98% makespan is reduced in comparison to the PESVMC algorithm. This is due to limited resources being considered in the PESVMC algorithm which impressively reduces the parallel execution of tasks. In the existing algorithm, the parent-child relationship is not considered during task scheduling to the VMs which has also affected the makespan of the workflow. Hence, makespan of PESVMC is significantly increased for a large dataset of workflow.

6. Conclusion

The paper presents an energy aware VM placement model for the dependent scientific workflows in the cloud which achieves scheduling objectives and energy efficiency and improves the system performance for real-world scientific workflows. The proposed EVMP algorithm has reduced the energy consumption by applying DVFS (hardware technique) for the VMs/hosts which are not performing any work or idle computing resources, and software techniques for VMs and hosts which are idle beyond the preestablished threshold time. The data transfer energy consumption is minimized by scheduling tasks on or around the parent VM (where parent task is executed), and it also helped in reducing the execution delay by decreasing the transfer time and VM creation time. The EVMP algorithm is implemented on the CloudSim framework. The Pan-STARRS real-world scientific workflows are considered for evaluating the performance of the EVMP algorithm. The EVMP algorithm has increased resource utilization by 8.6% in comparison to the PESVMC algorithm. The energy consumption has been decreased by 42.3%, and makespan has been reduced by 98% in comparison to PESVMC algorithms. The proposed EVMP algorithm will also be implemented on a public cloud platform along with the evaluation of additional performance metrics of security and fault tolerance in the future.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this study.