Mathematical Problems in Engineering

Mathematical Problems in Engineering / 2017 / Article

Research Article | Open Access

Volume 2017 |Article ID 2724531 | https://doi.org/10.1155/2017/2724531

Lei Yang, Yu Dai, Bin Zhang, "Optimized Speculative Execution to Improve Performance of MapReduce Jobs on Virtualized Computing Environment", Mathematical Problems in Engineering, vol. 2017, Article ID 2724531, 11 pages, 2017. https://doi.org/10.1155/2017/2724531

Optimized Speculative Execution to Improve Performance of MapReduce Jobs on Virtualized Computing Environment

Academic Editor: Mauro Gaggero
Received26 Apr 2017
Revised19 Sep 2017
Accepted08 Oct 2017
Published07 Dec 2017

Abstract

Recently, virtualization has become more and more important in the cloud computing to support efficient flexible resource provisioning. However, the performance interference among virtual machines may affect the efficiency of the resource provisioning. In a virtualized environment, where multiple MapReduce applications are deployed, the performance interference can also affect the performance of the Map and Reduce tasks resulting in the performance degradation of the MapReduce jobs. Then, in order to ensure the performance of the MapReduce jobs, a framework for scheduling the MapReduce jobs with the consideration of the performance interference among the virtual machines is proposed. The core of the framework is to identify the straggler tasks in a job and back up these tasks to make the backed up one overtake the original tasks in order to reduce the overall response time of the job. Then, to identify the straggler task, this paper uses a method for predicting the performance interference degree. A method for scheduling the backing-up tasks is presented. To verify the effectiveness of our framework, a set of experiments are done. The experiments show that the proposed framework has better performance in the virtual cluster compared with the current speculative execution framework.

1. Introduction

Recently, the MapReduce [1, 2] as a platform for massive data analysis has been widely adopted by most of companies for processing large body of data to correlate, mine, and extract valuable features. With the prevailing of the virtualized techniques, the virtual clusters can provide much more flexible mechanism for different applications sharing the common computing resources. Then, currently, lots of MapReduce jobs are deployed in a virtual cluster. However, the modern virtual techniques do not provide perfect performance isolation mechanism, for example, Xen [3], which may cause the virtual machines to compete for the limited resource and result in the performance interference among the virtual machines. Then, how to ensure the performance of the MapReduce job in the virtual cluster becomes a key issue.

Previous works focusing on the performance of the MapReduce job have indicated the performance degradation in the virtual clusters [47]. Other researchers have found that the performance interference [810] is one of the important factors causing such degradation. Then, a set of works in the field of task scheduling were conducted [1113] to ensure the performance of the MapReduce applications. However, most of them only focus on I/O intensive applications and try to find a uniform performance interference model to predict the performance degradation for different types of the applications. In fact, for different applications, using a uniform model to evaluate its performance may not always work well.

In this paper, we present an optimized speculative execution framework for MapReduce jobs which aims to improve the performance of the jobs in the virtual clusters. The contribution of the paper is as follows.

In order to predict the performance degradation, a method for predicting the performance degree is proposed. In this method, the linear regression model is used to reflect the performance degree and the system workloads and a swarm particle algorithm is used for finding the coefficients in the model.

In order to find the stragglers, the method for computing the remaining time of the task is presented with the consideration of the performance interference degree.

In order to back up the stragglers, a scheduling algorithm is proposed which assigns the tasks to the slot with a global optimization.

The organization of the rest of the paper is as follows. The next part introduces the current works related to the MapReduce scheduling in the virtual cluster. Section 3 overviews our speculative execution framework. Sections 4 and 5 show how to predict the performance interference degree, identify the stragglers, and schedule the tasks. Section 6 presents the experimental result to verify our methods. Finally the paper is summarized in Section 7.

Currently, lots of works in the field of performance analysis in the virtual cluster are conducted. Reference [14] presents a method for predicting the interthread cache conflicts based on the hardware activity vector. Reference [15] presents a method to characterize the application performance in order to predict the overheads caused by the virtualization. Reference [16] uses an artificial neural network to predict the application performance. References [17, 18] analyze the network I/O contention in the cloud environment. Performance interference among the CPU-intensive applications has been discussed in [11]. Reference [12] considers the performance interference of the disk I/O intensive applications and proposes a model for predicting such interference. Reference [8] analyzes the factors related to the performance interference and presents a method for estimating it. Reference [19] targets the problem of application scheduling in data centers with the consideration of the heterogeneity and the interference. Although some of the current works have noticed the performance interference and the MapReduce applications’ performance caused by such interference, they only focus on I/O intensive applications and try to find a uniform performance interference model to evaluate the performance degradation for different types of the applications. In fact, for different applications, using a uniform model to evaluate its performance may not always work well as the resource usage pattern can be very different. Besides, the method proposed in [19] develops several microbenchmarks to derive interference sensitivity scores and uses a collaborative filtering method to induce the sensitivity score for a new arrival application which needs the application to run against at least 2 microbenchmarks for 1 minute to get its profile. Then, as the method relies on the microbenchmarks for analyzing the interference degree, the diversity of the microbenchmarks will affect the accuracy of the analysis. Besides, if the diversity number of the microbenchmarks is large, the score matrix for the collaborative filtering may be very sparse as the new application cannot run against many microbenchmarks for 1 minute before inducing the interference sensitivity score. Then, the collaborative filtering method may not work well as for the sparse matrix. Although some methods have been proposed to solve this problem, the effect is not very good. Meanwhile, in the field of the MapReduce job scheduling, the QoS may depend on not only the interference, but also the factor of the data locality. Then, making the MapReduce job run against the microbenchmarks may not reflect its actual performance and get its actual profile. MapReduce job may need to read the data file remotely from the microbenchmarks. Then, the runtime of the job under this situation may be different from the runtime when the job need not read input data files remotely. In this sense, the method proposed in [19] may not be used in the field of MapReduce job scheduling.

Many researchers have put their efforts in the field of task scheduling in MapReduce. Reference [20] proposes a capacity scheduler to guarantee the fairly share of the capacity of the cluster among different users. To ensure the data locality, [21] proposes a delay scheduler. With this technique, if the head-of-line job cannot launch a local task, the scheduler can delay it and look at the subsequent job. When a job has been delayed for more than the maximum delay time, the scheduler will assign the job’s nonlocal map tasks. Reference [22] uses a linear regression method to model the relation between the I/O intensive applications. Reference [23] uses node status prediction to improve the data locality rate. Reference [24] uses a matchmaking algorithm for scheduling not only considering the data locality but also respecting the cluster utilization. Reference [25] introduces a Quincy scheduler to achieve data locality. Several recent proposals, such as resource-aware adaptive scheduling [26] and cost effective resource provisioning [27], have introduced resource-aware job schedulers to the MapReduce framework. Reference [28] mentions the problem of task assignment with the consideration of the data locality in cloud computing. Reference [29] focuses on the scheduling with the consideration of the data locality to minimize the cost caused by accessing remote files. Reference [30] proposes a scheduling algorithm to make the jobs meet the SLAs. Reference [31] solves the problem of job scheduling with the consideration of the fairness as well as the data locality. Reference [19] proposes a method for application scheduling with the consideration of the interference and a greed algorithm is presented for finding the optimal assignments. However, this method is only for single application. As for our problem, we need to find optimal assignments in each time interval for a set of tasks. As stated above, most of the current works assume the perfect performance isolation among virtual machines. Then, based on such an assumption, current works seldom consider the performance interference. As stated above, some of the works consider the performance interference; for example, in [22], the scheduler optimizes the assignment with the consideration of only one task or only one slot while it is hard to achieve the global optimization of minimizing the performance interference. For example, when two slots are free simultaneously and the first job in the wait queue has the acceptable interference degree with the two nodes, in this case, one needs to determine which slot will be used to serve the job. However current works do not highlight this issue and, in fact, it needs to make a decision with a global optimization.

As for the performance of the MapReduce in the heterogeneous environment, [32] presents a LATE method to improve the performance of MapReduce applications through speculative execution. Reference [33] proposes a method for optimizing the speculative execution by considering the computing power to optimize the method for estimating the remaining time. Reference [34] proposes a scheduling method especially for the heterogeneous environment. This algorithm according to the historical execution progress of the task dynamically estimates the execution time to determine whether to start a backup task for the task with low progress rate. However, the above literature does not consider the factor of the performance interference among virtualized computing resource on the problem of identifying the stragglers when estimating the remaining time. Besides, when assigning the backup task to the slot, current works do not consider the performance interference which may cause the future straggler again. Besides this, current work only waits for the straggler without a prediction in order to make the backup decision early. Then, the effectiveness of the method may be affected also.

For the limitations of the above works, this paper proposes an optimized speculative execution framework for MapReduce jobs on the virtualized computing resources. The framework considers the interference. Then, an interference prediction is employed, and, according to the prediction, the framework will compute the remaining time of the task to predict the stragglers and assign the backup task to an appropriate node.

3. Framework Overview

Figure 1 shows the optimized speculative execution framework for MapReduce jobs. This framework is mainly for the MapReduce applications running in a virtual cluster. In the cluster, there are a set of physical servers. We imagine that each of the physical servers has the same virtualized environment. Each physical server can allocate its resource to multiple virtual machines. The virtual machine can host the application. The virtual cluster serves the Hadoop framework. The Hadoop framework has one master node and multiple slave nodes. The master node is deployed on a dedicated physical host. For each of the slave nodes, it will be deployed on a VM. In the master node, there are 4 major components: Straggler Identification Module, Backup Module, Heart Beat Receiver, and Performance Interference Modeling & Prediction. Straggler Identification Module is to compute the remaining time of the task in order to identify the straggler; Backup Module is to assign the straggler tasks to the slots; Heart Beat Receiver is to collect the running states of the servers and the tasks by receiving the heart beat information from the slave nodes; Performance Interference Modeling & Prediction is to train or retrain the performance interference model for predicting.

In the Sections 4 and 5, the major components in our framework will be discussed.

4. Methods for Predicting Performance Interference

4.1. Modeling the Performance Interference

In a virtual cluster, the application deployed on a virtual machine (VM) will consume the resource of this VM. Due to the contention of the limited shared resource, the resource usage of the VMs consolidated on the same physical host may affect others’ access to the shared resource. Then, the performance degradation of the applications on the VMs may be caused. To mitigate such degradation, one of the important issues is to predict the extent to which the application’s performance is affected by the contention of the shared resource. By this, when the predicted result shows a bad degradation, we can place this application on the other VM to mitigate the performance degradation. In the following, for simplicity, the “foreground VM” is used to signify the VM which serves the application app to be deployed while the other VMs consolidated with the “foreground VM” are called the “background VMs.”

As stated above, the contention of the shared resource may cause the performance interference of the VMs to be consolidated on the same physical server. Then, the resource usage pattern of the “background VM” may affect the performance of the “foreground VM.” With the difference of the resource usage of the background VM, the performance of the foreground one will be different. That is to say, the extent to which the foreground VM’s performance is affected by the background one is different. Then, the term “performance interference degree” is used for signifying this extent.

Definition 1 (performance interference degree). We use (1) to show the performance interference degree. where we use system-level workloads to reflect the resource usage pattern of a VM. The system-level workloads considered in this paper are shown in Table 1. and are the workloads of the foreground and background VMs, respectively. The performance of the application on may include response time and throughput. We use to signify such performance when the background VM’s workload is . Here, is especially for the background VM when no application has been deployed on it.


System-level
workload
Meaning

cpuutilAverage CPU utilization
memutilAverage memory utilization
rpsAverage number of read operations per second
wpsAverage number of write operations per second
awaitAverage waiting time of the I/O operations
svctmAverage time spent for the request in the disk device

Since the contention of the shared resource can cause the performance degradation, the interference degree of the foreground VM will have a relation with the resource usage pattern of the background VM. We also do some experiments to show this relation as Tables 2 and 3 show.


AppcpuutilmemutilrpswpsawaitSvctm (s)Response time (s)

Bizp20.910.007278.0990.9990.7750.34584.265
cat0.0010.044335.158433.111137.6431.5827.801
Super PI0.980.0010.2451.54712.2229.3999.107
Iozone0.2640.036370.25392.95151.999.93108.724
Ccrypt0.7620.421172.77168.685.1622.13216.611
Gzip0.9120.053295.7211.838.441.18218.95


Bzip2catSuper PIIozoneCcryptGzip

Bzip293.887148.2889.787146.988142.82144.168
cat30.99658.89240.76240.76245.10450.026
Super PI101.14199.981100.892121.99105.21103.54
Iozone175.33205.27109.756198.73110.26113.64
Ccrypt245.03296.55233.50311.73257.91254.03
Gzip230.75393.99227.09403.74245.10240.02

Tables 2 and 3 show that, with the background VM serving different types of applications, the response time of the foreground one is different. Here, when the background VM serves different types of applications, it means that the resource usage pattern of the background VM is different which also causes the difference of the performance of the foreground VM. For example, with the background VM’s system-level workloads varying, the application cat has different response time. Then, we use the system-level workloads to reflect the interference degree as (2) shows. where , , , , , , and are coefficients.

By using (2), the interference degree can be known if the coefficients are known. Then, we need to estimate the coefficients. Imagine that the estimated coefficients are , , , , , , and . Then, according to (2), the model for estimating the performance interference degree can be as follows:

Then, when the background VM’s workloads are fed into the above equation, we can estimate the performance interference degree. To estimate the coefficients, we need to compute the error between the predicted interference degree and the actual one according to the observed data record.

Then, the problem of finding the combination of the coefficients can be mapped to a problem according to the set of observed data , to make the overall error the minimum which can be seen in

The above problem can be seen as a problem of finding the optimal combination of the coefficients, in order to make the error between the predicted interference degree and the actual one the minimum. In this paper, for solving the problem efficiently, we use a swarm particle algorithm.

When using swarm particle algorithm to solve such problem, the first task is to define the particle. For this problem, the particle in the swarm can be defined as . Here, signifies the location of the particle in the direction . The number of particles in a swarm is signified as . The particle will update its location in the direction with a speed . The particle will compute the speed according to the best location pBest the particle is experiencing and the best location the swarm is experiencing. The best location means the location which is the closest one to the optimal solution which usually is expressed as the fitness function. As for our problem, the fitness function should evaluate how the swarm is close to the optimal solution. Then, according to formula (4), the fitness function of a swarm can be defined as follows:

Then, we can use formula (6) to update the speed of the particle in the direction and compute the location of the particle in the same direction as formula (7). where signifies the speed in the direction in the th iterations; signifies the location in the direction in the th iterations; and are 2 functions which return a random number between 0 and 1; and are the constants; and is the weight which can be computed as formula (8) according to [35]. In our experiment, the size of the swarm is 30, the iteration number is 1000, and . where and are the maximum and minimum weights; is the current iteration number; and is the maximum iteration number.

Then, the PSO algorithm can find the optimal combination of the coefficients of each attribute. Algorithm 1 presents the detailed algorithm.

Procedure PSO
  Initialize particle by giving velocity and position;
  Initialize pbest and ;
  for each particle do
    compute pBest and ;
    Update the speed and location of by pBest and ;
   end for
While maximum iterations
End procedure

The method which uses regression model for estimating the performance interference degree can work well when there are historical data for training the coefficients. However, as for the problem of MapReduce job scheduling, such historical data may not always be available. This is because the new arriving jobs may not have the historical data about the running status together with the consolidated VM in the same physical host. Then, in this case the historical data for training may not be available. For this situation, we will discuss the corresponding method in the following.

4.2. Inferring the Performance Interference Degree

For two applications, if their resource usage patterns are similar, with the same background VM, their extents of the performance degradation may be similar. Then, when one of the applications is new and little historical data can be used for training its performance interference degree model, we can predict its performance interference by looking at another one’s model. Based on this idea, we will discuss our method in the following.

Imagine that the performance interference degree models can be kept and stored. Then, all the models can be a set . Here, of each item in is called the workload pattern. Then, if we do not have enough historical data for training application ’s performance interference model, we can use an available and appropriate model in for prediction.

Let be the workload pattern of the virtual machine . To find an appropriate equation in is to find the equation whose workload pattern is the most similar to .

Then, in the following, we will show how to compute the similarity degree.

For comparing the similarities, we will use an Euclidean distance. For two VMs and , the similarity degree between their workload patterns can be computed as follows:

Then, we can use (9) to find the workload patterns which are similar to the workload pattern of the VM to be predicted. In this paper, if the similarity is beyond the predefined threshold, it means the two workload patterns are similar. Then, for a workload pattern , by comparing the similarity degrees, we may find multiple workload patterns satisfying the predefined threshold requirement. Then, we can use the following equation to generate a combined equation. By using such combined equation, we can estimate the performance interference degree for the VM which has no historical data for training the model.where, for the VM which is used to predict the performance, is used for signifying its workload. Imagine the workload patterns satisfying the threshold requirements form the set . is the interference model corresponding to the th workload pattern in . is the similarity degree between and .

Then, by using the above methods, the performance interference model can be generated. By using the model, we can estimate the performance interference degree of an application. For a MapReduce job, it may contain a set of tasks. The resource usage patterns of these tasks are always similar [36]. And there are also many research works for predicting the resource demand of the MapReduce jobs. Then, using this information, the performance interference degree between the tasks to be assigned (no matter whether the corresponding job is newly submitted or runs for a while) and the VMs on the candidate physical host can be predicted.

5. Methods for Identifying Straggler and Backing-Up in Virtualized Environment

In our framework, the task trackers will send the heart beat information which includes the resource status of the VMs. Taking the task profile, the status of VMs, and the physical host as inputs, the module of Performance Interference Modeling & Prediction will return a value to evaluate the interference. Then, in every interval, the Straggler Identification Module will predict the remaining time of each running task in the next time interval according to the heart beat information from the slave node and the performance interference degree provided by the Performance Interference Modeling & Prediction. The backup module will back up a new task for the straggler by assigning a new slot to it.

In the speculative execution, the task which will finish farthest into the future will be backed up since the backed up task will have a greatest opportunity to overtake the original one and reduce the overall response time of the job. Then, the core of identifying a straggler is to estimate whether the task has a bad progress rate; that is to say, compared with other tasks in a job, it has a longer remaining time to be finished. Then, in the following, we will introduce how to estimate the remaining time of the task in order to identify the stragglers.

Imagine we have a job which contains a set of tasks. Then, we will introduce how to find the straggler tasks in the job. Imagine that the number of the allocated map slots for this job is and the number of the allocated reduce slots for this job is . Imagine that the number of the map tasks in this job to be executed is and the number of the allocated reduce slots for this job to be executed is . The overall remaining time of the job is a sum of the remaining time of the map phase and the reduce phase. The remaining time of either the map phase or the reduce phase depends on the slowest task. Then, the remaining time of can be computed as (5).

According to (5), is the predicted completion time of the current running map task which can be computed as (11), is the predicted completion time of the current running reduce task which can be computed as (12), is the execution time of map task , is the execution time of reduce task , and are the maximum and average completion time, respectively, of all the map tasks which have been executed completely, and and are the maximum and average completion time, respectively, of all the reduce tasks which have been executed completely.where is the function to return the slot where the task is deployed on, is the predicted performance interference degree among the slot and the other slots consolidated on the same physical server in the next time interval, and is the average performance interference degree among the slot and the other slots consolidated on the same physical server in the last interval from the beginning of the execution to the current time.

Then, based on (13), the remaining time of the job can be predicted. If there exists a running task whose predicted completion time makes the remaining time bigger than the required one, this task will be the straggler.

Then, after identifying the stragglers, a backup task for the stragglers needs to be initiated by assigning a slot for this task. Since, from every time interval, the Straggler Identification Module will predict the stragglers in the next time interval, there may be a set of straggler tasks to be backed up. This problem can be seen as a problem of scheduling this set of tasks in a virtualized computing environment. As the performance interference is an important factor which may affect the execution of the tasks, when scheduling the task to a slot with high time interference degree with others, the task may become a new straggler in the future again which may result in the bad performance of the job. Then, when dealing with the problem of how to back up the stragglers, the performance interference degree needs to be considered also. Previous works [37] schedule the tasks to the slot, if the predicted interference degree is not higher than a predefined threshold ; otherwise, the task will wait for the available node with the required interference degree or will be assigned to a slot when the task is waiting for a long time. In these works, the scheduler optimizes the assignment with the consideration of only one task or only one slot while it is hard to achieve the global optimization of minimizing the performance interference. For example, when two slots are free simultaneously and the first task in the wait queue has the acceptable interference degree with the two nodes, which slot is used to place the task in will affect the following assigning plan. That is to say, a decision with a global optimization needs to be made.

This paper presents a scheduling strategy with a global optimization as mentioned in Algorithm 2. In each interval, the backup module will collect the status of the tasks running in the slots and estimate which slots will be free in the next interval by computing the remaining time of the task. Then, in each interval, the backup module will assign a set of tasks to the set of free slots for the next interval with the global optimization of minimizing the performance interference degree of each task. Optimally finding the solution to the above problem is an NP-complete problem. Then, we propose a greedy algorithm for solving this problem with better efficiency. Firstly, the algorithm will place the task on the slot with least interference degree. Then, for the remaining slots to be free in the next interval, redo the first step until all the slots are assigned with a task.

Input: the set SL of slots to be free in the next interval; the queue of tasks to be assigned.
Output: assignment plan AP.
Begin
While is not empty do
Begin
;
 For each slot in SL do
 Begin
  If sloti.capacity >= Q.element[i].demand then
   PID = GetPID(Q.element[i], slotj.Background);
   If min > PID then
   begin
    min = PID;
    AP_candidate[i]=slotj;
   end
 End
 If min < threshold then
  AP[i] = AP_candidate[i];
end
Return AP;
 End

6. Simulation Results

We evaluate our framework in a 24-node virtual cluster. The cluster has 6 physical servers; one is for the mast node. The configuration of each server is as follows: the memory is 4 G, disk amount is 250 G, and the version of CPU is i3. On each physical server, 4 virtual machines are deployed. Each VM is created using Xen hypervisor and has 4VCPU and 1 GB memory. We configured each virtual machine with 1 slot which can be a map slot or a reduce slot. In the whole virtual cluster, we allocate 16 map slots and 8 reduce slots.

We evaluate the framework using 10 MapReduce applications, seen in Table 4. These applications are widely used for evaluating the performance of MapReduce framework in the previous research works [21, 32, 38, 39]. To verify the effectiveness of our works, the experiments will be carried out for some comparisons between our scheduler and other main competitors which also consider the performance interference in the scheduling.


NameMajor resource usedIntroduction

TeraSortI/OSort the input data into a total order
TeraGenI/OGenerate and write data into system
GrepI/OExtract matching regular expression
WordCountI/OCount words in the input file
PiEstCPUEstimate Pi
BayesCPUConstruct Bayes classifiers
MatrixCPUMatrix add and multiplication
gzipmixedCompress text files
Bzip2mixedCompress text files
povraymixedA frame rendering tool for 3-D graphics

In this section, we evaluate whether our method is effective in estimating the interference degree. We will compare it with the model discussed in previous works [12] which uses a uniform model for evaluating all the applications. In our experiment, the predicted and actual performance interference degrees are considered. Figure 2 shows the prediction error for each type of jobs using different models.

From Figure 2, we can see that the current method led to an average of 29% error rate while our method can achieve the average rate of 15%. This is because our method trains the model with the consideration of no historical data about performance interference while the current method relies on establishing a uniform model to evaluate all the types of applications which will sacrifice the prediction accuracy.

In the following part, the experiments will be done to show whether our method is effective in predicting the remaining time in every time interval.

From Figure 3, we can see that the current method led to an average of 20%. This is because our method considers the performance interference in the estimation of the remaining time while the current method in [32] only takes an average progress rate for the estimation.

In the following, the experiments will show the effectiveness of our method in speculative execution. The performance of the backup module is also affected by the data locality. Then, to emphasize the performance interference only, we conduct the experiment in an intranet environment where when accessing the data, it does not need to read the data remotely which minimizes the effect caused by the data locality as much as possible. We select the applications of Matrix and TeraGen which need no input and we also select the applications of TeraSort and Gzip which need to read data. We set the numbers of map tasks in the Matrix job, TeraGen job, TeraSort job, and Gzip job which are 15, 10, 10, and 5, respectively. Every 15 seconds, a batch of jobs which contains 3 Matrix jobs, 3 TeraGen jobs, 5 TeraSort jobs, and 2 Gzip jobs will be submitted in the virtual cluster. The average normalized completion time is used for evaluation. In our method, we model the relation between the performance interference degree and the background workload. Then, in the experiment, we will show the effectiveness of our scheduler under the different status of the background workload. We will adjust the background workload in this way that we let different jobs run on the virtualized slave node in order to adjust the cpu, memory, and other system load to simulate the variations of the background workload. Figures 4 and 5 show the result when using different schedulers in the master node.

From Figures 4 and 5, when the workload of the background is heavy, for example, with the high CPU and memory utilization, all the applications suffer the performance degradation severely when using the FairScheduler [37] and CapacityScheduler [20]. Even under the situation with the light workload of the background, the speculative execution has the better performance than the FairScheduler and CapacityScheduler. The reason is that speculative execution can identify the stragglers and speed up the speed of the application. Besides, our speculative execution outperforms the current speculative execution. This is because ours finds the stragglers by prediction while the current one finds them by waiting for the degradation. Besides, the backing-up module in our framework also considers the performance interference when assigning the slots which may reduce the future risk of the degradation caused by the performance interference. However, we also notice that when the background workload is light, the performance of the different schedulers is not too different. This is because, with the light background workload, the application suffers not too bad performance as a result of the interference among virtualized slave nodes. However, in reality, maintaining a light background workload is usually not an easy task especially with the consideration of the cost of the hardware and the system utilization.

7. Conclusions

This paper presents an optimized speculative execution framework for MapReduce jobs which aims to improve the performance of the jobs on the virtual cluster. Firstly, we analyze the factors related to the performance degradation in the virtual cluster and present a method for modeling how the factors affect the degradation. Secondly, we develop an algorithm that works with the performance interference prediction to identify the stragglers and assign the tasks.

In this work, when predicting the remaining time of the MapReduce job, only the performance interference factor is considered. In fact, there are other factors such as the fault ratio of the physical server which can also affect the accuracy of estimating the remaining time. Then, in the future works, we will optimize our method in predicting the remaining time of the MapReduce jobs.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Key Technology R&D Program of the Ministry of Science and Technology (2015BAH09F02 and 2015BAH47F03), National Natural Science Foundation of China (60903008 and 61073062), and the Fundamental Research Funds for the Central Universities (N130417002 and N130404011).

References

  1. J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” in Proceedings of the Symposium on Operating Systems Design and Implementation, pp. 137–150, New York, NY, USA, 2004. View at: Google Scholar
  2. B. R. Chang, N. T. Nguyen, B. Vo, and H. Hsu, “Advanced Cloud Computing and Novel Applications,” Mathematical Problems in Engineering, vol. 2015, pp. 1-2, 2015. View at: Publisher Site | Google Scholar
  3. “Xen Virtual Machine Monitor,” http://www.xen.org. View at: Google Scholar
  4. S. Ibrahim, H. Jin, L. Lu, L. Qi, S. Wu, and X. Shi, “Evaluating MapReduce on Virtual Machines: The Hadoop Case,” in Proceedings of the International Conference on Cloud Computing, vol. 1-4 of Lecture Notes in Computer Science, pp. 519–528, Springer, Berlin, Germany, 2009. View at: Publisher Site | Google Scholar
  5. B. He S, M. Yang, and Z. Guo Y, “Wave Computing in the Cloud,” in Proceedings of the Usenix Workshop on Hot Topics in Operating Systems, Monte Verita, Switzerland, 2009. View at: Google Scholar
  6. S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, and L. Qi, “LEEN: Locality/fairness-aware key partitioning for MapReduce in the cloud,” in Proceedings of the 2nd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2010, pp. 17–24, USA, December 2010. View at: Publisher Site | Google Scholar
  7. Z. Peng, D. Cui, J. Zuo, and W. Lin, “Research on Cloud Computing Resources Provisioning Based on Reinforcement Learning,” Mathematical Problems in Engineering, vol. 2015, Article ID 916418, 2015. View at: Publisher Site | Google Scholar
  8. Y. Koh, R. Knauerhase, P. Brett, M. Bowman, Z. Wen, and C. Pu, “An analysis of performance interference effects in virtual environments,” in Proceedings of the ISPASS 2007: IEEE International Symposium on Performance Analysis of Systems and Software, pp. 200–209, USA, April 2007. View at: Publisher Site | Google Scholar
  9. S. Ibrahim, H. Jin, L. Lu, B. He, and S. Wu, “Adaptive disk I/O scheduling for MapReduce in virtualized environment,” in Proceedings of the 40th International Conference on Parallel Processing, ICPP 2011, pp. 335–344, Taiwan, September 2011. View at: Publisher Site | Google Scholar
  10. X. Zhang, E. Tune, R. Hagmann, R. Jnagal, V. Gokhale, and J. Wilkes, “CPI2: CPU performance isolation for shared compute clusters,” in Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys 2013, pp. 379–391, Czech Republic, April 2013. View at: Publisher Site | Google Scholar
  11. R. Nathuji, A. Kansal, and A. Ghaffarkhah, “Q-clouds: Managing performance interference effects for QoS-aware clouds,” in Proceedings of the 5th ACM EuroSys Conference on Computer Systems, EuroSys 2010, pp. 237–250, France, April 2010. View at: Publisher Site | Google Scholar
  12. R. C. Chiang and H. H. Huang, “TRACON: Interference-aware scheduling for data-intensive applications in virtualized environments,” in Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC11, USA, November 2011. View at: Publisher Site | Google Scholar
  13. P. Lama and X. Zhou, “NINEPIN: Non-invasive and energy efficient performance isolation in virtualized servers,” in Proceedings of the 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2012, USA, June 2012. View at: Publisher Site | Google Scholar
  14. A. Settle, J. Kihm, A. Janiszewski, and D. Connors, “Architectural support for enhanced SMT job scheduling,” in Proceedings of the Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004., pp. 63–73, Antibes Juan-les-Pins, France. View at: Publisher Site | Google Scholar
  15. T. Wood, L. Cherkasova, K. Ozonat, and P. Shenoy, “Profiling and Modeling Resource Usage of Virtualized Applications,” in Middleware 2008, vol. 5346 of Lecture Notes in Computer Science, pp. 366–387, Springer Berlin Heidelberg, Berlin, Heidelberg, 2008. View at: Publisher Site | Google Scholar
  16. S. Kundu, R. Rangaswami, K. Dutta, and M. Zhao, “Application performance modeling in a virtualized environment,” in Proceedings of the 2010 IEEE 16th International Symposium on High Performance Computer Architecture (HPCA), pp. 1–10, Bangalore, India, January 2010. View at: Publisher Site | Google Scholar
  17. Y. Mei, L. Liu, X. Pu, and S. Sivathanu, “Performance measurements and analysis of network I/O applications in virtualized cloud,” in Proceedings of the IEEE 3rd International Conference on Cloud Computing, pp. 59–66, Miami, Fla, USA, July 2010. View at: Publisher Site | Google Scholar
  18. X. Pu, L. Liu, Y. Mei, S. Sivathanu, Y. Koh, and C. Pu, “Understanding performance interference of I/O workload in virtualized cloud environments,” in Proceedings of the 3rd IEEE International Conference on Cloud Computing, CLOUD 2010, pp. 51–58, USA, July 2010. View at: Publisher Site | Google Scholar
  19. C. Delimitrou and C. Kozyrakis, “Paragon: QoS-Aware scheduling for heterogeneous datacenters,” ACM SIGPLAN Notices, vol. 48, no. 4, pp. 77–88, 2013. View at: Publisher Site | Google Scholar
  20. Yahoo! inc, Capacity scheduler, 2011, http://developer.yahoo.com/blogs/hadoop/posts/2011/02/capacity-scheduler/.
  21. M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, “Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling,” in Proceedings of the 5th ACM EuroSys Conference on Computer Systems (EuroSys '10), pp. 265–278, April 2010. View at: Publisher Site | Google Scholar
  22. X. Bu, J. Rao, and C. Xu, “Interference and locality-aware task scheduling for MapReduce applications in virtual clusters,” in Proceedings of the the 22nd international symposium, p. 227, New York, New York, USA, June 2013. View at: Publisher Site | Google Scholar
  23. X. Zhang, Z. Zhong, S. Feng, B. Tu, and J. Fan, “Improving Data Locality of MapReduce by scheduling in homogeneous computing environments,” in Proceedings of the 9th IEEE International Symposium on Parallel and Distributed Processing with Applications, ISPA 2011, pp. 120–126, Republic of Korea, May 2011. View at: Publisher Site | Google Scholar
  24. C. He, Y. Lu, and D. Swanson, “Matchmaking: A new MapReduce scheduling technique,” in Proceedings of the 2011 3rd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2011, pp. 40–47, Greece, December 2011. View at: Publisher Site | Google Scholar
  25. M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg, “Quincy: Fair scheduling for distributed computing clusters,” in Proceedings of the 22nd ACM SIGOPS Symposium on Operating Systems Principles, SOSP'09, pp. 261–276, USA, October 2009. View at: Publisher Site | Google Scholar
  26. J. Polo, C. Castillo, D. Carrera et al., “Resource-Aware Adaptive Scheduling for MapReduce Clusters,” in Middleware 2011, vol. 7049 of Lecture Notes in Computer Science, pp. 187–207, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. View at: Publisher Site | Google Scholar
  27. B. Palanisamy, A. Singh, and L. Liu, “Cost-Effective Resource Provisioning for MapReduce in a Cloud,” IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 5, pp. 1265–1279, 2015. View at: Publisher Site | Google Scholar
  28. X. Fu, Y. Cang, X. Zhu, and S. Deng, “Scheduling method of data-intensive applications in cloud computing environments,” Mathematical Problems in Engineering, vol. 2015, Article ID 605439, 2015. View at: Publisher Site | Google Scholar
  29. X. Ma, X. Fan, J. Liu, H. Jiang, and K. Peng, “VLocality: Revisiting Data Locality for MapReduce in Virtualized Clouds,” IEEE Network, vol. 31, no. 1, pp. 28–35, 2017. View at: Publisher Site | Google Scholar
  30. N. Lim, S. Majumdar, and P. Ashwood-Smith, “MRCP-RM: A Technique for Resource Allocation and Scheduling of MapReduce Jobs with Deadlines,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 5, pp. 1375–1389, 2017. View at: Publisher Site | Google Scholar
  31. S. Tang, B.-S. Lee, and B. He, “DynamicMR: a dynamic slot allocation optimization framework for mapreduce clusters,” IEEE Transactions on Cloud Computing, vol. 2, no. 3, pp. 333–347, 2014. View at: Publisher Site | Google Scholar
  32. M. Zaharia, A. Konwinski, and A. Joseph, “Improving mapreduce performance in heterogeneous environments,” in Proceedings of the Usenix Symposium on Opearting Systems Design and Implementation, pp. 29–42, San Diego, Ca, USA, 2008. View at: Google Scholar
  33. H. Jung and H. Nakazato, “Dynamic scheduling for speculative execution to improve MapReduce performance in heterogeneous environment,” in Proceedings of the 2014 IEEE 34th International Conference on Distributed Computing Systems Workshops, ICDCSW 2014, pp. 119–124, Spain, July 2014. View at: Publisher Site | Google Scholar
  34. K. Kc and K. Anyanwu, “Scheduling hadoop jobs to meet deadlines,” in Proceedings of the 2nd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2010, pp. 388–392, USA, December 2010. View at: Publisher Site | Google Scholar
  35. Y. Shi and R. C. Eberhart, “Fuzzy adaptive particle swarm optimization,” in Proceedings of the Congress on Evolutionary Computation, vol. 1, pp. 101–106, IEEE, Seoul, Republic of Korea, 2001. View at: Publisher Site | Google Scholar
  36. A. Ganapathi, Y. Chen, A. Fox, R. Katz, and D. Patterson, “Statistics-driven workload modeling for the cloud,” in Proceedings of the 2010 IEEE 26th International Conference on Data Engineering Workshops, ICDEW 2010, pp. 87–92, USA, March 2010. View at: Publisher Site | Google Scholar
  37. A. Beloglazov and R. Buyya, “Optimal online deterministic algorithms and adaptive heuristics for energy and performance efficient dynamic consolidation of virtual machines in Cloud data centers,” Concurrency and Computation: Practice and Experience, vol. 24, no. 13, pp. 1397–1420, 2012. View at: Publisher Site | Google Scholar
  38. B. Palanisamy, A. Singh, L. Liu, and B. Jain, “Purlieus: Locality-aware resource allocation for mapreduce in a cloud,” in Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC11, USA, November 2011. View at: Publisher Site | Google Scholar
  39. G. Ananthanarayanan, S. Agarwal, S. Kandula et al., “Scarlett: Coping with skewed content popularity in MapReduce clusters,” in Proceedings of the 6th ACM EuroSys Conference on Computer Systems, EuroSys 2011, pp. 287–300, Austria, April 2011. View at: Publisher Site | Google Scholar

Copyright © 2017 Lei Yang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


More related articles

 PDF Download Citation Citation
 Download other formatsMore
 Order printed copiesOrder
Views856
Downloads253
Citations

Related articles

We are committed to sharing findings related to COVID-19 as quickly as possible. We will be providing unlimited waivers of publication charges for accepted research articles as well as case reports and case series related to COVID-19. Review articles are excluded from this waiver policy. Sign up here as a reviewer to help fast-track new submissions.