Abstract
A cloud system usually consists of a lot of server clusters handling various applications. To satisfy the increasing demands, especially for the frontend web applications, the computing capacity of a cloud system is often allocated for the peak demand. Such installation causes resource underutilization during the offpeak hours. VaryOn/VaryOff (VOVO) schemes concentrate workloads on some servers instead of distributing them across all servers in a cluster to reduce idle energy waste. Recent VOVO schemes adopt queueing theory to model the arrival process and the service process for determining the number of poweredon servers. For the arrival process, Poisson process can be safely assumed in web services due to the large number of independent sources. On the other hand, the heavytailed distribution of service times is observed in real web systems. However, there are no exact solutions to determine the performance for M/heavytailed/m queues. Therefore, this paper presents two queueingbased sizing approximations for Poisson and nonPoisson governed service processes. The simulation results of the proposed approximations are analyzed and evaluated by comparing with the simulated system running at full capacity. This relative measurement indicates that the Pareto distributed service process may be adequately modeled by memoryless queues when VOVO schemes are adopted.
1. Introduction
The numbers of Internet requests are not uniformly distributed over time. There are a huge number of requests during the peak hours. Cloud service providers tend to install surplus server nodes to handle the bursty load. Clearly, these servers waste a lot of energy during the offpeak periods. Dynamically adjusting the number of active servers, that is, VaryOn/VaryOff (VOVO) scheme, improves energyefficiency of server clusters. However, overly shrinking the number of poweredon servers may lead to decreased service quality. Therefore, finding the right number of active servers to balance energy consumption and operation performance is a primary issue of an applicable VOVO scheme.
VOVO schemes can be dated back to earlier last decade [1, 2]. The basic idea of earlier VOVO schemes is to dynamically size a cluster according to CPU utilization or resource usage. This resource provisioning problem in a cluster can be analogous to the staff sizing problem in a telephone call center. In a call center, the customers are callers, servers are telephone agents, and telequeues consist of callers that await service by an agent. The wellknown ErlangC model [3] has been widely applied to this problem. Many recent VOVO studies [4–9] adopt queueing analysis to manage resource usage of clusters.
Most available analytic solutions in queuing theory rely on independence assumptions and Poisson processes [10]. Internet traffic patterns are well known to possess extreme variability and bursty structure [11]. The heavytailed distributions of service times are observed in real web systems [12, 13]. This characteristic is characterized by selfsimilar process [14]. Pareto distribution is a popular model of selfsimilar processes [15]. However, queueing models with Pareto distributed service times are very difficult to analyze [16]. Although heavytailed service processes in web systems are widely documented, memoryless queues are still used for evaluating system performance in many studies [17–21]. On the other hand, studies [22–25] that adopt general/Pareto distributions need approximations for the analytically intractable distributions to obtain the performance measures.
The Poisson arrival process is particularly appropriate if the arrivals are from a large number of independent sources [10], such as users of web services. However, exploring the difference between modeling service times with Poisson process and nonPoisson process governed queues remains a challenging research topic, since many queueing models remain analytically intractable [26]. In order to understand the performance difference between modeling service times with Poisson process and nonPoisson process governed queues, a series of simulations are conducted in this study. Compared with the mathematical analysis and numerical methods, simulation is more time and memory consuming but it is sometimes the only way to get reasonably accurate results [27].
This paper presents the approximations of VOVO cluster sizing for systems modeled by and . Randomly generated workload traces with Pareto and exponential distributed service times are simulated using the approximations. Two distinct types of real web access logs are simulated as well. A relative performance evaluation method is proposed and used for gauging the simulation results. Through the evaluation, the performance difference between modeling service times with Poisson process and nonPoisson process governed queues is found. The result suggests that based sizing approach may be adequate when a queueingbased VOVO scheme is adopted in a cluster.
This paper is organized as follows. Section 2 shows the approximation methods for cluster sizing. Section 3 details the simulation setup and the evaluation metric. Section 4 presents the simulation process and discusses the results. Section 5 concludes this paper.
2. Approximation for QueueingBased Cluster Sizing
Investigations in a queueing theory applied system mainly aim at getting the performance measures, which are the probabilistic properties of the random variables, including number of customers in the system, number of waiting customers, utilization of the servers, response time of a request, waiting time of a customer, idle time of the server, and busy time of a server. These measures heavily depend on the assumptions concerning the distributions of interarrival times and service times as well as number of servers and service discipline. Queueing analysis can be naturally applied to the performance measures of server clusters. Server clusters have been widely adopted in many cloud data centers to resolve the increasing user needs [28]. Although heterogeneity is common in multifunctional cloud data centers, server closets or blade systems that form the basic computing units usually consist of homogeneous nodes. Therefore, this work focuses on singlequeue homogeneous systems.
The symbols and definitions used in this paper to describe the performance measures of queueing systems are shown in Symbols and Definitions.
In classical queueing analysis, supposing that requests are handled by a singlequeue homogeneous server system with the FirstCome FirstServe (FCFS) discipline, exponentially distributed service times, and Poisson process governed arrival intervals, the system can be modeled as system. must be less than 1 () for system being in a stable state. Many performance measures of a stable system have been thoroughly studied and are shown in (1) to (7). The calculations and proofs of these equations can be found in many textbooks, for example, [29, p. 412]:
2.1. Approximation for Sizing Modeled Clusters
In a homogeneous system, of each server is identical. From (5), of system can be considered as a function of denoted by :Let be an arrival rate of system maintaining a targeted response time given . The curves of for , , , , and with a targeted response time are shown in Figure 1.
can be easily obtained from (5):For , , based on (7), can be represented asFrom (1) and (2), isTherefore, to get of for , the following equation has to be solved: can also be easily obtained by solving (13) with :
It is difficult to get a closedform expression of in terms of , , and when . Therefore, an approximation is proposed for for . Assume that this approximation can be applicable for the systems with at most servers. Every , , is shifted with the offset value of and denoted as :Figure 2 shows the combination of the curves of and , , with emphasis on the intersections between the targeted response time and these curves.
By observing Figure 2, the distances between all consecutive and approximately form an exponential decay series . Let the series be approximated by an exponential decay function, let be the initial quantity, and let be the exponential decay constant. An element in the series can be expressed asLet the initial quantity ; can be obtained from (10):From (17), (16), (10), and (14), is can be obtained by rearranging (18):Therefore, can be represented asLet . For a positive integer , can be approximated asConsequently, with an anticipated arrival rate and the measured service rate , the number, denoted as , of servers that maintain the targeted mean response time can be approximated as
2.2. Approximation for Sizing Modeled Clusters
Internet workload characterization has found that the probability of service times is not an exponential distribution but a heavytailed distribution in real web systems [12, 14, 30]. In other words, a singlequeue server cluster should be referred to as queue for Internet services. There are no exact formulas for the mean response time of system, but numerous approximations can be used. Kingman’s Exponential Law of Congestion is a popular approximation that is calculated using the coefficient of variation of service times and known solutions from queues. Kingman’s approximation is expressed asLet . The mean response time of system can be expressed asLet represent the mean response time of system on different arrival rates. Based on (8), can be expressed asAlthough rises at a more precipitous rate than , the correlation observed in Figure 2 and aforementioned approximation still remain valid.
Let the variables , , and be the correspondences in model to the variables , , and previously mentioned in model. The mean response time for system can be approximated based on the PollaczekKhintchine transform:Suppose that the targeted response time is still ; then
Similar to the process from (14) to (20), the following equations can be derived:With an anticipated arrival rate and the measured service rate , the number, denoted as , of servers that is expected to maintain the required mean response time can be approximated as
3. Simulation Setup and Evaluation Metric
A cluster managed by a VOVO scheme periodically adjusts the number of active servers that provide the required services. In general, there are several key functional components including the following:(1)Job queue: the job queue holds the waiting requests. Each request enters the tail of the queue and waits for service in FCFS manner. In this work, all jobs share a common queue.(2)Workload distributor: the workload distributor retrieves a job from the head of the job queue and distributes the job to an available node.(3)Cluster sizing unit: this unit decides the number of active servers. The decision may be based on some predefined thresholds of certain resources, for example, CPU utilization, job throughput, and energy usage. In this work, the decision is calculated based on (22) or (29) according to the given arrival rate, mean service rate, and targeted response time.(4)On/off controller: the on/off controller periodically activates or deactivates server nodes according to the number given by the sizing unit.(5)Managed servers: the cluster consists of a group of identical computer nodes, which may be commodity servers. Each server node processes the assigned jobs and reports its working status to the workload distributor.
3.1. The Design of Simulation Program
A simulation program for the VOVO managed system is developed to investigate the performance of the proposed sizing methods. This program is written using the C++ programming language.
In a real VOVO managed system, every incoming job is queued, and an event notification is issued to the workload distributor upon the arrival of a job. If there are available nodes, the workload distributor then dispatches the queued jobs to the available nodes. If a node has completed its assigned job, it also sends an event notification to inform the distributor about its availability. The instructions of node activation and deactivation are periodically issued by the on/off controller. If a deactivation command is issued to a busy node, the node will complete the processing job before turning itself off. However, it will be extremely time consuming to simulate the system with timebased eventdriven process. Since the input workload traces have to be readily prepared for the simulation, this work adopts the sequential process that significantly reduces the simulation time. The simulation process is shown in Algorithm 1.

3.2. Randomly Generated Traces
A set of randomly generated traces and two realworld traces are simulated in this work. The most widely used heavytailed distribution as the service time distribution is the Pareto distribution [31]. The Poisson distribution is appropriate if the arrivals are from a large number of independent sources, such as web requests [10, 32]. Therefore, the randomly generated traces have Pareto distributed service times with tail indexes from 0.1 to 4.0 stepping by 0.1 and exponentially distributed arrival intervals with traffic intensities from 0.05 to 0.95 stepping by 0.05.
A randomly generated trace with a tail index and a traffic intensity is represented by . is a series of pairs of an arrival time, denoted by , and a service time, denoted by . Suppose that has elements; it can be represented asEach unique combination of and is randomly generated 10 times. That is, there are 10 different traces for a combination of and . Each trace contains values covering 36,000 time units. All traces are generated with the same mean service time. Therefore, there are 7,600 randomly generated traces which have been simulated in this study. The generating functions for Pareto distributed values and exponential distributed values can be found in many textbooks, for example, [10, p. 509]. The coefficient of variation is often used to measure the relative variation in the data and is the ratio of the standard deviation to the mean. For Pareto distributed values, the coefficient of variation denoted by of can be calculated as [10]The coefficient of variation of exponential distributed values is supposed to be 1. The coefficients of variation of service times and arrival intervals of the generated traces are shown in Figures 3(a) and 3(b), respectively.
(a) CV of service times versus tail index
(b) CV of arrival intervals versus tail index
3.3. RealWorld Traces
This simulation adopts two realworld workload traces that include a publicly available trace and a trace acquired from a university campus. The service time of a request is assumed to be proportional to its responded page size in the simulation.
The publicly available trace was recorded at the 1998 World Cup web site [30]. This workload trace is one of few logs providing server activation records. It is known for having a heavytailed pagesize distribution with a tail index of 1.37 [30]. Each request recorded in the log contains an arrival time, a responded page size, and a server identification. The 1998 World Cup log was collected from 05:30:17 May 1, 1998, through 05:59:55 July 27, 1998, a total of 87 days. The log exhibits the following characteristics: 1,352,804,107 requests, 33 hosting servers, 4,040.684 bytes per response in average, 108.71 requests per second per server (the peak service rate) [30], and an average service time of 0.0092 seconds per request with a standard deviation of 0.084.
The second workload trace is acquired from a university with a student population of 4,219, including 3,531 undergraduates. This web access log was collected from 12:03:59 September 19, 2014, through 00:01:39 October 21, 2014, a total of 31 days. The trace log is from a site hosting a student information system that provides course information, handouts/homework systems, message system, email system, and other campus information. The log exhibits the following characteristics: 7,054,170 requests, 8 hosting servers, 5,991.64 bytes per response in average, 74.74 requests per second per server (the peak service rate), an average service time of 0.0134 seconds per request with a standard deviation of 0.227, and a tail index of 0.154 of the service time distribution.
The hourly traffic patterns of the 1998 World Cup log and the 2014 campus log are shown in Figures 4(a) and 4(b), respectively. The two logs represent two distinct service patterns including an occasional service pattern, that is, 1998 World Cup, and a regular service pattern, that is, student information system. The World Cup log shows a growthdecay pattern. An iterative pattern analogous to the daily working hours is observed in the campus log. Note that there are a school break and a scheduled maintenance during the recorded period.
(a) Traffic pattern of the World Cup log
(b) Traffic pattern of the campus log
For the World Cup log, the simulated cluster consists of 33 servers based on the information given in the log. For the campus log, the simulated cluster consists of 8 servers. As for the randomly generated traces, the simulated cluster consists of 10 servers. The on/off controller periodically sizes the simulated cluster with the interval set at 300 seconds, which are long enough to compensate the machine bootup delays and short enough to reflect the demand changes [1, 2, 33].
3.4. Evaluation Metric
Three simulation scenarios, which are allon, , and , are performed. All servers in a cluster are always powered on in allon scenario. This scenario is expected to consume the most energy but to have the best service quality. The scenario uses (22) to approximate the number of servers. The scenario is similar to except that (29) is used for the sizing approximation. Nielsen’s [34] response time limits for usability are adopted by setting the targeted response time at 1 second and the failure threshold at 10 seconds.
The objective of a VOVO scheme is to reduce the energy consumption while maintaining a reasonable service quality. To gauge the performance of an approach (denoted by ), relative measures to allon are adopted instead of absolute measurements, since the allon scenario must have the least response time and the highest energy consumption. The considered factors of a scenario are as follows:(1)being satisfactory, denoted by , which is the portion of responses conforming to the targeted response time;(2)acceptance, denoted by , which is the portion of responses being admissible (i.e., under the failure threshold);(3)energy, denoted by , which is the average number of activated servers, since all servers are identical and have the same power profile.
The relative measurements of , , and are defined as
Let , , and be the weighting coefficients for , , and , respectively. The relative performance, denoted by , is defined as
4. Simulation Results and Analysis
4.1. Simulation Results
With this relative measurement, that is, (33), the optimal solution produces the minimal value of . The simulation results of randomly generated traces are summarized by the relative performance of the simulated scenarios to allon with , , and . In order to make the results be easily comprehended, the relative performances of and are graphically visualized using gray level. Figure 5 shows the relative performance, that is, , of scenarios and , with , , and . It is very difficult to visually differentiate Figures 5(a) and 5(b). Using the averaged values, as shown in Figure 6, it can be found that has a slightly better performance than . In average, which is based on Figure 6, scenarios and outperform allon under most cases except when the tail index is between 0.4 and 0.9. Furthermore, the averaged relative performances shown in Figure 6(a) are clearly correlated with the coefficient of variation of service times (as shown in Figure 3(a)). This simulation result indicates that both and yield a worse performance than allon for diverse access patterns. This may imply that these approaches undersize the cluster for high variation of service times.
(a) Relative performance of scenario
(b) Relatively performance of scenario
(a) Relative performances versus tail index
(b) Relative performances versus traffic intensity
In Figures 6(a) and 6(b), the curves of and are indistinguishable under those scales. In fact, the relative performances of scenarios and are not identical. Figure 7(a) shows the ratios of to . There are some regions between tail indexes 0.3 and 1.3 where the ratios are not 1, that is, identical. In Figure 7(b), the average of is always less than or equal to 1, which means that based sizing is more effective than based sizing. However, Figure 7 also shows that the difference is very small, that is, under 1% in average. Given the fluctuation nature of web traffic, based sizing may be adequate for empirical practices.
(a) Relative performance ratios of to
(b) Average of relative performance ratios
In order to examine above findings, two realworld traces are simulated under previously mentioned scenarios, that is, allon, , and . Figure 8 shows the cumulative distribution of the response times of the simulated realworld traces. As shown in Figure 8(a), all requests in scenario allon can be served within 1 second, but only approximately 80% of requests can be handled for this targeted response time in scenarios and . The curves of and are also indistinguishable in Figure 8(a). In Figure 8(b), more than 99.96% of requests in scenario allon can be served within 1 second. More than 97% of requests can be handled for this targeted response time in and scenarios. The curves of and are also indistinguishable in Figure 8(b).
(a) CDF of response times (World Cup)
(b) CDF of response times (campus)
Based on the relative performance, that is, , Table 1 shows that and are very similar in both cases. As expected, allon always has the shortest mean response time but the most energy consumption. The proposed queueingbased sizing approaches, that is, and , can reduce significant energy consumption while maintaining a reasonable service quality.
4.2. Analysis and Comparison
Energy consumption and service quality of the server machines are two major performance measures for a cloud service provider. The above results are fully based on simulation. To evaluate the proposed strategy on a real system, a 6hour log is extracted from the World Cup trace and fed to a cluster consisting of 33 computers. In addition to the 33node cluster, there are an external computer that hosts other key functional components mentioned in Section 3 and a network switch connecting all nodes and the external computer through 1000BASET Ethernet. The extracted log contains 22,821,177 access records, which are from June 29, 1998, 17:20:00 GMT to June 29, 1998, 23:19:59 GMT. Each node of the cluster is equipped with a dualcore 1.66 GHz Intel Atom N280 processor and 1 GB of memory. All nodes use Linux 2.6 as the operation system with Apache 2.2 installed. The average power demand is 20.83 Watts when an idle node waits for a request with all its parts being turned on. The peak power level of a node that was instrumented is 26.33 Watts. The node profile of the test cluster is shown in Table 2.
In the evaluation, the on/off controller periodically sizes the cluster with the interval set at 300 seconds. Interval energy data of the cluster, excluding the external computer and the network switch, is instrumented and stored by a digital multimeter (DMM). The evaluation result is shown in Figure 9 and conforms to the simulation results. As shown in Figure 9(a), with all nodes turned on, that is, allon scenario, all requests are responded to within 1 second, while only approximately 92% of requests can be responded to within 1 second for either scenario or scenario. On the other hand, both scenario and scenario consume much less energy than allon scenario, as shown in Figure 9(b). Similar to the simulation results, the curves of and are also very close to each other in both Figures 9(a) and 9(b).
(a) Response times
(b) Energy usage
VOVO strategy has been studied for more than a decade. Many VOVO approaches [33, 35–37], which dynamically size a cluster according to a preset threshold of CPU utilization or resource usage, were developed based on the designs proposed by Chase et al. [1] or Pinheiro et al. [2]. To compare the proposed queueingbased approach with the thresholdbased approaches, Pinheiro’s approach [2] is simulated and denoted as vovo scenario. In vovo, the service demand is smoothed and estimated using the cumulative moving average. vovo periodically activates one more node of the cluster when the estimated utilization rate exceeds a predefined threshold and deactivates one node otherwise. The World Cup trace is also used in the simulation of vovo. Since vovo uses the threshold of CPU utilization rate instead of the response time as a controlling factor, 3 different threshold values, which are 0.7, 0.8, and 0.9, are simulated to get a comparable result.
The simulation results of vovo are evaluated with the metric proposed in Section 3.4 and compared with allon and , as shown in Table 3. From this comparison, the threshold of the CPU utilization rate has to be less than 0.8 for vovo to get a comparable result with . Although vovo outperforms with the threshold set at 0.7, it requires more nodes and therefore consumes more energy than . In order to get a reasonable threshold value for vovo, it may be necessary to go through several runs of simulation or other lengthy procedures. On the other hand, the proposed approach minimally requires only the anticipated arrival rate , the service rate , and the desired response time to approximate the required number of servers , that is, (22).
5. Conclusion
This paper proposes two queueingbased sizing methods to periodically adjust the number of servers in a cluster. The proposed method aims at achieving a fair energydelay performance tradeoff of server clusters. The proposed approximation formulas, that is, (22) and (29), are simple closedform expressions, which may be implemented in a network switch for realtime processing.
From the simulation results, the schemes with the proposed approximation formulas reduce considerable amount of energy consumption while maintaining comparable service performance for gentle service time fluctuations. However, the proposed methods tend to underestimate the number of required servers for service processes with high variability, that is, tail index between 0.3 and 1.3. Similar observation has also been documented in [5].
The relative measurements of and are almost undifferentiated, except that is very slightly better than for service processes with high variability. Although Internet workload characterization has found that the probability of service times is a heavytailed distribution, periodically resizing the cluster is possible to alleviate the situation of long jobs blocking short jobs in the waiting queue. Because once a deactivation command is issued to a busy node, the node becomes a pendingoff node that has to complete the unfinished job before turning itself off. If a long job is handled by this pendingoff node, the queued jobs can be quickly assigned to other newly activated nodes of the next period without waiting for the finish of that long job. Therefore, sizing the cluster based on model or model makes little difference. Based on the simulation results, the simpler model may be adequate and preferable for sizing clusters adopting queueingbased VOVO schemes.
Server clusters are widely adopted in cloud data centers [28]. In order to support various kinds of services including userend applications and backend activities, heterogeneity becomes common in multifunctional cloud data centers. It is popular that a data center has different group of servers with different computation capacities. Since the basic computing units that are grouped for specific function usually consist of the same type of machines, the proposed approach is built based on the assumption of homogeneous nodes. Therefore, the proposed approach is particularly pertinent for the computing units forming the underling base of cloud data centers. Nevertheless, extending this work to the heterogeneous environments is an immediate future work of this study. The multitier system is an obvious case of server heterogeneity and is widely adopted in many enterprise systems. There are many approaches which have been proposed to address the applicability of queueing models on multitier systems, such as Multitier Internet Applications [25], Heterogeneous Multitier Web Clusters [38], Layered Queueing Networks (LQN) [39, 40], and PowerSaving Server Farms [41]. The job dispatching [42, 43] and scheduling [44, 45] also arise as important issues in a heterogeneous environment. Considering these related developments and integrating the proposed approach with the existing work may be a practical way to extend this study to a heterogeneous environment.
Symbols and Definitions
:  The job arrival rate of a queueing system 
:  The mean service rate of a server in a queueing system 
:  The mean service time of a server in a queueing system 
:  The standard deviation of the service times in a queueing system 
:  The number of servers in a queueing system 
:  The traffic intensity, 
:  A system state, which is the same as the number of jobs in the system 
:  The probability of a state 
:  The coefficient of variation of service times in a queueing system, 
:  The number of jobs in system 
:  The number of jobs in system 
:  The number of busy servers in system 
:  The mean value of 
:  The mean value of 
:  The mean value of 
:  The response time of a job in system 
:  The response time of a job in system 
:  The response time of a job in system 
:  The response time of a job in system 
:  The waiting time of a job in system 
:  The waiting time of a job in system 
:  The mean value of 
:  The mean value of 
:  The mean value of 
:  The mean value of 
:  The mean value of 
:  The mean value of. 
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This study is funded by the Ministry of Science and Technology (Taiwan) under Grant no. NSC 1012632E036001MY3 for the project A Study of Applications and Examinations on the Smart Meter Enabled Electricity Grid.