Research Article | Open Access
Zhihui Du, Rong Ge, Victor W. Lee, Richard Vuduc, David A. Bader, Ligang He, "Modeling the Power Variability of Core Speed Scaling on Homogeneous Multicore Systems", Scientific Programming, vol. 2017, Article ID 8686971, 13 pages, 2017. https://doi.org/10.1155/2017/8686971
Modeling the Power Variability of Core Speed Scaling on Homogeneous Multicore Systems
We describe a family of power models that can capture the nonuniform power effects of speed scaling among homogeneous cores on multicore processors. These models depart from traditional ones, which assume that individual cores contribute to power consumption as independent entities. In our approach, we remove this independence assumption and employ statistical variables of core speed (average speed and the dispersion of the core speeds) to capture the comprehensive heterogeneous impact of subtle interactions among the underlying hardware. We systematically explore the model family, deriving basic and refined models that give progressively better fits, and analyze them in detail. The proposed methodology provides an easy way to build power models to reflect the realistic workings of current multicore processors more accurately. Moreover, unlike the existing lower-level power models that require knowledge of microarchitectural details of the CPU cores and the last level cache to capture core interdependency, ours are easier to use and scalable to emerging and future multicore architectures with more cores. These attributes make the models particularly useful to system users or algorithm designers who need a quick way to estimate power consumption. We evaluate the family of models on contemporary x86 multicore processors using the SPEC2006 benchmarks. Our best model yields an average predicted error as low as 5%.
We consider the problem of how to model the power of a modern multicore processor as a function of the speed of its cores. On its surface, the problem seems simple as it is natural to assume that cores are independent of one another: the classic power model posits that the total processor power is the sum over that of independent cores. However, we find that in practice such modeling methods do not adequately capture what happens on real multicore systems in which there may be interactions among cores.
By way of motivation, let us consider the following classic model and then compare what it predicts to what happens in an actual experiment. In the classic single-core model, the power, , consumed by a core is expressed as the following function of its operating frequency (“speed”), :where is a workload-dependent factor and is a hardware technology-dependent parameter. For simplicity, (1) omits a term for constant (or static) power, but our argument and methods hold with or without the term. This model appears in a variety of papers on the power-aware scheduling problem [1, 2], in particular when the system provides dynamic voltage and frequency scaling (DVFS) [3, 4].
A widely adopted approach used for multicore power modeling extends from the method for single-core power modeling. It sums the power consumed by individual cores [5–8]. As a result, the power consumption of an -core processor, denoted by , is calculated by Critically, this approach assumes independence: the power of an individual core does not depend on what is happening on other cores on the same chip. Consider an environment consisting of multiple homogeneous cores, where all cores execute the same workload. In this setting, one may derive two predictions from (2). First, all cores contribute to the total power consumption independently. Second, scaling any core from one speed to another causes the same change in the total power consumption, regardless of the speed of the other cores. In other words, the cores have uniform power effects with speed scaling. For example, suppose a multicore processor has 16 cores with their frequencies set as . If , then changing the frequency of core from to causes a total power change of , which will have the same value as if we change the frequency of core from to .
However, the observations made in our experiments contradict these predictions. Figure 1 shows how the total processor power varies with a sequence of frequency scaling on a representative homogeneous multicore processor. In our experiments, all cores execute the same workload. The experimental results may be summarized as follows.(i)The effect on power from speed-scaling a core depends on the states of the other cores. The resulting change in total power depends on whether the scaling updates the maximum speed among the cores. This observation contradicts the first prediction derived from (2).(ii)The scaling that updates the maximum speed among the cores leads to a significantly larger change in total power than others. That is, the same increase in speed among the cores may have nonuniform power effects. This observation contradicts the second prediction derived from (2).
Thus, we may conclude that power models should account for interdependency and variability among the cores to estimate the power consumption of a multicore processor more accurately. Unfortunately, only a few studies [9–12] have investigated this issue. In general, these studies decompose a processor to its architectural components and use performance counters to infer the power consumption of each component. The effect of core interdependency on power consumption is explicitly captured through shared resources and differentiated behaviors of cores. Due to the use of hardware performance events, the models are detailed and complex. Furthermore, they have only been developed for dual- or quad-core processors. This approach is problematic when applied to emerging and future processors that may have eight or more cores.
Multicore processors that integrate a dozen or more DVFS-capable cores are commonplace today and manycore processors are pervasive. The goal of this study is to propose a family of practical power models that are accurate and easy to use and, at the same time, can be scaled to emerging and future multicore technologies. Our power models use two statistical parameters, average speed and dispersion of speeds, on cores. The former is used to accurately capture the holistic impact of multicore speeds while the latter captures the core dependencies. The evaluation shows that our models are more accurate than the traditional models by reflecting interdependence among cores but also maintain a similar level of simplicity. Our models are at the system level and eliminate the need to model individual architectural components with hardware performance events.
We explore this family of models systematically, to show how one can “derive” a suitable power model for multicore processors by experiments. We carry out the experiments using SPEC2006  on contemporary multicore processors and ultimately obtain a “basic power model” with an average relative error of 3% (in absolute value) for most benchmarks. These results help bolster the practical case for using our approach. And for those applications in which the basic power model is not as accurate, we find that an improved piecewise model, which partitions the maximum frequency among the cores into a small number of segments, best expresses overall power consumption of a multicore processor.
We evaluated our approach systematically on current generations of Intel and AMD processors. To instantiate the model for a given application and processor, one needs to only run the applications on the processor a few times, each with a different setting of core speeds. Once fitted, the power models can be used to predict the power consumption at any settings of core speeds. Further, if in the future the processor architectures evolve, the proposed family of models can still be applied, since the models take a general form with the statistical values of core speeds as input. In principle, one needs to only rerun the designed experiments to determine the new values of the coefficients in the model.
The model properties and results presented in this paper may enable future researchers to use more appropriate analytical frameworks to tackle a variety of power- and energy-aware algorithms and application design problems, including both classical scheduling algorithms under DVFS and emerging scheduling problems such as the problem of how to assign work to cores and set core speeds to satisfy a power bound .
The main contributions of this work are as follows.(i)The presented family of models accurately captures the nonuniform power effect of frequency scaling on multicore processors. Such models are much needed for power-aware, multicore-based HPC systems.(ii)By using only a couple of high-level variables, the models are easy to use and can be applied to emerging and future processors with more cores.(iii)The models are the first to use statistical measurements as model variables, in contrast to the commonly adopted complex approach that models individual cores and other microarchitectural components with hardware performance events.(iv)The models in the family have different forms with different numbers of variables. It is at users’ liberty to choose one that best suits their needs, such as balancing accuracy and complexity.
2. A Family of Multicore Power Models
The discussions of Figure 1 suggest that it may not be correct to model the power consumption of a multicore processor by modeling the power consumed by each individual core and then adding them together. Therefore, we propose a family of new models for estimating the power consumption of multicore processors. These models use statistical measures of core speeds, such as means and dispersions, as model variables.
Note that we focus on homogeneous multicore processors. Such an environment is common in parallel computing programmed by MPI and OpenMP, which are the dominant parallel programming paradigms for solving scientific and engineering problems. We leave the research on heterogeneous architectures to our future work.
2.1. The Model Family
The general form of the model family is as follows. Let denote the average frequency of the cores in a multicore processor and denote the dispersion of speeds among the cores. Below, we will consider several possible forms of . Assuming that power consumption correlates with and , we posit a general model of the formwhere and are the parameters to be estimated. In this general model, the average frequency is simply calculated by , where is the number of cores and are their frequencies.
For , a natural choice is the standard deviation among frequencies, denoted by . However, we also consider several more possibilities. Let denote the maximum frequency setting of any core and be the minimum frequency. Thus, in addition to , we consider the following three measures of speed dispersion:(i): the difference between the maximum frequency and the average frequency, namely, .(ii): the difference between the average frequency and the minimum frequency, namely, .(iii): the difference between the maximum and minimum frequency, namely, .
In the proposed model family, instead of considering many individual core speeds, we only employ two statistical parameters to capture the typical speed distribution of all cores in a processor.
2.2. Candidate Models
Beyond through , we consider two additional classic power models for comparison. One assumes a polynomial relation between power and frequency of each individual core (), and the other assumes a linear relationship ():Note that fitting , , , , and requires nonlinear regression methods, whereas simple linear regression is sufficient to fit and .
2.3. Building the Power Models
The purpose of this work is to propose a methodology for system users or algorithm designers to build accurate and simple power models for current and even future multicore processors. In this subsection, we present the methodology for building our power models.
The following procedure is used to determine which of the candidate models in Section 2.2 can best represent the power consumption of multicore processors.
In general, the procedure involves designing different frequency settings, running benchmark application(s) on the given modern multicore processor, and recording the power consumption and the corresponding frequency settings. More details of the procedure are described below.
2.3.1. Frequency Settings
We performed an (or approximately) exhausted test in training to understand the relationship between frequency and power. But in model setup runs, we only need to run the experiments with a small number of frequency settings using the following frequency sampling method, the principle of which is that a small number of frequencies still represent the full spectrum of all possible frequencies. If a multicore processor has homogeneous cores and each core can be set at different frequencies independently, the total number of frequency settings is . For example, if and , then . For a large , that is, a core has many different frequency levels, we select the minimum and maximum frequency and 2~3 additional frequencies in between to cover all the speed range. For a large , that is, there are many cores in a multicore processor, we divide the cores into smaller groups, and all cores in a group are configured with the same frequency setting.
2.3.2. Monitoring Power Consumption
The tool for monitoring power consumption in the experiments can be a hardware power meter device or other software power measurement packages. The exemplar software power measurement packages are Intel’s Running Average Power Limit (RAPL) interface  and other packages such as likwid-powermeter . The accuracy of the RAPL-based power measurement tool is adequate for high-level power prediction.
2.3.3. Regression Analysis
Once the data are measured, we fit the candidate models, through , to them using standard statistical parameter estimation procedures. Fits are specific to a processor, and we report on fit quality both for individual benchmarks and for mixed workloads (see Sections 4.2 and 4.3). Models through require nonlinear regression methods, whereas and may be fitted by standard linear regression procedures. Additionally, models through require determining both the coefficients (i.e., –) and the value of exponents (i.e., and ), whereas in and , only the values of coefficients (i.e., , , and in and and in ) need to be determined.
2.3.4. Models Screening
Finally, after fitting each candidate model, we analyze the parameter values and the fitting quality of each model and identify which model best captures the relation between power consumption and core frequencies. Note that we only need to run an application on a multicore processor with a limited number of frequency settings to obtain the experimental data. Once we have established the power model, we can use the model to predict the power consumption under any frequency setting of the multicore processor.
3. Model Analysis and Refinement
In this section, we propose the basic model based on the method in the last section. The analysis shows that the basic model can be used for different optimization purposes. We also show the weakness of the basic model for some cases and how we improve it with the refined model.
3.1. The “Basic Model” and What It Implies
We have conducted extensive experiments on x86 multicore processors (see the experiment results in Section 4 ). After comparing the results obtained by our candidate models with those by the classic multicore power model, we find that , combined with the dispersion measure , typically exhibits the best fit. Hereafter, we will refer to as the basic model; that is,
Observe that the basic model is linear with and . Although dynamic power is generally nonlinear with frequency, the relation we observed in reality on current processors appears to be linear approximately.
The basic model suggests that two different frequency settings may deliver the same throughput or performance for a given application but cause significantly different power consumption. For example, consider the following two different frequency distributions on four cores, which both have an average of 1.6 GHz: , , and . These have values of 0 GHz and 0.4 GHz, respectively. The classic multicore power model such as will predict that the same amount of power will be consumed under these two frequency distributions. However, using (6), we can predict that the distribution with greater values of will cause more power consumption.
Among all frequency distributions, those with the minimum define a theoretical Pareto frontier and will consequently consume the least amount of power. For example, consider Figure 2. This figure shows the measured power of benchmark 410.bwaves running on an Intel Core i7-2600K (a quad-core Sandy Bridge processor). The red line is the Pareto frontier obtained by the basic model. Each of the blue dots is the measured power when the application is running with a particular average frequency. It can be observed from this figure that, with the same average frequency, different frequency distributions make a huge difference with power consumption. In this figure, the optimal frequency distribution saves up to of the power, compared with other frequency distributions. For a given power budget, the optimal frequency distribution can outperform naïve ones by up to . Assume “” in this figure corresponds to an initial frequency distribution with an average frequency and power consumption. Then, the basic model indicates that we can save power by following the vertical line down to , or improve performance by following the horizontal line rightward to , or balance both improvements by reaching point .
3.2. Model Refinements
The basic model can be refined in certain contexts. For some of the benchmarks, such as 458.sjeng of SPEC2006  (see the experimental results in Table 2), the prediction result of the basic model is not very accurate. Digging deeper, Figure 3 plots the power consumption of 458.sjeng as a function of and ; observe that the power surface consists of multiple piecewise planes. Similarly, the contour lines of the measured power surface, shown in Figure 3(b), reveal that the distance between the parallel contour lines is uneven. Again, this observation confirms the piecewise planar nature of the power surface.
(a) The measured power surface is piecewise planar in and
(b) The contour lines of the measured power surface in Figure 3(a) are parallel lines, but the distances are not equal
These observations further suggest that we might be able to extend our basic model to be piecewise linear. More formally, let be the interval of all possible frequencies. is the low bound of possible frequencies and is the up bound of possible frequencies. Consider a -way partition of this interval into segments (each segment corresponds to a part of our refined piecewise model) such thatThen, a piecewise linear power model can take the following form:where , , and are the coefficients of frequency segment . For the SPEC2006 benchmarks, we have observed that is sufficient to capture any piecewise linear behavior. The coefficient indicates the line between different pieces when they are projected onto the - plane.
In practice, it is not straightforward to determine the exact values of and in (8). The motivating example in Figure 1 shows that a significant power change occurs when the maximum speed among the cores changes. So, we can replace with to simplify the process of determining the values of and . Experimental results show that this is an effective way to establish the improved piecewise power model.
4. Model Evaluation
We employ 28 benchmarks of SPEC2006 to evaluate the proposed basic model on several different modern multicore processors. The extensive experimental results show that our basic model is accurate for most cases. The refined model can further improve the accuracy of the basic model for some special workloads.
4.1. Experimental Setup
We chose the computation-intensive benchmarks from SPEC2006 . SPEC benchmark is used because it represents general-purpose computing. In the future, we would include more different workloads whose power is sensitive to speed. Of the 29 benchmarks in this suite, we omitted 400.perlbench due to its long execution times. In the experiments, we assigned a benchmark application to run on each core. We considered two assignments: uniform assignment, where the same benchmark is assigned to all cores, and mixed assignment, where different benchmarks are assigned to run on different cores.
4.1.2. Multicore Processors
We carried out our experiments on different generations of Intel x86 microarchitectures and one AMD Opteron architecture. In Table 1, denotes the number of processors and denotes the number of cores on each processor.
4.1.3. Speed Scaling and Core Affinity
We used the Linux user-level cpufreq interface to set the frequencies of the cores. (To set core as the frequency of Fre, we use the cpufreq interface on the following command line: echo Fre > /sys/devices/system/cpu/cpu/cpufreq/scaling_setspeed.) We used the Linux command taskset to bind a process to a physical core. (To bind the launched process, BenchName, to core and run it times, the following command can be used: taskset -c runspec --config=My.cfg --action onlyrun --size=test --noreportable --iterations= BenchName.)
4.1.4. Power Measurement
If the multicore systems have power monitoring tools, we will use them directly. For all quad-core Intel processors in Table 1, a clamp ammeter (meter) was equipped to measure the power. For the AMD Opteron processor, the PowerPack tool  was installed to get the power. For the platforms that do not provide a power measurement method, such as the machine with dual octacore Sandy Bridge processor and the dual 14-core Haswell processor, we used Intel’s Running Average Power Limit (RAPL) interface  to obtain the power (PKG Power).
4.2. Model Accuracy
Table 2 shows the results of different candidate models for the benchmark 410.bwaves, on the quad-core Ivy Bridge platform. Note that we recorded and analyzed a full set of experimental data covering all benchmarks and platforms and the results for other benchmarks show similar trends.
We assess model accuracy using a variety of criteria. In Table 2, “” and “” refer to the fraction of predictions whose relative error, , is no more than 5% and 3%, respectively. The larger the value is, the more accurate the power model is. “Max.%” and “Avg.%” are the maximum and average values of all relative errors. “Max. err.” and “Min. err.” mean the maximum and minimum values of all errors. “Avg. abs. err.” means the average of the absolute value of residual. The smaller the value is, the more accurate the power model is. According to Table 2, models through all achieve very high prediction accuracy with variables and . But model is the simplest one.
Furthermore, the experimental results show that the average relative error of is as low as 0.004% and replacing with any other dispersion variable leads to higher prediction errors.
We have also tested the effectiveness of the models using mixed workloads. We generated these mixed workloads using two methods: (i) using four different benchmarks and (ii) using two different benchmarks. (For instance, “two different benchmarks” on a quad-core processor means that one benchmark runs on two cores and another different benchmark runs on the remaining two cores.) The results are similar to those in Table 2. These results suggest that the effectiveness of is not just tied to a particular workload. Section 4.3 explores uniform versus mixed workloads in more detail.
The experimental results in Table 2 show that our basic model, , is accurate. Figure 4 shows the average relative error of the basic model for running the 28 SPEC2006 benchmarks on the seven multicore processors. As can be seen from the figure, except for the Haswell-EP, the basic model achieves very low average relative error (less than 2%) for most benchmarks running on the other six multicore processors, while for Haswell-EP, the average relative error is a little bit high (less than 5%) for most benchmarks.
For the few benchmarks whose average relative errors are greater than 5% (but are all less than 13%), we will employ the refined piecewise model (see Section 3.2) to improve the prediction accuracy.
We compare the prediction accuracy of the basic model and the piecewise model in Table 3. Overall, the piecewise approach improves prediction accuracy. For example, the results of benchmark 458.sjeng show that the piecewise model reduces the maximum relative error to 0.3% from the original 50.4% of the basic model. They also show that average relative error decreases from 0.094 to 0.001 and the improvement is about 9x on average.