Modeling the Power Variability of Core Speed Scaling on Homogeneous Multicore Systems
We describe a family of power models that can capture the nonuniform power effects of speed scaling among homogeneous cores on multicore processors. These models depart from traditional ones, which assume that individual cores contribute to power consumption as independent entities. In our approach, we remove this independence assumption and employ statistical variables of core speed (average speed and the dispersion of the core speeds) to capture the comprehensive heterogeneous impact of subtle interactions among the underlying hardware. We systematically explore the model family, deriving basic and refined models that give progressively better fits, and analyze them in detail. The proposed methodology provides an easy way to build power models to reflect the realistic workings of current multicore processors more accurately. Moreover, unlike the existing lower-level power models that require knowledge of microarchitectural details of the CPU cores and the last level cache to capture core interdependency, ours are easier to use and scalable to emerging and future multicore architectures with more cores. These attributes make the models particularly useful to system users or algorithm designers who need a quick way to estimate power consumption. We evaluate the family of models on contemporary x86 multicore processors using the SPEC2006 benchmarks. Our best model yields an average predicted error as low as 5%.
We consider the problem of how to model the power of a modern multicore processor as a function of the speed of its cores. On its surface, the problem seems simple as it is natural to assume that cores are independent of one another: the classic power model posits that the total processor power is the sum over that of independent cores. However, we find that in practice such modeling methods do not adequately capture what happens on real multicore systems in which there may be interactions among cores.
By way of motivation, let us consider the following classic model and then compare what it predicts to what happens in an actual experiment. In the classic single-core model, the power, , consumed by a core is expressed as the following function of its operating frequency (“speed”), :where is a workload-dependent factor and is a hardware technology-dependent parameter. For simplicity, (1) omits a term for constant (or static) power, but our argument and methods hold with or without the term. This model appears in a variety of papers on the power-aware scheduling problem [1, 2], in particular when the system provides dynamic voltage and frequency scaling (DVFS) [3, 4].
A widely adopted approach used for multicore power modeling extends from the method for single-core power modeling. It sums the power consumed by individual cores [5–8]. As a result, the power consumption of an -core processor, denoted by , is calculated by Critically, this approach assumes independence: the power of an individual core does not depend on what is happening on other cores on the same chip. Consider an environment consisting of multiple homogeneous cores, where all cores execute the same workload. In this setting, one may derive two predictions from (2). First, all cores contribute to the total power consumption independently. Second, scaling any core from one speed to another causes the same change in the total power consumption, regardless of the speed of the other cores. In other words, the cores have uniform power effects with speed scaling. For example, suppose a multicore processor has 16 cores with their frequencies set as . If , then changing the frequency of core from to causes a total power change of , which will have the same value as if we change the frequency of core from to .
However, the observations made in our experiments contradict these predictions. Figure 1 shows how the total processor power varies with a sequence of frequency scaling on a representative homogeneous multicore processor. In our experiments, all cores execute the same workload. The experimental results may be summarized as follows.(i)The effect on power from speed-scaling a core depends on the states of the other cores. The resulting change in total power depends on whether the scaling updates the maximum speed among the cores. This observation contradicts the first prediction derived from (2).(ii)The scaling that updates the maximum speed among the cores leads to a significantly larger change in total power than others. That is, the same increase in speed among the cores may have nonuniform power effects. This observation contradicts the second prediction derived from (2).
Thus, we may conclude that power models should account for interdependency and variability among the cores to estimate the power consumption of a multicore processor more accurately. Unfortunately, only a few studies [9–12] have investigated this issue. In general, these studies decompose a processor to its architectural components and use performance counters to infer the power consumption of each component. The effect of core interdependency on power consumption is explicitly captured through shared resources and differentiated behaviors of cores. Due to the use of hardware performance events, the models are detailed and complex. Furthermore, they have only been developed for dual- or quad-core processors. This approach is problematic when applied to emerging and future processors that may have eight or more cores.
Multicore processors that integrate a dozen or more DVFS-capable cores are commonplace today and manycore processors are pervasive. The goal of this study is to propose a family of practical power models that are accurate and easy to use and, at the same time, can be scaled to emerging and future multicore technologies. Our power models use two statistical parameters, average speed and dispersion of speeds, on cores. The former is used to accurately capture the holistic impact of multicore speeds while the latter captures the core dependencies. The evaluation shows that our models are more accurate than the traditional models by reflecting interdependence among cores but also maintain a similar level of simplicity. Our models are at the system level and eliminate the need to model individual architectural components with hardware performance events.
We explore this family of models systematically, to show how one can “derive” a suitable power model for multicore processors by experiments. We carry out the experiments using SPEC2006  on contemporary multicore processors and ultimately obtain a “basic power model” with an average relative error of 3% (in absolute value) for most benchmarks. These results help bolster the practical case for using our approach. And for those applications in which the basic power model is not as accurate, we find that an improved piecewise model, which partitions the maximum frequency among the cores into a small number of segments, best expresses overall power consumption of a multicore processor.
We evaluated our approach systematically on current generations of Intel and AMD processors. To instantiate the model for a given application and processor, one needs to only run the applications on the processor a few times, each with a different setting of core speeds. Once fitted, the power models can be used to predict the power consumption at any settings of core speeds. Further, if in the future the processor architectures evolve, the proposed family of models can still be applied, since the models take a general form with the statistical values of core speeds as input. In principle, one needs to only rerun the designed experiments to determine the new values of the coefficients in the model.
The model properties and results presented in this paper may enable future researchers to use more appropriate analytical frameworks to tackle a variety of power- and energy-aware algorithms and application design problems, including both classical scheduling algorithms under DVFS and emerging scheduling problems such as the problem of how to assign work to cores and set core speeds to satisfy a power bound .
The main contributions of this work are as follows.(i)The presented family of models accurately captures the nonuniform power effect of frequency scaling on multicore processors. Such models are much needed for power-aware, multicore-based HPC systems.(ii)By using only a couple of high-level variables, the models are easy to use and can be applied to emerging and future processors with more cores.(iii)The models are the first to use statistical measurements as model variables, in contrast to the commonly adopted complex approach that models individual cores and other microarchitectural components with hardware performance events.(iv)The models in the family have different forms with different numbers of variables. It is at users’ liberty to choose one that best suits their needs, such as balancing accuracy and complexity.
2. A Family of Multicore Power Models
The discussions of Figure 1 suggest that it may not be correct to model the power consumption of a multicore processor by modeling the power consumed by each individual core and then adding them together. Therefore, we propose a family of new models for estimating the power consumption of multicore processors. These models use statistical measures of core speeds, such as means and dispersions, as model variables.
Note that we focus on homogeneous multicore processors. Such an environment is common in parallel computing programmed by MPI and OpenMP, which are the dominant parallel programming paradigms for solving scientific and engineering problems. We leave the research on heterogeneous architectures to our future work.
2.1. The Model Family
The general form of the model family is as follows. Let denote the average frequency of the cores in a multicore processor and denote the dispersion of speeds among the cores. Below, we will consider several possible forms of . Assuming that power consumption correlates with and , we posit a general model of the formwhere and are the parameters to be estimated. In this general model, the average frequency is simply calculated by , where is the number of cores and are their frequencies.
For , a natural choice is the standard deviation among frequencies, denoted by . However, we also consider several more possibilities. Let denote the maximum frequency setting of any core and be the minimum frequency. Thus, in addition to , we consider the following three measures of speed dispersion:(i): the difference between the maximum frequency and the average frequency, namely, .(ii): the difference between the average frequency and the minimum frequency, namely, .(iii): the difference between the maximum and minimum frequency, namely, .
In the proposed model family, instead of considering many individual core speeds, we only employ two statistical parameters to capture the typical speed distribution of all cores in a processor.
2.2. Candidate Models
From the general form of (3), we consider several specific cases as candidate models for fitting, denoted as through below: Note that is the same as (3). The other cases simplify the general form.
Beyond through , we consider two additional classic power models for comparison. One assumes a polynomial relation between power and frequency of each individual core (), and the other assumes a linear relationship ():Note that fitting , , , , and requires nonlinear regression methods, whereas simple linear regression is sufficient to fit and .
2.3. Building the Power Models
The purpose of this work is to propose a methodology for system users or algorithm designers to build accurate and simple power models for current and even future multicore processors. In this subsection, we present the methodology for building our power models.
The following procedure is used to determine which of the candidate models in Section 2.2 can best represent the power consumption of multicore processors.
In general, the procedure involves designing different frequency settings, running benchmark application(s) on the given modern multicore processor, and recording the power consumption and the corresponding frequency settings. More details of the procedure are described below.
2.3.1. Frequency Settings
We performed an (or approximately) exhausted test in training to understand the relationship between frequency and power. But in model setup runs, we only need to run the experiments with a small number of frequency settings using the following frequency sampling method, the principle of which is that a small number of frequencies still represent the full spectrum of all possible frequencies. If a multicore processor has homogeneous cores and each core can be set at different frequencies independently, the total number of frequency settings is . For example, if and , then . For a large , that is, a core has many different frequency levels, we select the minimum and maximum frequency and 2~3 additional frequencies in between to cover all the speed range. For a large , that is, there are many cores in a multicore processor, we divide the cores into smaller groups, and all cores in a group are configured with the same frequency setting.
2.3.2. Monitoring Power Consumption
The tool for monitoring power consumption in the experiments can be a hardware power meter device or other software power measurement packages. The exemplar software power measurement packages are Intel’s Running Average Power Limit (RAPL) interface  and other packages such as likwid-powermeter . The accuracy of the RAPL-based power measurement tool is adequate for high-level power prediction.
2.3.3. Regression Analysis
Once the data are measured, we fit the candidate models, through , to them using standard statistical parameter estimation procedures. Fits are specific to a processor, and we report on fit quality both for individual benchmarks and for mixed workloads (see Sections 4.2 and 4.3). Models through require nonlinear regression methods, whereas and may be fitted by standard linear regression procedures. Additionally, models through require determining both the coefficients (i.e., –) and the value of exponents (i.e., and ), whereas in and , only the values of coefficients (i.e., , , and in and and in ) need to be determined.
2.3.4. Models Screening
Finally, after fitting each candidate model, we analyze the parameter values and the fitting quality of each model and identify which model best captures the relation between power consumption and core frequencies. Note that we only need to run an application on a multicore processor with a limited number of frequency settings to obtain the experimental data. Once we have established the power model, we can use the model to predict the power consumption under any frequency setting of the multicore processor.
3. Model Analysis and Refinement
In this section, we propose the basic model based on the method in the last section. The analysis shows that the basic model can be used for different optimization purposes. We also show the weakness of the basic model for some cases and how we improve it with the refined model.
3.1. The “Basic Model” and What It Implies
We have conducted extensive experiments on x86 multicore processors (see the experiment results in Section 4 ). After comparing the results obtained by our candidate models with those by the classic multicore power model, we find that , combined with the dispersion measure , typically exhibits the best fit. Hereafter, we will refer to as the basic model; that is,
Observe that the basic model is linear with and . Although dynamic power is generally nonlinear with frequency, the relation we observed in reality on current processors appears to be linear approximately.
The basic model suggests that two different frequency settings may deliver the same throughput or performance for a given application but cause significantly different power consumption. For example, consider the following two different frequency distributions on four cores, which both have an average of 1.6 GHz: , , and . These have values of 0 GHz and 0.4 GHz, respectively. The classic multicore power model such as will predict that the same amount of power will be consumed under these two frequency distributions. However, using (6), we can predict that the distribution with greater values of will cause more power consumption.
Among all frequency distributions, those with the minimum define a theoretical Pareto frontier and will consequently consume the least amount of power. For example, consider Figure 2. This figure shows the measured power of benchmark 410.bwaves running on an Intel Core i7-2600K (a quad-core Sandy Bridge processor). The red line is the Pareto frontier obtained by the basic model. Each of the blue dots is the measured power when the application is running with a particular average frequency. It can be observed from this figure that, with the same average frequency, different frequency distributions make a huge difference with power consumption. In this figure, the optimal frequency distribution saves up to of the power, compared with other frequency distributions. For a given power budget, the optimal frequency distribution can outperform naïve ones by up to . Assume “” in this figure corresponds to an initial frequency distribution with an average frequency and power consumption. Then, the basic model indicates that we can save power by following the vertical line down to , or improve performance by following the horizontal line rightward to , or balance both improvements by reaching point .
3.2. Model Refinements
The basic model can be refined in certain contexts. For some of the benchmarks, such as 458.sjeng of SPEC2006  (see the experimental results in Table 2), the prediction result of the basic model is not very accurate. Digging deeper, Figure 3 plots the power consumption of 458.sjeng as a function of and ; observe that the power surface consists of multiple piecewise planes. Similarly, the contour lines of the measured power surface, shown in Figure 3(b), reveal that the distance between the parallel contour lines is uneven. Again, this observation confirms the piecewise planar nature of the power surface.
(a) The measured power surface is piecewise planar in and
(b) The contour lines of the measured power surface in Figure 3(a) are parallel lines, but the distances are not equal
These observations further suggest that we might be able to extend our basic model to be piecewise linear. More formally, let be the interval of all possible frequencies. is the low bound of possible frequencies and is the up bound of possible frequencies. Consider a -way partition of this interval into segments (each segment corresponds to a part of our refined piecewise model) such thatThen, a piecewise linear power model can take the following form:where , , and are the coefficients of frequency segment . For the SPEC2006 benchmarks, we have observed that is sufficient to capture any piecewise linear behavior. The coefficient indicates the line between different pieces when they are projected onto the - plane.
In practice, it is not straightforward to determine the exact values of and in (8). The motivating example in Figure 1 shows that a significant power change occurs when the maximum speed among the cores changes. So, we can replace with to simplify the process of determining the values of and . Experimental results show that this is an effective way to establish the improved piecewise power model.
4. Model Evaluation
We employ 28 benchmarks of SPEC2006 to evaluate the proposed basic model on several different modern multicore processors. The extensive experimental results show that our basic model is accurate for most cases. The refined model can further improve the accuracy of the basic model for some special workloads.
4.1. Experimental Setup
We chose the computation-intensive benchmarks from SPEC2006 . SPEC benchmark is used because it represents general-purpose computing. In the future, we would include more different workloads whose power is sensitive to speed. Of the 29 benchmarks in this suite, we omitted 400.perlbench due to its long execution times. In the experiments, we assigned a benchmark application to run on each core. We considered two assignments: uniform assignment, where the same benchmark is assigned to all cores, and mixed assignment, where different benchmarks are assigned to run on different cores.
4.1.2. Multicore Processors
We carried out our experiments on different generations of Intel x86 microarchitectures and one AMD Opteron architecture. In Table 1, denotes the number of processors and denotes the number of cores on each processor.
4.1.3. Speed Scaling and Core Affinity
We used the Linux user-level cpufreq interface to set the frequencies of the cores. (To set core as the frequency of Fre, we use the cpufreq interface on the following command line: echo Fre > /sys/devices/system/cpu/cpu/cpufreq/scaling_setspeed.) We used the Linux command taskset to bind a process to a physical core. (To bind the launched process, BenchName, to core and run it times, the following command can be used: taskset -c runspec --config=My.cfg --action onlyrun --size=test --noreportable --iterations= BenchName.)
4.1.4. Power Measurement
If the multicore systems have power monitoring tools, we will use them directly. For all quad-core Intel processors in Table 1, a clamp ammeter (meter) was equipped to measure the power. For the AMD Opteron processor, the PowerPack tool  was installed to get the power. For the platforms that do not provide a power measurement method, such as the machine with dual octacore Sandy Bridge processor and the dual 14-core Haswell processor, we used Intel’s Running Average Power Limit (RAPL) interface  to obtain the power (PKG Power).
4.2. Model Accuracy
Table 2 shows the results of different candidate models for the benchmark 410.bwaves, on the quad-core Ivy Bridge platform. Note that we recorded and analyzed a full set of experimental data covering all benchmarks and platforms and the results for other benchmarks show similar trends.
We assess model accuracy using a variety of criteria. In Table 2, “” and “” refer to the fraction of predictions whose relative error, , is no more than 5% and 3%, respectively. The larger the value is, the more accurate the power model is. “Max.%” and “Avg.%” are the maximum and average values of all relative errors. “Max. err.” and “Min. err.” mean the maximum and minimum values of all errors. “Avg. abs. err.” means the average of the absolute value of residual. The smaller the value is, the more accurate the power model is. According to Table 2, models through all achieve very high prediction accuracy with variables and . But model is the simplest one.
Furthermore, the experimental results show that the average relative error of is as low as 0.004% and replacing with any other dispersion variable leads to higher prediction errors.
We have also tested the effectiveness of the models using mixed workloads. We generated these mixed workloads using two methods: (i) using four different benchmarks and (ii) using two different benchmarks. (For instance, “two different benchmarks” on a quad-core processor means that one benchmark runs on two cores and another different benchmark runs on the remaining two cores.) The results are similar to those in Table 2. These results suggest that the effectiveness of is not just tied to a particular workload. Section 4.3 explores uniform versus mixed workloads in more detail.
The experimental results in Table 2 show that our basic model, , is accurate. Figure 4 shows the average relative error of the basic model for running the 28 SPEC2006 benchmarks on the seven multicore processors. As can be seen from the figure, except for the Haswell-EP, the basic model achieves very low average relative error (less than 2%) for most benchmarks running on the other six multicore processors, while for Haswell-EP, the average relative error is a little bit high (less than 5%) for most benchmarks.
For the few benchmarks whose average relative errors are greater than 5% (but are all less than 13%), we will employ the refined piecewise model (see Section 3.2) to improve the prediction accuracy.
We compare the prediction accuracy of the basic model and the piecewise model in Table 3. Overall, the piecewise approach improves prediction accuracy. For example, the results of benchmark 458.sjeng show that the piecewise model reduces the maximum relative error to 0.3% from the original 50.4% of the basic model. They also show that average relative error decreases from 0.094 to 0.001 and the improvement is about 9x on average.
4.3. Uniform versus Mixed Workloads
We consider two benchmarking scenarios: one in which we run the same benchmark on all cores (“uniform” case) and the other in which we run different benchmarks on different cores (“mixed” case).
First, consider the uniform case, for the specific example of the benchmark, 410.bwaves, running on a quad-core Ivy Bridge processor. The model predictions match very well the actual measurements for various core speeds, as shown in Figure 5(a). In addition, Figure 5(b) shows that the maximum absolute error is less than 0.25 watts and that the maximum relative prediction error is less than . Furthermore, more than of the predicted values have a relative error within , and the average relative error is less than . Though not shown, the test results with 28 SPEC2006 benchmarks show a similar level of model accuracy.
(a) Model prediction versus actual power measurement
(b) The residuals distribution of our power model
(c) A fairly perfect power plane indicating that power linearly increases with and
(d) Parallel straight contour lines on the power plane
We also find strong linear relationships among power, , and in Figures 5(c) and 5(d). Figure 5(c) shows a flat surface (plane) where CPU power increases linearly with and . These relationships are easier to see in Figure 5(d), which is a flattened contoured version of the same data; the straight parallel contour lines again reflect linear relationships. These observations essentially confirm that the basic model, , should be expected to work well.
We also consider mixed workloads, in which each core runs a different application. The model fits under mixed workloads show a similar level of accuracy for . For example, when running the set of benchmarks, 410.bwaves, 433.milc, 437.leslie3d, 444.namd, one per core on the quad-core Ivy Bridge, the maximum absolute residual is less than 0.31 watts, and the maximum relative error is less than . Furthermore, more than of the predicted values have a relative error within , and the average relative error is less than . Other mixed workloads with two and four different benchmarks exhibit similar degrees of accuracy.
The models proposed in Section 2 raise some natural questions, including why the power effect of frequency scaling of a core is dependent on other cores’ states and why power models as a linear function of frequency could accurately capture the power effect of frequency scaling empirically.
5.1. DVFS Interdependency for Multicore Processors
Figure 1 reveals that the same speed scaling from one source frequency to a target may result in different changes for the total processor power. The scaling that updates the maximum frequency among the cores leads to more significant changes for the total power than others. Such differences are explained by two main reasons.
5.1.1. Power of Uncore Devices
The cores on the same processor share uncore devices, which include the last level cache, memory controller, and interconnection links. Uncore device power increases with two main sources. First, when the devices receive more requests from cores, they consume more power to respond . Second, uncore devices on modern processors are equipped with power-aware technologies and can transit among multiple sleep states and performance states . A higher core frequency can trigger the uncore devices to transit from sleep states to active states, or from low performance states to high performance states [18, 19]. Such power state transition leads to a more significant power increase than activity request with the first source.
Uncore device power partly explains the different power effects between the scalings. The scaling that increases the highest speed among the cores not only causes more uncore activities but also transits uncore devices to higher power states. Consequently, it leads to a larger increase for the whole processor power. In contrast, other scalings only cause uncore activities without updating the uncore performance states and thus increase the uncore device power with a smaller amount.
5.1.2. DVFS on Chip Multiprocessing Cores
The mechanism implementing the DVFS technology is the other reason for the nonuniform power effect of speed scaling on multicore processors. DVFS technology transits the processor cores among different performance states, where a performance state of a core corresponds to a pair of (frequency, voltage). The tuning of the voltages and frequencies for chip multiprocessing cores is implemented by one of the three hardware mechanisms [20–22]: (i) one single shared clock domain and one single shared voltage domain by all the cores, (ii) multiple clock domains and one single shared voltage domain, and (iii) multiple clock domains and multiple voltage domains, or individual per-core DVFS.
Different mechanisms determine the various dependencies between the cores. With mechanism (i), the supplied shared voltage must match the highest frequency among the cores in order for DVFS to work properly. Consequently, if a scaling updates the maximum frequency among the cores, it causes large processor power jump/drop due to the tuned up/down frequency and voltage; other scalings merely change processor power. Mechanism (ii) has a finer power control than mechanism (i) as each core can individually scale its frequency. Mechanism (ii) is effectively Dynamic Frequency Scaling (DFS). Mechanism (iii) deploys individual clock and voltage domain for each of the cores and independently controls per-core frequency and voltage. Table 4 summarizes the interdependencies of power effects of DVFS scaling for these three mechanisms. Note that only mechanism (iii) supports per-core DVFS.
Technology has been shifting from mechanism (i) to mechanism (iii) [20–22]. Mechanism (i) has been mostly adopted by earlier generations of Intel processors such as Xeon Nehalem and SandyBridge architectures to limit the platform and packaging cost. To improve the granularity of DVFS control, AMD processors, as shown in Figure 1, explore mechanism (ii) to change frequencies of individual cores. More recently, per-core DVFS using mechanism (iii) [21, 23] is available on Intel Haswell processors to improve DVFS effectiveness for multithreaded workloads with heterogeneous behavior.
The challenge that users face in designing DVFS scheduling is that, no matter whether the underlying architectures support per-core DVFS or not, operating systems and kernels including cpufreq and the Intel P-State driver give users an impression that they do. Such discrepancy between user perception and the actual hardware ability can lead to poor DVFS scheduling decisions and adverse application performance degradation. To make better DVFS scheduling decisions, users must first identify the architectural DVFS mechanism and carefully select a proper model specific to the mechanism. Our models resolve this issue as they are applicable to all types of DVFS mechanisms for all generations of modern processors, relieving users from the burden of characterizing the underlying architectural and DVFS mechanisms.
5.2. Cubic Power Model versus Linear Power Model
It has been widely accepted that the dynamic power is a cubic function of frequency for DVFS-capable processors [1–4]; that is,
This cubic function is derived from two relations. First, the dynamic power of CMOS devices is a function of frequency and transistor’s supply voltage . where is the capacitance being switched per clock cycle, is the transistor’s supply voltage, is the activity factor indicating the average number of switching events undergone by the transistors in the chip, and is the frequency.
Second, frequency depends on supply voltage in the following relation:Here, is threshold voltage and is a technology-dependent constant accounting for velocity saturation. For 1000 nm technology and older, ’s value could be [25, 26] and supply voltage is much larger than threshold voltage . Consequently, frequency is considered to be proportional to supply voltage and power is considered proportional to the cubic function of frequency.
The power proportional to relation becomes inaccurate due to technology evolution in two aspects. First, to effectively reduce dynamic power consumption, supply voltage has been reduced over the years and is now only slightly larger than threshold voltage [27–29]. Resultantly, supply voltage for DVFS processors has a small range, and its scaling in this range leads to smaller variation for dynamic power. Second, reduces over the generations of technology. It is approximately 1.3 in 45 nm technology and could be even smaller in newer generations. Consequently, reducing the voltage by a small percentage will reduce the operating frequency by a larger percentage . Thus, the power effect of voltage scaling is overshadowed by the power effect of frequency scaling, and power is effectively governed by frequency scaling as a linear function, as captured by our models.
6. Related Work
As power becomes a critical constraint at all levels of HPC systems from chip, node to data center, extensive research has been conducted to measure, model, and manage power on computer components and systems. In this section, we briefly present related work in power measurement and architecture-level power modeling and also discuss most closely related work in system-level power modeling.
Direct power measurement is a fundamental approach to quantitative power evaluation and provides an ultimate reference for analytical power modeling . Limited by the availability of power measurement tools, earlier work usually instruments external meters to computer circuits to measure the power consumption of individual components and further the entire system. For example, PowerPack  is built with NI data acquisition devices, which are instrumented into the DC power lines to measure the power of computer components including CPU and memory. Similarly, PowerInsight  and PowerMon  deliver the same functions with self-made pluggable cards in smaller forms. More recently, to meet the increasing demand for power monitoring and measurement, commodity processors including those of Intel and AMD have begun to provide embedded power meters and interfaces [15, 33, 34]. Such embedded meters provide accurate power measurements that are greatly helpful to system and software designers. Nevertheless, direct power measurement is limited to physical devices and components. They cannot separate the power of individual cores on multicore processors to support power management with thread concurrency scaling, which is effective and most needed for future architectures.
Analytical modeling, in contrast to physical measurement, can be performed on both hardware and software units with different granularity. Microarchitecture-level power models are commonly used to investigate and evaluate new power-saving and power-aware hardware and architectures. Such models correlate power to parameters and usage of architectural components including register files, function units, clock, and caches [35–37]. Representative models include Wattch  for single-core architectures and McPat  for chip multiprocessors. Models with such great details are complex and limited to HPC components and building blocks.
System-level power modeling, which is the research class that our work falls into, is an essential approach for runtime frequency schedulers to achieve power reduction and energy saving on HPC systems. Most previous studies investigate single-core architectures and systems and can be grouped into two basic categories [1, 2, 10, 38]. Models in the first category [1, 2] describe power as a basic polynomial function of CPU frequency in the form of (1). The polynomial degree varies with power-aware technologies and is set to 3 for DVFS-capable processors and otherwise greater than or equal to 1 . Models in the other category [10, 38–41] build correlation between hardware performance events with power and leverage performance monitoring counters available on hardware to collect hardware event data. In general, the techniques in this category require extensive profiling and large volumes of experimental data for model training.
As multicore processors become the building blocks of HPC systems, researchers attempt to understand their power consumption. A widely adopted approach assumes that the cores are independent and the total power consumption of a multicore processor is the sum over the power of individual cores, each of which is estimated by the traditional system-level power models for single cores [5–8, 42]. Nevertheless, as our work and Basmadjian and de Meer’s  show, simply extending single-core power models without capturing the core interdependency results in inaccurate power estimation.
Little work has been done to capture the heterogeneous power effect from cores interdependency in multicore processor and all requires microarchitectural decomposition and event accounting. Basmadjian and de Meer  decomposed a processor to its architectural components including on-chip cores, off-chip caches, and interconnections and modeled the power of each component with the power model in (1). Specifically, in their work, the off-chip caches and interconnections capture the power interdependency between cores. Bertran et al.  decomposed a processor further in finer granularity to function units and front end and derived the power of each component with its measurable performance events with performance monitoring counters. This work reflects the power effect of core interdependency by using adjusted model coefficients for single-core processors. Shen et al.  similarly used measurable hardware performance events on microarchitectural components to estimate power. Particularly, they paid special attention to chip maintenance power and shared it evenly between active cores.
Our models are distinct from prior efforts in system-level multicore power modeling. Our models are accurate by capturing the interdependency between cores on multicore processors, yet practical and easy to use by only using average frequency and frequency dispersion as model variables. In contrast, existing simple models such as (2) may provide inaccurate power estimations and lead to wrong scheduling decisions, while detailed models such as [9, 12] are not scalable to future architectures that contain a large number of cores. Simple and easy-to-use power models are critical for power optimization and management for future applications and system software . We believe that our models provide a viable solution and can promote the research in energy optimization for traditional and emerging software.
7. Conclusions and Future Work
This work shows that simply extending the traditional single-core power model might not faithfully capture the real power behavior of modern multicore processors. The reason is that the traditional model assumes that individual cores contribute to power consumption independently. We show that this assumption is not true. Our proposed alternative uses aggregate statistical measures, mean frequency and dispersion, to express the interaction among cores. Compared to the existing approaches that explicitly investigate the shared resources among cores and use microarchitectural events to capture heterogeneous power effects of individual core speed scaling, our models are much simpler and scalable to emerging and future multicore technologies. Our experiments validate the effectiveness of the proposed model and show its accuracy.
From our work, we draw several additional high-level conclusions. First, the power consumption of a multicore processor can be accurately predicted by a simple linear model of the average core speed and the speed variation. The linear model indicates that, besides the average speed, greater speed variation can cause more power consumption.
Second, using our method, one can build the power model that is suitable for an underlying multicore processor without needing to know many hardware details.
Third, our power models can be used to analyze and quantify the power characteristics inherent in the applications and the hardware architectures. For new multicore processors, one only needs to run the experiments according to the methodology presented in this paper to determine the best model and value of its parameters from the experimental data. The modeling method proposed in this work requires running an application on the target processor a small number of times.
Looking forward, evaluating not only the core but also the uncore hardware effects (such as cache noise) may further improve the model. To further reduce the number of runs needed to derive the model parameters, future work might combine the modeling approach proposed in this paper with the general modeling approach developed in our prior work [44, 45], thereby yielding power models that are both accurate and generic.
Any opinions, findings and conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the NSF.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
The authors would like to thank Intel in Beijing for providing the Haswell-EP platform for experiments. This research is supported in part by the National Key Research and Development Program of China (nos. 2016YFB1000602 and 2017YFB0701501), National Natural Science Foundation of China (nos. 61440057, 61272087, 61363019, 61073008, and 11690023), and MOE Research Center for Online Education Foundation (no. 2016ZD302). Parts of this work are also supported by the U.S. National Science Foundation (NSF) (Awards nos. 1339745, 1422935, and 1551511) and CAREER (Award no. 0953100).
N. Bansal, T. Kimbrel, and K. Pruhs, “Speed scaling to manage energy and temperature,” Journal of the ACM, vol. 54, no. 1, article 3, 2007.View at: Publisher Site | Google Scholar | MathSciNet
F. Yao, A. Demers, and S. Shenker, “Scheduling model for reduced CPU energy,” in Proceedings of the 36th IEEE Annual Symposium on Foundations of Computer Science, pp. 374–382, IEEE, October 1995.View at: Google Scholar
T. D. Burd and R. W. Brodersen, “Energy efficient CMOS microprocessor design,” in Proceedings of the 28th Hawaii International Conference on System Sciences, vol. 1, pp. 288–297, January 1995.View at: Publisher Site | Google Scholar
M. Horowitz, T. Indermaur, and R. Gonzalez, “Low-power digital design,” in Proceedings of the IEEE Symposium on Low Power Electronics, pp. 8–11, October 1994.View at: Google Scholar
S. Cho and R. G. Melhem, “Corollaries to Amdahl's law for energy,” IEEE Computer Architecture Letters, vol. 7, no. 1, pp. 25–28, 2008.View at: Publisher Site | Google Scholar
M. Ghasemazar, H. Goudarzi, and M. Pedram, “Robust optimization of a chip multiprocessor's performance under power and thermal constraints,” in Proceedings of the IEEE 30th International Conference on Computer Design (ICCD '12), pp. 108–114, IEEE, Washington, DC, USA, October 2012.View at: Publisher Site | Google Scholar
K. Meng, R. Joseph, R. P. Dick, and L. Shang, “Multi-optimization power management for chip multiprocessors,” in Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 177–186, ACM, Ontario, Canada, October 2008.View at: Publisher Site | Google Scholar
L. Yu, F. Teng, and F. Magoulès, “Node scaling analysis for power-aware real-time tasks scheduling,” IEEE Transactions on Computers, vol. 65, no. 8, pp. 2510–2521, 2016.View at: Publisher Site | Google Scholar | MathSciNet
R. Basmadjian and H. de Meer, “Evaluating and modeling power consumption of multi-core processors,” in Proceedings of the 3rd International Conference on Future Energy Systems: Where Energy, Computing and Communication Meet, Madrid, Spain, May 2012.View at: Publisher Site | Google Scholar
R. Bertran, M. Gonzelez, X. Martorell, N. Navarro, and E. Ayguade, “A systematic methodology to generate decomposable and responsive power models for CMPs,” IEEE Transactions on Computers, vol. 62, no. 7, pp. 1289–1302, 2013.View at: Publisher Site | Google Scholar | MathSciNet
J. C. McCullough, Y. Agarwal, J. Chandrashekar et al., “Evaluating the effectiveness of model-based power characterization,” in Proceedings of the USENIX Annual Technical Conference, vol. 20, 2011.View at: Google Scholar
K. Shen, A. Shriraman, S. Dwarkadas, X. Zhang, and Z. Chen, “Power containers: An OS facility for fine-grained power and energy management on multicore servers,” ACM SIGPLAN Notices, vol. 48, no. 4, pp. 65–76, 2013.View at: Publisher Site | Google Scholar
J. L. Henning, “SPEC CPU2006 benchmark descriptions,” ACM SIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1–17, 2006.View at: Publisher Site | Google Scholar
T. Patki, D. K. Lowenthal, B. Rountree, M. Schulz, and B. R. de Supinski, “Exploring hardware overprovisioning in power-constrained, high performance computing,” in Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13), pp. 173–182, ACM, Eugene, Ore, USA, June 2013.View at: Publisher Site | Google Scholar
H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, and C. Le, “RAPL: Memory power estimation and capping,” in Proceedings of the 16th ACM/IEEE International Symposium on Low-Power Electronics and Design, pp. 189–194, IEEE, August 2010.View at: Publisher Site | Google Scholar
J. Triebig, “Likwid: Linux tools to support programmers in developing high performance multi-threaded programs,” 2012, http://code.google.com/p/likwid.View at: Google Scholar
R. Ge, X. Feng, S. Song, H.-C. Chang, D. Li, and K. W. Cameron, “PowerPack: Energy profiling and analysis of high-performance systems and applications,” IEEE Transactions on Parallel and Distributed Systems, vol. 21, no. 5, pp. 658–671, 2010.View at: Publisher Site | Google Scholar
V. Gupta, P. Brett, D. Koufaty et al., “The Forgotten "Uncore": on the energy-efficiency of heterogeneous cores,” in Proceedings of the USENIX Annual Technical Conference (USENIX ATC '12), pp. 367–372, 2012.View at: Google Scholar
H.-Y. Cheng, J. Zhan, J. Zhao, Y. Xie, J. Sampson, and M. J. Irwin, “Core vs. uncore: The heart of darkness,” in Proceedings of the 52nd ACM/EDAC/IEEE Design Automation Conference (DAC '15), pp. 1–5, IEEE, June 2015.View at: Publisher Site | Google Scholar
U. R. Karpuzcu, A. Sinkar, N. S. Kim, and J. Torrellas, “EnergySmart: Toward energy-efficient manycores for Near-Threshold Computing,” in Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture, pp. 542–553, IEEE, February 2013.View at: Publisher Site | Google Scholar
E. Rotem, R. Ginosar, A. Mendelson, and U. Weiser, “Multiple clock and voltage domains for chip multi processors,” in Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 459–468, ACM, December 2009.View at: Publisher Site | Google Scholar
A. A. Sinkar, H. Wang, and N. S. Kim, “Workload-aware voltage regulator optimization for power efficient multi-core processors,” in Proceedings of the 15th Design, Automation and Test in Europe Conference and Exhibition, pp. 1134–1137, IEEE, March 2012.View at: Google Scholar
W. Kim, M. S. Gupta, G.-Y. Wei, and D. Brooks, “System level analysis of fast, per-core DVFS using on-chip switching regulators,” in Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture, pp. 123–134, IEEE, February 2008.View at: Publisher Site | Google Scholar
T. Mudge, “Power: a first-class architectural design constraint,” The Computer Journal, vol. 34, no. 4, pp. 52–58, 2001.View at: Publisher Site | Google Scholar
R. Gonzalez, B. M. Gordon, and M. A. Horowitz, “Supply and threshold voltage scaling for low power CMOS,” IEEE Journal of Solid-State Circuits, vol. 32, no. 8, pp. 1210–1216, 1997.View at: Publisher Site | Google Scholar
J. Burr and A. Peterson, “Ultra low power cmos technology,” in Proceedings of the 3rd NASA Symposium on VLSI Design, vol. 1, 1991.View at: Google Scholar
H. Iwai, “Roadmap for 22 nm and beyond,” Microelectronic Engineering, vol. 86, no. 7, pp. 1520–1528, 2009.View at: Publisher Site | Google Scholar
Intel Hewlett-Packard, “Microsoft, phoenix, and toshiba. Advanced configuration and power interface specification,” 2004.View at: Google Scholar
N. S. Kim, T. Austin, D. Blaauw et al., “Leakage current: Moore's law meets static power,” The Computer Journal, vol. 36, no. 12, pp. 68–75, 2003.View at: Publisher Site | Google Scholar
H. Esmaeilzadeh, T. Cao, Y. Xi, S. M. Blackburn, and K. S. McKinley, “Looking back on the language and hardware revolutions: measured power, performance, and scaling,” ACM SIGARCH Computer Architecture News, vol. 39, pp. 319–332, 2011.View at: Google Scholar
J. H. Laros, P. Pokorny, and D. Debonis, “PowerInsight—a commodity power measurement capability,” in Proceedings of the International Green Computing Conference (IGCC '13), pp. 1–6, IEEE, June 2013.View at: Publisher Site | Google Scholar
D. Bedard, M. Y. Lim, R. Fowler, and A. Porterfield, “PowerMon: Fine-grained and integrated power monitoring for commodity computer systems,” in Proceedings of the IEEE SoutheastCon 2010 Conference: Energizing Our Future, pp. 479–484, IEEE, Concord, NC, USA, March 2010 (Chinese).View at: Publisher Site | Google Scholar
J. Demmel and A. Gearhart, “Instrumenting linear algebra energy consumption via on-chip energy counters,” Tech. Rep. UCB/EECS-2012-168, University of California, Berkeley, Calif, USA, 2012.View at: Google Scholar
E. Rotem, A. Naveh, A. Ananthakrishnan, E. Weissmann, and D. Rajwan, “Power-management architecture of the intel microarchitecture code-named Sandy Bridge,” IEEE Micro, vol. 32, no. 2, pp. 20–27, 2012.View at: Publisher Site | Google Scholar
D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: a framework for architectural-level power analysis and optimizations,” ACM, vol. 28, no. 2, pp. 83–94, 2000.View at: Publisher Site | Google Scholar
P. Landman, “High-level power estimation,” in Proceedings of the International Symposium on Low Power Electronics and Design, pp. 29–35, IEEE, August 1996.View at: Google Scholar
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, “McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures,” in Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 469–480, IEEE, December 2009.View at: Publisher Site | Google Scholar
W. L. Bircher and L. K. John, “Complete system power estimation using processor performance events,” IEEE Transactions on Computers, vol. 61, no. 4, pp. 563–577, 2012.View at: Publisher Site | Google Scholar | MathSciNet
R. Joseph and M. Martonosi, “Run-time power estimation in high performance microprocessors,” in Proceedings of the International Symposium on Low Power Electronics and Design, pp. 135–140, ACM, 2001.View at: Publisher Site | Google Scholar
C. Möbius, W. Dargie, and A. Schill, “Power consumption estimation models for processors, virtual machines, and servers,” IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 6, pp. 1600–1614, 2014.View at: Publisher Site | Google Scholar
K. Singh, M. Bhadauria, and S. A. McKee, “Real time power estimation and thread scheduling via performance counters,” ACM SIGARCH Computer Architecture News, vol. 37, no. 2, p. 46, 2009.View at: Publisher Site | Google Scholar
S. Albers, F. Müller, and S. Schmelzer, “Speed scaling on parallel processors,” Algorithmica, vol. 68, no. 2, pp. 404–425, 2014.View at: Publisher Site | Google Scholar | MathSciNet
H. Esmaeilzadeh, T. Cao, X. Yang, S. M. Blackburn, and K. S. McKinley, “Looking back and looking forward: Power, performance, and upheaval,” Communications of the ACM, vol. 55, no. 7, pp. 105–114, 2012.View at: Publisher Site | Google Scholar
J. W. Choi, D. Bedard, R. Fowler, and R. Vuduc, “A roofline model of energy,” in Proceedings of the IEEE International Symposium on Parallel & Distributed Processing, pp. 661–672, Boston, Mass, USA, May 2013, https://smartech.gatech.edu/xmlui/handle/1853/45737.View at: Google Scholar
K. Czechowski and R. Vuduc, “A theoretical framework for algorithm-architecture co-design,” in Proceedings of the 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2013, pp. 791–802, Boston, Mass, USA, May 2013.View at: Publisher Site | Google Scholar