#### Abstract

Performance profiling for the system is necessary and has already been widely supported by hardware performance counters (HPC). HPC is based on the registers to count the number of events in a time interval and uses system interruption to read the number from registers to a recording file. The profiled result approximates the actual running states and is not accurate since the profiling technique uses sampling to capture the states. We do not know the actual running states before, which makes the validation on profiling results complex. Jianwei YinSome experiments-based analysis compared the running results of benchmarks running on different systems to improve the confidence of the profiling technique. But they have not explained why the sampling technique can represent the actual running states. We use the probability theory to prove that the expectation value of events profiled is an unbiased estimation of the actual states, and its variance is small enough. For knowing the actual running states, we design a simulation to generate the running states and get the profiled results. We refer to the applications running on production data centers to choose the parameters for our simulation settings. Comparing the actual running states and the profiled results shows they are similar, which proves our probability analysis is correct and improves our confidence in profiling accuracy.

#### 1. Introduction

In data centers, performance is critical to improve the quality of service [1] and save costs [2]. Multiple tasks controlled by the operating system share computation resources at the same time to improve user experience and resource utilization. The whole system’s performance representing the combination of multiple tasks is not enough for analysis of each task’s performance. Many applications need to know the running states of specific tasks. These applications include anomaly detection on data centers [3, 4], compiler optimization using method stacks [5, 6], and hot spots detections [7, 8].

Modern processors have hardware support to monitor system performance. Hardware Performance Counters (HPC) [9] are register-based counters to count the number of events in a time interval. With the help of interruption, HPC can output the counted number to a recording file. For profiling each task’s performance, only one extra information is in need—the instruction address. The instruction address indicates which task the processor is working for at the moment of interruption. The profiling technique then treats the counted events in the last time interval as all caused by this task indicated by the instruction address. It is not accurate to use the instant instruction address to represent the running states of a long sustaining time interval. But it has already been used to profile the performance of tasks.

Many widely used profiling tools have already adopted this approximation method, like PAPI [10, 11], perf [12], and VTune [13]. Profiling accuracy attracts research attentions. The experiment-based evaluation compares the profiling results across multiple system architectures to improve the confidence of the profiling technique [14, 15]. And CPU simulator gives detailed information, which makes the comparison more direct [16]. But these researches utilized simple benchmarks to check the accuracy. This kind of validations cannot deduce other workloads’ conditions since the mechanism lacks proof and analysis—they have not explained why the sampling technique can represent the actual running states.

In this paper, we show the mechanism of the profiling technique with the help of probability theory. We model the profiling process with two main elements: the running granularity of a task and the sampling interval. We classify all possible conditions into three classes according to the rate between running granularity and sampling interval. For a constant rate, the sampling process is a kind of Bernoulli experiment [17]. We prove no matter what the rate value is, the expectation value is unbiased to the actual value, including the condition with a mixture of rate values that would still keep the unbiased property. And the variance is related to the number of samples, which is small enough and usually smaller than 0.25.

We further use the simulation experiments to validate our proof. The implementation of simulation includes the generation of actual running states and the sampling process. The settings of the simulation follow the characteristics of the workloads running on live production data centers. We simulate single tasks and mixed tasks running with multiple running granularities and under multiple resource utilization levels. All of these experiments show that the expectation value is an unbiased estimation. The variance is also included into consideration, whose effect does not influence the unbiased property.

We organize our paper as follows. Section 2 introduces the background of the profiling technique. We propose our analysis model in Section 3. Section 4 designs the simulation model. Section 5 shows the simulation results including the simulation’s prerequisite. Section 6 reviews the related work. And we conclude in Section 7.

#### 2. Profiling Technique

In this section, we introduce the profiling technique used for recording the running states of clusters. For example, Figure 1 is a profiled result of the Windows operating system. The green part at the bottom shows the value changes of CPU utilization in this 60-second observing window. The blue line shows the rate of current operating frequency to the highest frequency. These lines link the samples of every second. And each sample represents the averaged CPU utilization of the last 1-second time interval.

##### 2.1. HPC Profiles

Hardware performance counters consist of two components—event detectors and event counters [18]. Users can configure performance event detectors to detect performance events as cache misses, cycles consumed, or branch mispredictions. Often, event detectors have an event mask filed that allows further qualifications of the event. According to the processors’ privilege mode, for example, HPC can collect the kernel occurred events with the administrator mode.

The event counter would increase itself by one if this event happened once until a system interruption happened. It outputs its historical value and its value can be reset to zero or not according to whether it is on accumulative counting mode. The condition to cause this kind of system interruption can be separated into two types:(i)Time-based sampling is implemented through interrupting tasks’ execution at regular time intervals and recording the program counters. This approach is often used to show the relationships between profiled events to time dimensions.(ii)Event-based sampling is implemented through interrupting after a specific number of performance events—when the number of events that happened reaches a threshold. Users can specify the threshold events.

The hardware performance counter method has distinct advantages. First, it profiles the system from the hardware level without any intrusion to applications, making applications and operating systems remain largely unmodified. Second, every modern processor has the support of performance counters. This method is a general solution. Third, this method profiles system on the fly as the applications executing to save the effort to reproduce the workloads, since some executions are prohibitively complex to be simulated or reproduced.

Though hardware performance counters reveal lots of information from the system view, this information mainly exposes the system’s overall states, not a specific task’s behaviours.

##### 2.2. Task-Oriented Profiles

For profiling the running states of tasks, only one extra information is in need to be added. The information is the instruction address coming from context information [19, 20]. We would set periodic events to trigger the sampling, like for every 100 cache misses. At the sampling moment, the recorded sample contains the content of performance counters and instruction address. Then profiling technique uses this sample to represent the running of the last sampling interval.

Figure 2 illustrates the profiling method. The upper bar is the actual running state, and the lower bar is the state captured by profiling. The profiled state is different from the actual running state. The upper blue block shows the actual running of task XX. The upper orange blocks show the actual running condition of task YY. And the vertical lines represent the sampling moments that are triggered by the event threshold or time limit. The first sample regarding the last sampling interval events was all caused by task YY. The second sample would treat the last sampling interval events caused by task XX shown in the lower bar, though YY was running in most of the second sampling interval tasks. The profiling result shown in this figure is that task XX and task YY both consumed 2 units of the resource.

Our paper’s task is an abstract presentation that can represent the threads, processes, programs, or applications according to the analysis granularity. If we intend to profile the performance of an application, then the task means the application. However, an application can be divided into multiple threads. All these threads are regarded as running for a single application—a single task.

##### 2.3. Challenges

Checking the profiling method’s accuracy has many challenges from the complex environment and the limitations of profiling techniques. In the following, we list the changes from two aspects.

First, we do not have the standard answers to validate the profiling results. In the data center, there are tens of thousands of applications running on thousands of computers [21]. These applications include online services that have high requirements on response time and off-line services that require high throughput. Sometimes the clusters can reach extremely high pressure, for example, the double 11 shopping festival. The complex environment makes every next moment different from the last one. We do not know the true proportions of the running applications or the true load of queries from users. Many production scenes appeared only once, which means these conditions could hardly be repeated.

Second, the profiling technique would unavoidably introduce overhead to the running system [22]. With the increasing sampling frequency, the overhead would increase, making it impossible to increase the sampling density too much to get detailed enough running states. And it is also impossible for current profiling techniques to separate the profiling workload from the original workload.

The experiments on benchmarks only prove some events’ correctness under certain workloads, and these experiment-based researches have not covered all scenarios. An explanation of why profiling can be trusted would improve our confidence when profiling the system that has not been covered.

#### 3. Analysis Model

##### 3.1. Application Scenario

A representative scene using task-oriented profiling is the hot spot detection. Taking the hot methods detection as an example, it targets finding out the top hot methods that consumed the most resources (like CPU cycles) for further performance optimizations. Not every method can catch enough attention to be optimized further since there are too many methods running on a live environment to be optimized one by one. Thus for profiling how many CPU cycles are consumed by a method, we can set a sampling-based method to profile the running of methods.

For example, we set a sample of 0.1 seconds, which means every 0.1 seconds to interrupt the system running and record the current instruction address. This instruction address indicates the running method, for example, is “Sort(),” and the number of cycles consumed is 200 million in this sampling interval. Then we count that the “Sort()” method consumed 200 million cycles. This interruption on the system is repeated to get an overview of the CPU cycle’s consumptions of methods.

##### 3.2. Model Components

We model the profiling process as two major elements to help us do further analysis. The main elements that need to be considered include the following.(i)Running granularity: The averaged scheduling time of a task running continually until being switched out. Running granularity would be influenced by many factors like the property of this task, our observing level, system environment, etc. The length of a color block shown in Figure 3 is called the running granularity. The running granularity does not need to be a constant value.(ii)Sampling interval: The distance between the last sampling to current sampling. Figure 3 shows an example. If the interruption is event-based rather than time-based, and the number of events is not proportional to time, the sampling interval’s length would look nonuniform from the time dimension. But from the corresponding event dimension, it is still of uniform intervals.

In the following analysis, the base event is to denote the event dimension that causes the sampling interruption. For example, if it is time-based sampling, then the base event is time, and if it is CPU cycles based sampling (e.g., interrupt system every 250 million cycles), then the base event is CPU cycles. The number of base events that happened in a sampling interval is a constant value without variance. We call the constant number of events that happened in a sampling interval a unit of events.

About the nonbase events, the numbers of these events collected by samples may not be as steady as the base event. The number of nonbase events in a sampling interval would be different from the other sampling intervals. Their estimation variance would be a little higher than base events. We include the considerations on nonbase events by introducing variance to the constant unit of events when doing experiment. To avoid introducing extra variable considerations into probability model, we first model our analysis focusing on base event, which can be further extended into nonbase events by adding extra considerations on the variance of the unit of events.

##### 3.3. Three Classes of Conditions

The accuracy of estimations on the base event would be mainly influenced by the tasks switches reflected by the rate between running granularity and sampling interval. When the sampling interval becomes smaller, and the running granularity keeps the same, the sampling’s accuracy would increase, and the error bound would be smaller. Assuming an extreme condition that the sampling interval equals every clock cycle, the profiled result reflects the real running state accurately without any approximation.

We utilize the rate between the running granularity and sampling rate to define all possible conditions. We define a variable to denote the rate as

According to the value of *R*, there are three kinds of conditions as shown in Figure 4. They are (a) , (b) , (c) .(i)Figure 4(a) represents the condition that the sampling interval and the running granularity are the same.(ii)Figure 4(b) represents the condition that the running granularity is smaller than the sampling interval, which means it is possible that the tasks already have been switched more than once within one sampling interval.(iii)Figure 4(c) represents the condition that the running granularity is larger than sampling interval.

We use these three kinds of conditions to help with our further analysis.

With a specific constant *R* value, the sampling process is a kind of Bernoulli experiment whose results would follow a binomial distribution. The Bernoulli experiment means running a task would be captured by a sample or would not be repeated independently. We first use cases with representative *R* values to illustrate the calculation of the profiling distribution’s expectation value, conclude them with a general representation method, and show its corresponding variance calculation method.(1)For the first condition that , the sampling interval and the running granularity are of the same length. No matter where the sampling starts, one of two adjacent samples would capture this task and regard it caused by one unit of base events—one unit of base events means the constant number of base events that happened in a sampling interval. This is an accurate estimation without errors. Additionally, when the value is integer, like 2 or 3, the sampling result would keep the same condition as the and give an accurate estimation.(2)For the second condition that , we denote the time that the sample is captured as The value means this is the sample, and the profiling starts from with a period of . Every would occur an interruption to get the sample. There is an assumption on the . We regard the time starting to profile as randomly chosen— is independent of or other factors. We assume proportion (, e.g., 30%) of sampling interval is working for a task. The probability of being captured by a sampling point equals the running proportion of this task as in this sampling interval since the sampling point is independent of running this task. And the probability that this task is missed (not captured by the sampling point) is . This process is repeated, and we get a bunch of profiled samples. When we denote a unit of events as , the expectation value of events caused by this task in an interval can be deduced as We can find that the expectation value equals the real running proportion.(3)For the third condition that , we first analyze the case when the task’s running granularity is less than 2 times of sampling interval and bigger than 1 time of sampling interval denoted as times of sampling interval length. There are two possible conditions. One is shown in Figure 4(c) that this task appears in three intervals and is captured by two samples. Another one is shown in Figure 5 that this task appears in two intervals and only is captured by one sample. The probability of the first condition captured by two samples is , and the probability of the second condition captured by a single sample is the left probability . The expectation value of can be combined from these two probabilities aswhere the represents a unit of events. This condition would come to unbiased estimation too.

##### 3.4. Piecewise Binomial Distribution

We conclude this deduction process to be a more general representation. The running granularity is times of sampling interval. All possible sampling conditions are separated into two classes—when the running is captured by samples and captured by samples. The probabilities of being captured by samples and samples equal to and . And the expectation iswhere the is a function to get the integer part of this element and the is an operation to get the fractional part. For example, equals 3 and equals 0.14. This expectation value shows profiling method would get an unbiased expectation value about the running proportion.

We can find the sampling result distribution can be regarded as a binomial distribution under a specific constant *R* value. According to the binomial distribution, we get the variance (*V*) of sampling distribution under a specific value as

This expectation value and variance value is about the distribution where samples are drawn. It is an ideal model, and enough number of samples can approach its distribution according to Chebyshev’s theorem [23].

In actual running conditions, the running granularity would change, which makes the *R* value vary. We can regard the varying *R* value as the mixture of components with different *R* values. But no matter what kind of mixture proportion of these components, the expectations of components are all equal to *r* value. Then the combined expectation is equal to *r*. This can be denoted aswhere the represents the proportion of component, the represents the expectation value of component. The property that the expectation value is unbiased still exists when considering varying *R* values.

##### 3.5. The Number of Samples

The number of samples would influence the approximation to the distribution. When the number of samples is , then the mean value of these samples would keep the same to expectation value, and the variance of samples would be related with :

The maximum variance value appears as 0.25, when and . The small variance can improve our confidence in the current profiling method. Taking the worst condition as an example, it still has good performance. The probability of misestimating the mean value by one more unit event is less than 0.022 8.

#### 4. Simulation Model Design

We introduce how to implement our simulation process. Two parts need to be modelled: the actual running states and the profiling process.

It is a reflection of the condition that multiple tasks share the computing resources about the running state generation. For simplifying the resources, we will not detail computing resource into more specific types. We treat the resource such that it can only be occupied by one task at one moment, which is a simplified model to the true system. But model duplications to consider multiple types of resources are similar to the true system. For example, the multiprocessor CPU would run multiple processes simultaneously. At one moment, the CPU resource of the system can be shared by different tasks. But this true condition can be simulated by repeating this simplified model. A running model represents one processor resource that runs a single process at one moment. This simplification keeps the basic property of the system. If necessary, this model can be rebuilt into a complex system. For simulating the running states, we regard the smallest resource amount allocated to run tasks as a resource unit (e.g., a CPU cycle). Several parameters need to be specified, including the work amount of tasks (the number of resource units needed), their corresponding running granularity, and their scheduling and sharing behaviour.

Regarding the sampling method, the sampling interval is the key setting. By changing the sampling interval and running granularity, we simulate different value conditions. The sampling interval would keep stable roughly with little variance. We add random noise into the sampling interval to approach the real profiling scene. The start time of the sampling can be generated randomly. We get the profiled samples based on the same true running condition, repeating the sampling with different randomly generated start time and different noise introduced. The statistics of profiled samples about the proportion of tasks are supposed to approach real task running proportion values.

#### 5. Experimental Evaluation

This section shows the simulation results on various kinds of single task and mixed tasks, including the variance introduced to simulate nonbase events. We first introduce the simulation components and the experimental results are followed to prove the effectiveness of our model.

##### 5.1. Simulation Components

Our experiments are conducted on real running environments and the analysis data is collected by the system profiling tool. We call it a simulation experiment since the workload that is running on the system is simulated without actual functions and is under control. There are three components that need to be clarified, including data collection method, the control of workloads’ running granularity, and the method to introduce the nonbase event variance.

###### 5.1.1. Data Collection Method

We use the “perf” to profile the system—the Linux kernel already contains this profiling tool. An application or a program serving for user requests consists of multiple methods, and these methods belong to their corresponding modules. The “perf” script would offer an automatic parsing function to map Instruction Pointer (IP) value to the corresponding method and module, record the hardware events, and index each sample with the sampling timestamp, CPU number, and event name. The dimensions collected by the “perf record” script related to our model and their meanings are shown in Table 1. We profiled the running states of these physical machines from CPU view—each recorded sample represents the state of a CPU in a sampling interval.

###### 5.1.2. Running Granularity Scale

We regard the continuing samples with the same module name as the condition that this module has not been switched out—its running granularity value is calculated as the sum of the continuing sampling interval. The distribution of sampling interval from the time dimension is shown in Figure 6. The distribution of running granularity deduced from the sampling results is shown in Figure 7. When the running granularity is smaller than the sampling interval, we treat it equal to the sampling interval, making running granularity here a little larger than its actual value. The running granularity is not always smaller than the sampling interval. It is about several times of sampling interval. Thus the *R* value scale that we experiment with, like 1 to 10, is good enough to cover the conditions rather than a thousand or million scale.

###### 5.1.3. Nonbase Event Variance

For showing the variance of nonbase event is small when profiling and our model about the variance of nonbase event is reasonable, we analyze the variance of “cycle” event when the base event is “time.” The sampling intervals in the collected data are not the same; thus, we scale the cycle event value by dividing the number of cycles by the sampling interval’s length. For example, the distribution of profiled cycle event in one physical machine that ran for 231 modules in 5 seconds is shown in Figure 8. The cycles consumed in a sampling interval tend to be fully utilized or in idle state, but this characteristic does not make the variance large for each module. For each module, we filter out its corresponding cycle event samples and calculate its variance. For these 213 modules that appeared in our 5-second observing window, not every module has a large number of samples. Thus we filter out 111 modules whose number of samples is larger than 10. We get the variances of these 111 modules shown in Figure 9. The mean value of the variances is 0.317 1. The variance we introduce to simulate nonbase events between 0 to 2 is in a reasonable range.

##### 5.2. Unbiased Estimation

In this section, we show that, under any *R* condition, the expected value of profiled samples is unbiased, and their variance is small. We first use multiple types of tasks respectively with different utilization levels and keep the sampling interval the same—which means different *R* values. And we also show the estimation of mixed tasks with different running granularity also has good performance.

###### 5.2.1. Single Task

We set 1 million units to simulate running states, and each unit can work for a task or in idle. We set the utilization of units as 80%, 50%, and 30% to combine with the running granularity as 30, 50, 100, 150, 200, and 280 separately to cover a total of 18 types of running states. Then we profile these 18 types of running states by setting every 100 units to trigger interruption to get a sample—the sampling interval is 100, and repeat profiling on each running state 1000 times and get 1000 profiling results of each running state.

We do not introduce any extra variance first and randomly generate the running states for each unit according to the set running granularity. Taking the 80% utilization load level as an example as shown in Table 2, the mean value of utilization for these 1 million units approach to the set utilization value. But within each sampling interval, this task’s utilization is not always 80%, as shown in Figure 10—we take 30 running granularity as an example to plot out the actual running states for each sampling interval.

The profiling process is conducted on these 18 types of running states. We find profiling results have little variance and are unbiased to the expectation value—the detailed result of 80% is shown in Table 2. The mean value distribution of 1000 times profiling result on 30 running granularity and 80% utilization condition is shown in Figure 11 whose minimum value is 0.786 9 and maximum value is 0.809 2. The profiling results of the other conditions with 50% and 30% utilizations are shown in Table 3 whose expectation values approach the actual utilization value and variance values are small too.

But we may doubt the low variance of these 1000 profiling results caused by the large number of samples collected by each profiling process—reaching 10 thousand samples. Of course, a large number of samples would guarantee a low variance. The fact is that the variance is still small enough even when the number of samples is only 1. We reduce the number of samples to 1. For example, the variance of 80% utilization when running granularity is 30 equals 0.004. Thus we can believe that, without extra variance introduced, the estimation is unbiased, and its variance is small.

###### 5.2.2. Mixed Tasks

Except for a single task running on a system, it is more usual that multiple types of tasks (with different running granularity) share a system simultaneously. We simulate this condition by mixing tasks with a specific proportion. We show the result to make the system with 80% utilization composed by 30% task 1 with 30 running granularity, 20% task 2 with 50 running granularity, 20% task 3 with 150 running granularity, and 10% task 4 with 280 running granularity. We observe on 1 million units, and the sampling granularity is still 100. The generated running states show that each task’s mean value is unbiased, as shown in Table 4 with small variance. We can find the mixed running still keeps the unbiased property.

##### 5.3. Variance Introduced to Simulate Nonbase Events

The sampling interval and unit of events are constant values in the former experiments—having no variance. But the real condition would always have some variations. Thus we explore the impact of the variance of sampling interval and unit of events in this section. Some nonbase events would not be as accurate as base events. Thus we introduce variance to sampling interval or unit of events to simulate the performance of nonbase events. The inaccuracy of the sampling interval would influence the specific number of the units of events. Thus the variance introduced to sampling interval or unit of events has the same effect to simulate.

We introduce the noise to a unit of events—the number of events counted by a sample varies by a noise value generated randomly from a normal distribution. We denote the normal distribution by function. We profile the same running states of mixed tasks as the former section and introduce noise to a unit of events whose value is regarded as 100 before—since the sampling interval is set to 100. The noise for each sample is drawn from the normal distributions *N*(0,0.5), *N*(0,1), and *N*(0,2), respectively, and the unit of events profiled for each sample is calculated by

The mean and standard deviation (SD) value of 1000 times profiling—each profiling gets about 10 thousand samples—with noise introduced following three normal distributions, respectively, are shown in Table 5.

We also reduce the number of samples collected by each profiling process. When the number of samples is reduced to 1 thousand, it is shown in Table 6. The estimation is still unbiased, and the variance is influenced by injected noise but not too much to influence the mean value. The different actual running states cause the differences between task 1 and task 4 when *n* = 500 and noise following *N*(0,1) distribution—the real proportions of task 1 and task 4 are 0.343 2 and 0.556 0. This means they are still unbiased estimations on the real proportions.

#### 6. Related Work

Except for the hardware-based performance profiling, there are another two representative methods. One is an intrusive method [19, 24, 25] that needs to modify the application source code to add instrumentation code for collecting data. This method requires the authority to access the source code, rebuild the application source code, and redeploy this version into the system. These requirements are impractical. Moreover, these intrusive methods can disturb the application’s behaviour, bringing other questions about the collected data’s validity.

Another one is the simulator-based method [26, 27] using the processor simulator that models the real processor’s architecture. It collects processor performance data by using the simulator to execute the application. This method can yield detailed data on a processor like the pipeline stalls and cache line behaviours. However, not every processor would have its corresponding simulator that is provided by its manufacturers. The simulator would also be tens of times slower than running on real processors, making performance profiling costly.

The task granularity profiling is useful in code profiling and hot execution path detection [28]. Identifying program hot spots can support runtime optimization [29, 30]. The application anomaly [4] or stragglers detection [31, 32] also needs the information from the task-level.

For more accuracy to count the events into specific tasks, there are instruction-oriented profiling techniques [33]. The profiling interruption is triggered by an instruction related dimension. A detailed record of interesting events and pipeline stage latencies in the out-of-order processor is collected. Trace-based profiling [34, 35] has a similar design to follow a running pipeline to record the running states. But they are not useful in improving the confidence in the hardware counter accuracy.

Many pieces of literature analyze accuracy from the probability theory. Chen [36] proposed the ProbPP method for analyzing the probabilities on the execution paths of the multithreaded programs. Yan and Ling [37] used the probability model on the memory level parallel analysis to estimate the maximum number of cache misses. But they did not use the probability model to prove the accuracy of hardware-based profiling technique.

#### 7. Conclusion

In this paper, we analyze the hardware-based profiling technique’s mechanism using the probability theory and design an analysis model to simulate the profiling process. The setting of the simulated model follows the characteristics of workloads running in a live environment. The simulation results validate our probability deduction result and show that the expectation value has nonbiased property, and the variance is small. It is expected that this work can improve confidence in the profiling accuracy and broaden the relevant research directions.

#### Data Availability

The simulated running state data used to support the findings of this study are available from the corresponding author upon request.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.