About this Journal Submit a Manuscript Table of Contents
International Journal of Reconfigurable Computing
Volume 2012 (2012), Article ID 439727, 12 pages
http://dx.doi.org/10.1155/2012/439727
Research Article

Modeling and Implementation of a Power Estimation Methodology for SystemC

1Institute for Information Processing Technology, KIT, 7602 Karlsruhe, Germany
2Department of Informatics, Federal University of Paraiba (UFPB), 58051-900 João Pessoa, PB, Brazil

Received 19 March 2012; Revised 12 May 2012; Accepted 18 June 2012

Academic Editor: Massimo Conti

Copyright © 2012 Matthias Kuehnle et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This work describes a methodology to model power consumption of logic modules. A detailed mathematical model is presented and incorporated in a tool for translation of models written in VHDL to SystemC. The functionality for implicit power monitoring and estimation is inserted at module translation. The translation further implements an approach to wrap RTL to TLM interfaces so that the translated module can be connected to a system-level simulator. The power analysis is based on a statistical model of the underlying HW structure and an analysis of input data. The flexibility of the C++ syntax is exploited, to integrate the power evaluation technique. The accuracy and speed-up of the approach are illustrated and compared to a conventional power analysis flow using PPR simulation, based on Xilinx technology.

1. Introduction

The need for more abstract system on chip development techniques is evident due to rising system complexity. Consequently, accurate system evaluation in less time will increase the productivity. According to the Semiconductor roadmap, especially the consideration of energy consumption is becoming more important and is also a limiting factor for many applications [1, 2]. Modeling strategies are driven by system and software engineers on the one hand and hardware engineers on the other hand. The first group develops RTL models written in hardware description languages (HDLs) since they are the basis for synthesis tools. The second group uses transaction level models (TLMS), most commonly written in SystemC [3] since these models enable fast system simulation. SystemC is a library based on the object-oriented programming language C++. A TLM specification extends SystemC to separate communication from computation. TLM improves modeling and simulation speed. The simulation speed depends on the level of abstraction [4]. Also, modeling at different abstraction levels is possible. This increases the flexibility of SystemC.

A remaining problem is the trade-off between accuracy and simulation speed and with that, the link and synchronization between the two layers. Translation tools are solving this problem to some extent. They inherit some limitations in the translation of syntax constructs that do not have direct counterparts. The presented tool extends this feature list. The main goal of this work, however, is the integration of a power analysis methodology into the translation process. The power estimation methodology estimates switching activities in DSP units such as adders or multipliers according to actual input data. The approach is based on a statistical methodology. It implements the measurement functionality implicitly into the SystemC model by defining overloaded operators, in the sense of object-oriented programming. These can be differently characterized based on technology parameters. The operators can be automatically integrated at system translation. DSP units are considered, since a power analysis of such systems shows that the major part of the dynamic power dissipation is consumed in the data-processing part of the architecture. In addition, the power dissipation is highly data dependent. Therefore a fast but reasonably accurate estimation of the dynamic power dissipation of such data-driven kernels is of high interest and the analysis using representative input data is essential. From the hardware point of view, the additional flexibility inherent in reconfigurable architecture and easy-to-use implementation flows make reconfigurable computing, especially FPGA architectures, attractive for power-aware computing solutions. Since CPUs are not sufficiently energy efficient (power consumption up to 200 W) and, on the other hand, ASICS are unaffordable for a low market volume, reconfigurable computing is considered an alternative especially for data-driven applications. Their benefit has been shown, for example, in [5, 6].

The described situation in the system development landscape motivates the development of a strategy for abstract, hence faster data-dependent power analysis for LUT-(lookup-table-) based systems in this work. The results were presented in [7] and are compared here to a standard power estimation flow. Beside it a detailed mathematical modeling strategy for power consumption estimation is presented. To ease and accelerate the development cycle, the power estimation method is embedded in a translation tool, so that the process of power estimation is transparent for the user and can be validated.

The remainder of this work is structured as follows. Section 2 summarizes related work. Section 6 embeds the strategy in an overall tool flow. Section 3 discusses the strategy to evaluate power dissipation for LUT-based hardware based on a statistical model to estimate toggle rates. This is related to a hardware mapping analysis of representing macroblocks (adders and multipliers) for data processing. Section 7 describes the implementation of the technology into the VHDL to SystemC converter. Section discusses the results. Section 9 concludes and gives an outline on future extensions.

2. Related Works

Code translation tools can help guaranteeing the consistency of translated and original models across languages and speeding up development time, since tool-supported translation takes seconds instead of hours to days for manual translation. In the system development cycle, top-down approaches [810] are used in HLS (high-level synthesis) tools; bottom-up approaches [11, 12] are used for IP reuse and module abstraction to achieve faster simulation models. In many hardware engineering problems, optimization is necessary on RTL Level. However, readability and efficiency of the translated code are two major problems of the HLS code translators. Because of that a bottom-up strategy is followed in this work. The authors in [13] further illustrates the effects of IP reuse on design time, hence motivates the bottom up approach.

The presented methodology differs from the existing solutions since it targets, beside correct code translation, the automatic integration of further functionality: bus interface wrapping and a power estimation methodology. The power estimation methodology estimates switching activities in DSP units such as adders and multipliers according to actual input data. The approach is based on probability theory. With that it differs from approaches that need to run a conventional time-consuming power estimation flow (e.g., XPA from Xilinx) or other estimators that are based on synthesis results without considering input data (e.g., XPE from Xilinx).

The approach presented in [14] is also implemented for SystemC. It enables the usage of different power models. In comparison with this work, it uses a strategy based on logging the execution of special modules and signals extended from regular SystemC ones, instead of a probability model, as presented here. In [15] a SystemC class library is proposed to the calculate the energy consumption of hardware described with SystemC TLM, and the power model was based on experimental results performed in laboratory, while approach of the presented work is based on the translation of modules from VHDL to SystemC RTL and on probability models. Further works concern the extension of the approach also to TLM.

The work [16] presents an architectural level framework for power analysis, based on parameterized power models of common structures of microprocessors, but does not consider any probability model, as presented in this work. In [17] a methodology is presented for simulation and verification of low-power systems using SystemC. It is based on disabling modules during execution to simulate the power switch-off used by the technique of power gating. The technique for disabling SystemC modules at simulation time is detailed in [18]. At the same time, these related works are not specific to reconfigurable architectures, in contrast to our work, which considers specific characteristics of FPGAs.

3. Power Evaluation Strategy

This project uses an activity-based power estimation and calculation based on the switching frequency of the inputs, outputs, and the internal signals of the individual sub-modules. The proposed power estimation and calculation is tailored for signal processing units where the major part of the power consumption is produced by multiply-accumulate instructions, which is computed in hardware by ripple carry adders and field multipliers based on ripple carry adders. To guarantee that the synthesized hardware for computing MAC is realized on the given adder and multiplier structures, a hardware model on register transfer level of the processing unit is needed. Both implementation and calculation are optimized for area consumption on FPGAs with little slice count. The accuracy of the estimation is based on two areas.

First there is the realization on hardware which is done by the synthesis tools for a specific LUT-based FPGA. Later in this section synthesis results for given FPGAs are examined for getting the fan-out parameters which are needed for a exact calculation of the power dissipation of each element. Further general rules for optimal synthesis of ripple carry adders and field multipliers are formulated.

Secondly, input data is analyzed for predicting the switching frequency on the expected synthesized hardware structure. This means that the probability distribution of the input data has to be computed. The probability distribution of the output data is derived from that. The input data should be taken from the original implementation. This means, for example, in case of an audio decoder, a corresponding audio stream should be used. For better comprehension an complete workflow for power estimation is shown.

The first step to be done is the analysis of the input data. For example, the input data is uniformly distributed on the complete accepted range with the boundaries 0 and u, where u is a natural number. This assumption is suitable if the distribution of the input data is unknown. By the way this concept was first formulated by Gauss and is known as the Gaussian indifference principle.

Next the binary coding of the numbers between 0 and 𝑢 is analyzed. The value of the MSB is distributed as followed:(i)in the range between 0 and 𝑢/2, the value of the MSB is 0,(ii)in the range between 𝑢/2 and 𝑢, the value of the MSB is 1.

With the condition of the uniform distribution, all numbers in range are selected with the same frequency. This means, for the probability that the MSB bit is set to 1,𝑝(MSB)=𝑢/2𝑢=12.(1)

The analysis of the next less significant bit is made the same way. The value of the (MSB-1) bit is the follows:(i)in the range between 0 and 𝑢/4, value of (MSB-1) is 0,(ii)in the range between 𝑢/4 and 𝑢/2, value of (MSB-1) is 1,(iii)in the range between 𝑢/2 and 3𝑢/4, value of (MSB-1) is 0,(iv)in the range between 3𝑢/4 and u, value of (MSB-1) is 1,

The calculation of the probability that (MSB-1) bit is set to 1: 𝑝(MSB1)=2(𝑢/4)𝑢=12.(2)

Based on the calculations for the MSB and the (MSB-1) bit, a prediction can be made.

For every step you make from MSB towards LSB, the number of intervals in which the bit is set to 1 is doubled but the length of the intervals is halved. In consequence this means that the accumulated length of the intervals in which the bit is set to 1 is equal for all bits. Because of this fact the probability that a bit on the input is set to 1 is calculated for all bits: 1𝑝=2.(3)

After the computation of the probabilities on the input vector, the probabilities for the output vectors of the ripple carry adder and the field multiplier can be calculated. First the probability of the occurrence of carry bits of the ripple carry adder is determined.

For the possibility that a carry-on location 𝑛 is generated, all input vectors of the adder between 0 and 2𝑛1 are relevant. Figure 1 shows the addition of two summands in this range.

439727.fig.001
Figure 1: Carry generation in the adder.

Each cell in this figure corresponds to the sum of the row and column number. To calculate the the probability that the carry is set, it is suitable to count the number of cell in which a carry is generated and divide this count by the total number of cells. This figure has 22𝑛 entries of which 22𝑛12𝑛1 generate the carry. The division leads to𝑝𝑐,𝑛=22𝑛12𝑛122𝑛=122𝑛1.(4)

Additionally the following statement can be made: because the generation 𝑛th carry is only possible if the upper bound of the accepted input values is 2𝑛, all the probabilities for all carry bits on locations higher than 𝑙𝑑(𝑢) can be set to zero.

After calculating the probability of the carry bits, the probability of the sum bits of the adder can follow. The boolean equation of the sum on location 𝑛 is 𝑠𝑛=𝑖1𝑛𝑖𝑛2𝑛𝑖1𝑛𝑖2𝑛𝑐𝑛𝑖1𝑛𝑖𝑛2𝑛𝑖1𝑛𝑖2𝑛𝑐𝑛.(5)

Transferring this equation to the probability domain leads to 𝑝𝑠,𝑛=𝑝𝑖1,𝑛𝑝𝑖2,𝑛+𝑝𝑖1,𝑛𝑝𝑖2,𝑛𝑝𝑐,𝑛+𝑝𝑖1,𝑛𝑝𝑖2,𝑛+𝑝𝑖1,𝑛𝑝𝑖2,𝑛𝑝𝑐,𝑛.(6)

Setting 𝑝𝑖1,𝑛=0.5, 𝑝𝑖2,𝑛=0.5 (uniform distribution) and 𝑝𝑐,𝑛 equals 𝑝𝑏,𝑛=0.5.(7)

Like the statement for carry bits with location near the MSB, a likewise statement can be made for the sum bits: because the generation 𝑛th sum bit is only possible if the upper bound of the accepted input values is 2𝑛, all the probabilities for all carry bits on locations higher than 𝑙𝑑(𝑢) can be set to zero. Similar to these calculations, likewise calculations for other probability are possible. The following paragraph describes the approach for calculating the probability of sum and carry bits with calculated nonequal probabilities for each bit of the adder input. The given input probabilities are named 𝑝𝑖1,𝑛 and 𝑝𝑖2,𝑛. In this project a 32-bit adder was analyzed so the valid indices are in the range from 0 to 31. With this new assumption, the proceeding is as follows.

First the boolean equation of the carry bit is annotated: 𝑐𝑛=𝑖1𝑛𝑖2𝑛𝑖1𝑛𝑖2𝑛𝑐𝑛1𝑖1𝑛𝑖2𝑛.(8)

Based on this equation, a transformation to probability domain is made: 𝑝𝑐,𝑛=𝑝𝑖1,𝑛𝑝𝑖2,𝑛+𝑝𝑖1,𝑛𝑝𝑖2,𝑛𝑝𝑐,𝑛1+𝑝𝑖1,𝑛𝑝𝑖2,𝑛.(9)

With the extra knowledge that the carry bit in the least significant adder is never set (this means 𝑝𝑐,𝑛 is zero), all the carry probabilities can be calculated in a recursive manner. With the formula for calculating the 𝑝𝑠,𝑛 probabilities the examination of the ripple carry adder with nonequal distributed input vectors is completed.

Based on these observations a derivation for the calculation of the signal probabilities of the field multiplier is possible.

The idea is to connect the calculations for the ripple carry adder with individual input probabilities. According to Figure 2, the input probability 𝑝𝑖1𝑛 of one adder is the same value like the 𝑝𝑠(𝑛+1) probability of the previous level. With the extra assumption of uniform distributed input vectors of the adder and the knowledge that a AND-gate is equal to a multiplication of its input probabilities, the 𝑝𝑖2,𝑛 probability equals the multiplication of the probabilities of the inputs of the multiplier; that is, 0.50.5=0.25. If the distribution is restricted to the range 0 to u, all 𝑝𝑖2,𝑛 with a index and level higher than 𝑙𝑑(𝑢) have to be set to zero.

439727.fig.002
Figure 2: Structure of a field multiplier.

Also these calculations can be performed on other input probability distributions. After the calculation of the probabilities of occurrence, the transition to the switch frequency is made. The normalized statistical switching frequency is given by𝑓norm,stat=𝑝(1𝑝).(10)

That definition of the frequency is equal to the probability of the occurrence of a positive transition of the signal. By means of this frequency definition, the dynamic power dissipation of CMOS circuits is described as follows: 𝑃dyn,stat=𝐶𝑈2𝑓norm,stat𝑓basefanout,(11) where 𝑓base defines the bit rate on the observed signal. The parameter 𝐶 for the input capacity and the supply voltage 𝑈 can be extracted from the data sheet while the fan-out parameter which describes the number of inputs to be driven is dependent on the circuit synthesis on the FPGA. For getting the fanout parameters of the individual signals, the synthesis result of Xilinx ISE is analyzed. The following pictures show the elementary cells of the adder for different FPGAs.

Figure 3 shows the synthesis result of the basic cell of a ripple carry adder for a LUT-FPGA with three to four inputs and one output, Figure 4 shows the synthesis result of the basic cell of a ripple carry adder for a LUT-FPGA with five to six inputs and one output, Figure 5 shows the synthesis result of the basic cell of a field multiplier for a LUT-FPGA with four to five inputs, and Figure 6 shows the synthesis result for a LUT-FPGA with six to seven inputs.

439727.fig.003
Figure 3: Optimized elementary cell for three-to-four-input LUT.
439727.fig.004
Figure 4: Optimized elementary cell for five-to-six-input LUT.
439727.fig.005
Figure 5: Optimized elementary cell for four-to-five-input LUT.
439727.fig.006
Figure 6: Optimized elementary cell for six-to-seven-input LUT.

Out of the synthesis result, it is obvious that the maximal size of a elementary cell is limited by the LUT fanin which calculates the carry for the next elementary cell. The elementary cell synthesized for the adder equals a radix-oor((𝑛1)/2) adder which is shown in Figure 7.

439727.fig.007
Figure 7: The high radix adder mapped to LUTs.

Similar to the high-radix adder elementary cell, the elementary cell of the multiplier consists of a high-radix adder and set of AND-Gates which are controlling the optional addition of summand no. 2 (shown in Figure 8).

439727.fig.008
Figure 8: The high-radix adder with gate mapped to LUTs.

With help of this circuit diagrams, it is possible to get the fan-out parameters for the power dissipation formula. The overall workflow can be seen in Figure 9 and is described as follows.

439727.fig.009
Figure 9: Setup of a new power estimation.

4. Improving Probability Analysis by Carry-Computation

Another possibility for the quality of the probabilities of signal occurrence is the replacement of signal by transfer of a specifically calculated value. It raises the question of which carry bit achieved through the replacement of an exact signal calculation has the best effect. The first consideration is the simplified assumption that all bits are equally distributed, then, for the probability of occurrence for each carry bit of the formula:𝑝𝑁1(𝑐,𝑛)=212𝑛+1.(12)

In case that the 𝑛1th bit is calculated exactly for the signal, the maximum difference applies to Diff𝑐,𝑛1=𝑁𝑛1𝑛=0||𝑝𝑁𝑐,𝑛/𝑝𝑁𝑐,𝑛||,Diff𝑐,𝑛1=𝑁𝑛1𝑛=012𝑝𝑁𝑐,𝑛,Diff𝑐,𝑛1=𝑁𝑛1𝑛=012𝑛,Diff𝑐,𝑛1=1212𝑁𝑛1.(13)

This maximum deviation occurs with the probability of 𝑝𝑁𝑐𝑛𝑎. Hence, for the probable deviation, 𝑝Diff1𝑐,𝑛=Diff𝑐,𝑛1𝑝𝑁𝑐,𝑛1,𝑝Diff1𝑐,𝑛=1212𝑛+11212𝑁𝑛1.(14)

To get the maximum deviation probability, the first derivative of 𝑝𝑐diff,𝑛1 should be used:𝑝Diff1𝑐,𝑛1=ln(2)2𝑛11212𝑁𝑛11ln(2)2𝑁𝑛11212𝑛1+10.(15)

It can be seen from the equation that it must apply for a zero:𝑁𝑛1!=𝑛1𝑛+1,1=𝑁12.(16)

With an additional graphic, it can be seen that it is found at the point where only the maximum of the function is located. Now it is possible to find the point in the equation that determines the maximum:𝑝maxDiff𝑐,(𝑁1)/2;Diff𝑐,(𝑁1)/2𝑝𝑁𝑐,(𝑁1)/20.25.(17)

At this point it becomes clear why the exact calculation of the carry is not worthwhile: the probability of occurrence of a input signal bit is already, by assumption, 0.5 and can be determined without calculation; the calculation of a carry must be made in each run of the adder and helps only to avoid errors in average of 0.25. The distribution of the most probable estimation error for different length adders is presented in Figure 10. The calculation of a carry bit, however, is costly. The 𝑛1th carry is calculated as follows: Carry𝑛1[]+[]=(Summand1mod(1𝑖)Summand2mod(1𝑖))÷(1𝑖),(18)

439727.fig.0010
Figure 10: Distribution of the most probable estimation error for n-bit adder.

the total of(i)3 identical left-shift operations(ii)1 division(iii)2 modulo (division with remainder).

5. Interface: Power Loss Estimation

In order to generate as little code overhead, a basic C data type was created for unsigned integers unsigned int for the equivalent class sc_int_power, which is equipped with the same operator overloads like the original data type. These include all the arithmetic operators that are required for signal processing (addition, subtraction, multiplication, division), the assignment operator, and the operator for streaming output with cout. In addition, a cast operator for unsigned int casting in a safe situation in which only the numeric value of the sc_int_power object must be such as that with an array indexing. The real power estimation takes place only in the multiplication and addition operators, because these functions are the main focus of this work. The functions occurring in this energy consumption model are stored by the class static member power.

To take advantage of the presented class sc_int_power in a SystemC project, it should be compared to a standard implementation requiring the following additional code fragments:(1) include the header file power.h;(2) call the static function set_calc to indicate the accuracy of calculations;(3) call the static function set_mu set parameters for the estimation;(4) call the static function set_lut to adjust the power estimation of the target FPGA.

Anything else due to the replication of the functionality of unsigned int in sc_int_power must be ignored. It is also possible to define a simple design already in the VHDL source code with pragmas, such as the type integer in the types sc_int_power by the modified V2SC during translation. Instead of the function calls in an equivalent SystemC project boundary bonds are defined by the above pragmas as follows:(1)set powerestimation_lut number;(2)set powerestimation_mu number;(3)set powerestimation_calc 0 or CALC_ALL;(4)powerestimation on: all of the following integers in sc_int_power translated;(5)powerestimation off: all of the following integers in int translated.

These pragmas are recognized when translated from VHDL to SystemC and automatically converted into the corresponding SystemC functions. By this procedure, additional user settings are generated into the source code.

5.1. The Class prob_bit

The class prob_bit implements the functions to estimate the amount of time for the inputs, outputs, and internal signals of a 32-bit adder to be high and the estimation of the switching frequency. To fill the individual estimated, six arrays are implemented:(1)input contains the estimation of bit = “1” for input bits;(2)input_rising_edges contain the estimation for the switching frequency of the input;(3)output contain the estimation of bit = “1” for the sum bit;(4)output_rising_edges contain the estimation for the switching frequency of the sum bits;(5)signal contains the estimation of bit = “1” for the carry bit;(6)signal_rising_edges contains the estimation for the switching frequency of the carry bit.

To store a new estimate for bit = “1” in each field, the functions add_input_prob, add_output_prob and add_signal_prob provided that again over the functions get_input_prob, get_output_prob, and get_signal_prob can be retrieved. To simplify these getter and setter methods, the class has the static methods in1, in2, carry and out, which return the corresponding array index in the fields. For example, an estimate probability of bit = “1” for the 3rd bit of the adder adder1 is set to 0.5 with the following call:

adder1.add_input_prob(prob_bit::in2(3),0.5).

The estimated switching frequencies on the power dissipation of the adder can have the class of functions input_weight_function, signal_weight_function, and output_weight_function, corresponding fan-ins for the individual inputs, outputs, and signal. By multiplying the switching frequencies associated with the switching frequency, we have the following relationship: 𝑃𝑣=𝐶LUT𝑉2(𝑓fanin).(19)

Thus, all weighted switching frequencies must be multiplied only by the same factor in order to infer the loss of performance. The actual value for each weighting function depends on the specific type of LUT-optimized design.

The weighting factors provided with the switching frequencies can be accessed through the functions get_input_power, get_output_power, and get_signal_power. The parameters of these functions can also have the static function in1, in2 be carry and out to select the desired bits.

These functions only refer to individual bits of each type. A new group of functions was defined to access the accumulated probabilities of occurrence of bit = “1” and the weighted switching frequencies. The functions are named following the same pattern as the getter and setter methods:(1)get_weighted_input_sum accumulates all the probabilities of occurrence of the inputs;(2)get_weighted_input _power_sum accumulates all the switching frequency of the inputs;(3)get_weighted_signal_sum accumulates all the probabilities of occurrence of the signals;(4)get_weighted_signal_power_sum accumulates all the switching frequency of the signals;(5)get_weighted_output_sum accumulates all the probabilities of occurrence of the outputs;(6)get_weighted_output_power_sum accumulates all the switching frequency of the outputs.

It is also possible to accumulate over the functions get_weighted_all_sum and get_weighted_all_power_sum all weighted probabilities of occurrence or all the weighted switching frequencies. For further support of statistics on different prob_bit objects, two other functions were created for each port and signal to return the probability of occurrence and the switching frequency, as follows(1)print_input_all_prob returns the probability of occurrence of all inputs;(2)print_input_all_power outputs the switching frequency of all inputs;(3)print_signal_all_prob returns the probability of occurrence of all signals;(4)print_signal_all_power outputs the switching frequency of all signals;(5)print_output_all_prob returns the probability of occurrence of all outputs;(6)print_output_all_power returns the frequency switching of all outputs.

The class uses a custom constructor with three parameters for the transfer of the number of input and output signals, which also allows the formation of any combinational logic result.

5.2. The Class exact_bit

The class exact_bit implements functions for the accurate determination of probability of occurrence and switching frequency of 32-bit adders and is analogue of the class prob_bit which estimates these parameters. In order to not duplicate code and to keep the naming of the methods most consistently, the class exact_bit inherits from the parent class prob_bit and can largely use the methods of the parent class. The difference from the parent class makes itself felt in the determination of the exact switching frequency. During the estimation in the class prob_bit, an estimation of the switching frequency can be made directly from occurrence probability, this is not an exact determination because positive edges can be made only in the context of the last state of the adder. For this reason, the class attribute last_state holds a pointer to an associated exact_bit object that stores the last state of the input and output signals. With this additional information, it is now the exact determination of the number of positive edges occurred and therefore the exact determination of the switching frequency possible.

The determination of the accumulated amount of time in which one remains on high bit (equivalent to the probability of occurrence) is also possible without the context of the last condition; therefore the implementation of the class prob_bit is used. All other functions such as statistical functions to accumulate or print functions can also be used without modification, as there are the fields’ input, input_rising_edges for the stored values to access from the direct calculation.

5.3. Addition Class

The class addition provides the methods for estimation and exact determination of the probabilities of occurrence and switching frequencies of individual bits in a prepared 32-bit ripple-carry adder, which are used in the class sc_int_power. Addition has the functions power_add_approx2, power_add_approx2_complex, and power_add_exact2, where the first two estimate the switching frequency and the last ones accurately determine the switching frequency of the individual inputs, broadcast, and outputs.

The two estimators power_add_approx2 and power_add_approx2_complex differ in the specification of the probability of occurrence of the input bits. For the case when the adders are used as a single unit, it can be assumed as a multiplier, according to the principle of indifference when all input vectors are equally distributed within an interval. The function power_add_approx2 can be used to get the average of the input vectors and a statistic value that the prob_bit object returns. There is the case when the adder is part of a multiplier, the perception of the equal distribution is no longer made, and the function power_add_approx2_complex is used. This function takes an input statistic in the form of a prob_bit counter object and returns the output statistics again in a prob_bit object.

The function for accurate calculation of the switching frequency power_add_exact2 was implemented, two terms for the associated switching frequencies, whilst the output statistics are in an exact_bit object are used as parameters. The exact determination of the required associated switching frequency is associated to exact_bit object with the latest state of the adder stored by the class in the private attribute last_state.

5.4. Multiplication Class

The class Multiplication provides methods for accurate determination and estimation of the probability of occurrence and the switching frequency in a 32-bit multiplier, which are required in the parent class sc_int_power. The class has the functions power_multiply_approx2 for the estimated and the power_multiply_exact2 for accurately determined switching frequency. The parameters of both functions are analogous to the methods of the class Addition. The estimator gives the mean value for the factors, while the function is used to the exact determination of the two factors to be multiplied. Because of the considered carry of the multiplier, the class Addition takes into account the wiring structure and take intermediate AND gates completely.

5.5. Random Generator for gcc

First, to test the estimation of the classes Addition and Multiplication, the built-in random generator of C was used with the function rand from stdlib library. In the first test the estimations had similar variations (about 10%) to the exact values calculated as in the subsequent tests with the Mersenne-Twister random number generation. Due to the fact that the C standard requires no algorithm for random number generation, the author does not know which algorithm is used in the current implementation by gcc.

5.6. Random Generator LFSR

In a second test the classes Addition and Multiplication were used for a linear feedback shift register. In the present implementation, the first 32 CRC polynomials were stored; that is, it can pseudo-randomize sequences to generate numbers in intervals up to 232. Tests, however, showed that the generated pseudo-random numbers are distributed very unevenly, most obviously in direct comparison with the C-random number generator or the Mersenne-Twister. Due this statistical property, this random number generator was considered unsuitable and was not included in further tests.

5.7. Mersenne-Twister Random Number Generator

The GNU suite delivers a Mersenne-Twister random number generator. This type of random generator works on an extremely long interval of 2199371 and its most important feature is the uniform distribution of all output bits. Due to these excellent statistical properties, the Mersenne-Twister random number generator has been selected as a reference for testing the power loss estimation. In a comparison to integrated random in C, the Mersenne-Twister cuts marginally better due to the fact that in the random number generator rand of C, the least significant bits are not equally distributed.

5.8. Random NDIST Based on gcc

To test the implementation of the power estimation with normally distributed random variables, the C-function NDIST was used, which approximates to function rand, but with uniformly distributed random numbers. With the aid of the central limit theorem, which states that the mean of a sufficiently large number of independent random variables, each with finite mean and variance, will be approximately normally distributed, six different random numbers are added and displayed as a normally distributed random number. The actual number of six random numbers to be accumulated was determined experimentally. The chart of Figure 11 shows the theoretical distribution of the generated numbers in this six-accumulated random variable.

439727.fig.0011
Figure 11: Normal distribution.
5.9. Generating a Large Random Number Set

The problematics when comparing different implementations of the estimation with the classes Addition and Multiplication in comparison with the VHDL implementation are to regenerate the random numbers used in each run. For this reason a common set of uniform distribution of 21000 random numbers 0–1024 was generated and from this set all the relevant statistical characteristics of was calculated the probability of occurrence of individual bits. But the numbers from 0 to 1, which are the basis for determining the switching frequency, change. In addition to this, statistical characteristics were raised to what extension they can differ from the requirement to assess the validity of the estimation result.

6. Module Translation and Integration

An integrated system simulation can be accomplished in homogeneous or heterogeneous environments. The adoption of homogeneous system simulations has several advantages: (i) it achieves faster simulation since synchronization tasks between different simulation kernels can be omitted; (ii) the usage of SystemC instead of for example, VHDL and the possibility to simulate models on higher abstraction levels further increment the simulation speed and additionally; (iii) since SystemC models are inevitable for system simulations and RTL models are necessary for synthesis flows, a module translation helps in keeping synchronism among the different module implementations and reduces error sources. On the other hand, this strategy requires a module translator. The VHDL to SystemC translator (V2SC) that was used in this work as basis has been developed by [19].

It was designed for SystemC 1.0 and basically supports constructs that can be directly translated from VHDL to SystemC. In this work, the translator has been extended to be compatible with the actual SystemC version 2.2 and to translate synthesisable VHDL constructs. Table 1 presents the features provided by V2SC and Table 2 shows a selection of extensions included as a contribution of this work. Further convertible VHDL syntax elements are listed in [20]. The converter has been verified against a set of VHDL modules, among them a IMDCT (for calculating the inverse modified discrete cosine transform) module including an AHB master and a APB slave interface with a complexity of >5000 lines of code, a FFT module (for calculating the fast Fourier transform) with about 250 lines of code, and a GCD module (for calculating the greatest common divisor) with about 100 lines of code.

tab1
Table 1: V2SC features.
tab2
Table 2: V2SC + wrapper extras.

The extensions are implemented with the compiler building tools flex and bison in the same way as the original converter. Furthermore the macroprocessor m4 and the text manipulation tool sed were used. The whole extension is build as a wrapper around V2SC (Figure 12) and is divided in two parts. A preconvert filter simplifies syntax elements by converting them to V2SC compatible constructs, or the filter masks constructs such that they can pass the V2SC without modification, if the V2CS cannot handle them. The masked blocks are converted by the postconverter to SystemC constructs.

439727.fig.0012
Figure 12: Combination of wrapper and V2SC.

The converted IPs own a pin accurate RTL interface whereas the system simulator provides TLM interfaces. To decrease the design time, a protocol conversion has been integrated in the translation tool. The modified V2SC enables connecting VHDL RTL IPs to a SystemC TLM system by implementing a library that contains design patterns for pin accurate to TLM functional units according to the interface specifications. Appropriate modules are exemplary implemented for AMBA, AHB, and APB. The resulting translated modules can be directly connected to a AMBA-TLM 2.0 base system (see Figure 13).

439727.fig.0013
Figure 13: Interface conversion for generating a heterogeneous accuracy simulation model.

7. Implementation of Power Estimation Features

The extension contains beside the compatibility pack, as described in Section 6, a tool for the automated integration of the previously discussed power estimation and calculation. This section explains a strategy, with which the conventional flow may be omitted if a certain decrease of accuracy can be accepted. Generally, the extension of functional parts with monitoring capabilities is an efficient way for automated system analysis. Since accurate power analysis flows are one the most time-consuming steps in the development process, their integration in a SystemC environment, with the techniques discussed in Section 3, are expected to speed up the evaluation process. Conventional accurate power analysis tools need mainly two kinds of input information. On the one hand a model of the placed and routed design allows to calculate capacitive loads of each net by evaluating fan-out numbers and the characteristics of the primitives (LUTs for FPGA technology) and with that the energy consumption per net per activity. On the other hand dynamic activities are collected during post-place and route simulation in value Change Dumps (VCD). Especially the accurate simulation with enabled signal tracing is a very time-consuming step.

The power evaluation extension of the modified V2SC can be enabled by the flag -powerestimation of preconvert. After setting this flag, preconvert converts the VHDL-type integer in the new-user-defined type sc_int_power and not to the standard SystemC-type int. sc_int_power is a new type which is constructed in this project to do all power estimation and calculation. It acts a replacement for int and its behaviour is similar to it. To get this functionality, all common arithmetic operators (+,−,*,/) are overloaded while the addition and the multiplication operator definitions contain the power estimation and calculating algorithms. Additionally all arithmetic operators are overloaded for mixed-type operation with int. Another feature is the overloaded cast operator of int, which is especially for indexing arrays. With a view to the use of sc_int_power as a complete SystemC data type, the streaming operator , the test of equality ==, the test of inequality !=, the assignment operator =, and the function sc_trace are overloaded. Thus in this manner defined type can be used as a template class in sc_signal<>, for example. To get a high prediction quality of the power estimation, the algorithms have to be parameterised. This is done by the static class methods set_mu (defines the average value of the input vectors) and the set_lut (for optimizing the design to architecture). The static class method set_calc defines a switch between estimation (value is zero) and calculation (value is equal CALC_ALL constant). The result of the calculation respectively estimation can be retrieved from the static variable power. For improvement of the manageability of the power analyser tool the parametrisation of the algorithms is also possible in VHDL code by special comments called pragmas. These pragmas consist of the prefix “powerestimation” followed by the parametrisation function (i.e., set_calc, set_mu or set_lut) and the value to set. The following example shows the proceeding.

The original VHDL code:(1)signal a,b,c: integer; (2)powerestimation set_lut 5(3)powerestimation set_calc 0(4)powerestimation set_mu 10.0(5)(6)c<=a+b.

Is translated in SystemC code:(1)#includepower.h(2)··(3)sc_int_power a,b,c; (4)c=a+b;(5)//the powerestimation result(6)//is stored in sc_int_power::power (7)return 0;

The whole workflow is like that presented following that Figure 14.

439727.fig.0014
Figure 14: Power estimation workflow.

8. Results

8.1. Accuracy of Toggle Estimation

The accuracy of the power estimation algorithms in sc_int_power to the also implemented exact calculation is about 13% for a average of more than 100 single additions or multiplications. Figure 15 shows the estimated toggle counts on the left and the exact calculated toggle count on the right of the partial sums (summands no. 1 of each stage) for a data set of 1000 two factors namely summands which are equally distributed in the range between 0 and 1024.

fig15
Figure 15: Toggle distribution estimation.

In comparison to the Xilinx XPower tool (see Figure 16), an almost similar accuracy was reached. On the following picture is the count of toggles for the partial sum bits visualized.

439727.fig.0016
Figure 16: Reference from XPA tool.
8.2. Speed Improvement

The integration of the proposed power estimation slows the calculation in contrast to the calculations on the standard type int about the factor two, but the also implemented exact power calculation slows the additions and multiplications about the factor 32, namely, 1024 (see Table 3). But the estimation is only slightly worse than the exact determination of the power dissipation; however the estimation is about the factor 16, namely, 512 faster. In sum this means that the proposed estimation is a good tradeoff between accuracy and simulation speed.

tab3
Table 3: Simulation speed comparison.

9. Conclusion and Future Work

This work proposes a methodology for switching activity estimation, taking into account the underlying hardware structure. The methodology has been exercised for MAC units. Different FPGAs have been used to show the portability of the method to other technologies. The loss of accuracy of 13% in the case of the MAC unit compared to post-place and route simulation results comes along with a simulation speed up of a factor up to 1024. The transparent implementation of that methodology into a VHDL to SystemC converter further accelerates and eases the development process. The general approach of the methodology can be also applied to other regular computation structures. With the implementation of further computational units (e.g., dividers or other adder architectures) and the support for other FPGA architectures, an analysis of complex data paths and a faster evaluation of design alternatives are envisioned.

References

  1. N. Dhanwada, I. C. Lin, and V. Narayanan, “A power estimation methodology for SystemC transaction level models,” in Proceedings of the 3rd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and Systems Synthesis (CODES+ISSS '05), pp. 142–147, September 2005. View at Scopus
  2. N. Dhanwada, R. A. Bergamaschi, W. W. Dungan et al., “Transaction-level modeling for architectural and power analysis of PowerPC and CoreConnect-based systems,” Design Automation for Embedded Systems, vol. 10, no. 2-3, pp. 105–125, 2005. View at Publisher · View at Google Scholar · View at Scopus
  3. M. Lang, “System C for Embedded System Design,” Seminar, 2006, http://dl.acm.org/citation.cfm?id=339657.
  4. S. Boukhechem and E. B. Bourennane, “TLM platform based on systemC for STARSoC design space exploration,” in Proceedings of the NASA/ESA Conference on Adaptive Hardware and Systems (AHS '08), pp. 354–361, June 2008. View at Publisher · View at Google Scholar · View at Scopus
  5. S. Hauck and A. Dehon, Reconfigurable Computing the Theory and Practice of FPA-Based Computing, Elsevier, 2008.
  6. N. Voros, A. Rosti, and M. Hübner, XDynamic System Reconfiguration in Heterogeneous Platforms, Elsevier, 2009.
  7. M. Kuehnle, A. Wagner, and J. Becker, “A statistical power estimation methodology embedded in a SystemC code translator,” in Proceedings of the 24th Symposium on Integrated Circuits and Systems Design (SBCCI '11), pp. 79–84, IEEE Computer Society, 2011.
  8. Forte, “Cynthesizer,” 2011, http://www.forteds.com/products/cynthesizer.asp.
  9. M. Graphics, “Catapultc,” 2011, http://www.mentor.com/esl/catapult/overview.
  10. Cadence, “C-to-silicon compiler,” 2011, http://www.cadence.com/products/ sd/silicon compiler/pages/default.aspx.
  11. N. Bombieri, “Hif suite 2.0: hdl translating and manipulation tools,” 2009, http://hifsuite.edalab.it/.
  12. University of Cincinnati, “Savant,” 2011, http://www.cliftonlabs.com/savant/.
  13. M. e. a. Bocchi, “A system level IP integration methodology for fast SOC design,” in Proceedings International Symposium on System-on-Chip, pp. 127–130, 2003.
  14. G. B. Vece and M. Conti, “Power estimation in embedded systems within a SystemC-based design context: the PKtool environment,” in Proceedings of the 7th Workshop on Intelligent Solutions in Embedded Systems (WISES '09), pp. 179–184, June 2009. View at Scopus
  15. M. Giammarini, M. Conti, and S. Orcioni, “System-level energy estimation with Powersim,” in Proceedings of the 18th IEEE International Conference on Electronics, Circuits and Systems (ICECS '11), pp. 723–726, December 2011.
  16. D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: a framework for architectural-level power analysis and optimizations,” ACM SIGARCH Computer Architecture News, vol. 28, no. 2, pp. 83–94, 2000, http://portal.acm.org/citation.cfm?id=339657.
  17. G. S. Silveira, A. V. Brito, and E. U. K. Melcher, “Functional verification of power gate design in systemc RTL,” in Proceedings of the 22nd Symposium on Integrated Circuits and Systems Design (SBCCI '09), I. S. Silva, R. P. Ribas, and C. Plett, Eds., ACM, September 2009. View at Publisher · View at Google Scholar · View at Scopus
  18. A. V. Brito, M. Kühnle, M. Hübner, J. Becker, and E. U. K. Melcher, “Modelling and simulation of dynamic and partially reconfigurable systems using System C,” in Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI ’07), pp. 35–40, 2007. View at Publisher · View at Google Scholar · View at Scopus
  19. U. Tuebingen, 2001, http://www-ti.informatik.uni-tuebingen.de/systemc/.
  20. A. Wagner, Diplomarbeit Randbedingungen der HW-Modellierung auf RTL-und Systemebene, 2011.