Abstract

With forecasted hundreds of processing elements (PEs), future embedded systems will be able to handle multiple applications with very diverse running constraints. Systems will integrate distributed decision capabilities. In order to control the power and temperature, dynamic voltage frequency scalings (DVFSs) are applied at PE level. At system level, it implies to dynamically manage the different voltage/frequency couples of each tile to obtain a global optimization. This paper introduces a scalable multiobjective approach based on game theory, which adjusts at run-time the frequency of each PE. It aims at reducing the tile temperature while maintaining the synchronization between application tasks. Results show that the proposed run-time algorithm requires an average of 20 calculation cycles to find the solution for a 100-processor platform and reaches equivalent performances when comparing with an offline method. Temperature reductions of about 23% were achieved on a demonstrative test-case.

1. Introduction

“MP-SoC is not just coming: it has arrived” [1]. Intel's Tera Scale computing research program [2] is one illustration of today's reality: a prototype is composed of 80 processors interconnected with a network-on-a-chip (NoC) [3, 4]. In a near future, an increasing number reconfigurable processing elements on a single chip is forecasted, leading to new challenges for system-level designers. From technological side, the variability between dies requires chip-by-chip optimization. Offline methods are no more possible when hundred of processors are integrated on advanced technologies. Thus, on-chip distributed techniques are required to adjust the parameters of each chip. On the applicative side, MP-SoC platforms will support several applications, including someone unknown at design time. Then, run-time adaptability is a requirement to optimize applicative parameters.

In this paper, we consider MP-SoC platforms integrating hundreds of reconfigurable processing elements (PEs) as shown in Figure 1, each includes processors, memories, and peripherals. It is assumed that each PE embeds the required components to locally adjust its parameters. For instance, dynamic voltage/frequency scaling (DVFS) tunes the voltage/frequency couple of each tile. In this context, an important issue is how to manage the tradeoff between the performance achieved and the die temperature which is an indirect indicator of the power consumption. It is a multiobjective optimization problem [5]: how to solve the power and temperature management for hot-spot reduction [6] taking into account the performance control through task synchronization for a given application with functional dependencies [7].

The main contribution of this paper is the use of the game theory [8] as a model to dynamically optimize MP-SoC platforms in a distributed way. A run time game-theoretic method to choose the frequency for each PE in MP-SoC platforms in order to optimize the circuit temperature taking into account the task synchronization has been presented in [9]. In this article, this technique is reviewed. A statistical analysis regarding the algorithm convergence and scalability is studied and a demonstrative test-case is presented.

1.1. Problem Formulation

Consider an MP-SoC architecture composed of several reconfigurable processing elements (PEs). It is assumed that future MP-SoC will integrate a large number of PEs leading to the choice of a distributed control architecture. Each PE integrates devices such as processors, memories, peripherals, and dynamic voltage/frequency scaling (DVFS). The choice of distributed DVFS is justified in [10, 11] while in [12, 13], a whole demonstrator is presented.

It is assumed that synchronous PEs are connected by an asynchronous 2D-mesh NoC as in [14]. The interconnect system offers the required bandwidth and latency for the targeted applications, as well as the ability of individual frequency selection.

Consider an application composed of tasks and connections as illustrated by the example described in Figure 2(a). It is assumed that the application is supported by the MP-SoC and it is mapped on the platform by a given mechanism. At application level, the functional dependencies between tasks lead to adjust the frequency of each PE in order to guarantee the task synchronization [7].

At physical level, the frequency choice is influenced by the temperature metric. In this paper, we consider a first-order approximation that the temperature of a given PE is affected by its neighbors [6, 15], as shown in Figure 2(b). The two constraints meet when the application is mapped on the MP-SoC (Figure 2(c)). A tradeoff between the synchronization and the temperature metrics has to be solved. It is a multiobjective optimization problem we solve in this paper.

1.2. Paper Organization

The paper is organized as follows. Section 2 presents some related works regarding dynamic and static optimizations for embedded reconfigurable systems. The basic formalisms, definitions, and notations of game theory are presented in Section 3. In Section 4, we present the multiobjective optimization notation and we formulate our task synchronization and temperature models. Based on discussed models and using the preliminary presented theory, a game-theoretic optimization algorithm is proposed. Simulations were performed on several scenarios to study the feasibility of this approach in the MP-SoC context. Section 5 analyzes the dynamic behavior and its optimization quality. Finally, a demonstrative test case is evaluated in Section 6.

Several optimization methods are used to get the best tradeoff between system metrics for a given architecture. For instance, designers try to obtain the best performance on power consumption ratio (MIPS/mW). Optimization may be applied at different stages, either at design time (static optimization) or at run time (dynamic optimization), or both, as described in the following paragraphs.

Several works propose some static optimization techniques for given metrics. In [16], the authors developed a new framework based on integer linear programming (ILP) solvers and constraint programming to solve at design time the task allocation/scheduling problem. In this domain, there are also several significant contributions such as [17, 18] for workload optimization and [19] for energy savings.

More recently, there is a growing interest for run-time optimization. When dynamic methods are considered, most of the proposed approaches are global techniques: the optimization decisions are taken considering the whole system. In [20], heuristics for optimal task placement are discussed in an NoC-based heterogeneous MP-SoC. There are several approaches where tasks are moved in order to balance computation workload and homogeneously dissipate the power, for example, in [6, 21]. The slow reaction time, the requirement of unused tiles, and the memory or the important number of data transfer between tiles can limit the use of these techniques for certain applications. Moreover, the implementation of these techniques is limited to functionally homogeneous arrays.

In [22, 23], authors propose a design time Pareto exploration and characterization combined with run-time management [24]. In [15], a convex optimization method is used in a 2-phase algorithm that allows frequency assignment on an MP-SoC controlling hot-spots. These approaches become prohibitive as the applications should be completely characterized at design time for an efficient implementation. In [25, 26], voltage and frequency are chosen at run time by centralized mechanisms. These solutions do not scale with the number of PEs and then do not well match for future multiprocessor platforms [1].

Our contribution is based on the use of game theory for dynamic optimization. Game theory has been widely used in other domains such as economy, sociology, and biology. In the VLSI context, in [27], game theory has been firstly used for circuit synthesis. To the best of our knowledge, it has never been applied to run-time optimization of embedded systems.

3. Game Theory

The objective of this section is to introduce the notations used in the game theory and the common solution known as the Nash equilibrium (NE) [28]. Game theory was introduced by Von Neumann and Morgenstern [8]; it can be viewed as a branch of applied mathematics to study interactions among rational individuals or decision makers.

3.1. Noncooperative Games

A game is a scenario with several players interacting by actions and consequences [29]. Basically, the players, or decision makers, individually choose how they act, resulting in consequences or outcomes. Each decision maker tries to maximize its own outcome according to its preferences, leading to a global optimization. Since the decision is made without need to the cooperation of other players, the game is called noncooperative.

Mathematically, such a game in normal form (normal form is the way in which the game is described) with players is described aswhere is the set of players; is the set of actions for players ; and is a function describing its outcomes. The discrete set of actions or strategies of player is defined aswhere is the number of possible actions for this player. The outcome of player is represented by a score: the higher the score, the nearer the optimal point. Because of dependencies with other players, the score or utility is a function of choices from current player and choices from other players:For MP-SoC platforms, the choice of the game type to be applied is driven by the complexity, for low-cost implementation, and the number of steps of the game for the run-time feature. For these reasons, the noncooperative normal-form simultaneous repeated game model has been chosen.

3.2. The Nash Equilibrium

Nash proved that each -player noncooperative game has at least one equilibrium point, known as Nash equilibrium (NE). It can be defined by pure strategies or by mixed strategies. In pure strategies, the solutions are obtained by allowing only one action per player. On the other side, in mixed strategies, the solution is chosen in a set of actions, each played with a given probability [28]. For pure strategies, an equilibrium point is defined as follows.

Definition 1. For a given game , a solution is a Nash equilibrium if for all , where is the Nash equilibrium strategy for player .

Note that in an NE, players cannot improve their outcomes by unilaterally changing their strategies, indicating then that it is at least a local maximum.

4. Game-theoretic Multiobjective Optimization

In this section, a new algorithm based on game theory is proposed to solve the multiobjective optimization issue at run time. The mathematical description of multiobjective optimization problems is formulated. Then, the system-level metrics are discussed and finally, the game-theoretic model and the distributed optimization algorithm are proposed.

4.1. Multiobjective Optimization

The multiobjective optimization problem is defined as how to find a vector that optimizes a vector function whose elements represent -normalized objective functions . These functions are mathematical descriptions of performance criteria, usually in conflict. That is, where produces solutions in an -dimension space. The problem is converted in a single objective optimization problem thanks to the linear combination of the metrics [5]:where and for . The weights represent the importance of each metric in the combination .

This principle combined with the game theory can be used to optimize the task synchronization when minimizing the temperature of the blocks by the run-time selection of the PE frequency. Thus, (7) can be written for each PE of the MP-SoC aswhere is the frequency of each PE, represents the synchronization function, and is the temperature metric for each PE. Finally, and are the relative importance of each metric in the optimization process.

4.2. System-Level Metrics
4.2.1. Task Synchronization Model

The application synchronization problem is defined as the choice of the best working frequency for each processor. In [30], authors try to equalize the input and outputs rates of application nodes in order to maintain the processor rate. Based on this principle, we propose a simple task synchronization model based on the load assigned to each PE and its frequency.

Assume that block supports the task and transfers the generated data to block , which is running the task . They are synchronized when they are working at the same mean performance, that is, the data produced by block are entirely consumed by block . Otherwise, if block is faster than block , some data produced by the first block are not immediately consumed by the second one. On the contrary, if block is slower than , the second one will have undesirable idle cycles. In both cases, they are not synchronized.

Following this principle, we define the synchronism between two blocks by the following equation:where and are the frequencies of blocks and and are their respective computation loads. Parameters and are considered to normalize heterogeneous PEs. In the rest of the paper, they are assumed to be for homogeneous MP-SoC. The result of this equation is zero when the blocks are synchronized, and a negative value otherwise.

Equation (9) can now be used to build the first metric for each PE as follows:where is the number of connections and is the connectivity vector of task for the application:where if task is connected to task and otherwise.

Finally, the problem of global task synchronization results in the choice of the frequency that maximizes for each PE.

4.2.2. Temperature Model

Considering a fixed voltage, the temperature of a block depends on the power consumption and the effect induced by other blocks. On a simplified model, we neglect the static consumption. We consider the dynamic power consumption of the PE as with the clock frequency, the supply voltage and a given circuit constant.

Following [31], the transfer thermal resistance of with respect to iswhere is temperature rise due to the power dissipated by .

Require: MyStgy, OtherStgies
Ensure: NewStgy
NewStgy MyStgy
for all Stgy do
if
then
NewStgy Stgy
else
end if
end for

for all GameCycle do
for all do
UnilaterallyMax(player- )
end for
end for

The temperature of each PE in an -processor array is calculated bywhere are the power consumptions of the -PEs. For simplification purposes, it is assumed in this work that the PEs are only affected by the power dissipated by the nearest neighbors. In a regular 2-dimension mesh array, it means the PEs located at the north, south, east, and west positions.

Finally, expression (13) is reformulated for each PE in order to express the reduction of temperature as an optimization metric:with a vector containing the transfer thermal resistances of PEs affecting the temperature of .

4.3. Game-Theoretic Method

The tradeoff problem discussed in Section 1.1 is modeled as a game (as explained in Section 3) with multiple objectives (Section 4.1). The components of such a game are identified and an algorithm is proposed to solve the formulated problem.

4.3.1. Game Model

As stated in Section 3.1, a game is composed of players (), a set of actions per player (), and the outcomes (). For the given tradeoff problem, the PEs will be considered as the player set. The tradeoff variable is the clock frequency of each PE. Thus, the set of actions of each player is each possible frequency . Setting the frequency step allows to deduce the strategy space . For example, with a 5 MHz step into the bounds of 100 to 300 MHz, the strategy space will be as follows:

The outcome or utility function is described by the linearized version of the multiobjective optimization problem, that is, by using (8):where and are, respectively, the objective functions of (10) and (14). and are the synchronization and temperature weights of the optimization for block .

4.3.2. Distributed Algorithm

The method used to select the choice is defined as unilateral maximization: each player chooses the action maximizing its own utility function. The choices of other players are considered as given parameters. The maximization can be performed by running Algorithm 1 for each PE, where is the utility function of PE , is its last chosen strategy, and OtherStgies are the strategies chosen by other players in the previous cycle. Note that this code implements the utility maximization by comparing the outcomes of all possible solutions per player.

The code is embedded in each PE and is simultaneously executed at run time, allowing the parallel distributed optimization. The period between two executions is defined as game cycle. It is assumed that before the next game cycle, all PEs will end the current execution of the maximization process.

In order to analyze the global behavior, the parallel launching of Algorithm 1 is simulated by Algorithm 2. All PEs launch the unilateral maximization at the same time and taking into account the last known choice of other players. Note that once a new application is mapped or the application load changed, the game-theoretic algorithm will adjust the PE frequencies.

5. Results

The results presented in this section include the characterization of the algorithm behavior. Two aspects are analyzed. From one side, since the objective of this work is to provide a run-time optimization mechanism, the dynamic response of the algorithm has major interest: the convergence speed of the system is studied. On the other side, in order to characterize the quality of the solution, it has to be compared with an existing optimization method.

In this section, these two aspects are analyzed from a statistical point of view. Firstly, the exploration methodology is discussed. Secondly, the convergence results are presented regarding the system scalability. Then, an exploration of the impact of the objective weights on the dynamic response is presented. This section ends with a comparison of the optimization quality of the game-theoretic solution with an offline method.

5.1. Methodology

The game complexity is defined by the number of PEs needed by the application (i.e., the number of players), the application connectivity (i.e., the interaction between players), and the multiobjective function (i.e., the utility function). In order to study the scalability of our technique, the application size is explored in a range of 4 to 100 processors. For each evaluated case, the maximum application connectivity is explored from 2 links per task to full connectivity (, where is the number of tasks). The third aspect, the utility function, is explored by changing the objective weights between 0 and 1 and between 1 and 0.

A first analysis of the convergence speed is done by setting the objective weights to and and then repeating several different scenarios. In order to obtain statistical results, each parameter combination (i.e., size and application connectivity) is simulated 1000 times. For each time, a new application graph is defined. The simulation procedure is implemented by Algorithm 3.

for Size = 4 to 100 do
for Connectivity = 2 to Size do
for Scenario = 1 to 1000 do
Generate random application
Run reference optimization analysis
Save results
end for
end for
end for

In a second convergence analysis, the impact of the objective weights is explored. Several random 25-task applications are evaluated for each pair , .

5.2. Convergence

The dynamic response is analyzed by measuring the convergence speed in a number of game cycles that the algorithm takes to reach a solution. Statistically, the convergence speed corresponds to a gauss distribution with mean and standard deviation depending on the game parameters (number of processors and actions). For example, Figure 3 shows the distribution results over 98000 simulations of a 100-processor platform with random applications, connections chosen between 2 per PE to full connectivity. This curve indicates that typically a 100-processor MP-SoC will find a solution in around 18 game cycles but usually in less than 40 cycles, regardless of the application.

The analysis of the mean and the standard deviation of the convergence speed are repeated for the simulation results in a range of 4 to 100 processors. Figure 4 shows the results for different sizes. The graph shows 3 regions corresponding to the concentration of 68%, 95%, and 99.7%, respectively, the first, second, and third standard deviations. For instance, 99.7% of applications will converge in less than 32 game cycles in a 36-processor system. Note that the convergence speed decreases in when the platform size augments, showing that the algorithm perfectly scales with the number of processors.

A game cycle is composed by two phases. The first one consists of the communication of the decisions of all PEs. The second phase is the maximization of the utility function. A generic 8051-microcontroller has been used to illustrate a game cycle duration. It takes around 400 clock cycles to process the algorithm or 800 nanoseconds considering a 500 MHz frequency. On the other side, the communication latency in a 2D mesh NoC is estimated as follows. The longest path in an -processor system is hops. It is assumed that there is no conflict in the interconnect. Considering that in the asynchronous NoC the equivalent node latency is around one cycle of the given 500 MHz clock, the maximum estimated communication delay is 38 nanoseconds per game cycle for a 100-processor MP-SoC. Table 1 summarizes the speed results measured in a number of game cycles and in the equivalent estimated times.

From the statistical simulations, it is observed that 94% of evaluated scenarios converge to a solution. The other 6% cases do not converge to a unique solution but they present oscillations between two or more frequencies for each PE. It is assumed that in these conflictive cases an external mechanism such as task migration [32, 33] is used to solve the problem.

5.3. Impact of Weights

In order to study the impact of the objective weights on the convergence speed, 50 000 simulations are performed over a -processor array where each PE drives its frequency between 100 and 200 MHz with a step of 10 MHz. Applications are defined with random loads and connections as in the previous experience. For each simulation, a new application is randomly generated. The results are shown in Figure 5. The -axis represents the number of scenarios found for each convergence speed from -axis and for a given metric weight. The -axis explores the synchronization weight from 0 to 1, corresponding to from 1 down to 0.

The results show that the convergence speed augments with , meaning that it depends on the metric complexity. For instance, for the trivial case of , all the frequencies are driven to the minimum value in the first game cycle. It is due to the improvement of the temperature alone, without taking care of the task synchronization. On the other side, for the system presents the slowest convergence speed, with an average of 10 game cycles.

5.4. Optimization Quality

In the game-theoretic model, each player tries to optimize its outcome by maximizing the utility . From a global point of view, the outcome is described byThe global optimization problem is then formulated as where and are the lower and upper bounds of the strategy space of player , that is, the minimum and maximum frequencies. This formulation is known as the minimax problem [34].

Using Matlab minimax solution, worst and best bounds are found for each simulated scenario. The game-theoretic solution is then positioned between these two references allowing the characterization of the quality of the found solution. A total of 10 000 simulations of the procedure used in Section 5.2 are analyzed, calculating the optimization percentage achieved in each case. The results are presented as a distribution curve in Figure 6. As it is shown, the distribution shows an average at 89% while it presents a peak at 93%. The results are concentrated between 58 and 98%. Note that these results are obtained in few game cycles, for instance, less than 40 for a 100-processor MP-SoC as was explained in Section 5.2. On the contrary, the Matlab minimax algorithm takes between some seconds and few minutes to calculate each solution on a nowadays desktop PC.

6. Test Case

For clarity of the demonstration, a very simple test case composed of a 6-task application (Figure 7) mapped on a -PE array (Figure 2(c)) has been chosen. Each PE is able to adjust its own frequency between 100 and 300 MHz with a step of 5 MHz. The task synchronization is modeled as in Section 4.2.1 while the temperatures of PEs are calculated as in Section 4.2.2. The transfer thermal resistances are arbitrarily assumed to be and (the real values are dependent on the used technology). The two metrics are combined as in (16) by the weights for the task synchronism and for the PE temperature.

Four configurations were evaluated. The first one describes a system which is only interested in optimizing the task synchronization, that is, and . In the context of this work, this configuration is used as the reference to calculate the temperature reduction achieved by the other configurations. The second one expresses a scenario where the synchronization represents 75% of the optimization importance; while only 25% is for the temperature minimization ( and ). The third configuration defines a case with equal interest for each metric ( and ), while the last case gives only 25% of importance for the synchronization ( and ). For simplicity purpose, these values are arbitrary chosen to be the same for all PEs. Nevertheless, PEs may have different constraints. For example, a central PE may have more interest in temperature reduction than a border one to avoid hot-spots.

In order to measure the quality of the found solution, the same optimization issue is modeled as the minimax problem presented in Section 5.4 and solved using Matlab. The results are compared to the game-theoretic solution.

Figure 8 shows the evolution of the game-theoretic algorithm for the second configuration ( and ). Each graph of the figure shows in solid lines the evolution of the chosen frequency. In this example, processor 1 takes only two game cycles to reach a stable solution, while processor 2 needs 3 cycles, and processor 3 needs 4 game cycles. Processor 4 has chosen the lowest frequency from the beginning; finally, processors 5 and 6 are the slowest that need 17 and 16 game cycles, respectively, to reach the solution. In this example, the game reaches the NE in 17 cycles. In dashed lines, Figure 8 shows the optimal solutions found with Matlab. After few game cycles, the game-theoretic solution converges to an NE close to the Matlab solution.

Tables 2, 3, and 4 summarize the results of the four evaluated configurations. Table 2 lists the frequencies found by the game-theoretic algorithm and by Matlab. Note that in all cases, the solutions found by the game-theoretic algorithm are close to those found with Matlab. The convergence speed of the game-theoretic solution for each configuration is also highlighted. Table 3 presents the resulting temperature of each PE, the average temperature of the entire system, and the gain achieved by configurations 2, 3, and 4 compared to configuration 1. These results show up to 23% of average temperature gain, depending on value. Note that these reductions are obtained in few game cycles, making this technique able to manage the parameters at run time.

In addition, Table 4 lists the improvement percentage of task synchronization. The task synchronization of each PE has been calculated by using expression (10) for a nominal case where all PEs work at 200 MHz. The synchronization improvements for the game-theoretic and Matlab optimizations are calculated for each configuration with respect to the nominal case. The results are listed in Table 4, showing for configuration 1 (, ) an average improvement of 96.72% for the game-theoretic optimization and 97.69% for the Matlab one compared to the nominal frequency set. Note that for configurations 1, 2, and 3 Matab obtains better synchronizations at higher temperatures than the game-theoretic algorithm. On the contrary, for configuration 4, Matlab obtains worse synchronizations at lower temperatures compared to our algorithm.

Finally, the maximum, average, and minimum PE temperatures for the game-theoretic algorithm are represented in Figure 9. The results show more uniform temperature distribution when rises. Configuration 1 presents C of difference between maximum and minimum (44% of the average), while configuration 4 only shows C (28% of the average temperature). The run-time game-theoretic method has not only reduced the average temperature but also the peaks or hot spots.

7. Conclusion

In this paper, we have presented a novel run-time technique based on the game theory. We have discussed the optimization of multiple objectives on embedded reconfigurable systems. We have proposed an algorithm that optimizes the temperature profile while maintaining the task synchronization. Compared to other approaches, our technique assumes a complete distributed multiprocessor system able to take decisions at run time.

The results have shown that our method scales with the number of processor without excessive convergence times. For a 100-processor platform, our technique has required an average of 20 calculation cycles to reach the solution, that is, about 16 s when using the 8051 microcontroller at 500 MHz. The few calculation cycles needed to converge make this technique able to optimize metrics at run time on massively parallel embedded systems.

We have measured that the achieved optimization is about 89% in average compared to a global offline method. The evaluated test case has showed that our algorithm can achieve reductions of up to 23% in the temperature profile.