Journal of Optimization

Volume 2017, Article ID 8624021, 11 pages

https://doi.org/10.1155/2017/8624021

## Power and Execution Time Optimization through Hardware Software Partitioning Algorithm for Core Based Embedded System

^{1}Laboratory of Electronic and Microelectronic, Faculty of Sciences at Monastir, University of Monastir, 5000 Monastir, Tunisia^{2}Networked Objects Control & Communication Systems Laboratory, National Engineering School of Sousse, BP 264, Sousse Erriadh, 4023 Sousse, Tunisia

Correspondence should be addressed to Siwar Ben Haj Hassine; moc.liamg@enissah.jah.rawis

Received 18 August 2016; Revised 8 January 2017; Accepted 24 January 2017; Published 19 February 2017

Academic Editor: Manlio Gaudioso

Copyright © 2017 Siwar Ben Haj Hassine et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Shortening the marketing cycle of the product and accelerating its development efficiency have become a vital concern in the field of embedded system design. Therefore, hardware/software partitioning has become one of the mainstream technologies of embedded system development since it affects the overall system performance. Given today’s largest requirement for great efficiency necessarily accompanied by high speed, our new algorithm presents the best version that can meet such unpreceded levels. In fact, we describe in this paper an algorithm that is based on HW/SW partitioning which aims to find the best tradeoff between power and latency of a system taking into consideration the dark silicon problem. Moreover, it has been tested and has shown its efficiency compared to other existing heuristic well-known algorithms which are Simulated Annealing, Tabu search, and Genetic algorithms.

#### 1. Introduction

The exponential rise of embedded systems, all along with the persistent quest for higher levels of performance have resulted in the necessity of creating efficient types of embedded circuits. In fact, the embedded systems have become the worldwide leader technologies since they have penetrated into the human life to a very large extent. Besides, they play a vital role in industries as well as military applications which requires the necessity of having faster and better performing systems. Unfortunately, most of current technologies have only managed to further increase the system’s capacity in order to have a faster treatment at the cost of a considerable simultaneous augmentation in their power. However, excessive power consumption may damage the integrated circuits through overheating, limiting the degree of transistors integration on a chip, bringing problem signal integrity, shortening battery durability for portable devices, and requiring expensive cooling and packaging systems. Moreover, the huge dependence of wastage power consumption on threshold voltage has limited further threshold and provide voltage scaling. Thus, the power consumption is rising with technology scaling, such that it can no longer be cooled down profitably considering the physical limitations forced by cooling technologies and packaging. This gives rise to the dark silicon problem [1–3]. The concept of dark silicon is based on constraint that important fraction of transistors on chip cannot be powered on at a nominal voltage for a specific thermal design power TDP budget and have to be power-gated or simply remain dark. The TDP is the maximum amount of power provided to a chip while maintaining the chip temperature under the thermal safe temperature. In case where the TDP is exceeded, the temperature of the chip will rise beyond the cooling capacity that will throttle the chip. Previous studies [1, 2] have predicted that 50% to 80% of the chip area will be dark for GPU and CPU based systems. To overcome such dilemmas, designers’ efforts have been increased to produce less consuming systems. In this context, some research groups have focused on the creation of new architectures in terms of the material [4] while other groups have focused on extending batteries life cycle [5]. Yet, such solutions require high resources that several research groups do not have. For that, other methods have appeared in order to offer a less power consuming system such as hardware/software partitioning [6, 7].

Traditionally, partitioning was carried out manually which requires a detailed knowledge of circuit operations from designers. Such manual approaches were limited only to small designs with small number of constituent blocks [8, 9]. Since digital systems have become much more sophisticated, automatic HW/SW partitioning has become a necessity. In fact, many research groups have opted for the HW/SW partitioning in order to increase the performance of a system as presented in approaches such as [10, 11]; most of these approaches purposes are to meet performance constraints while keeping the system cost (area) as low as possible. Unfortunately, none of them took the power consumption and the execution time into consideration. Hence, we present in this paper an algorithm that finds a possible HW/SW partitioning of a data flow graph that finds out a tradeoff between power and latency taking into account the dark silicon problem.

The rest of the paper is organized as follows: Section 2 reviews the related literature; the proposed partitioning algorithm is addressed in Section 3 followed by an illustrative example; the numerical experimentation and discussion theorem are presented in Section 5 and finally the article ends up with the conclusion that briefs the present findings and future research on this theme.

#### 2. Related Work

Recently, a new alternative technology that combines logic elements and memory along with an intellectual property processor core has emerged to remedy the excessive need for better performance systems. This technology called System on Programmable Chip SoPC allows and facilitates the SW/HW partitioning.

As generally reckoned, embedded systems consist of programmable software part (SW) and application specific hardware part (HW). Software part is much easier to develop and modify, and it consumes less power compared to the hardware part but it requires extra time to give final response. In fact, compared to the software which is less expensive in terms of cost and power consumption, the hardware provides better performance because it offers a faster treatment. For that reason, the purpose of HW/SW partitioning is to design a balanced system that accomplishes all system constraints [12]. Most of formulations of HW/SW partitioning dilemma have proven to be NP-hard [13, 14]. In fact, many exact algorithms have been proposed such as Branch-and-Bound [15], dynamic programming [16], and integer linear programming [17]. However, these exact algorithms tend to be quite slow for bigger inputs. Hence, for bigger partitioning problem heuristic algorithms have been the basis for the majority of researches such as Genetic algorithm (GA) [18], Tabu Search [19, 20], Simulated Annealing [21], Particle Swam Optimization [22, 23], Ant algorithm [24, 25], shuffled frog leaping algorithm [26], and greedy algorithm [27]. Other designers have mixed two heuristic algorithms to solve the HW/SW partitioning problems like in [28] where authors have used hybrid algorithm of Genetic algorithm (GA) and Tabu Search one, while others [29] have mixed the Discrete Particle Swarm Optimization (DPSO) and Branch-and-Bound (B&B) algorithms to meet the same aim. Besides, authors in [30] have proposed a new heuristic solution based on HW/SW partitioning that aims to reduce the execution time of the overall circuit. Moreover, authors in [31] have come up with a new IVA-HD which is a programmable, true multistandard, and full HD video coding engine that adopts HW/SW partitioning to achieve the low power and area equipment of the OMAP 4 processor. To attain the same goal of power optimization, [32] has proposed a minimizing approach based on mapping clusters of instructions to a core that yields a high utilization rate of resources and thus minimizes power consumption. Such a method has offered a less consuming system at the cost of an additional hardware overhead. The problems that these previously mentioned works have met are either to optimize one parameter at the cost of another important constraint or to focus on achieving the optimization of only one constraint such as power or execution time. Also, none of them have mentioned the dark silicon problem. In fact, the dark silicon has become a critical issue for designers since it can decrease the reliability in the nanoera [33–35] and leads to soft errors, aging and even process variations [36, 37]. Recent works have explored the dark silicon problem by applying a very low voltage to power on more cores [38] and proposed new accelerators architectures [39, 40]. Almost, the majority of works have handled the dark silicon problem on low level codesign which necessitates a good knowledge of the target circuit and extra time of marketing cycle of the product. Other designers have proposed new architectures by exploiting architectural heterogeneity [41–43]. However, such solutions require high resources that several research groups do not have. In the literature, only few works have combined the HLS and dark silicon problem due to its complexity [44, 45]. It is true that generally the dark silicon problem appears for multiprocessor system-on-chip (MPSoc). But, due to the Soc huge rise the dark silicon problem must be taken into consideration even with one core based embedded system [46]. Motivated by the fact and coming across the shortages of other researches, it has been vital to come up with a new idea of developing a new algorithm that aims to create a less consuming system and a faster one without influencing the system reliability.

#### 3. Problem’s Definitions

We consider the applications that can be modeled using data flow graph (DFG). A data flow graph that is used to create a preliminary overview of the system denoted as , where , is the set of vertices or nodes that are interconnected to each other by edges . Edges of the graph present the dependencies between the components of the system. In general, the node of the graph can represent a basic block [47], a short of instruction [48], a procedure or a function [49], and so on. In this paper, we use four different types of nodes:(i)A start or an end node and .(ii)A node that includes simple code and .(iii)A node that contains the beginning of a control-construct and .(iv)A node that contains the end of a control-construct and .

##### 3.1. Partitions’ Types

The graph partitioning is to cut the graph into possible partitions where is the set of all possible partitions; is a possible partition; and is the number of possible partitions.

There exist two kinds of partitions:(i)A control-construct partition that includes a whole construct such as** if** to** end if**,** case** to** end case**, and so on.(ii)A mix partition that could contain either two or more control-constructs or one or more control-construct combined with a simple node (that contains simple construct such as addition operation).

##### 3.2. Node’s Links

To facilitate the search of control-construct partitions, we have used the parameter of link. If the node is a beginning of a control-construct or an end of a control construct, the link value equals 1. For the rest of node’s types, the link value equals 0. The link definition can be defined as follows:

##### 3.3. Related Statements

When a task is realized by hardware or software, its execution time and power consumption show diverse values. We define the following functions and to represent the hardware latency, the software latency, the hardware power, and the software power respectively of a given partition . Although obtaining the exact values of the execution time and power consumption is a challenging problem, it is beyond the scope of this article. Rather, we focus on algorithmic issues in partitioning.

Given a path : and a hardware/software partitioning for all the nodes in , the completion time of under partitioning is the summation of all the latencies occurred on taking into consideration the parallel execution of some tasks. The system completion time is defined to be the completion time of a critical path Cp in DFG. The hardware latency and the software latency corresponding to a target partition can be written as follows: where and ;For a given , we define a vector to indicate either the task is realized by hardware or software. For instance, for a node equals 0 if the task is executed by the software and equal to 1 if it is realized by the hardware.

The power consumption of the system with respect to a given partitioning can be calculated as the summation of all the task power consumption of each node realized by software or hardware. In fact, it can be written as follows:So, to recapitalize, we define the hardware/software partitioning problem as follows: given and thermal design power, find a partitioning that offers the best tradeoff between total power and execution time of the system.

#### 4. Proposed Algorithm

Our algorithm is meant to achieve graph partitioning in order to find the best compromise between power and execution time. As generally reckoned, the software consumes less power than the hardware but it requires more time to give response while the hardware which tackles the problem of timing consumes more power. This approach starts with a system totally implemented by software, it will not consume power but it will be too tardy. Whenever a partition of the system migrates to be executed by the hardware, the system will consume more power and become faster. As mentioned previously, our algorithm includes two different kinds of partitions. Its first function is to search for all control-construct partitions (Algorithm 2) and then it builds the mix partitions. After that, it makes all possible combinations between the generated partitions (Algorithm 1).