International Journal of Reconfigurable Computing

Volume 2010, Article ID 980762, 20 pages

http://dx.doi.org/10.1155/2010/980762

## New Three-Level Resource Management Enhancing Quality of Offline Hardware Task Placement on FPGA

^{1}University of Nice Sophia-Antipolis/LEAT-CNRS, 250 rue Albert Einstein, bât 4. 06560, Sophia Antipolis - Cedex, France^{2}Research Unit ReDCAD, National Engineering School of Sfax, B.P. 1173-3038 Sfax, Tunisia

Received 14 November 2009; Revised 6 March 2010; Accepted 24 April 2010

Academic Editor: Christophe Bobda

Copyright © 2010 Ikbel Belaid et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Currently, reconfigurable hardware devices feature a high density of heterogeneous resources to enable multitasking and offer flexibility in application needs. These concepts raise the need for efficient management of hardware tasks and hardware resources. The scheduling of hardware tasks is highly dependent on placement. Placement focuses on allocation of hardware resources required by the scheduled hardware tasks. In this paper, we propose novel three-level resource management that investigates enhancement of placement quality by reducing task rejection, configuration overheads, and by optimizing resource utilization. Improving placement quality will produce significant enhancement of performance for scheduling and overall execution time of the application in FPGA. Hence, the placement problem is formulated into a constrained optimization problem and resolved with powerful solvers using the Branch and Bound method. The obtained results of an application of heterogeneous hardware tasks show an average resource utilization of 36% of the available resources on the reconfigurable region and an overall overhead of 11% of total application running time, and we have eliminated the issue of task rejection. Compared to static implementation, the gain in resource utilization within the reconfigurable region achieves up to 43%.

#### 1. Introduction

Scheduling and placement are strongly linked. The scheduler decides which of the ready tasks should be executed next and calls the placer to find a feasible location. The scheduler decision should be taken in accordance with the ability of placer to allocate free resources required by the elected task. Field-Programmable Gate Array (FPGA) is the most widely used reconfigurable hardware device. Today’s FPGA devices provide several million reconfigurable heterogeneous resources. The development of dynamic partial reconfiguration in the FPGAs allows reconfiguring only the necessary part of the FPGA when required without interfering with any other parts running on the same FPGA. While this technique can increase device utilization and performance of scheduling and application, it also leads to high configuration overhead, fragmentation, and complex allocation situations of hardware tasks [1]. Frequently, the existing methods of placement face these issues. Consequently, the quality of placement and performance of the scheduling degrades while the overall response time increases. So, there exists a serious need to define an efficient method that helps manage the area of resources.

In general, the placement of hardware tasks consists of two main functions: (i) *partitioning,* which handles the free space in the device and identifies any potential sites enabling execution of hardware tasks, and (ii) *fitting,* which selects the feasible placement solution. In this paper, under FOSFOR project [2], we address the aspect of placement and introduce new three-level offline resource management that challenges all the above-mentioned issues. The main concern of our method is enhancing placement quality by targeting the optimized use of FPGA’s resources and taking into account the physical and functional features of hardware tasks (FOSFOR (Flexible Operating System For Reconfigurable platform) is French national program (ANR) targeting the most evolved technologies. Its main objective is to design a real time operating system distributed on hardware and software execution units which offers required flexibility to application tasks through the mechanisms of dynamic reconfiguration and homogeneous Hw/Sw OS services.). During the conception of three-level resource management, we rely on the physical architecture of target technology and on the advantages of dynamic partial reconfiguration. The contribution of this paper is the development of (i)offline flow for hardware task classification which is an enhancement of our previously proposed work in [3] and which creates task classes,(ii)a formulation of the task placement as a constrained optimization problem and its resolution by powerful solvers using Branch and Bound method by considering two independent sublevels: the first sublevel ensures the fitting of obtained task classes on physical blocs partitioned on target technology which improves resource efficiency up to 36% and the second sublevel performs the mapping of tasks to obtained classes by optimizing the overall overhead up to 11% of total running time and by minimizing the number of task relocations.

The remainder of this paper is organized as follows. Section 2 reviews related work of placement. Section 3 details our three-level resource management. Section 4 focuses on the formulation of hardware task placement as a constrained optimization problem and its resolution by the Branch and Bound method. The obtained results are depicted in Section 5. In Section 6, we summarize our work, make some concluding remarks, and present future works.

#### 2. Related Work

Current strategies dealing with task placement are divided into two categories: offline placement and online placement.

##### 2.1. Online Methods for Hardware Task Placement

The main reference is [4], Bazargan et al. suggest online scenario for hardware task placement. In fast on-line placement, Bazargan et al. introduce two partitioning techniques; the first technique denoted *Keeping All Maximal Empty Rectangles (KAMER)* searches all the Maximal Empty Rectangles (MER) after each task insertion/deletion operation. The MERs are defined as the empty rectangles which are not contained in another empty rectangle and are not necessarily disjoined. The second technique of partitioning called *Keeping Nonoverlapping Empty Rectangles* keeps all the nonoverlapping holes and is evoked after each split/merge operation. For both previous techniques of partitioning, the fitting in the on-line placement is conducted by the *First Fit*, *Best Fit (BF)* and *Bottom Left *bin-packing algorithms [5].

Fast two-dimensional on-line placement is presented in [6] which is an extension of Bazargan’s partitioning; the *Keeping Nonoverlapping Empty Rectangles*. In [6], Walder et al. propose placement methods that rely on efficient algorithms of partitioning enhancing the quality of Bazargan’s partitioning by 70% and on a hash matrix data structure that finds a feasible placement in constant time. Based on a nonpreemptive system without precedence constraint, the Walder’s partitioner delays split decision instead of using the Bazargan’s heuristics for decision of split. Reference [6] details four enhancements of Bazargan’s partitioner. The main partitioner is On-The-Fly (OTF) partitioner which consists of resizing empty rectangles only if the newly arrived task overlaps them. After each task insertion or deletion, the Walder’s partitioner updates the data on the hash matrix. If a new arrived task fits into more than one empty rectangle determined by the hash matrix, a fitting strategy is used to choose a rectangle [6]. Reference [6] implements four types of fitting: *BF*, *Worst Fit*, *Best Fit with Exact Fit,* and *Worst Fit with Exact Fit*.

Ahmadinia et al. present in [7] a new method of on-line placement by managing the occupied space instead of free space because of the difficulty of managing empty space and the huge increase of empty rectangles. In [7], at the arrival of a new task, the space manager starts by delimiting the *Impossible Placement Region (IPR)* relative to placed modules and to device. Thereafter, the *Nearest Possible Position* fitter selects the optimal point that gives the optimal communication cost, which is not included in the *IPR*.

In [8], Marconi et al. extend Bazargan’s placement by means of an *Intelligent Merging* (*IM*) algorithm. IM dynamically combines three techniques of managing free resources: *Merging Only if Needed*, *Partially Merging,* and *Direct Combine*. IM accelerates Bazargan’s partitioner by 3 and improves placement quality by increasing the rate of accepted tasks.

Handa et al. introduce the staircase method in [9]. The staircase method handles free space during the first subfunction of the on-line placement. This method is considered efficient as it tries to cover the faults of the *KAMER* method, especially for task rejection.

Some approximate metaheuristics are adopted to resolve the hardware task placement such as [10] that employs an on-line task rearrangement by using genetic algorithm approach. When a newly arrived task could not be placed immediately, the proposed approach tries to rearrange a subset of tasks executing on the FPGA to allow the processing of the pending task sooner. The approach is based on the First-Fit strategy and genetic algorithm. By allowing the rotation of tasks and by using input buffer to save the data of suspended tasks, the approach combines two genetic algorithms to resolve two subproblems. The first subproblem identifies a feasible rearrangement and the second one consists on scheduling the moves of executing tasks to attain the feasible rearrangement.

Reference [11] copes with task placement problem and adopts interconnection-based FPGA as support for run-time reallocation of hardware tasks. It applies matador task concurrency management methodology for scheduling hardware tasks on identical tiles by minimizing run-time reconfiguration. This goal is reached by two new techniques implied in the scheduler named configuration reuse and configuration prefetch. Reduction in configuration overhead decreases significantly the execution time and energy consumption.

##### 2.2. Offline Methods for Hardware Task Placement

In the offline scenario for hardware task placement, [4] defines 3D templates depicting the tasks in time and space dimensions and uses slow heuristics: simulated annealing and greedy research, and KAMER-BF to perform a high-quality placement in terms of resource utilization and task rejection.

Reference [12] models the problem of resource allocation as a 0-1 integer linear programming problem which aims necessarily at minimizing the resources area which is reconfigured at runtime. Reference [12] considers an application with known sequential execution trace. Hence, the huge configuration latency is tackled by reducing the overlapped areas between tasks.

By considering the placement of hardware tasks as rectangular items on hardware device as rectangular unit, several approaches for resolving the two-dimensional packing problem are proposed. For example, in [13], the offline approximate heuristics: Next-Fit Decreasing Height, First-Fit Decreasing Height, and Best-Fit Decreasing height are presented as strip-packing approaches based on packing items by levels.

In addition, Lodi et al. in [14] propose different offline approaches to resolve hardware task placement as 2D bin-packing problem. For instance, the Floor-Ceiling algorithm that considers alternate directions for packing tasks, either from left to right when their bottom edges touch the level floor or from right to left; when their top edges are on the top of the level floor. The Knapsack packing algorithm is also proposed in [15] which initializes each level by the tallest unpacked item and completes it by packing tasks as the associated Knapsack problem that maximizes the total area within level. For both latter algorithms, the second phase of 2D bin packing is achieved by Best-Fit Decreasing algorithm.

Baker et al. define in [16] the Bottom-left (BL) offline algorithm. BL packs each hardware task in the bottom left position.

Martello and Vigo propose in [17] an enumerative offline approach to exact solution for 2D bin-packing. Their algorithm is based on a two-level branching scheme: the outer branch-decision tree that assigns tasks to the bins without specifying their position and the inner branch-decision tree that enumerates all possible patterns.

By optimizing the total execution time and the resource utilization, the method of placement in [18] consists of two phases: the first phase is the recursive bi-partitioning by means of slicing tree that defines the relative position of each hardware task towards the other hardware task placement and finds the appropriate room in the reconfigurable device for each hardware task according to task’s resources and intertask communication. The second phase uses the obtained room topology to achieve the sizing that computes the possible sizes for each room.

Reference [19] minimizes task rejection and presents offline algorithms for 3D floorplanning of hardware tasks on reconfigurable functional unit such as: KAMER-BF Decreasing, Simulated Annealing, Low-temperature Annealing, and Zero-temperature Annealing. The 3D placement models tasks as 3D boxes having a base corresponding to the spatial dimensions of tasks and a height corresponding to their time-span.

In [20], as bin-packing problem, an offline approach is proposed by Fekete et al. through a graph-theoretical characterization of the packing of a set of items into a single bin. Tasks are presented as three-dimensional boxes and the feasible packing is decided by the orthogonal packing problem within a given container. Their approach considers packing classes, precedence constraints, and the edge orientation to solve the packing problem. Similarly, in [21], Teich et al. definesthe task placement as more-dimensional packing problem. Tasks are modeled as 3D polytopes with two spatial dimensions and the time of computation. Based on packing classes as well as on a fixed scheduling, they search a feasible placement on a fixed-size chip to accommodate the set of tasks. The resolution is performed by Branch and Bound technique to optimality of dynamic hardware reconfiguration.

The major shortcoming of all the above-proposed methods of placement is that they are applicable only in homogeneous devices. In fact, these methods assume that the relocation of tasks is allowed and enable the allocation of resources whenever sufficient free space is available. Furthermore, the placement disregards the routing constraints as it does not address the issue of intertask communication and I/O routing. Moreover, tasks are nonpreemptive and almost-identical. Unfortunately, the algorithms for 2D packing focus only on the objective of minimizing resource waste and do not satisfy all other goals. All the existing strategies of placement provide a nonguarantee system as they suffer from task rejection and fragmentation. We believe the issue of task rejection is caused by the constructive way in which the placement is performed throughout all existent strategies of placement. The issue of fragmentation may lead to undesirable situations where a new task cannot be placed although there would be sufficient free space.

As we have full knowledge about the set of hardware tasks and the features of the reconfigurable device, in this paper, we present a realistic three-level resource management solution as a new strategy to perform offline placement of hardware tasks in FPGA. This new strategy aims at enhancing placement quality by trimming the previously mentioned issues. Our proposed method is technology-dependent and in accordance with generic placement as the second level partitions the available resources in FPGA according to the task classes provided by the first level. Nevertheless, the third level ensures the subfunction of fitting. The task model is preemptive and preemption points are predefined. Our resource management allows the relocation of tasks and results in strict positions for each hardware task by respecting its preemption points and types of resources.

#### 3. Three-Level Resource Management

We use Xilinx’s Virtex FPGA as a reference for the hardware reconfigurable device to lead our hardware resource management study. We offer a definition of a few terms which are used throughout the paper: *NT* is the number of tasks, *NR* the number of Reconfigurable Physical Blocs, *NZ *the number of Reconfigurable Zones, and *NP *the number of resource types in the chosen technology. In the beginning, we should start by introducing the hardware task models. We have defined three models.(i)*The functional model*: this contains the functional features of hardware tasks *T _{i}* as the worst case execution time (

*C*), the period (

_{i}*P*), and preemption points

_{i}*l*(

*Preemp*). The number of preemption points of

_{i,l}*T*is denoted by

_{i}*NbrPreemp*. This number also includes the first point of execution of

_{i}*T*. Preemption points are specified by the designer.(ii)

_{i}*The behavioral model*: This includes the finite state machine controlling each task and which handles a set of

*NbrReg*registers of 32 bits to conduct the context switch. The behavioral model defines the functional overhead (

*Context*) that is needed to preempt or resume the execution of tasks. This functional overhead is fixed for all the hardware tasks as they have similar register banks with

_{i}*NbrReg*registers. The functional overhead is computed as two times (save and load) the access to a bus having 32 bits of width and functioning at 80 MHz. In addition, we have considered the worst case, when tasks need the

*NbrReg*32bit-registers to perform context switch. Hence, this functional overhead represents sequential access of

*NbrReg*registers associated for a given task to save and load its context through a 32 bit-bus (

*Context*= 2x

_{i}*NbrReg*/80 MHz). In our preemptive modeling, we do not use the classical method of readback and load bitstream since it takes a significant latency, complicates the preemption, and requires a large space memory as a new readback bitstream must be saved at each preemption. Thus, we resort to save the state of finite state machine with an acceptable amount of data by keeping always the same bitstream for each task. Preemption points of hardware tasks are fixed in a way to reduce the data dependency that could exist between two states. In fact, we must avoid keeping a preemption point between two states processing the same data because we need to save these data into an external memory which might increase the overhead at run-time. Otherwise, it is recommended to put a preemption point when the task is in a blocked state waiting for receiving external resource to allow the ready tasks to be executed in the RZ. As tasks are periodic, a preemption point could be inserted after the last state before restarting the task to avoid any data dependency. In the finite state machine, the longest execution time between two states must be considered in order to deduct the worst case execution time.(iii)

*The RB-model*: tasks are presented as a set of reconfigurable resources called Reconfigurable Blocs (RB). The RBs are closely shaped to the reconfiguration granularity in the chosen technology. The determination of the RB-model of hardware tasks is well-detailed in our work in [3]. Each type of RB is characterized by specified cost

*RBCost*which is defined according to three parameters: the number of the RB type in the device, its power consumption and the importance of its functionality. The more consistent these parameters are, the higher the cost of RB type. The RB-model of each hardware task is described by (1).

_{k}The functional model and the RB-model are the basic models for resource management. However, the behavioral model is employed during scheduling. The FPGA provides reconfigurable resources organized according to column-based technology [22]. The management of hardware resources on FPGA consists of three levels.

##### 3.1. Level 1: Offline Flow of Hardware Task Classification

Level 1 takes a set of tasks as input and provides the types and instances of Reconfigurable Zones (RZ). The RZs are abstractions of task classes and are defined according to the types of resources needed by the tasks. As the RZs model the classes of hardware tasks, they are described by their RB-model given by

Level 1 consists of three steps:

*Search of RZ types: *Step gathers the tasks sharing the same types of RBs under the same type of RZ. Step is essentially based on RB-model of hardware tasks and is achieved by Algorithm 1 of complexity in the worst case O (*NT ***NP ***NZ*).

Step scans the RB-model of each hardware task and checks whether there exists in the list of RZ types *List-RZ* an already inserted type of RZ that closely matches the required types of RBs in the task (line 6). In this case, step updates the number of RBs within this type of RZ by the maximum between the number of RBs in the task and that in the RZ (line 9). If the required types of RBs in the task do not match any type of RZ included in the *List-RZ*, the algorithm of the search of RZ types decides the creation of a new type of RZ as required by the task (line 13) and inserts it in *List-RZ *(line 14). At the end of step , we obtain the possible types of RZs. The number of RZ types is limited by the number of tasks. As shown in Figure 1, step groups *T _{1}* and

*T*in the same type of RZ (

_{3}*RZ*) as both need

_{1}*RB*and

_{1}*RB*and adjusts the number of each RB type within

_{2}*RZ*by the maximum number of RBs between

_{1}*T*and

_{1}*T*. Similarly,

_{3}*RZ*is created by

_{2}*T*and

_{2}*T*, and

_{4}*T*defines the third type of RZ (

_{5}*RZ*).

_{3}*Classification of hardware tasks: *Step starts by computing cost *D* between hardware tasks and RZ types resulting from step . Based on RB-models of hardware tasks (*T _{i}*) and RZs (

*RZ*), cost

_{j}*D*is computed as follows according to two cases.

We define by

*Case 1. *, contains a sufficient number of each type of RB () required by T_{i}. In this case, cost D is equal to the sum of differences in the number of each RB type between T_{i} and weighted by RBCost_{k. }(see (4))

*Case 2. *, the number of RBs required by exceeds the number of RBs in the or needs which is not included in . In this case, the cost D between T_{i} and is infinite (see (5))
Figure 2 illustrates the computing of costs between the five tasks and described in Figure 1.

As shown in Table 1, step assigns each task to the RZ giving the lowest cost *D *as described by the third column. For example, *T _{1}* is assigned to

*RZ*since

_{1}*D*(

*T*,

_{1}*RZ*) and

_{2}*D*(

*T*,

_{1}*RZ*) are superior to

_{3}*D*(

*T*,

_{1}*RZ*). Then, by using (6), step computes the workload of each RZ according to this assignment and by using the functional models of hardware tasks

_{1}The overhead *Overhead _{j,i}* is the sum of

*Config*corresponding to each and

_{j}*Context*(save and load) common for all tasks .

_{i}*Config*corresponds to the configuration overhead to place on the target technology. We compute this configuration overhead by making the floorplan of each on the chosen device and by conducting the whole partial reconfiguration flow up to the creation of partial bitstream. According to the configuration frequency and the configuration port, the

_{j}*Config*is determined by

_{j}For example, the workload of *RZ _{1}* resulting from

*T*and

_{1}*T*is 66%. The last column in Table 1 gives costs

_{3}*D*of the other tasks not assigned to the RZ.

We notice an overload in *RZ _{2}* caused by the workloads of execution of

*T*and

_{2}*T*as well as by their overheads. This overload is resolved during Step .

_{4}*Decision of increasing the number of RZs: *Step takes place only when an overload within some RZs is detected in Step and is achieved by Algorithm 2. Step aims to lighten the overload in RZs by conducting the migration of task execution sections to nonoverloaded RZs before resorting to the solution of increasing the number of overloaded RZs. Hence, for each task, we search all the possible combinations of task execution sections. For each overloaded RZ, Algorithm 2 searches all the nonoverloaded RZs that could accept at least one of its assigned tasks; that is, . Then, Algorithm 2 checks task by task the possibility of migration of an execution section or a combination of execution sections of the current task in order to reduce the overload of its RZ by respecting the workload of the nonoverloaded receiving RZ. In the worst case, the complexity of Algorithm 2 is O (*M ***N ***NTM ***TS*), where *M* denotes the number of overloaded RZs, *N* is the number of nonoverloaded RZs, *NTM* is the maximum number of tasks assigned to an overloaded RZ and *TS* the maximum number of execution section combinations for a task assigned to an overloaded RZ.

Step groups the workloads of overloaded RZs in *L1* (line 7) and the workloads of nonoverloaded RZs in *L2 *(line 7). Step goes throughout the RZs in *L1* to resolve their overloads independently (line 11). Step uses nonoverloaded RZs in *L2* to lighten the workloads of RZs in *L1 *(line 15). This step searches the nonoverloaded RZ in *L2* that gives finite cost *D* with at least one task assigned to the overloaded RZ during Step (line 19). Once Step finds the set of tasks that could be executed in the nonoverloaded RZ, it balances the workloads between both RZs by respecting the tasks’ preemption points (line 22–line 37). If the overload persists in the RZ of *L1*, the algorithm decides adding other instances of this RZ up to *┌*workload of RZ*┐*(line 42). When the processed nonoverloaded do not affect the added number of overloaded *RZ _{m}*, Step reinitializes their workloads to their values before dealing with the overload of

*RZ*(line 43).

_{m }Without any loss of generality, our proposed strategy of resource management includes the main functions of generic placement: partitioning and fitting, which are fulfilled by the two following levels.

##### 3.2. Level 2: Partitioning of Reconfigurable Physical Blocs on the Target Technology

Level 2 takes the types of RZs provided by level 1 as inputs and searches all the possible locations for them on the target device. These locations, called Reconfigurable Physical Blocs (RPB), are partitioned on the specified Reconfigurable Regions (RR) delimited in the target device. The RPBs are depicted by their RB-model as presented in

The RPBs must contain all the types of RBs required by the RZ type. The number of RBs in RPBs is greater than or equal to the number of RBs in RZs. Figure 3 shows an example of RPBs partitioned in which are associated to an RZ requiring two and one . The RPBs are presented by the five dotted rectangles.

##### 3.3. Level 3: Two-level Fitting

Level 3 consists of two independent sublevels. The first sublevel ensures the fitting of RZs on the most suitable nonoverlapped RPBs in terms of resource efficiency. The second sublevel performs the mapping of tasks to RZs according to their preemption points by respecting the workload of each RZ and guaranteeing the total execution of each task. Such mapping essentially promotes the solution giving the lowest overhead and lowest cost *D*. The second fitting sublevel provides an execution unit for each task; consequently, there is no longer the issue of task rejection. The mapping of tasks to RZs is strongly based on dynamic partial reconfiguration. This latter concept enables multitasking as well as execution of several hardware tasks on the same RZs. In fact, dynamic partial reconfiguration allows the reconfiguration of subareas on the FPGA during runtime without affecting other running tasks.

#### 4. Resolution of the Hardware Task Placement Problem

Previously, in our work presented in [23], we proposed a straightforward method to resolve partitioning and fitting. Nevertheless, with the rapid growth of resources in recent technologies and with the increasing complexity of applications, this exhaustive search is not efficient. In fact, the problem of partitioning/fitting is NP-complete, the search space is immense and the temporal complexity of the execution of our proposed algorithm in [23] is exponential. In this paper, we formulate the partitioning/fitting problem as a constrained optimization problem. Our work is based on the smart nonexhaustive complete method called Branch and Bound [24] which employs efficient techniques for scanning search space and extracting the optimal solution.

##### 4.1. Formulation of Hardware Task Placement as a Constrained Optimization Problem

The problem of partitioning/fitting is modeled as a constrained combinatory optimization problem as it uses discrete solution set, chooses the best solution out of all possible combinations and aims the optimization of multicriteria function. Partitioning/fitting problem is mixed integer problem as it uses some natural and binary variables. This problem is described by the quadruplet (Constants, Variables, Constraints, and Objective Function).

###### 4.1.1. Constants

*NT:* number of tasks constituting the application, *NZ*: number of RZs resulting from level 1 (Offline flow of hardware task classification), *NP*: number of RB types existing in the target technology, *D *(*T _{i}*,

*RZ*): the cost

_{j}*D*between

*T*and

_{i}*RZ*: the cost of each RB type.

_{j}, RBCost_{k}*Device Features**Device_Width*: the width of the device, *Device_Height*: the height of the device, *Device_RB*: the RB-model of the device.

*Task features*

: the Worst Case Execution Time (WCET) of *, *: the period of *, *: the preemption point *l* of , : the number of preemption points in , : the functional overhead for preempting and resuming *T _{i}, *: the full overhead for execution of task on ,

*_RB*: the RB-model for .

*RZ features**_RB*: the RB-model for , : the configuration overhead for *RZ _{j}* in target technology.

###### 4.1.2. Variables

* Features for Each *

: the abscissa of the upper left vertex of *, *: the ordinate of the upper left vertex of *, *: the abscissa of the upper right vertex of *, *: the ordinate of the bottom left vertex of *, **_RB*: the RB-model of the constructed by the above coordinates.

*The Task Preemption Points*

Under the task functional model, it is assumed that the preemption points*l*of each task are known. : Boolean variable controls whether the mapping of of is performed on . It is equal to 1 when is mapped to . *PreempTask _{j,i,l}*: each preemption point assigned to is taken from the predefined preemption points of each task(see(9)). : The sum of preemption points of performed within is expressed by
: The full occupation rate of in resulting from the mapping of preemption points of to RZs. The occupation rate is computed as expressed in

*AverageLoad*: The average of RZ workloads obtained after task mapping is calculated by

###### 4.1.3. Constraints

*Heterogeneity Constraint*

The RZs must be fitted on RPBs containing a sufficient number of their required types of RBs. This constraint must be respected during partitioning and fitting of RZs. The heterogeneity constraint is formulated by

*Nonoverlapping between RPBs*

As expressed by (14), this constraint restricts the fitting of RZs on nonoverlapped RPBs

*Nonoverload in RZs*

As mentioned in (15), the non-overload in RZs must be respected during mapping of tasks

*Infeasibility of Mapping for Preemption Points*

This constraint prohibits the mapping of preemption points of tasks to RZs giving infinite cost *D* (see (16)).

*Uniqueness of Preemption Points*

This constraint claims that each preemption point *l *of must exist on unique and guarantees the achievement of task execution as well as the elimination of task rejection (see (17))

*Domains of RPB Coordinates*

Equation (18) defines the allowed domain of values that can be assigned to RPB coordinates during partitioning

###### 4.1.4. Minimization Objective Function (F)

As with all optimization problems, we have defined a minimization objective function *F* that helps in selecting the optimal solution. As described by (19), *F* promotes a solution giving the best values for the two objective subfunctions *MappingFunction* and *PlaceFunction *

*PlaceFunction *focuses on the subproblem of fitting RZs on the most suitable RPBs after partitioning the target device by respecting the predefined constraints*. *In (20)*, PlaceFunction* evaluates the efficiency of resources after fitting RZs on the selected RPBs. *PlaceFunction* promotes the fitting of RZs on the RPBs that strictly contain the number and type of RBs required by RZs

*MappingFunction* focuses on the subproblem of fitting hardware tasks on RZs by respecting the predefined preemption points. *MappingFunction *targets the full exploitation of RZs and their workloads balance. It also aims at mapping tasks to the RZs providing the lowest cost to optimize resource utilization. Moreover, *MappingFunction* promotes the solution that minimizes the overall overhead and number of task relocations. Thus, *MappingFunction* is expressed by four subfunctions targeting these goals

*Map1* focuses on the RZ workloads by means of (22). Its first expression evaluates whether the RZs are fully exploited. While minimizing this expression, the RZ workloads approach 100% of their exploitation. Its second expression computes the variance of the RZ workloads towards the obtained average workload. Minimization of this second expression ensures load balancing between RZs

Where

*Map2* computes the overhead in (24) resulting from the mapping of task preemption points to the RZs. *Map2* takes into account all the possible preemption points, even when two successive preemption points of one task are mapped to the same RZ, for obtaining the worst case overhead. In fact, the scheduler could preempt a task on these successive preemption points in the same RZ in favor of a higher priority task. Minimizing *Map2* promotes the solutions that map the tasks to the RZs providing the lowest configuration overhead

By minimizing *Map3 *in (25), we promote a solution that maps tasks by a high occupation rate to the RZs giving the lowest cost *D*. The benefit of this minimization is optimizing the use of available resources in the technology. Indeed, these *D* costs between tasks and RZs reveal the rate of resource waste. Moreover, these *D* costs consider the weight of each resource in terms of three parameters which are: its frequency on the technology, the importance of its functionality, and its power consumption. Our objective is to minimize the utilization of these costly resources whenever possible. Hence, the more we promote mapping of tasks with high occupation rates to the RZs giving the lowest cost *D*, the more we optimize resource utilization

*Map4* computes in (26) the total number of task relocations obtained after such mapping. Although the migration of tasks between RZs solves the conflicts between tasks on the same RZ, minimizing the number of migrations optimizes the number of preemptions and overhead

##### 4.2. Branch and Bound Method for Solving the Hardware Task Placement

Our work is based on the global optimization method called Branch and Bound during partitioning of resource space and fitting RZs and tasks. The method of branch and Bound consists in enumerating all the possible solutions in an intelligent way by relying on the features of the specified problem. This technique succeeds in eliminating all the partial solutions that do not lead to the optimal solution. The Branch and Bound method relies on a predefined bound function which is our objective function *F* to make boundary on solutions to be excluded or to be kept as potential solutions. The performance of the Branch and Bound method depends on the quality of this function. Branch and Bound is adapted to mixed integer linear and non-linear programming.

Initially only one subset of solutions exists namely, the root subset that contains all the solutions for the problem. The unexplored subsets are represented as nodes in a dynamically generated search tree. Each iteration on Branch and Bound processes one such node. The iteration has three components: the selection of the node to process, branching and the bound function calculation of the ramified nodes. For each node, the bound function calculation considers the best fitting for the remaining RZs or tasks without constraints. Whereas, only nodes respecting predefined constraints are only ramified after node selection. Our strategy for selecting the next node is based on the value of the bound function *F *of the node. We start by the node giving the best bound function* F *at the current level and we use the strategy of *Depth First Search* (*DFS*). This strategy starts by exploring, at each level, the node giving the best *F *until obtaining the complete solution which gives the last best* F. *Then, the process is repeated for the next best node on the last visited level if its *F* does not exceed the last best *F*. In this case, we branch this node and compute the bound functions of its ramified nodes and compare them to the last best *F*. If *F* of these partial solutions is greater than or equal to the last best *F*, they are rejected. If not, they are kept and the best partial solution is selected to be processed by Branch and Bound. Once a new complete solution is founded, the method checks whether its *F* optimizes the last best *F*. When an optimization is detected, the last best *F* is updated. The process is repeated for all the next best nodes in each level as described above. Algorithm 3 describes the Branch and Bound method applied on our problem.

Partitioning resource space and fitting RZs can be resolved independently from task mapping. In the subproblem of partitioning resource space and fitting RZs, *PlaceFunction *is employed as the bound function *F* in Algorithm 3. While in the task mapping subproblem, *F *corresponds to *MappingFunction*. To fit RZs on their suitable RPBs, the two subproblems of partitioning the RPBs on the target device and fitting RZs on the selected RPBs are performed in parallel. In fact, the resolution starts by partitioning the RPBs corresponding to the RZ root (for example *RZ _{1}*) with respect to the heterogeneity constraint. This partitioning is done randomly by assigning all potential values to the RPB coordinates and the RZ root is fitted onto the RPB that gives the best

*PlaceFunction*. After selection of the best RPB for the RZ root that is, the selected node, this node is branched into all the possible RPBs for

*RZ*by respecting the predefined constraints and the

_{2}*PlaceFunction*of each ramified node is computed. The first selected node for the

*RZ*level must satisfy the heterogeneity constraint as well as the nonoverlapping constraint. The subset of solutions ramified from this first selected node in

_{2}*RZ*level is explored by searching the RPBs of

_{2}*RZ*. The resolution is carried out similarly until the achievement of exploration branched from the first selected node which gives the last best

_{3}*PlaceFunction*. The process is repeated for the next best nodes and recursively, the remainder RZs are fitted on RPBs as described for the RZ root by respecting the predefined constraints. During exploration, if a partial solution gives

*PlaceFunction*worse than the last best

*PlaceFunction*, this partial solution is rejected and its branching is stopped. If not, this partial partitioning/fitting of RZs is kept as a potential partial solution. During this repetitive process, all possible combinations of RPBs respecting the constraints are tested, when the process is achieved, the optimal solution is extracted.

Figure 4 illustrates the running of the Branch and Bound method on the partitioning/fitting of four RZs. The nodes with an mark present the subsets of solutions that do not contain the optimal solution and with a *√* mark present potential solutions. In the *RZroot level* and * level*, the best node is selected, branched on partial solutions and the *PlaceFunction* of its ramified nodes are computed. As can be noticed, there are two processed best fittings in the * level*. The method starts by the first best node that is, *PlaceFunction3p* and processes this selected node. It is branched into two nodes by the RPBs of . After computing the *PlaceFunction* for these ramified nodes, the is selected and a complete solution is obtained. The last best *PlaceFunction* is *PlaceFunction41*. is rejected as it does not optimize *PlaceFunction41.* The next best node in the last visited level not yet processed is . The node is kept and branched into two RPBs on the *RZ _{4} level*.

*PlaceFunction43*optimizes the last best

*PlaceFunction*and the solution is complete, thus the optimal solution is obtained by

*PlaceFunction43*. The next iteration processes the next best node in the last visited level; partial solution

*RPB*. This partial solution is rejected as

_{2}-RZ_{3}*PlaceFunction32*exceeds the last best

*PlaceFunction*. The other nodes are also not processed as their partial bound function exceeds the last best

*PlaceFunction*. Finally, the optimal solution of partitioning/fitting of RZs is obtained by fitting

*RZroot*() on

*RPB*, on , on , and on .

_{i}-RZrootThe resolution of mapping starts by randomly assigning the preemption points of task root () to RZs giving finite cost *D* and computing the bound function of each node. The selected node is the node that gives the most optimal mapping of preemption points according to *Mapping Function. *This node is ramified for new processing for the next task. This ramification must respect the total execution of tasks as well as the non-overload on RZs. Recursively, as task root, the remainder tasks are mapped by satisfying the two previous constraints and computing the *Mapping Function *of their ramified nodes. When a complete solution is obtained, it represents the last best *Mapping Function*. Similarly to partitioning/fitting of RZs, the mapping is resolved in a recursive manner by computing the *Mapping Function* of each partial mapping branched from the selected best nodes and keeping only the ones optimizing the last best *Mapping Function*. The mapping resolution is finished when all the potential mappings of preemption points for all tasks to RZs respecting mapping constraints are checked.

Figure 5 describes the progress of mapping three tasks to four RZs. The set of RZs giving finite cost *D* is provided for each task. In *Troot level*, the resolution selects the first best node that is, *Mapping Function 13*, after branching it and after computing the *Mapping Function* of the partial solutions, the next iteration is released on * level*. In * level*, *MappingFunction23* is the best partial solution, its node is ramified on * level* and the last best *Mapping Function* is obtained by *MappingFunction31*. The process is repeated for the next best nodes in the last visited level. The nodes *MappingFunction21 and MappingFunction22* are not processed since they exceed the last best *Mapping Function*. The method processes the next best node in the last visited level which is *MappingFunction11*. The branching of this node does not optimize the last best *MappingFunction*. Thus, its ramified nodes are not explored in * level*. Finally, the optimal solution of mapping hardware tasks to RZs is obtained by fitting the first preemption point P1 of on and its remaining preemption points P2 and P3 on and by fitting all the preemption points of and on .

As well-known, in the worst case, the complexity of the classical Branch and Bound method is exponential [25]. In fact, in the worst case, the complexity of the first subproblem: partitioning/fitting RZs is equal to O ((4**B*+8*+(*NZ *+1)*)^{NZ-1}+4**B*) where *B* denotes a possible combination of RPB coordinates for a given RZ as presented by a node in Figure 4. Thus, is a possible value assignment for the decision variables *X _{j}*,

*WRPB*,

_{j}*Y*,

_{j}*HRPB*. The values domain of

_{j}*B*is equal to , which is equal to (

*Device_Width*)

^{2*}(

*Device_Height*)

^{2}. 4*

*B*depicts the selection of a node, its removal from the set live, and both following test: the test of final and optimal solution (line 8) and the test of failed solution (line 11). 8* is the branching operation by respecting the constraints (line 15) and (

*NZ*+1)* is the bound computation for the branched nodes (line 17) by assigning each already processed RZ to its RPB and the remaining ones to the most suitable RPBs in terms of resource efficiency without constraints and the insertion of this branched nodes into the set live (line 18). Consequently, the complexity of the subproblem in the worst case is ~ O (((

*Device_Width*)

^{4*}(

*Device_Height*)

^{4})

*). In the worst case, the complexity of the second subproblem: task fitting is O ((4**

^{NZ}*B*+ (2+

*NT*+6*

*NT**

*NZ*))

^{NT-1}+4*

*B*) where

*B*denotes a possible fitting of preemption points on RZs

*RZ*for a given task as associated to a node in Figure 5. Thus,

_{j}*B*is a possible value assignment for the decision variables

*PreempUnicity*. The values domain of

_{j,i,l}*B*for a given task

*T*is which is equal to: and we consider the maximum number of preemption points is

_{i}*MaxPreemp*. After the branching of the selected node, we compute the bound of its branched candidate by considering that the remaining tasks which are not yet processed are mapped with 100% to their optimal RZs in terms of distance and configuration overhead. Consequently, in the worst case, the complexity of the second sub-problem by using the Branch and Bound search is . Thus, in the worst case, the search by Branch and Bound algorithm grows exponentially with

*NT*and

*NZ*. This complexity is expected as the placement of hardware tasks on the heterogeneous reconfigurable devices is NP-complete problem. Currently, many enhancements on Branch and Bound algorithm are conducted as in [26] which could reduce the exponential complexity of some problems to logarithmic complexity. But despite this possible exponential complexity, compared to an exhaustive exact method, the Branch and Bound method immensely lightens the search space. Effectively, the search tree branches are discarded when the current bound function exceeds the last best bound function. The Branch and Bound ensures complete resolution of placement problem with a performance better than or equal to that of complete exhaustive method in terms of resolution speed and storage space. With this smart method, we ensure a full combination of RZ fittings as well as task mapping which solves the problem of task rejection confronted in the previous works of placement. Unlike the approached methods such as metaheuristics [27], this method is complete and affords an optimal solution. In fact, within these metaheuristics, the computation in each generation is rather complicated and time consuming and the number of generations must be increased to ensure a more favorable solution. The quality of the obtained solution in metaheuristics is conditioned by the best choice of the initial generation.

#### 5. Application and Results

To investigate the influence of our proposed method for hardware task placement, we implemented an application composed of several hardware tasks taken from opencores [28]. We have chosen examples of hardware tasks that are frequently used in recent embedded systems performing video and audio applications. Our application features hardware tasks of varied sizes and of heterogeneous resources. The hardware tasks are guided by an application manager; the microcontroller. We do not consider the control overhead between the microcontroller and the other hardware tasks. During the design of this application, we synthesized the resources of each hardware task by the ISE 11.3 Xilinx tool. We defined the configuration overheads of the obtained RZs by performing the partial reconfiguration flow by means of the Planahead 11.3 Xilinx tool.

##### 5.1. Application

Figure 6 shows the hardware tasks included in the application. The core of the application is the microcontroller T48. The microcontroller guides the other hardware tasks either for speeding up the computing such as FIR and MULTF or for performing complicated processing such as JPEG compression. Consequently, there are three categories of hardware tasks.(i)Application manager.(a)Microcontroller T48: Hardware task configuration and data flow synchronization.(ii)Fine-grained hardware tasks (for speeding up microcontroller).(a)FIR: Performs 1000 FIR (Finite Impulse Response) filters. Each FIR features 3 taps, 8bit-input data and 48bit-coefficients.(b)MULTF: Performs 1000 floating point multiplications between two vectors of 8 bits. The exponent precision and mantissa precision are 3 and 4 respectively. These multiplications are pipelined with the latency of 8 clock cycles.(iii)Coarse-grained hardware tasks.(a)MDCT: Computes Modified Discrete Cosine Transform.(b)AES: (Advanced Encryption Standard): Performs encryption of blocs of 128 bits with 256bit-key.(c)DDS: (Direct Digital Synthesizer): Creates sinusoidal waves programmable with frequency and phase on-time.(d)JPEG: Performs hardware compression of 24 frames per second**.**(e)VGA: Drives VGA monitors with an 800x600 resolution, it can display one picture on the screen either of chars or color waveforms or color grid.

The features of hardware tasks and their instances are presented in Table 2. These hardware tasks are characterized by considering the resource area in Virtex 5 technology [29]. Virtex 5 contains four main types of resources CLBL, CLBM, BRAM, and DSP. The RBs are vertical stacks composed of the same type of resources and match the reconfiguration granularity. Hence, is 20 CLBMs, *RB _{2}* is 20 CLBLs,

*RB*is 4 BRAMs, and is 8 DSPs. We have assigned 20, 80, 192, and 340 as

_{3}*RBCost,*respectively, for , , , and . Configuration overhead is determined as described in (7) by considering that each task defines an RZ. After synthesizing hardware tasks by the ISE tool, they are modeled under their RB-model reported in [3]. The partial reconfiguration flow dedicated by the Planahead tool enables the floorplanning of hardware tasks on the chosen device to create their bitstreams independently for estimating configuration overheads. The estimation of configuration overheads considers the best case fitting of each task, as we believe the subproblem of partitioning/fitting RZs is solved efficiently and independently from the subproblem of task mapping. We rely on parallel 8bit-width configuration ports and use 100 MHz as the configuration clock frequency. Preemption points are determined arbitrarily according to the granularity of hardware tasks and their WCET. For all tasks, we consider the first preemption point is equal to 0 s.

##### 5.2. Results

We applied the three levels of our resource management on the application and obtained the following results.

###### 5.2.1. Level 1 Results

Step creates 6 types of RZ depicted in Table 3. is created by MDCT and VGA tasks, by AES, by DDS, by T48, by JPEG, and MULTF and by FIRs. If the RBs of one type of RZ are not constructed from the same task (i.e. there exist some RBs created by other tasks), the configuration overhead corresponding to this RZ must be recomputed as described in (7) before performing Step . In our application, the RBs of all RZs are created by one task. Hence, the RZ configuration overhead is provided by the predefined features of hardware tasks in Table 2.

During Step , the *D* costs between tasks and RZs are computed; in Table 3, the column specific for each task and its instances shows the values of obtained *D* costs*. *Thereafter, step calculates the workloads of obtained RZs by assigning to each RZ the tasks giving lowest cost *D *as mentioned by the bold numbers. WorkLoad values are presented in the first column of Table 3. For example, the workload of is computed by assigning the hardware tasks AES, MULTF, and VGA. We detected an overload in and . The overloads on these RZs are the result of the assigned hardware task execution time as well as the overheads, especially the configuration overheads of the RZs. These overloads are dealt with in step .

To resolve these overloads, step adds two other RZs having the same type of *RZ _{2}* since the other RZs cannot totally resolve its overload. In fact, the only possible migration is performed by on its second 1650 s preemption point which loads off up to 299%. Whereas, the overload of could be resolved by performing the migration of two tasks among , , and on on their second preemption points that is, 120 s, since is the least loaded. Consequently, the final number of RZs is equal to 8: , (, , ), , , , and .

###### 5.2.2. Level 2 and Level 3 Results

Among all the available solvers [30], our work is based on the AIMMS environment [31] relying on powerful solvers. AIMMS environment has independently resolved the two subproblems: partitioning/fitting of RZs and fitting of tasks on RZs based on the Branch and Bound method. For the subproblem of partitioning/fitting of RZs, we used the Mixed Integer Programming model and for task fitting we employed Mixed Integer Nonlinear Programming model. At the end of resolution on CPU of 2 GHz with 2 GB of RAM, each RZ is fitted on its most suitable RPB and each preemption point of each task is mapped to a unique RZ. The resolution respects the consistency of constraints and extracts the optimal solution.

The subproblem of RZ partitioning/fitting was resolved after 2 hours and 30 minutes. Table 4 shows the four coordinates of the most suitable RPBs for RZ fitting on the initial RR limited on the Virtex 5 FX200 device ( RBs) as depicted in Figure 7. The initial RR is defined by the designer and by taking into account the set of resources dedicated only for the static part.

Table 5 shows comparison between the RBs of the obtained RPBs and their RZs. We observe high resource efficiency as the number of RBs within the RPBs is nearly equal to that of the RZs. Consequently, the internal fragmentation within the RPBs is considerably reduced. The differences of RBs are given by the last column () of Table 5. For all RZs, except and , their corresponding RPBs have one RB in excess. The RPB of contains 3 extra and the RPB of strictly contains the required number of each RB type. The nonnull is explained by the rectangular shape of the RPBs, thus the number of RBs included in the RPB could exceed the required RBs in the RZ. The nonnull is also due to the heterogeneity on the device; the partitioning could book some RBs which are not used by the RZ.

Figure 7 depicts the floorplanning of RPBs conducted on the Virtex 5 FX200 device. The obtained results show an average of resources utilization of 36% of the available resources on the initial RR. This average is computed according to the number and the cost of each RB type. The optimization in utilization of resources minimizes the area of FPGA which is reconfigured at runtime. We have created static design by floorplanning each instance of each hardware task on its RPB without using the concept of dynamic partial reconfiguration. The obtained utilization of such static design resources is 63% of the available resources on the initial RR. Therefore, the gain of configuration overhead in a static design is paid by the resource waste, which is 43% compared to our obtained results employing dynamic partial reconfiguration. The RPBs are closely packed on the initial RR which avoids the resources waste and the external fragmentation in the device. For this reason, the initial RR could be resized in order to dedicate sufficient space for the remainder static part as depicted in Figure 7 by final RR.

Once the final RPB floorplanning is conducted, the final configuration overheads corresponding to RZs are determined. Thus, these new values are used to resolve the subproblem of task mapping to RZs.

The subproblem of task fitting on RZs was solved within 9 seconds. Figure 8 shows the results of preemption point mapping of hardware tasks in the application. and start on , they are preempted after 85% of their WCET and continue their execution on . is mapped first to , it is stopped on on its second preemption point that is, after 33% of execution and restarts on up to 54%. At this third preemption point, migrates to where it completes its execution. , , , , and are totally mapped to . , , , *, *and are totally mapped, respectively to , , , , and .

As shown in Figure 9, task *T _{6}* starts its execution on

*RZ*then is preempted on its third preemption point that is, after 42% of execution. resumes its execution on till 72% of its execution. Hence, it is preempted again on on the fourth preemption point and restarts on where it is achieved.

_{8}This resolution of mapping hardware tasks to RZs discards the problem of task rejection as it guarantees an execution unit for each task to achieve its execution by respecting its predefined preemption points.

The bar chart on Figure 10 presents obtained RZ workloads after task mapping resolution. Except having a workload of 44%, the remainder RZs have balanced workloads which are closest to the average workload equal to 89%.

The bar chart in Figure 11 depicts the task occupation rates on the RZs as a result of mapping. This bar chart shows the number of migrations that must be performed to avoid the RZ overloads and ensure total execution of tasks. We obtained 6 migrations. realizes two migrations. In fact, is mapped to , , and . The mapping of combines the two objectives expressed by *Map2,* which is the minimization of configuration overhead and *Map4 *which optimizes utilization of costly resources. Similarly is mapped to , , and . and sustain both, migration from to .

If we consider independent tasks, within each RZ respecting 100% as workload bound, execution sections mapped to this RZ could be scheduled by Earliest Deadline First (EDF) algorithm as they respect the monoprocessor sufficient schedulability condition ().

After RZ fitting on their RPBs and according to the obtained task mapping, the final RPB configuration overheads are determined. In the worst case, by counting all the possible preemption points, the total overhead including configuration overheads and functional overheads of tasks is 72959 s. The overall overhead is 11% of total running time.

Table 6 gives some comparisons with previous work in hardware task placement in terms of task rejection, resource utilization, configuration overhead, and the complexity of the performed technique. To the best of our knowledge, there are several placement algorithms proposed for each goal. These algorithms could be classified as metaheuristics or online heuristics. Nevertheless, these algorithms target only one goal and do not take into account other goal satisfactions. In opposition to [11, 32], our multiobjective placement computes the configuration overhead in the worst case before scheduling (11%) and targets an application of 14 tasks which is not the case of [11, 32] that optimizes the configuration overhead only for two or three tasks during the scheduling (18%, 8%). Comparing to [10] (sets of 3,000 homogeneous tasks) and [1] (10.000 homogeneous tasks) applied in homogeneous devices; we have reduced efficiently the resource utilization (36%) for an application of 14 heterogeneous tasks and by taking into account the heterogeneity of recent reconfigurable devices. The heterogeneous resources in FPGA could fit the RZs on large RPBs giving a significant resource waste. Comparing to the offline simulated annealing in [4] performed for 100 tasks which produces 13% of task rejection, our three-level offline placement discards totally the task rejection for an application of 14 tasks.

#### 6. Conclusion and Future Works

In this paper, we propose novel three-level resource management that investigates enhancing placement quality by reducing task rejection, resource waste, and configuration overheads. Our work is based on the optimization of resource utilization relying on the features of heterogeneous reconfigurable devices and on constrained optimization problems. We reported on resolution showing an improvement of placement quality compared to previous works. In fact, the overall overhead is 11% of total running time, the average resource utilization is 36% of the available resources on the reconfigurable region and we enhanced resource utilization by 43% compared to static design. In addition, the non-overload in some RZs improves the possibility of mapping other tasks by respecting their deadlines without performing another resource allocation. Unlike the works achieved in placement that deal with independent tasks, we eliminated the issue of task rejection as we pack tasks on RZs and employ dynamic partial reconfiguration.

Our results do not agree with other works on hardware task placement since experimental conditions are not the same. In fact, we used the recent FPGA technology that provides the highest configuration frequency and wide data ports that speed up configuration overhead. The improvement of resource utilization is explained by the optimality of our solution. The cancellation of task rejection is due to the employment of dynamic partial reconfiguration to map several tasks to the same RZ instead of dealing with each task placement independently. We have exploited powerful solvers that ensure the achievement of searching. Consequently, the most optimal solution provides the least rates of resources utilization and configuration overhead.

To conclude, it is recommended to take advantage of the obtained results for purposes of establishing the rules of hardware task scheduling for real-time applications. By attempting to follow the obtained partitioning/fitting, we guarantee the highest placement quality equal to or better than that obtained in offline three-level resource management.

Further work includes the defining of efficient on-line scheduling guided by obtained offline results. Moreover, we aim improving our offline task mapping by adding precedence constraints as well as deadline and periodicity constraints to achieve an offline mapping/scheduling of hardware tasks. Consequently, we ensure a full offline placement/scheduling of hardware tasks on hardware reconfigurable devices. The dependency between tasks should be investigated, especially in considering intertask communication with the overall overhead presented in this paper. Intertask communication will be an important criterion in deciding the most optimal RZ fitting. Intertask communication relies on the management of an efficient communication network such as FAT-Tree [33] as well as on the management of a shared memory.

#### Acknowledgments

This paper was supported by AIMMS technical support and Xilinx tools. It is sponsored by the national agency of research in France and the world-ranking “Secured Communicating Solutions” (SCS) cluster that pool together with industrial, research and higher education players which are involved in the microelectronics, telecommunications, software, and multimedia sectors.

#### References

- A. A. El Farag, H. M. El-Boghdadi, and S. I. Shaheen, “Improving utilization of reconfigurable resources using two dimensional compaction,” in
*Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE '07)*, pp. 135–140, Nice, France, April 2007. View at Publisher · View at Google Scholar · View at Scopus - http://www.polytech.unice.fr/~fmuller/fosfor/, FOSFOR. 2008.
- I. Belaid, F. Muller, and M. Benjemaa, “Off-line placement of hardware tasks on FPGA,” in
*Proceedings of the 19th International Conference on Field Programmable Logic and Application (FPL '09)*, pp. 591–595, Prague, Czech republic, September 2009. - K. Bazargan, R. Kastner, and M. Sarrafzadeh, “Fast template placement for reconfigurable computing systems,”
*IEEE Design and Test, Special Issue on Reconfigurable Computing*, vol. 17, no. 1, pp. 68–83, 2000. View at Publisher · View at Google Scholar · View at Scopus - E. G. Coffman Jr., M. R. Garey, and D. S. Johnson,
*Approximation Algorithms for Bin Packing: A Survey*, Chapter 2, PWS Publishing Company, Boston, Mass, USA, 1997. - H. Walder, C. Steiger, and M. Platzner, “Fast online task placement on FPGAs: free space partitioning and 2D-hashing,” in
*Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS '03)*, p. 178, Nice, France, April 2003. - A. Ahmadinia, C. Bobda, M. Bednara, and J. Teich, “A new approach for on-line placement on reconfigurable devices,” in
*Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS '04)*, vol. 4, p. 134, Santa Fe, NM, USA, April 2004. View at Publisher · View at Google Scholar · View at Scopus - T. Marconi, Y. Lu, K. Bertels, and G. Gaydadjiev, “Intelligent merging online task placement algorithm for partial reconfigurable systems,” in
*Proceedings of the Design Automation Test Europe (DATE '08)*, pp. 1346–1351, Munich, Germany, March 2008. - M. Handa and R. Vemuri, “An efficient algorithm for finding empty space for online FPGA placement,” in
*Proceedings of the 41st Design Automation Conference (DAC '04)*, pp. 960–965, San Diego, Calif, USA, June 2004. View at Scopus - H. ElGindy, M. Middendorf, H. Schmeck, and B. Schmidt, “Task rearrangement on partially reconfigurable FPGAs with restricted buffer,” in
*Proceedings of the Field Programmable Logic and Applications*, vol. 1896, pp. 379–388, Vienna, Austria, August 2000. - J. Resano, D. Mozos, D. Verkest, S. Vernalde, and F. Catthoor, “Run-time minimization of reconfiguration overhead in dynamically reconfigurable systems,” in
*Proceedings of the International Conference on Field Programmable Logic and Application*, vol. 2778 of*Lecture Notes in Computer Science*, pp. 585–594, Lisbon, Portugal, September 2003. View at Scopus - E. M. Panainte, K. Bertels, and S. Vassiliadis, “FPGA-area allocation for partial run-time reconfiguration,” in
*Proceedings of the Design Automation Test Europe (DATE '05)*, pp. 100–105, Munich, Germany, March 2005. - A. Lodi, S. Martello, and M. Monaci, “Two-dimensional packing problems: a survey,”
*European Journal of Operational Research*, vol. 141, no. 2, pp. 241–252, 2002. View at Publisher · View at Google Scholar · View at Scopus - A. Lodi, S. Martello, and D. Vigo, “Neighborhood search algorithm for the guillotine non-oriented two-dimensional bin packing problem,” in
*Proceedings of the Meta-heuristics : Advances and Trends in Local Search Paradigms for Optimization*, pp. 125–139, Sophia Antipolis, France, July 1997. - A. Lodi, S. Martello, and D. Vigo, “Heuristic and metaheuristic approaches for a class of two-dimensional bin packing problems,”
*INFORMS Journal on Computing*, vol. 11, no. 4, pp. 345–357, 1999. View at Google Scholar · View at Scopus - B. S. Baker, E. G. Coffman Jr., and R. L. Rivest, “Orthogonal packings in two dimensions,”
*SIAM Journal on Computing*, pp. 846–855, 1980. View at Google Scholar · View at Scopus - S. Martello and D. Vigo, “Exact solution of the two-dimensional finite bin packing problem,”
*Management Science*, vol. 44, no. 3, pp. 388–399, 1998. View at Google Scholar · View at Scopus - K. Danne and S. Stuehmeier, “Off-line placement of tasks onto reconfigurable hardware considering geometrical task variants,” in
*From Specification to Embedded Systems Application*, vol. 184 of*International Federation for Information Processing*, Springer, New York, NY, USA, 2005. View at Google Scholar - K. Bazargan, R. Kastner, and M. Sarrafzadeh, “3-D floorplanning: simulated annealing and greedy placement methods for reconfigurable computing systems,”
*Design Automation for Embedded Systems*, vol. 5, no. 3, pp. 329–338, 2000. View at Google Scholar · View at Scopus - S. P. Fekete, E. Kohler, and J. Teich, “Optimal FPGA module placement with temporal precedence constraints,” in
*Proceedings of the Conference Design Automation and Test in Europe*, pp. 658–665, Munich, Germany, 2001. - J. Teich, S. P. Fekete, and J. Schepers, “Optimization of dynamic hardware reconfigurations,”
*The Journal of Supercomputing*, vol. 19, no. 1, pp. 57–75, 2001. View at Publisher · View at Google Scholar · View at Scopus - F. Rivoallon and A. Cosoroaba,
*Achieving Higher System Performance with the Virtex 5 Family of FPGAs*, Xilinx White Paper, San Jose, Calif, USA, 2006. - I. Belaid, F. Muller, and M. Benjemaa, “Off-line placement of reconfigurable zones and off-line mapping of hardware tasks on FPGA,” in
*Proceedings of the Design and Architectures for Signal and Image Processing (DASIP '09)*, Sophia Antipolis, France, September 2009. - J. Clausen,
*Branch and Bound Algorithms-Principles and Examples*, University of Copenhagen, Copenhagen, Denmark, 1999. - G. Pataki, M. Tural, and E. B. Wong, “Basis reduction and the complexity of branch-and-bound,” in
*Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms*, pp. 1254–1261, Austin, Tex, USA, January 2010. View at Scopus - S. M. Azam, M. ur-Rehman, A. K. Bhatti, and N. Daudpota, “Parallel branch and bound model using logarithmic sampling (PBLS) for symmetric traveling salesman problem,” in
*Proceedings of the World Academy of Science, Engineering and Technology*, vol. 6, pp. 66–69, June 2005. - J.-K. Hao, P. Galinier, and M. Habib, “Métaheuristiques pour l'optimisation combinatoire et l'affectation sous contraintes,”
*Revue d'Intelligence Artificielle*, vol. 13, no. 2, pp. 283–324, 1999. View at Google Scholar · View at Scopus - http://opencores.org/.
- “Virtex-5 FPGA Configuration User Guide,” Xilinx white paper, August 2009.
- A. Neumaier, O. Shcherbina, W. Huyer, and T. Vinkó, “A comparison of complete global optimization solvers,”
*Mathematical Programming*, vol. 103, no. 2, pp. 335–356, 2005. View at Publisher · View at Google Scholar · View at Scopus - http://www.aimms.com/.
- J. A. Clemente, C. González, J. Resano, and D. Mozos, “A hardware task-graph scheduler for reconfigurable multi-tasking systems,” in
*Proceedings of the International Conference on Reconfigurable Computing and FPGAs*, pp. 79–84, Cancun, Mexico, December 2008. View at Publisher · View at Google Scholar · View at Scopus - L. Devaux, D. Chillet, S. Pillement, and D. demigny, “Flexible communication support for dynamically reconfigurable fpgas,” in
*Proceedings of the 5th Southern Conference on Programmable Logic (SPL '09)*, pp. 65–70, São Paulo, Brazil, April 2009. View at Publisher · View at Google Scholar · View at Scopus