Research Article  Open Access
Ikbel Belaid, Fabrice Muller, Maher Benjemaa, "New ThreeLevel Resource Management Enhancing Quality of Offline Hardware Task Placement on FPGA", International Journal of Reconfigurable Computing, vol. 2010, Article ID 980762, 20 pages, 2010. https://doi.org/10.1155/2010/980762
New ThreeLevel Resource Management Enhancing Quality of Offline Hardware Task Placement on FPGA
Abstract
Currently, reconfigurable hardware devices feature a high density of heterogeneous resources to enable multitasking and offer flexibility in application needs. These concepts raise the need for efficient management of hardware tasks and hardware resources. The scheduling of hardware tasks is highly dependent on placement. Placement focuses on allocation of hardware resources required by the scheduled hardware tasks. In this paper, we propose novel threelevel resource management that investigates enhancement of placement quality by reducing task rejection, configuration overheads, and by optimizing resource utilization. Improving placement quality will produce significant enhancement of performance for scheduling and overall execution time of the application in FPGA. Hence, the placement problem is formulated into a constrained optimization problem and resolved with powerful solvers using the Branch and Bound method. The obtained results of an application of heterogeneous hardware tasks show an average resource utilization of 36% of the available resources on the reconfigurable region and an overall overhead of 11% of total application running time, and we have eliminated the issue of task rejection. Compared to static implementation, the gain in resource utilization within the reconfigurable region achieves up to 43%.
1. Introduction
Scheduling and placement are strongly linked. The scheduler decides which of the ready tasks should be executed next and calls the placer to find a feasible location. The scheduler decision should be taken in accordance with the ability of placer to allocate free resources required by the elected task. FieldProgrammable Gate Array (FPGA) is the most widely used reconfigurable hardware device. Today’s FPGA devices provide several million reconfigurable heterogeneous resources. The development of dynamic partial reconfiguration in the FPGAs allows reconfiguring only the necessary part of the FPGA when required without interfering with any other parts running on the same FPGA. While this technique can increase device utilization and performance of scheduling and application, it also leads to high configuration overhead, fragmentation, and complex allocation situations of hardware tasks [1]. Frequently, the existing methods of placement face these issues. Consequently, the quality of placement and performance of the scheduling degrades while the overall response time increases. So, there exists a serious need to define an efficient method that helps manage the area of resources.
In general, the placement of hardware tasks consists of two main functions: (i) partitioning, which handles the free space in the device and identifies any potential sites enabling execution of hardware tasks, and (ii) fitting, which selects the feasible placement solution. In this paper, under FOSFOR project [2], we address the aspect of placement and introduce new threelevel offline resource management that challenges all the abovementioned issues. The main concern of our method is enhancing placement quality by targeting the optimized use of FPGA’s resources and taking into account the physical and functional features of hardware tasks (FOSFOR (Flexible Operating System For Reconfigurable platform) is French national program (ANR) targeting the most evolved technologies. Its main objective is to design a real time operating system distributed on hardware and software execution units which offers required flexibility to application tasks through the mechanisms of dynamic reconfiguration and homogeneous Hw/Sw OS services.). During the conception of threelevel resource management, we rely on the physical architecture of target technology and on the advantages of dynamic partial reconfiguration. The contribution of this paper is the development of (i)offline flow for hardware task classification which is an enhancement of our previously proposed work in [3] and which creates task classes,(ii)a formulation of the task placement as a constrained optimization problem and its resolution by powerful solvers using Branch and Bound method by considering two independent sublevels: the first sublevel ensures the fitting of obtained task classes on physical blocs partitioned on target technology which improves resource efficiency up to 36% and the second sublevel performs the mapping of tasks to obtained classes by optimizing the overall overhead up to 11% of total running time and by minimizing the number of task relocations.
The remainder of this paper is organized as follows. Section 2 reviews related work of placement. Section 3 details our threelevel resource management. Section 4 focuses on the formulation of hardware task placement as a constrained optimization problem and its resolution by the Branch and Bound method. The obtained results are depicted in Section 5. In Section 6, we summarize our work, make some concluding remarks, and present future works.
2. Related Work
Current strategies dealing with task placement are divided into two categories: offline placement and online placement.
2.1. Online Methods for Hardware Task Placement
The main reference is [4], Bazargan et al. suggest online scenario for hardware task placement. In fast online placement, Bazargan et al. introduce two partitioning techniques; the first technique denoted Keeping All Maximal Empty Rectangles (KAMER) searches all the Maximal Empty Rectangles (MER) after each task insertion/deletion operation. The MERs are defined as the empty rectangles which are not contained in another empty rectangle and are not necessarily disjoined. The second technique of partitioning called Keeping Nonoverlapping Empty Rectangles keeps all the nonoverlapping holes and is evoked after each split/merge operation. For both previous techniques of partitioning, the fitting in the online placement is conducted by the First Fit, Best Fit (BF) and Bottom Left binpacking algorithms [5].
Fast twodimensional online placement is presented in [6] which is an extension of Bazargan’s partitioning; the Keeping Nonoverlapping Empty Rectangles. In [6], Walder et al. propose placement methods that rely on efficient algorithms of partitioning enhancing the quality of Bazargan’s partitioning by 70% and on a hash matrix data structure that finds a feasible placement in constant time. Based on a nonpreemptive system without precedence constraint, the Walder’s partitioner delays split decision instead of using the Bazargan’s heuristics for decision of split. Reference [6] details four enhancements of Bazargan’s partitioner. The main partitioner is OnTheFly (OTF) partitioner which consists of resizing empty rectangles only if the newly arrived task overlaps them. After each task insertion or deletion, the Walder’s partitioner updates the data on the hash matrix. If a new arrived task fits into more than one empty rectangle determined by the hash matrix, a fitting strategy is used to choose a rectangle [6]. Reference [6] implements four types of fitting: BF, Worst Fit, Best Fit with Exact Fit, and Worst Fit with Exact Fit.
Ahmadinia et al. present in [7] a new method of online placement by managing the occupied space instead of free space because of the difficulty of managing empty space and the huge increase of empty rectangles. In [7], at the arrival of a new task, the space manager starts by delimiting the Impossible Placement Region (IPR) relative to placed modules and to device. Thereafter, the Nearest Possible Position fitter selects the optimal point that gives the optimal communication cost, which is not included in the IPR.
In [8], Marconi et al. extend Bazargan’s placement by means of an Intelligent Merging (IM) algorithm. IM dynamically combines three techniques of managing free resources: Merging Only if Needed, Partially Merging, and Direct Combine. IM accelerates Bazargan’s partitioner by 3 and improves placement quality by increasing the rate of accepted tasks.
Handa et al. introduce the staircase method in [9]. The staircase method handles free space during the first subfunction of the online placement. This method is considered efficient as it tries to cover the faults of the KAMER method, especially for task rejection.
Some approximate metaheuristics are adopted to resolve the hardware task placement such as [10] that employs an online task rearrangement by using genetic algorithm approach. When a newly arrived task could not be placed immediately, the proposed approach tries to rearrange a subset of tasks executing on the FPGA to allow the processing of the pending task sooner. The approach is based on the FirstFit strategy and genetic algorithm. By allowing the rotation of tasks and by using input buffer to save the data of suspended tasks, the approach combines two genetic algorithms to resolve two subproblems. The first subproblem identifies a feasible rearrangement and the second one consists on scheduling the moves of executing tasks to attain the feasible rearrangement.
Reference [11] copes with task placement problem and adopts interconnectionbased FPGA as support for runtime reallocation of hardware tasks. It applies matador task concurrency management methodology for scheduling hardware tasks on identical tiles by minimizing runtime reconfiguration. This goal is reached by two new techniques implied in the scheduler named configuration reuse and configuration prefetch. Reduction in configuration overhead decreases significantly the execution time and energy consumption.
2.2. Offline Methods for Hardware Task Placement
In the offline scenario for hardware task placement, [4] defines 3D templates depicting the tasks in time and space dimensions and uses slow heuristics: simulated annealing and greedy research, and KAMERBF to perform a highquality placement in terms of resource utilization and task rejection.
Reference [12] models the problem of resource allocation as a 01 integer linear programming problem which aims necessarily at minimizing the resources area which is reconfigured at runtime. Reference [12] considers an application with known sequential execution trace. Hence, the huge configuration latency is tackled by reducing the overlapped areas between tasks.
By considering the placement of hardware tasks as rectangular items on hardware device as rectangular unit, several approaches for resolving the twodimensional packing problem are proposed. For example, in [13], the offline approximate heuristics: NextFit Decreasing Height, FirstFit Decreasing Height, and BestFit Decreasing height are presented as strippacking approaches based on packing items by levels.
In addition, Lodi et al. in [14] propose different offline approaches to resolve hardware task placement as 2D binpacking problem. For instance, the FloorCeiling algorithm that considers alternate directions for packing tasks, either from left to right when their bottom edges touch the level floor or from right to left; when their top edges are on the top of the level floor. The Knapsack packing algorithm is also proposed in [15] which initializes each level by the tallest unpacked item and completes it by packing tasks as the associated Knapsack problem that maximizes the total area within level. For both latter algorithms, the second phase of 2D bin packing is achieved by BestFit Decreasing algorithm.
Baker et al. define in [16] the Bottomleft (BL) offline algorithm. BL packs each hardware task in the bottom left position.
Martello and Vigo propose in [17] an enumerative offline approach to exact solution for 2D binpacking. Their algorithm is based on a twolevel branching scheme: the outer branchdecision tree that assigns tasks to the bins without specifying their position and the inner branchdecision tree that enumerates all possible patterns.
By optimizing the total execution time and the resource utilization, the method of placement in [18] consists of two phases: the first phase is the recursive bipartitioning by means of slicing tree that defines the relative position of each hardware task towards the other hardware task placement and finds the appropriate room in the reconfigurable device for each hardware task according to task’s resources and intertask communication. The second phase uses the obtained room topology to achieve the sizing that computes the possible sizes for each room.
Reference [19] minimizes task rejection and presents offline algorithms for 3D floorplanning of hardware tasks on reconfigurable functional unit such as: KAMERBF Decreasing, Simulated Annealing, Lowtemperature Annealing, and Zerotemperature Annealing. The 3D placement models tasks as 3D boxes having a base corresponding to the spatial dimensions of tasks and a height corresponding to their timespan.
In [20], as binpacking problem, an offline approach is proposed by Fekete et al. through a graphtheoretical characterization of the packing of a set of items into a single bin. Tasks are presented as threedimensional boxes and the feasible packing is decided by the orthogonal packing problem within a given container. Their approach considers packing classes, precedence constraints, and the edge orientation to solve the packing problem. Similarly, in [21], Teich et al. definesthe task placement as moredimensional packing problem. Tasks are modeled as 3D polytopes with two spatial dimensions and the time of computation. Based on packing classes as well as on a fixed scheduling, they search a feasible placement on a fixedsize chip to accommodate the set of tasks. The resolution is performed by Branch and Bound technique to optimality of dynamic hardware reconfiguration.
The major shortcoming of all the aboveproposed methods of placement is that they are applicable only in homogeneous devices. In fact, these methods assume that the relocation of tasks is allowed and enable the allocation of resources whenever sufficient free space is available. Furthermore, the placement disregards the routing constraints as it does not address the issue of intertask communication and I/O routing. Moreover, tasks are nonpreemptive and almostidentical. Unfortunately, the algorithms for 2D packing focus only on the objective of minimizing resource waste and do not satisfy all other goals. All the existing strategies of placement provide a nonguarantee system as they suffer from task rejection and fragmentation. We believe the issue of task rejection is caused by the constructive way in which the placement is performed throughout all existent strategies of placement. The issue of fragmentation may lead to undesirable situations where a new task cannot be placed although there would be sufficient free space.
As we have full knowledge about the set of hardware tasks and the features of the reconfigurable device, in this paper, we present a realistic threelevel resource management solution as a new strategy to perform offline placement of hardware tasks in FPGA. This new strategy aims at enhancing placement quality by trimming the previously mentioned issues. Our proposed method is technologydependent and in accordance with generic placement as the second level partitions the available resources in FPGA according to the task classes provided by the first level. Nevertheless, the third level ensures the subfunction of fitting. The task model is preemptive and preemption points are predefined. Our resource management allows the relocation of tasks and results in strict positions for each hardware task by respecting its preemption points and types of resources.
3. ThreeLevel Resource Management
We use Xilinx’s Virtex FPGA as a reference for the hardware reconfigurable device to lead our hardware resource management study. We offer a definition of a few terms which are used throughout the paper: NT is the number of tasks, NR the number of Reconfigurable Physical Blocs, NZ the number of Reconfigurable Zones, and NP the number of resource types in the chosen technology. In the beginning, we should start by introducing the hardware task models. We have defined three models.(i)The functional model: this contains the functional features of hardware tasks T_{i} as the worst case execution time (C_{i}), the period (P_{i}), and preemption points l (Preemp_{i,l}). The number of preemption points of T_{i} is denoted by NbrPreemp_{i}. This number also includes the first point of execution of T_{i}. Preemption points are specified by the designer.(ii)The behavioral model: This includes the finite state machine controlling each task and which handles a set of NbrReg registers of 32 bits to conduct the context switch. The behavioral model defines the functional overhead (Context_{i}) that is needed to preempt or resume the execution of tasks. This functional overhead is fixed for all the hardware tasks as they have similar register banks with NbrReg registers. The functional overhead is computed as two times (save and load) the access to a bus having 32 bits of width and functioning at 80 MHz. In addition, we have considered the worst case, when tasks need the NbrReg 32bitregisters to perform context switch. Hence, this functional overhead represents sequential access of NbrReg registers associated for a given task to save and load its context through a 32 bitbus (Context_{i} = 2x NbrReg/80 MHz). In our preemptive modeling, we do not use the classical method of readback and load bitstream since it takes a significant latency, complicates the preemption, and requires a large space memory as a new readback bitstream must be saved at each preemption. Thus, we resort to save the state of finite state machine with an acceptable amount of data by keeping always the same bitstream for each task. Preemption points of hardware tasks are fixed in a way to reduce the data dependency that could exist between two states. In fact, we must avoid keeping a preemption point between two states processing the same data because we need to save these data into an external memory which might increase the overhead at runtime. Otherwise, it is recommended to put a preemption point when the task is in a blocked state waiting for receiving external resource to allow the ready tasks to be executed in the RZ. As tasks are periodic, a preemption point could be inserted after the last state before restarting the task to avoid any data dependency. In the finite state machine, the longest execution time between two states must be considered in order to deduct the worst case execution time.(iii)The RBmodel: tasks are presented as a set of reconfigurable resources called Reconfigurable Blocs (RB). The RBs are closely shaped to the reconfiguration granularity in the chosen technology. The determination of the RBmodel of hardware tasks is welldetailed in our work in [3]. Each type of RB is characterized by specified cost RBCost_{k} which is defined according to three parameters: the number of the RB type in the device, its power consumption and the importance of its functionality. The more consistent these parameters are, the higher the cost of RB type. The RBmodel of each hardware task is described by (1).
The functional model and the RBmodel are the basic models for resource management. However, the behavioral model is employed during scheduling. The FPGA provides reconfigurable resources organized according to columnbased technology [22]. The management of hardware resources on FPGA consists of three levels.
3.1. Level 1: Offline Flow of Hardware Task Classification
Level 1 takes a set of tasks as input and provides the types and instances of Reconfigurable Zones (RZ). The RZs are abstractions of task classes and are defined according to the types of resources needed by the tasks. As the RZs model the classes of hardware tasks, they are described by their RBmodel given by
Level 1 consists of three steps:
Search of RZ types: Step gathers the tasks sharing the same types of RBs under the same type of RZ. Step is essentially based on RBmodel of hardware tasks and is achieved by Algorithm 1 of complexity in the worst case O (NT *NP *NZ).

Step scans the RBmodel of each hardware task and checks whether there exists in the list of RZ types ListRZ an already inserted type of RZ that closely matches the required types of RBs in the task (line 6). In this case, step updates the number of RBs within this type of RZ by the maximum between the number of RBs in the task and that in the RZ (line 9). If the required types of RBs in the task do not match any type of RZ included in the ListRZ, the algorithm of the search of RZ types decides the creation of a new type of RZ as required by the task (line 13) and inserts it in ListRZ (line 14). At the end of step , we obtain the possible types of RZs. The number of RZ types is limited by the number of tasks. As shown in Figure 1, step groups T_{1} and T_{3} in the same type of RZ (RZ_{1}) as both need RB_{1} and RB_{2} and adjusts the number of each RB type within RZ_{1} by the maximum number of RBs between T_{1}and T_{3}. Similarly, RZ_{2} is created by T_{2} and T_{4}, and T_{5} defines the third type of RZ (RZ_{3}).
Classification of hardware tasks: Step starts by computing cost D between hardware tasks and RZ types resulting from step . Based on RBmodels of hardware tasks (T_{i}) and RZs (RZ_{j}), cost D is computed as follows according to two cases.
We define by
Case 1. , contains a sufficient number of each type of RB () required by T_{i}. In this case, cost D is equal to the sum of differences in the number of each RB type between T_{i} and weighted by RBCost_{k. }(see (4))
Case 2. , the number of RBs required by exceeds the number of RBs in the or needs which is not included in . In this case, the cost D between T_{i} and is infinite (see (5)) Figure 2 illustrates the computing of costs between the five tasks and described in Figure 1.
As shown in Table 1, step assigns each task to the RZ giving the lowest cost D as described by the third column. For example, T_{1} is assigned to RZ_{1} since D (T_{1}, RZ_{2}) and D (T_{1}, RZ_{3}) are superior to D (T_{1}, RZ_{1}). Then, by using (6), step computes the workload of each RZ according to this assignment and by using the functional models of hardware tasks

The overhead Overhead_{j,i} is the sum of Config_{j} corresponding to each and Context_{i} (save and load) common for all tasks . Config_{j} corresponds to the configuration overhead to place on the target technology. We compute this configuration overhead by making the floorplan of each on the chosen device and by conducting the whole partial reconfiguration flow up to the creation of partial bitstream. According to the configuration frequency and the configuration port, the Config_{j} is determined by
For example, the workload of RZ_{1} resulting from T_{1} and T_{3} is 66%. The last column in Table 1 gives costs D of the other tasks not assigned to the RZ.
We notice an overload in RZ_{2} caused by the workloads of execution of T_{2} and T_{4} as well as by their overheads. This overload is resolved during Step .
Decision of increasing the number of RZs: Step takes place only when an overload within some RZs is detected in Step and is achieved by Algorithm 2. Step aims to lighten the overload in RZs by conducting the migration of task execution sections to nonoverloaded RZs before resorting to the solution of increasing the number of overloaded RZs. Hence, for each task, we search all the possible combinations of task execution sections. For each overloaded RZ, Algorithm 2 searches all the nonoverloaded RZs that could accept at least one of its assigned tasks; that is, . Then, Algorithm 2 checks task by task the possibility of migration of an execution section or a combination of execution sections of the current task in order to reduce the overload of its RZ by respecting the workload of the nonoverloaded receiving RZ. In the worst case, the complexity of Algorithm 2 is O (M *N *NTM *TS), where M denotes the number of overloaded RZs, N is the number of nonoverloaded RZs, NTM is the maximum number of tasks assigned to an overloaded RZ and TS the maximum number of execution section combinations for a task assigned to an overloaded RZ.

Step groups the workloads of overloaded RZs in L1 (line 7) and the workloads of nonoverloaded RZs in L2 (line 7). Step goes throughout the RZs in L1 to resolve their overloads independently (line 11). Step uses nonoverloaded RZs in L2 to lighten the workloads of RZs in L1 (line 15). This step searches the nonoverloaded RZ in L2 that gives finite cost D with at least one task assigned to the overloaded RZ during Step (line 19). Once Step finds the set of tasks that could be executed in the nonoverloaded RZ, it balances the workloads between both RZs by respecting the tasks’ preemption points (line 22–line 37). If the overload persists in the RZ of L1, the algorithm decides adding other instances of this RZ up to ┌workload of RZ┐(line 42). When the processed nonoverloaded do not affect the added number of overloaded RZ_{m}, Step reinitializes their workloads to their values before dealing with the overload of RZ_{m }(line 43).
Without any loss of generality, our proposed strategy of resource management includes the main functions of generic placement: partitioning and fitting, which are fulfilled by the two following levels.
3.2. Level 2: Partitioning of Reconfigurable Physical Blocs on the Target Technology
Level 2 takes the types of RZs provided by level 1 as inputs and searches all the possible locations for them on the target device. These locations, called Reconfigurable Physical Blocs (RPB), are partitioned on the specified Reconfigurable Regions (RR) delimited in the target device. The RPBs are depicted by their RBmodel as presented in
The RPBs must contain all the types of RBs required by the RZ type. The number of RBs in RPBs is greater than or equal to the number of RBs in RZs. Figure 3 shows an example of RPBs partitioned in which are associated to an RZ requiring two and one . The RPBs are presented by the five dotted rectangles.
3.3. Level 3: Twolevel Fitting
Level 3 consists of two independent sublevels. The first sublevel ensures the fitting of RZs on the most suitable nonoverlapped RPBs in terms of resource efficiency. The second sublevel performs the mapping of tasks to RZs according to their preemption points by respecting the workload of each RZ and guaranteeing the total execution of each task. Such mapping essentially promotes the solution giving the lowest overhead and lowest cost D. The second fitting sublevel provides an execution unit for each task; consequently, there is no longer the issue of task rejection. The mapping of tasks to RZs is strongly based on dynamic partial reconfiguration. This latter concept enables multitasking as well as execution of several hardware tasks on the same RZs. In fact, dynamic partial reconfiguration allows the reconfiguration of subareas on the FPGA during runtime without affecting other running tasks.
4. Resolution of the Hardware Task Placement Problem
Previously, in our work presented in [23], we proposed a straightforward method to resolve partitioning and fitting. Nevertheless, with the rapid growth of resources in recent technologies and with the increasing complexity of applications, this exhaustive search is not efficient. In fact, the problem of partitioning/fitting is NPcomplete, the search space is immense and the temporal complexity of the execution of our proposed algorithm in [23] is exponential. In this paper, we formulate the partitioning/fitting problem as a constrained optimization problem. Our work is based on the smart nonexhaustive complete method called Branch and Bound [24] which employs efficient techniques for scanning search space and extracting the optimal solution.
4.1. Formulation of Hardware Task Placement as a Constrained Optimization Problem
The problem of partitioning/fitting is modeled as a constrained combinatory optimization problem as it uses discrete solution set, chooses the best solution out of all possible combinations and aims the optimization of multicriteria function. Partitioning/fitting problem is mixed integer problem as it uses some natural and binary variables. This problem is described by the quadruplet (Constants, Variables, Constraints, and Objective Function).
4.1.1. Constants
NT: number of tasks constituting the application, NZ: number of RZs resulting from level 1 (Offline flow of hardware task classification), NP: number of RB types existing in the target technology, D (T_{i}, RZ_{j}): the cost D between T_{i} and RZ_{j}, RBCost_{k}: the cost of each RB type.
Device Features
Device_Width: the width of the device, Device_Height: the height of the device, Device_RB: the RBmodel of the device.
Task features
: the Worst Case Execution Time (WCET) of , : the period of , : the preemption point l of , : the number of preemption points in , : the functional overhead for preempting and resuming T_{i}, : the full overhead for execution of task on , _RB: the RBmodel for .
RZ features
_RB: the RBmodel for , : the configuration overhead for RZ_{j} in target technology.
4.1.2. Variables
Features for Each
: the abscissa of the upper left vertex of , : the ordinate of the upper left vertex of , : the abscissa of the upper right vertex of , : the ordinate of the bottom left vertex of , _RB: the RBmodel of the constructed by the above coordinates.
The Task Preemption Points
Under the task functional model, it is assumed that the preemption pointslof each task are known. : Boolean variable controls whether the mapping of of is performed on . It is equal to 1 when is mapped to . PreempTask_{j,i,l}: each preemption point assigned to is taken from the predefined preemption points of each task(see(9)). : The sum of preemption points of performed within is expressed by
: The full occupation rate of in resulting from the mapping of preemption points of to RZs. The occupation rate is computed as expressed in
AverageLoad: The average of RZ workloads obtained after task mapping is calculated by
4.1.3. Constraints
Heterogeneity Constraint
The RZs must be fitted on RPBs containing a sufficient number of their required types of RBs. This constraint must be respected during partitioning and fitting of RZs. The heterogeneity constraint is formulated by
Nonoverlapping between RPBs
As expressed by (14), this constraint restricts the fitting of RZs on nonoverlapped RPBs
Nonoverload in RZs
As mentioned in (15), the nonoverload in RZs must be respected during mapping of tasks
Infeasibility of Mapping for Preemption Points
This constraint prohibits the mapping of preemption points of tasks to RZs giving infinite cost D (see (16)).
Uniqueness of Preemption Points
This constraint claims that each preemption point l of must exist on unique and guarantees the achievement of task execution as well as the elimination of task rejection (see (17))
Domains of RPB Coordinates
Equation (18) defines the allowed domain of values that can be assigned to RPB coordinates during partitioning
4.1.4. Minimization Objective Function (F)
As with all optimization problems, we have defined a minimization objective function F that helps in selecting the optimal solution. As described by (19), F promotes a solution giving the best values for the two objective subfunctions MappingFunction and PlaceFunction
PlaceFunction focuses on the subproblem of fitting RZs on the most suitable RPBs after partitioning the target device by respecting the predefined constraints. In (20), PlaceFunction evaluates the efficiency of resources after fitting RZs on the selected RPBs. PlaceFunction promotes the fitting of RZs on the RPBs that strictly contain the number and type of RBs required by RZs
MappingFunction focuses on the subproblem of fitting hardware tasks on RZs by respecting the predefined preemption points. MappingFunction targets the full exploitation of RZs and their workloads balance. It also aims at mapping tasks to the RZs providing the lowest cost to optimize resource utilization. Moreover, MappingFunction promotes the solution that minimizes the overall overhead and number of task relocations. Thus, MappingFunction is expressed by four subfunctions targeting these goals
Map1 focuses on the RZ workloads by means of (22). Its first expression evaluates whether the RZs are fully exploited. While minimizing this expression, the RZ workloads approach 100% of their exploitation. Its second expression computes the variance of the RZ workloads towards the obtained average workload. Minimization of this second expression ensures load balancing between RZs
Where
Map2 computes the overhead in (24) resulting from the mapping of task preemption points to the RZs. Map2 takes into account all the possible preemption points, even when two successive preemption points of one task are mapped to the same RZ, for obtaining the worst case overhead. In fact, the scheduler could preempt a task on these successive preemption points in the same RZ in favor of a higher priority task. Minimizing Map2 promotes the solutions that map the tasks to the RZs providing the lowest configuration overhead
By minimizing Map3 in (25), we promote a solution that maps tasks by a high occupation rate to the RZs giving the lowest cost D. The benefit of this minimization is optimizing the use of available resources in the technology. Indeed, these D costs between tasks and RZs reveal the rate of resource waste. Moreover, these D costs consider the weight of each resource in terms of three parameters which are: its frequency on the technology, the importance of its functionality, and its power consumption. Our objective is to minimize the utilization of these costly resources whenever possible. Hence, the more we promote mapping of tasks with high occupation rates to the RZs giving the lowest cost D, the more we optimize resource utilization
Map4 computes in (26) the total number of task relocations obtained after such mapping. Although the migration of tasks between RZs solves the conflicts between tasks on the same RZ, minimizing the number of migrations optimizes the number of preemptions and overhead
4.2. Branch and Bound Method for Solving the Hardware Task Placement
Our work is based on the global optimization method called Branch and Bound during partitioning of resource space and fitting RZs and tasks. The method of branch and Bound consists in enumerating all the possible solutions in an intelligent way by relying on the features of the specified problem. This technique succeeds in eliminating all the partial solutions that do not lead to the optimal solution. The Branch and Bound method relies on a predefined bound function which is our objective function F to make boundary on solutions to be excluded or to be kept as potential solutions. The performance of the Branch and Bound method depends on the quality of this function. Branch and Bound is adapted to mixed integer linear and nonlinear programming.
Initially only one subset of solutions exists namely, the root subset that contains all the solutions for the problem. The unexplored subsets are represented as nodes in a dynamically generated search tree. Each iteration on Branch and Bound processes one such node. The iteration has three components: the selection of the node to process, branching and the bound function calculation of the ramified nodes. For each node, the bound function calculation considers the best fitting for the remaining RZs or tasks without constraints. Whereas, only nodes respecting predefined constraints are only ramified after node selection. Our strategy for selecting the next node is based on the value of the bound function F of the node. We start by the node giving the best bound function F at the current level and we use the strategy of Depth First Search (DFS). This strategy starts by exploring, at each level, the node giving the best F until obtaining the complete solution which gives the last best F. Then, the process is repeated for the next best node on the last visited level if its F does not exceed the last best F. In this case, we branch this node and compute the bound functions of its ramified nodes and compare them to the last best F. If F of these partial solutions is greater than or equal to the last best F, they are rejected. If not, they are kept and the best partial solution is selected to be processed by Branch and Bound. Once a new complete solution is founded, the method checks whether its F optimizes the last best F. When an optimization is detected, the last best F is updated. The process is repeated for all the next best nodes in each level as described above. Algorithm 3 describes the Branch and Bound method applied on our problem.

Partitioning resource space and fitting RZs can be resolved independently from task mapping. In the subproblem of partitioning resource space and fitting RZs, PlaceFunction is employed as the bound function F in Algorithm 3. While in the task mapping subproblem, F corresponds to MappingFunction. To fit RZs on their suitable RPBs, the two subproblems of partitioning the RPBs on the target device and fitting RZs on the selected RPBs are performed in parallel. In fact, the resolution starts by partitioning the RPBs corresponding to the RZ root (for example RZ_{1}) with respect to the heterogeneity constraint. This partitioning is done randomly by assigning all potential values to the RPB coordinates and the RZ root is fitted onto the RPB that gives the best PlaceFunction. After selection of the best RPB for the RZ root that is, the selected node, this node is branched into all the possible RPBs for RZ_{2} by respecting the predefined constraints and the PlaceFunction of each ramified node is computed. The first selected node for the RZ_{2} level must satisfy the heterogeneity constraint as well as the nonoverlapping constraint. The subset of solutions ramified from this first selected node in RZ_{2} level is explored by searching the RPBs of RZ_{3}. The resolution is carried out similarly until the achievement of exploration branched from the first selected node which gives the last best PlaceFunction. The process is repeated for the next best nodes and recursively, the remainder RZs are fitted on RPBs as described for the RZ root by respecting the predefined constraints. During exploration, if a partial solution gives PlaceFunction worse than the last best PlaceFunction, this partial solution is rejected and its branching is stopped. If not, this partial partitioning/fitting of RZs is kept as a potential partial solution. During this repetitive process, all possible combinations of RPBs respecting the constraints are tested, when the process is achieved, the optimal solution is extracted.
Figure 4 illustrates the running of the Branch and Bound method on the partitioning/fitting of four RZs. The nodes with an mark present the subsets of solutions that do not contain the optimal solution and with a √ mark present potential solutions. In the RZroot level and level, the best node is selected, branched on partial solutions and the PlaceFunction of its ramified nodes are computed. As can be noticed, there are two processed best fittings in the level. The method starts by the first best node that is, PlaceFunction3p and processes this selected node. It is branched into two nodes by the RPBs of . After computing the PlaceFunction for these ramified nodes, the is selected and a complete solution is obtained. The last best PlaceFunction is PlaceFunction41. is rejected as it does not optimize PlaceFunction41. The next best node in the last visited level not yet processed is . The node is kept and branched into two RPBs on the RZ_{4} level. PlaceFunction43 optimizes the last best PlaceFunction and the solution is complete, thus the optimal solution is obtained by PlaceFunction43. The next iteration processes the next best node in the last visited level; partial solution RPB_{2}RZ_{3}. This partial solution is rejected as PlaceFunction32 exceeds the last best PlaceFunction. The other nodes are also not processed as their partial bound function exceeds the last best PlaceFunction. Finally, the optimal solution of partitioning/fitting of RZs is obtained by fitting RZroot () on RPB_{i}RZroot, on , on , and on .
The resolution of mapping starts by randomly assigning the preemption points of task root () to RZs giving finite cost D and computing the bound function of each node. The selected node is the node that gives the most optimal mapping of preemption points according to Mapping Function. This node is ramified for new processing for the next task. This ramification must respect the total execution of tasks as well as the nonoverload on RZs. Recursively, as task root, the remainder tasks are mapped by satisfying the two previous constraints and computing the Mapping Function of their ramified nodes. When a complete solution is obtained, it represents the last best Mapping Function. Similarly to partitioning/fitting of RZs, the mapping is resolved in a recursive manner by computing the Mapping Function of each partial mapping branched from the selected best nodes and keeping only the ones optimizing the last best Mapping Function. The mapping resolution is finished when all the potential mappings of preemption points for all tasks to RZs respecting mapping constraints are checked.
Figure 5 describes the progress of mapping three tasks to four RZs. The set of RZs giving finite cost D is provided for each task. In Troot level, the resolution selects the first best node that is, Mapping Function 13, after branching it and after computing the Mapping Function of the partial solutions, the next iteration is released on level. In level, MappingFunction23 is the best partial solution, its node is ramified on level and the last best Mapping Function is obtained by MappingFunction31. The process is repeated for the next best nodes in the last visited level. The nodes MappingFunction21 and MappingFunction22 are not processed since they exceed the last best Mapping Function. The method processes the next best node in the last visited level which is MappingFunction11. The branching of this node does not optimize the last best MappingFunction. Thus, its ramified nodes are not explored in level. Finally, the optimal solution of mapping hardware tasks to RZs is obtained by fitting the first preemption point P1 of on and its remaining preemption points P2 and P3 on and by fitting all the preemption points of and on .
As wellknown, in the worst case, the complexity of the classical Branch and Bound method is exponential [25]. In fact, in the worst case, the complexity of the first subproblem: partitioning/fitting RZs is equal to O ((4*B+8*+(NZ +1)*)^{NZ1}+4*B) where B denotes a possible combination of RPB coordinates for a given RZ as presented by a node in Figure 4. Thus, is a possible value assignment for the decision variables X_{j}, WRPB_{j}, Y_{j}, HRPB_{j}. The values domain of B is equal to , which is equal to (Device_Width)^{2*} (Device_Height)^{2}. 4*B depicts the selection of a node, its removal from the set live, and both following test: the test of final and optimal solution (line 8) and the test of failed solution (line 11). 8* is the branching operation by respecting the constraints (line 15) and (NZ+1)* is the bound computation for the branched nodes (line 17) by assigning each already processed RZ to its RPB and the remaining ones to the most suitable RPBs in terms of resource efficiency without constraints and the insertion of this branched nodes into the set live (line 18). Consequently, the complexity of the subproblem in the worst case is ~ O (((Device_Width)^{4*}(Device_Height)^{4})^{NZ}). In the worst case, the complexity of the second subproblem: task fitting is O ((4*B+ (2+NT+6*NT *NZ))^{NT1}+4*B) where B denotes a possible fitting of preemption points on RZs RZ_{j} for a given task as associated to a node in Figure 5. Thus, B is a possible value assignment for the decision variables PreempUnicity_{j,i,l}. The values domain of B for a given task T_{i} is which is equal to: and we consider the maximum number of preemption points is MaxPreemp. After the branching of the selected node, we compute the bound of its branched candidate by considering that the remaining tasks which are not yet processed are mapped with 100% to their optimal RZs in terms of distance and configuration overhead. Consequently, in the worst case, the complexity of the second subproblem by using the Branch and Bound search is . Thus, in the worst case, the search by Branch and Bound algorithm grows exponentially with NT and NZ. This complexity is expected as the placement of hardware tasks on the heterogeneous reconfigurable devices is NPcomplete problem. Currently, many enhancements on Branch and Bound algorithm are conducted as in [26] which could reduce the exponential complexity of some problems to logarithmic complexity. But despite this possible exponential complexity, compared to an exhaustive exact method, the Branch and Bound method immensely lightens the search space. Effectively, the search tree branches are discarded when the current bound function exceeds the last best bound function. The Branch and Bound ensures complete resolution of placement problem with a performance better than or equal to that of complete exhaustive method in terms of resolution speed and storage space. With this smart method, we ensure a full combination of RZ fittings as well as task mapping which solves the problem of task rejection confronted in the previous works of placement. Unlike the approached methods such as metaheuristics [27], this method is complete and affords an optimal solution. In fact, within these metaheuristics, the computation in each generation is rather complicated and time consuming and the number of generations must be increased to ensure a more favorable solution. The quality of the obtained solution in metaheuristics is conditioned by the best choice of the initial generation.
5. Application and Results
To investigate the influence of our proposed method for hardware task placement, we implemented an application composed of several hardware tasks taken from opencores [28]. We have chosen examples of hardware tasks that are frequently used in recent embedded systems performing video and audio applications. Our application features hardware tasks of varied sizes and of heterogeneous resources. The hardware tasks are guided by an application manager; the microcontroller. We do not consider the control overhead between the microcontroller and the other hardware tasks. During the design of this application, we synthesized the resources of each hardware task by the ISE 11.3 Xilinx tool. We defined the configuration overheads of the obtained RZs by performing the partial reconfiguration flow by means of the Planahead 11.3 Xilinx tool.
5.1. Application
Figure 6 shows the hardware tasks included in the application. The core of the application is the microcontroller T48. The microcontroller guides the other hardware tasks either for speeding up the computing such as FIR and MULTF or for performing complicated processing such as JPEG compression. Consequently, there are three categories of hardware tasks.(i)Application manager.(a)Microcontroller T48: Hardware task configuration and data flow synchronization.(ii)Finegrained hardware tasks (for speeding up microcontroller).(a)FIR: Performs 1000 FIR (Finite Impulse Response) filters. Each FIR features 3 taps, 8bitinput data and 48bitcoefficients.(b)MULTF: Performs 1000 floating point multiplications between two vectors of 8 bits. The exponent precision and mantissa precision are 3 and 4 respectively. These multiplications are pipelined with the latency of 8 clock cycles.(iii)Coarsegrained hardware tasks.(a)MDCT: Computes Modified Discrete Cosine Transform.(b)AES: (Advanced Encryption Standard): Performs encryption of blocs of 128 bits with 256bitkey.(c)DDS: (Direct Digital Synthesizer): Creates sinusoidal waves programmable with frequency and phase ontime.(d)JPEG: Performs hardware compression of 24 frames per second.(e)VGA: Drives VGA monitors with an 800x600 resolution, it can display one picture on the screen either of chars or color waveforms or color grid.
The features of hardware tasks and their instances are presented in Table 2. These hardware tasks are characterized by considering the resource area in Virtex 5 technology [29]. Virtex 5 contains four main types of resources CLBL, CLBM, BRAM, and DSP. The RBs are vertical stacks composed of the same type of resources and match the reconfiguration granularity. Hence, is 20 CLBMs, RB_{2} is 20 CLBLs, RB_{3} is 4 BRAMs, and is 8 DSPs. We have assigned 20, 80, 192, and 340 as RBCost, respectively, for , , , and . Configuration overhead is determined as described in (7) by considering that each task defines an RZ. After synthesizing hardware tasks by the ISE tool, they are modeled under their RBmodel reported in [3]. The partial reconfiguration flow dedicated by the Planahead tool enables the floorplanning of hardware tasks on the chosen device to create their bitstreams independently for estimating configuration overheads. The estimation of configuration overheads considers the best case fitting of each task, as we believe the subproblem of partitioning/fitting RZs is solved efficiently and independently from the subproblem of task mapping. We rely on parallel 8bitwidth configuration ports and use 100 MHz as the configuration clock frequency. Preemption points are determined arbitrarily according to the granularity of hardware tasks and their WCET. For all tasks, we consider the first preemption point is equal to 0 s.

5.2. Results
We applied the three levels of our resource management on the application and obtained the following results.
5.2.1. Level 1 Results
Step creates 6 types of RZ depicted in Table 3. is created by MDCT and VGA tasks, by AES, by DDS, by T48, by JPEG, and MULTF and by FIRs. If the RBs of one type of RZ are not constructed from the same task (i.e. there exist some RBs created by other tasks), the configuration overhead corresponding to this RZ must be recomputed as described in (7) before performing Step . In our application, the RBs of all RZs are created by one task. Hence, the RZ configuration overhead is provided by the predefined features of hardware tasks in Table 2.

During Step , the D costs between tasks and RZs are computed; in Table 3, the column specific for each task and its instances shows the values of obtained D costs. Thereafter, step calculates the workloads of obtained RZs by assigning to each RZ the tasks giving lowest cost D as mentioned by the bold numbers. WorkLoad values are presented in the first column of Table 3. For example, the workload of is computed by assigning the hardware tasks AES, MULTF, and VGA. We detected an overload in and . The overloads on these RZs are the result of the assigned hardware task execution time as well as the overheads, especially the configuration overheads of the RZs. These overloads are dealt with in step .
To resolve these overloads, step adds two other RZs having the same type of RZ_{2} since the other RZs cannot totally resolve its overload. In fact, the only possible migration is performed by on its second 1650 s preemption point which loads off up to 299%. Whereas, the overload of could be resolved by performing the migration of two tasks among , , and on on their second preemption points that is, 120 s, since is the least loaded. Consequently, the final number of RZs is equal to 8: , (, , ), , , , and .
5.2.2. Level 2 and Level 3 Results
Among all the available solvers [30], our work is based on the AIMMS environment [31] relying on powerful solvers. AIMMS environment has independently resolved the two subproblems: partitioning/fitting of RZs and fitting of tasks on RZs based on the Branch and Bound method. For the subproblem of partitioning/fitting of RZs, we used the Mixed Integer Programming model and for task fitting we employed Mixed Integer Nonlinear Programming model. At the end of resolution on CPU of 2 GHz with 2 GB of RAM, each RZ is fitted on its most suitable RPB and each preemption point of each task is mapped to a unique RZ. The resolution respects the consistency of constraints and extracts the optimal solution.
The subproblem of RZ partitioning/fitting was resolved after 2 hours and 30 minutes. Table 4 shows the four coordinates of the most suitable RPBs for RZ fitting on the initial RR limited on the Virtex 5 FX200 device ( RBs) as depicted in Figure 7. The initial RR is defined by the designer and by taking into account the set of resources dedicated only for the static part.

Table 5 shows comparison between the RBs of the obtained RPBs and their RZs. We observe high resource efficiency as the number of RBs within the RPBs is nearly equal to that of the RZs. Consequently, the internal fragmentation within the RPBs is considerably reduced. The differences of RBs are given by the last column () of Table 5. For all RZs, except and , their corresponding RPBs have one RB in excess. The RPB of contains 3 extra and the RPB of strictly contains the required number of each RB type. The nonnull is explained by the rectangular shape of the RPBs, thus the number of RBs included in the RPB could exceed the required RBs in the RZ. The nonnull is also due to the heterogeneity on the device; the partitioning could book some RBs which are not used by the RZ.

Figure 7 depicts the floorplanning of RPBs conducted on the Virtex 5 FX200 device. The obtained results show an average of resources utilization of 36% of the available resources on the initial RR. This average is computed according to the number and the cost of each RB type. The optimization in utilization of resources minimizes the area of FPGA which is reconfigured at runtime. We have created static design by floorplanning each instance of each hardware task on its RPB without using the concept of dynamic partial reconfiguration. The obtained utilization of such static design resources is 63% of the available resources on the initial RR. Therefore, the gain of configuration overhead in a static design is paid by the resource waste, which is 43% compared to our obtained results employing dynamic partial reconfiguration. The RPBs are closely packed on the initial RR which avoids the resources waste and the external fragmentation in the device. For this reason, the initial RR could be resized in order to dedicate sufficient space for the remainder static part as depicted in Figure 7 by final RR.
Once the final RPB floorplanning is conducted, the final configuration overheads corresponding to RZs are determined. Thus, these new values are used to resolve the subproblem of task mapping to RZs.
The subproblem of task fitting on RZs was solved within 9 seconds. Figure 8 shows the results of preemption point mapping of hardware tasks in the application. and start on , they are preempted after 85% of their WCET and continue their execution on . is mapped first to , it is stopped on on its second preemption point that is, after 33% of execution and restarts on up to 54%. At this third preemption point, migrates to where it completes its execution. , , , , and are totally mapped to . , , , , and are totally mapped, respectively to , , , , and .
As shown in Figure 9, task T_{6} starts its execution on RZ_{8} then is preempted on its third preemption point that is, after 42% of execution. resumes its execution on till 72% of its execution. Hence, it is preempted again on on the fourth preemption point and restarts on where it is achieved.
This resolution of mapping hardware tasks to RZs discards the problem of task rejection as it guarantees an execution unit for each task to achieve its execution by respecting its predefined preemption points.
The bar chart on Figure 10 presents obtained RZ workloads after task mapping resolution. Except having a workload of 44%, the remainder RZs have balanced workloads which are closest to the average workload equal to 89%.
The bar chart in Figure 11 depicts the task occupation rates on the RZs as a result of mapping. This bar chart shows the number of migrations that must be performed to avoid the RZ overloads and ensure total execution of tasks. We obtained 6 migrations. realizes two migrations. In fact, is mapped to , , and . The mapping of combines the two objectives expressed by Map2, which is the minimization of configuration overhead and Map4 which optimizes utilization of costly resources. Similarly is mapped to , , and . and sustain both, migration from to .
If we consider independent tasks, within each RZ respecting 100% as workload bound, execution sections mapped to this RZ could be scheduled by Earliest Deadline First (EDF) algorithm as they respect the monoprocessor sufficient schedulability condition ().
After RZ fitting on their RPBs and according to the obtained task mapping, the final RPB configuration overheads are determined. In the worst case, by counting all the possible preemption points, the total overhead including configuration overheads and functional overheads of tasks is 72959 s. The overall overhead is 11% of total running time.
Table 6 gives some comparisons with previous work in hardware task placement in terms of task rejection, resource utilization, configuration overhead, and the complexity of the performed technique. To the best of our knowledge, there are several placement algorithms proposed for each goal. These algorithms could be classified as metaheuristics or online heuristics. Nevertheless, these algorithms target only one goal and do not take into account other goal satisfactions. In opposition to [11, 32], our multiobjective placement computes the configuration overhead in the worst case before scheduling (11%) and targets an application of 14 tasks which is not the case of [11, 32] that optimizes the configuration overhead only for two or three tasks during the scheduling (18%, 8%). Comparing to [10] (sets of 3,000 homogeneous tasks) and [1] (10.000 homogeneous tasks) applied in homogeneous devices; we have reduced efficiently the resource utilization (36%) for an application of 14 heterogeneous tasks and by taking into account the heterogeneity of recent reconfigurable devices. The heterogeneous resources in FPGA could fit the RZs on large RPBs giving a significant resource waste. Comparing to the offline simulated annealing in [4] performed for 100 tasks which produces 13% of task rejection, our threelevel offline placement discards totally the task rejection for an application of 14 tasks.
