International Journal of Reconfigurable Computing

International Journal of Reconfigurable Computing / 2011 / Article
Special Issue

Selected Papers from the International Workshop on Reconfigurable Communication-centric Systems on Chips (ReCoSoC' 2010)

View this Special Issue

Research Article | Open Access

Volume 2011 |Article ID 591983 | 28 pages | https://doi.org/10.1155/2011/591983

Static Scheduling of Periodic Hardware Tasks with Precedence and Deadline Constraints on Reconfigurable Hardware Devices

Academic Editor: Michael Hübner
Received27 Aug 2010
Revised12 Jan 2011
Accepted10 Feb 2011
Published12 May 2011

Abstract

Task graph scheduling for reconfigurable hardware devices can be defined as finding a schedule for a set of periodic tasks with precedence, dependence, and deadline constraints as well as their optimal allocations on the available heterogeneous hardware resources. This paper proposes a new methodology comprising three main stages. Using these three main stages, dynamic partial reconfiguration and mixed integer programming, pipelined scheduling and efficient placement are achieved and enable parallel computing of the task graph on the reconfigurable devices by optimizing placement/scheduling quality. Experiments on an application of heterogeneous hardware tasks demonstrate an improvement of resource utilization of 12.45% of the available reconfigurable resources corresponding to a resource gain of 17.3% compared to a static design. The configuration overhead is reduced to 2% of the total running time. Due to pipelined scheduling, the task graph spanning is minimized by 4% compared to sequential execution of the graph.

1. Introduction

An important trend in real-time applications implemented in reconfigurable computing systems consists in using reconfigurable hardware devices to increase performances and to guarantee temporal constraints. These reconfigurable devices provide a high density of heterogeneous resources in order to satisfy application requirements and especially to enable parallel computing. Furthermore, the devices employ the pertinent concept of run-time partial reconfiguration which allows reconfiguration of a portion of available resources without interrupting the remainder parts running in the same device. Consequently, the concept increases resource utilization and application performance.

Periodic partially ordered activities represent the major computational demand in real-time systems such as real-time control and digital signal processing. This category of repetitive computation is described by directed acyclic graphs (DAGs). Implementation of these DAGs in reconfigurable hardware devices consists in scheduling tasks to a limited number of nonidentical units shaped on the area of reconfigurable resources, while respecting the four constraints described as follows. (1) The periodicity constraint: each task is repeated periodically according to its ready times in the graph. Thus, if task has a period , then for all , , where and are the th and the th repetitions of task , and and are their start times. (2) The precedence constraint: to maintain the rightness of task precedences, in each iteration, a task can be executed only if all its predecessors in the graph have finished their executions. Therefore, each task must start execution after the completion of executions of its predecessors defined by the subset , thus for all , , for all , where , are the start times of task and task , respectively, during their th iteration, and is the execution time of task . (3) The dependence constraint: the execution of each task in DAG is launched when all the data resulting from all its predecessors are available. This constraint guides the choice of task periods as detailed in Section 3. (4) The deadline constraint: as this paper focuses on hard real-time systems, each task in the DAG must finish its execution before its hard deadline. Thus, within iteration , if task has an execution time and an absolute deadline , then .

Figure 1 illustrates an example of the targeting task graph. As can be seen in Figure 1, the tasks are repeated according to their fixed periods. Each task with precedence link launches its execution only when its predecessors achieve their executions and only when it is required. For example, the third iterations of and of periods 8 do not need a third execution of their predecessor as it is less repetitive than and (period of is equal to 12). At each repetition, to enable the task execution, the dotted lines ensure the data transfer between interdependent tasks. The issue of data dependence is detailed later in the paper. Finally, at each iteration, the real-time tasks must respect their hard deadlines.

As shown in Figure 2, this paper proposes a new methodology comprising three main stages to achieve the scheduling of these DAGs with the predefined constraints on reconfigurable devices.

Task Clustering
This stage is technology dependent. It targets the partitioning of tasks requiring the same types of resources into the same cluster.

Mapping/Scheduling of Tasks in Clusters
This stage starts by performing spatial and temporal analyses mentioned in Figure 2 by DAG validity, Ready Times, and a set of heuristics. Subsequently, based on a predefined preemption model, it deals with simultaneous resolution of mapping tasks to the obtained clusters and global scheduling of tasks in clusters respecting the periodicity, precedence, dependence, and deadline constraints. This stage aims at optimizing scheduling quality.

Cluster Placement on the Reconfigurable Device
This stage is also technology dependent. It involves searching for the most suitable physical location partitioned on the reconfigurable device for each cluster obtained at the second stage. This stage aims at optimizing placement quality.

The resolution of these three stages results in static scheduling of tasks in the DAGs into a limited number of reconfigurable units partitioned on the device, respecting the periodicity, precedence, dependence, and deadline constraints. This is a fundamental problem in parallel computation, equivalent to determining static multiprocessor scheduling for DAGs in a software context. As is well known, static multiprocessor task graph scheduling is a combinatorial optimization problem, and it is formulated in this paper through mixed integer programming and solved by means of powerful solvers.

The paper details the spatial and temporal analyses required to check scheduling task graph feasibility and aims at determining the optimal solution in terms of schedule length, waiting time, parallel efficiency, resource efficiency, and configuration overhead. Schedulability analysis is not the focus of the present paper. However, before dealing with DAG scheduling on reconfigurable device, the rightness of the precedences and dependences between tasks within the graph and the accuracy of real-time functioning are analyzed, and a set of heuristics are performed to provide the number of reconfigurable physical units needed to ensure the existence of valid DAG scheduling. The analyses are expressed by some constraints to ensure the validity of the chosen task graph.

The paper is organized as follows: Section 2 presents related works of the DAG scheduling problem. Section 3 details the methodology we propose to achieve the placement and scheduling of DAGs on reconfigurable devices. Section 4 describes an illustration of our proposed methodology on a given DAG and evaluates the obtained enhancements by metric measuring of placement and scheduling quality. The conclusion and future works are presented in Section 5.

Static multiprocessor scheduling techniques using task graphs have matured over the last years, and many powerful scheduling strategies have emerged. As this problem is known to be NP-hard [1], the main research efforts in this area focus on heuristic methods and few of them propose analytic resolutions. We have studied static and dynamic multiprocessor scheduling using DAGs in both the software and hardware contexts.

In [2], Clemente et al. implement a static hardware scheduler employing efficient techniques which greatly reduce reconfiguration latencies and schedule length. Taking into account that configuration latency drastically reduces the efficiency of hardware multitasking systems, they introduce a new hardware scheduler communicating directly with the reconfigurable units and using optimization techniques: prefetch, reuse and replace while guaranteeing the precedence constraints. The prefetch technique manages in advance the reconfigurations and replacements required to improve task reuse. Reference [2] presents three algorithms for the replace technique, that is, the Least Recently Used, Longest Forward Distance, and Look Forward + Critical algorithms. The paper focuses especially on to schedule graph tasks on several reconfigurable units. In order to maximize the task reuse that reduces reconfiguration latency, classifies tasks from the most critical ones in terms of required reconfiguration to the least critical ones and tries to replace the latter tasks, so that their reconfiguration does not generate any overhead during task graph execution. This advantage is ensured by the prefetch technique which reconfigures a given task during execution of its predecessors.

Static multiprocessor scheduling is a very difficult problem, but genetic algorithms have successfully been applied to the search for acceptable solutions. In [3], the authors investigate scheduling for cyclic task graphs using genetic algorithms by transforming the cyclic graph into several alternate DAGs. To create an efficient schedule, the paper considers both the intracycle dependencies and the dependencies across different cycles. After unfolding the cyclic task graph for two cycles by incorporating the intercycle dependencies, the paper presents an algorithm investigating all the subgraphs extracted from a two-cycle task graph. Based on measurements such as the height and the width of the task graph, connection degree, degree of simultaneousness, and independent parts in the graph, the method evaluates the resulting subgraphs to select the configuration that best suits a chosen application and the available hardware configuration. Suitable allocation to processors is obtained by the earliest start time heuristic where tasks are assigned to the processor that offers the earliest start time. The employed genetic algorithm tries to optimize the schedule length which is expressed by the finishing time on all processors.

In [4], Abdeddaim et al. describe a method based on a timed automaton model for solving the problem of scheduling partially ordered tasks on parallel identical machines. The proposed method formulates for each task in the graph a 3-state automaton consisting of the waiting, active, and finish states. Therefore, by searching the tasks related by a partial order in the graph, the possible disjoint chains in the graph are extracted. The automaton of every chain is constructed using the individual task-specific automaton. The global automaton is then composed by the chain-specific automata and takes care that the transitions do not violate the precedence, and resource constraints. Thus, optimal scheduling consists in finding the shortest path in the timed automaton. The proposed methodology is also extended to include two additional features in the task-specific automaton which are the release and the deadline times.

Integer linear programming (ILP) formulation is exploited in some works of static multiprocessor scheduling using task graphs. The authors of paper [5] propose an exact ILP formulation to perform task graph scheduling on dynamically and partially reconfigurable architectures and that minimizes schedule length. In addition, this work proposes a dynamic reconfiguration-aware heuristic scheduler called NAPOLEON, which adopts an ALAP (as late as possible) order of tasks and exploits configuration prefetching, module reuse, and antifragmentation techniques. Both methods are extended to the Hw/Sw codesign. The ILP formulation is based on nonpreemptive tasks, allows the execution of tasks on software processors or on the FPGA, and respects the FPGA physical constraints as well as the precedence and temporal constraints in the graph. Both methods provide a solution for the complete scheduling of the DAG and determine for each task its Sw or Hw execution unit, its time of reconfiguration start, its position on FPGA, and its execution starting time.

Another exact ILP formulation for performing task mapping and scheduling on multicore architectures is presented in [6]. The technique of these authors incorporates loop level task partitioning, task transformations by using loop fusion and loop splitting, and it aims at reducing system execution time. The paper focuses on an ILP-based approach for task partitioning, task mapping, and pipelined scheduling while taking data communication between processors into consideration for DSP applications on the multicore platform. The authors in [6] divide the problem into two parts. The first assigns and schedules tasks on processors by including the task merging on the same batch and the replication of batches to several processors. The second step conducts a mapping of data to memory architecture by minimizing memory access latencies.

In [7], Sandnes and Sinnen consider the scheduling of iterative computing that can be represented by cyclic task graphs. In order to avoid costly classic graph unfolding and to shorten the makespan during scheduling, the authors propose a new strategy for transforming cyclic task graphs into acyclic task graphs; an efficient scheduling from the literature named Critical Path/Most Immediate Successor First (CP-MISF), proposed by Kasahara and Narita in 1985, is then applied to the transformed graph. The strategy is based on a decyclification step involving three parts: (1) a decyclification algorithm for transforming the cyclic graph into an acyclic graph based on a given start node and depth first search (DFS) strategy, (2) by assuming that the critical path in the graph is a good estimator for its schedule length, it searches the start node that yields the shortest critical path in the transformed graphs, and (3) quantifying the acyclic graph quality in terms of makespan. In addition, based on an adjacency matrix representing the graph dependencies and simplifying the unfolding formulation, the paper presents a new intuitive graph unfolding formulation which decomposes the adjacency matrix into two matrices, one for intraiteration dependences and another for interiteration dependences. The unfolded graph is then scheduled using a genetic algorithm approach.

In [1], Djordjević and Tošić propose a new compile-time single-pass scheduling technique applied for task graphs onto fully connected multiprocessor architectures called chaining and which takes into account communication delays. The proposed technique consists in a generalized list scheduling with no preconditions concerning the order in which tasks are selected for scheduling. The main idea is to build an heuristic providing a trade-off between maximizing parallelism on processors, minimizing communication overheads, and minimizing overall execution time of the task graph. Chaining technique uses nonpreemptive tasks and constructs a scheduled task graph incrementally by scheduling one task at each step. The intermediate partially scheduled task graphs are obtained by selecting a nonscheduled task at each step and by placing it on the most appropriate precedence edge. The policy of selection of tasks to be scheduled is based on a Task Selection First heuristic, and the selection of the most suitable valid edge where the task will be placed is guided by the critical path and edge width criteria. The tasks encompassed within the same chain are scheduled on the same processor.

In [8], the authors aim at improving the performance of hardware tasks on the FPGA. Intertask communication and data dependences between tasks are analyzed in order to reduce configuration overhead, to minimize communication latency, and to shorten the overall execution of tasks. The work exploits the proposed works in reconfigurable computing and addressing resource efficiency to present three algorithms. Reduced Data Movement Scheduling (RDMS) is the most efficient dynamic algorithm for reducing configuration and communication overheads and provides the optimal performance for scheduling tasks in DAGs on the FPGA. RDMS uses the total reconfiguration of the FPGA and tries to minimize the number of reconfigurations by grouping communicating tasks in the same configuration. By conducting a width search, RDMS schedules tasks while respecting their data dependences. RDMS is based on dynamic programming algorithm and ensures that each configuration includes the combination of tasks that exploits the hardware resources to the maximum and that encompasses the highest possible number of task dependences.

In [9], Fekete et al. consider the optimal placement of hardware tasks in space and in time on the FPGA. Tasks are presented as three-dimensional boxes in space and time. The authors integrate intertask communication expressing data dependence and use a graph-theoretical characterization of the feasible packing determined by means of a decision of an orthogonal packing problem with precedence constraints. By searching the transitive orientations and by performing projections, the authors of paper [9] transform the 3D boxes representing tasks into 3×1D arrangements and then verify whether the three obtained arrangements referred to as packing classes satisfy the conditions of the feasible packing and determine the optimal spatial and temporal packing. This work enhances the makespan of the graph and optimizes the used reconfigurable space on the FPGA.

The major contribution in [10] is the development of a multitasking microarchitecture to perform a dynamic task scheduling algorithm on reconfigurable hardware for nondeterministic applications with intertask dependences which are not known until runtime. The task system is modeled as a modified directed acyclic graph which contains directed data edges and directed control edges labeled with scalar values indicating the probability of occurrence of the corresponding sink task in multiple task graph iterations [10]. Based on dynamic priority assignment for nonpreemptive tasks, the dynamic scheduler assigns each task to a software or hardware processing element, schedules the contexts (bitstreams) and the data, and the fetch and the prefetch reconfigurations, and activates task execution. In order to minimize reconfiguration overhead, the dynamic scheduler uses the configuration prefetching technique to prefetch the task bitstream ahead of time or exploits the previous context configured in the logic cell. In addition, it aims to minimize application execution time by employing a local optimization technique.

In [11], Kohler defines a new heuristic to schedule DAGs on a system of independent identical processors. The author describes a simple critical path priority method which is shown to be optimal or near optimal in the most randomly generated computation graphs compared to the Branch and Bound method. This heuristic aims to minimize the finishing time of the computation graph. Critical path scheduling is based on a list () containing permutation of the tasks. Any time a processor is idle, it instantaneously scans the list from the beginning and begins to execute the first free task which may validly be executed because all its predecessors have been completed [11]. The construction of the list is based on the critical path of tasks which is defined by the longest path from a given task to a terminal node. The paper also presents the exact Branch and Bound method used to obtain optimal scheduling, and the results obtained are compared to the critical path heuristic to prove the high quality of the latter method.

Table 1 provides a summary of the optimization parameters and employed techniques described in the cited works.


ReferencesMakespan/Speedup/Parallel efficiencyResource efficiencyConfiguration overheadTechniques

[2]xxPrefetch/replace/reuse of reconfigurations
[3]xGraph unfolding/genetic algorithm/earliest start time heuristic
[4]xTimed automaton model/shortest path
[5] x x(1) ILP/reuse
(2) NAPOLEAN/ALAP/prefetch/reuse/antifragmentation
[6]xILP/loop level task partitioning/task transformation (loop fusion + loop splitting)/task mapping and scheduling (task merging on batches + batch replication)/data mapping
[7] x(1) Graph decyclification (DFS + shortest critical path)/CP-MISF
(2) Graph unfolding/genetic algorithm
[1]xChaining/task selection first heuristic/critical path/edge width/communication latencies
[8]xxxDynamic programming algorithm/FPGA total configuration
[9]xxOrthogonal packing problem/packing classes
[10]xxPrefetch/local optimization/reuse
[11]xCritical path heuristic/Branch and Bound

The major common drawback of most described techniques is that they do not address real-time constraints. Furthermore, as shown in Table 1, most of them seek to optimize the makespan of the graph and neglect reconfiguration overhead and resource inefficiency or do not optimize the three parameters simultaneously. The works described in [8, 9] that conduct scheduling of DAGs on FPGA devices are based on successive total configurations of the device. Their resource efficiency consists only in the packing the maximum of tasks in the DAG on the FPGA in order to efficiently exploit the reconfigurable resources as well as to perform the minimum of total configurations. These works therefore do not consider the internal fragmentation caused by task placement on the FPGA which represents resource efficiency in our work.

In the context of hardware task scheduling, in the proposed works, the placement of scheduled tasks is not considered. Either the placement of the task is allowed in whatever position in the device (in this case, they do not take into account device heterogeneity) or the task is fixed to a unique reconfigurable unit which will reduce application flexibility. Contrary to these works, our strategy may be generalized for all types of devices, that is, both homogeneous and heterogeneous devices; the placement problem is considered an important stage, that is, highly interlinked with the scheduling of task graphs on the reconfigurable device. With our strategy, the task may be executed on several reconfigurable units according to its resources and according to the analyses conducted during the clustering stage.

Moreover, some of the described works do not exploit the relevant concept of run-time partial reconfiguration afforded by recent reconfigurable devices and employ the total configuration of FPGAs.

Based on these observations, our challenge is to utilize the benefits of the run-time partial reconfiguration concept for recent heterogeneous devices. The concept opens up the possibility of developing a hardware multitasking system by dividing the reconfigurable area into smaller reconfigurable units and by customizing them as required by the running application.

Considering multitasking, scheduling of task graphs on reconfigurable hardware devices is similar to heterogeneous multiprocessor scheduling in the software context. With full knowledge of the characteristics of DAG tasks and technology features, our methodology targets constrained applications and endeavors to provide pipelined scheduling in multi-reconfigurable-unit system while optimizing schedule length, waiting time, parallel efficiency, resource efficiency, and configuration overhead.

3. Proposed Methodology for Placement and Scheduling of Dags on Reconfigurable Devices

Our methodology can be viewed as two separate subproblems: (i) the mapping and scheduling of hardware tasks on predefined clusters by satisfying periodicity, precedence, dependence, and deadline constraints and (ii) the placement of obtained clusters on reconfigurable device taking into account its heterogeneity. Our resource and task management is essentially based on features of hardware tasks and reconfigurable hardware devices. The most recent Xilinx’s Virtex FPGA was used as a reference for the reconfigurable hardware device to perform the placement and scheduling of the DAGs. Virtex SRAM-based FPGAs are characterized by a column-based architecture, a high density of heterogeneous resources, and several parallel configuration ports functioning at a high speed.

3.1. Terminology and Definitions

Throughout the paper, refers to the number of tasks in the graph, the number of clusters, and the number of resource types in the chosen technology. The directed acyclic task graph is denoted by the pair . is the set of nodes representing tasks in the DAG, and is the set of edges linking the dependent tasks, .

As shown in Figure 3, on each edge, the outgoing value from the source node to the sink node depicts the amount of data that must produce at each repetition for . The incoming value represents the amount of data that must be consumed by the sink node at each execution iteration after completion of the repetition of its predecessor .

Each task in the graph has three models as follows.

3.1.1. Functional Model

Each hardware task is represented by a set of parameters fixed at compile time and which are kept static throughout the DAG execution. is characterized by its worst-case execution time , its period , which is equal to its relative deadline , and a set of preemption points . The preemption points are instants of the time taken throughout the worst-case execution time as shown in Figure 4. The number of preemption points of task is denoted by . This number also includes the first point of execution of the task. The set of preemption points is determined by the designer according to the known states in the behavioral model and according to possible data dependences between these states. The predefinition of preemption points gives rise to the execution sections within the hardware task. Our methodology is based on preemptive modeling to create a reactive system, to increase flexibility towards application needs, and consequently to enhance the respect of real-time constraints.

3.1.2. Behavioral Model

This includes the finite state machine controlling each task and which handles a set of registers or a small memory bank useful for context switching during preemption. The latency required to preempt and to resume the execution of tasks is disregarded as in the worst case; the time to access the system bus and memory is negligible. With the preemptive model, we do not use the classical method of readback and load modified bitstream since latency with this method is significant; it complicates preemption and requires a large space memory. With the classical method, a new readback bitstream must be saved at each preemption. With our model, the number of preemptions within tasks is limited by specifying the possible preemption points according to task states outside of which preemption is not allowed. Thus, we resort to saving/loading the current state of the finite state machine with an acceptable amount of data by maintaining the same bitstream for each task within a given reconfigurable unit when the task needs to be preempted or resumed. Preemption points of hardware tasks are set in a way that reduces the data dependences that could exist between two states. In fact, maintaining a preemption point between two states processing the same data must be avoided because these data need to be saved into an external memory which might increase the preemption overhead at runtime. Otherwise, it is recommended to insert a preemption point when the task is in a blocked state, waiting to receive an external resource to enable the ready tasks to be executed in the reconfigurable unit. In the finite state machine, the longest execution time between two states must be considered in order to deduct the worst-case execution time.

3.1.3. RB-Model

At the physical level, tasks are presented as a set of reconfigurable resources called reconfigurable blocs (RB). RBs correspond to the physical resources in the reconfigurable hardware device required to achieve execution of the hardware task, and they define the RB model of the task as expressed by (1). Determination of the RB-model of hardware tasks is well detailed in our work described in [12]. The number of RB types is equal to the number of resource types in the chosen technology. The RBs are the smallest reconfigurable units in the hardware device. They are determined according to the available reconfigurable resources in the device, and they closely match its reconfiguration granularity. Each type of RB is characterized by a specified cost, , defined according to its frequency in the device, its power consumption, and the importance of its functionality,

The reconfigurable device is also characterized by its RB model as shown in our work described in [12] to enable the placement of hardware task clusters at a later stage.

The three main stages of the methodology used for static scheduling of DAGs on multi-reconfigurable-unit system with predefined constraints are described below.

3.2. Hardware Task Clustering

This stage comprises two steps that are performed consecutively: (i) reconfigurable zone type search and (ii) cost computing. Bearing in mind that the concept of run-time partial reconfiguration had to be used, our main objective at this first stage was to partition tasks constituting the graph into cluster types determined according to their required RB types in order to enhance resource utilization.

3.2.1. Reconfigurable Zones Types Search

This step takes as input the RB model of each task in the DAG, and by performing Algorithm 1 of the worst-case complexity , it groups tasks sharing the same types of RBs under the same type of cluster by taking the maximum number of RBs between these tasks. With our methodology, the obtained types of clusters are denoted as reconfigurable zones (RZs). The upper bound of the possible RZs is . Thus, RZs are virtual units customized by Algorithm 1 to model the classes of hardware tasks in terms of RB types. RZs separate hardware tasks from their execution units on the reconfigurable device. In the last stage of our proposed methodology, RZs will be placed on their suitable reconfigurable units respecting the heterogeneity of the device and optimizing resource efficiency as well as configuration overhead. After the completion of Algorithm 1, each RZ is represented by its RB model as expressed by Algorithm 1 processes the tasks of the DAG as follows. It scans the RB model of each hardware task and checks whether an already inserted type of RZ that closely matches the required types of RBs in the task exists in the RZ types list, list-RZ (line 6). Should this be the case, Algorithm 1 updates the number of RBs within this type of RZ by the maximum between the number of RBs in the task and that in the RZ (line 9).

(1)   // RZ types
(2)   // list of RZ types
(3)   // natural
(4)  for  all  tasks   do
(5)   //
(6)   if     and     or     
     then
(7)    // this test checks whether the task matches with an RZ type that already exists in list-RZ
(8)    for  all     do
(9)      // update RB number of
(10)    end  for
(11)   else
(12)    Increment RZ-type
(13)     // new type of RZ,
(14)    Insert (list-RZ, )
(15)   end  if
(16)  end  for

If the required types of RBs in the task do not match any type of RZ included in the list-RZ, the algorithm of the search of RZ types decides on the creation of a new type of RZ as required by the task (lines 12, 13) and inserts it in list-RZ (line 14).

Figure 5 is an example of the execution of Algorithm 1 for the RZ types search for DAG comprising five tasks. Figure 5 illustrates the search for RZ types resulting from five tasks in a technology including four types of RBs (RB1, RB2, RB3 and RB4). and are grouped in the same type of RZ (RZ1) as both need RB1 and RB2, and the number of each RB type within RZ1 is adjusted by the maximum number of RBs between and . Similarly, RZ2 is created by and , and defines the third type of RZ (RZ3).

After searching for the set of RZs, the configuration overhead for each obtained RZ is computed and denoted by . corresponds to the configuration overhead to fit in the target technology. This configuration overhead is computed by the floorplanning of each on the chosen device and by conducting the whole partial reconfiguration flowup to the creation of the partial bitstream. is determined by (3) according to configuration frequency and configuration port,

3.2.2. Cost Computing

This step commences by computing cost between tasks and each RZ type resulting from the first step. Cost represents the differences in RBs between tasks and RZs; consequently, it expresses resource wastage when a task is mapped to an RZ. Based on RB models of task and RZ , cost is computed according to two cases as follows. Firstly, we define by

Case 1. For all , , contains a sufficient number of each type of RB () required by , and cost is equal to the sum of the differences in the numbers of each RB type between and weighted by as expressed in

Case 2. There exists , , the number of RBs required by exceeds the number of RBs in or needs which is not included in . In this case, cost is infinite,
Cost is exploited during the stage of task mapping and scheduling on RZs and the RZ placement stage to optimize the utilization of costly resources to the device. The execution of a given task in an RZ is allowed only when the cost between them is finite. Figure 6 illustrates the computing of costs between the five tasks and RZ3 described in Figure 5.

3.3. Mapping and Scheduling of Hardware Tasks on RZs

It is well known that task mapping and scheduling are highly interdependent. The two issues therefore need to be addressed together for mapping and scheduling to be efficient. In order to analyze the scheduling of a given DAG of periodic tasks, it is sufficient to study its behavior for a time interval equal to the least common multiple of all the task periods, called the hyperperiod (HP). Consequently, the possible iterations of execution of each task during the HP may be determined according to its period , which is equal to . To resolve the subproblem of mapping and scheduling of hardware tasks on RZs, our methodology conducts three steps of spatial and temporal analyses. The first one checks the rightness of task precedences, dependences, and real-time functioning in the DAG by means of three essential constraints. Consequently, the DAG will be validated to launch the following analyses. The second analysis determines the lists of ready times of each task for each possible iteration during the hyperperiod. These lists take into account the periodicity, precedence, dependence, and deadline constraints. The third straightforward analysis takes as input the lists of ready times, searches at each iteration the possible execution intervals for each task, and therefore detects possible conflicts due to overlapping between execution intervals of parallel tasks on the same RZ. The third analysis pursues its processing to solve the detected overloads within RZs by performing either the migration of execution sections of some tasks respecting their predefined preemption points or by increasing the number of overloaded RZs.

3.3.1. Checking of Precedence, Dependence, and Real-Time Rightness in DAG

The first temporal analysis does not take into account spatial constraints and considers that there is an available RZ for each task. It also considers the periodicity of tasks. The main objective of this analysis is only to validate the correctness of task precedence and dependences and real-time constraints in the studied DAG. It is conducted by means of three essential constraints.

(a) Dependence Checking
More than precedence, we consider that the tasks related by an edge of precedence are also dependent. This means that the execution of a given task requires the data resulting from the execution of all its predecessors. This dependence is expressed by (7). Equation (7) focuses on periods and the amount of interchanged data between dependent tasks. It guarantees, when and , that each data produced by task in its repetition is consumed by its successor during its iterations of execution or . When and , at each iteration of execution of , it ensures that there are the sufficient data for to be executed. The previous equation eliminates the problem of data wastage and ensures that the data produced by all the iterations of each task are consumed by all the iterations of its successors. In this work, we focus on the case where , for all .

(b) Precedence Checking
As we work in the case of , for all , considering the periodicity constraint, this constraint claims that each iteration of execution of a given task may include only the iterations or of its successors to ensure the correctness of the precedence constraint. During repetition of a given task , this constraint prohibits that and repetitions of its successors coexist. In this way, during the HP, each th execution of any task is preceded by the th execution iterations of all its predecessors. To guarantee these rules, the following constraint expressed by (8) was developed to guide the selection of execution times and periods for tasks in the DAG: depicts the possible number of execution iterations of during the HP. and are the ready times for tasks and without considering the spatial constraints. They are determined by searching the critical path corresponding to the task in the graph by using the ASAP (as soon as possible) technique. For each task having a set of predecessors , its ready time is computed as follows

(c) Real-Time Checking
Considering the periodicity constraint, this constraint analyses the respect of real-time constraints in the best case of spatial conditions in terms of number of RZs. By respecting the precedence and dependence constraints, (10) checks at each iteration whether each task may complete its execution before its strict absolute deadline. If the absolute deadline of the task for its last iteration exceeds the HP, the deadline then turns into the HP. As we work in the case , for all , the second expression of (10) satisfies the deadline constraint for the remaining repetitions of task when its predecessors achieve all their execution iterations
Respect of the three previous constraints validates the selection of periods and execution times for periodic tasks in the graph for the scheduling with precedence, dependence constraints, and under strict real-time constraints on an unlimited number of reconfigurable units. Nevertheless, the following temporal and spatial analyses will extract the corresponding number of RZs that will satisfy these constraints. When the previous constraints are unreal, the DAG is considered invalid and consequently it will be rejected or the features of its tasks will be modified and evaluated again till it respects the constraints expressed by (7), (8), and (10).
Consequently, with our methodology, we depart from the temporal analysis to construct a suitable physical architecture allowing a feasible schedule. As a final step at this stage, after fixing the physical architecture of the multi-reconfigurable-unit system, analytic resolution of mapping/scheduling provides the optimal scheduling of DAGs on target technology.

3.3.2. Determination of Lists of Ready Times

The temporal analysis does not consider the physical constraints and searches all the ready times in each execution iteration for all tasks in the graph by respecting the precedence, dependence, periodicity, and real-time constraints. This analysis yields, by means of (11), the lists during which task may start execution, where denotes the th iteration of task within the HP. Outside this list, task might not respect the predefined constraints, which would lead to unfeasible scheduling.

For each task , is the lower bound time to start task execution at its iteration in order to respect its precedence, dependence, and periodicity constraints. is the upper bound time from which the task can no longer start execution if its strict deadline and the data dependency and precedence of its successors are to be respected. To compute the of tasks, we start computing from the top of the DAG. For the calculation, we start from the bottom of the DAG, and for both, we proceed using the breadth-first search strategy

and are the start times of the th repetition of according to its first ready times ( and ). Hence, they are determined by incrementing and by . For example, if we have a task with a period = 8, a hyperperiod , and we consider and , then , , , , , .

3.3.3. Determination of Task Execution Intervals and the Number of RZs

This temporal and spatial analysis considers the RZ types resulting from the task clustering stage and searches the possible parallelism between tasks to study execution conflicts on the RZs. When an overload is detected in some RZs, the analysis starts by solving this problem through a migration mechanism; if migration does not produce a solution, it increments the number of overloaded RZs as required. This analysis searches first, by means of Algorithm 2, the execution intervals of each task for each possible iteration during the HP, then, using Algorithm 3, it deals with the overlapping execution intervals on the RZs to search the possible overloads, and finally, it uses Algorithm 4 to try to solve the found overloads by the migration mechanism; when this latter mechanism fails, additional RZs are inserted in the architecture of the multi-reconfigurable-unit system to solve the persisting overloads.

(1) // Tasks
(2) // Natural, iterations of execution of task
(3) // the list of ready times for task during th iteration
(4) // the ready times for task during th iteration
(5) // the ready times for task from the current
(6) // the set of execution intervals for task during th iteration
(7) for  all  tasks   do
(8)  for  all  execution iterations of task   do
(9)   
(10)   if     then
(11)     = ,
    
(12)   else
(13)    for  all     do
(14)      = First
(15)     for  all     do
(16)      
       ( , if is chosen as upper
      bound for this interval then it must respect: +
       , if is chosen then it must respect:
(17)     end  for
(18)      = next ( )
(19)    end  for
(20)   end  if
(21)  end  for
(22) end  for

(1) // Tasks
(2) , , , // Natural, iterations of execution of tasks
(3) // The set of execution intervals for task during iteration
(4) // All the possible combinations of sets of distinct tasks that give
    overlapping execution intervals on a given RZ, during
(5) // Combination of overlapping execution intervals depicted by ,
(6) // The remaining execution time within task
(7) The remaining period within task
(8) // The period of conflict for a combination of overlapping execution intervals
(9) for  all  iteration   do
(10)  for  all  RZ  do
(11)    = /
(12)    = one combination of overlapping execution intervals extracted from sets in .
(13)   for  all   in   do
(14)    In , search the latest tasks in starting execution
(15)    In , search all the remaining tasks that are ready or start execution before and determine
     their remaining execution times: and their remaining periods:
(16)     )
(17)     = )
(18)    if  ( , = , , , ,
      = + and , = + *
      , , and and are the
      iterations from which the execution intervals are taken respectively for tasks and )  then
(19)      = / + /
(20)    else
(21)     
(22)      = / + /
(23)    end  if
(24)   end  for
(25)  end  for
(26) end  for

(1) // Combination of overlapping execution intervals
(2) // Boolean controlling after migration whether RZ is no longer overloaded by
(3) // Load of the RZ corresponding to
(4) // The set of RZs helpful for partial migration
(5) // The set of tasks that might perform partial migration
(6) for  all   giving overloaded RZ  do
(7)    False
(8)   while  ( = False) and ( )  do
(9)    // Total migration of tasks
(10)   Select the interval from that gives the most heavy occupation rate and discard it from .
(11)   Check whether the iteration, corresponding to the execution interval of the selected task, is studied on another
       non-overload RZ and update the load of the overloaded RZ after the elimination of the selected task.
(12)    if     then
(13)      True
(14)    end  if
(15)   end  while
(16)  if     then
(17)   Reinitialize and with its tasks
(18)   
(19)   
(20)   // Partial migration of tasks
(21)   Omit the tasks from the overloaded RZ, corresponding to , that are also acceptable by another RZs ( ) and
   reduce their occupation rates from the loads of these RZs. These latter RZs with the overloaded RZ corresponding to
    are included in the set . The omitted tasks are included in
(22)   while     do
(23)    In set, start by the task that gives the best trade-off between least number of RZs in
    where it could migrate and heaviest occupation rate in the overloaded RZ.
(24)    During , within the selected task, choose the biggest execution sections that could be placed in RZs from the
     set without overloading them.
(25)    Update the load of RZs receiving execution sections from the selected task
(26)    if  Some execution sections of the selected task are not placed  then
(27)     Reinitialize the loads of RZs in to values before processing
(28)     Increment the number of RZ corresponding to up to , go to 6
(29)    else
(30)     Discard the selected task from .
(31)    end  if
(32)   end  while
(33)   // All the execution sections of tasks are placed
(34)   
(35)  end  if
(36) end  for

(a) Search of Execution Intervals
This step uses the functional model of tasks and determines their execution intervals by means of Algorithm 2 of worst-case temporal complexity equal to . An execution interval for a given task at a given iteration is an interval during which the task can be executed while satisfying all the predefined constraints.
For each task in the DAG, Algorithm 2 produces the set of its possible execution intervals expressed by at each possible iteration during the HP. When this algorithm processes the first iteration of task , the set of its possible execution intervals is determined directly (line 11) considering the precalculated ready times in , and the periodicity and dependence constraints. For the next iterations, considering each possible for the purpose of respecting the periodicity constraint (line 13), at each iteration , Algorithm 2 searches all the corresponding execution intervals starting with each possible (line 14, line 15), and considering the dependence and the deadline constraints, and the HP to determine the upper bound for each execution interval (line 16). In line 18, in order to guarantee the periodicity and dependence constraints, progressing from a to another in list , Algorithm 2 must shift the list by omitting its first element, since ready times in each iteration are also linked to the ready times of the first iteration.

(b) Search of Overlapping between Execution Intervals of Tasks in the Same RZ
For the first iteration, the parallel tasks on a given RZ are extracted directly from the DAG; hence, there is no parallel execution between a given task and its successors. However, in the next iterations , searching parallel tasks must respect some rules. For the next iterations , for parallel efficiency, a given task could overlap with its successors during their iterations on the same RZ. Moreover, this step must also consider the execution conflicts on the same RZ for a given task in its iteration and other tasks that are in the iterations when there are no dependence constraints between them. This step prohibits simultaneous execution of several iterations of the same task.
The step defines for each task the possible RZs allowing its execution in terms of types and number of RBs based on computed cost and then searches the execution conflicts on the RZs using Algorithm 3. Algorithm 3 has a worst-case temporal complexity equal to .
During each iteration , for each RZ, in order to find the parallel tasks with a finite cost with the RZ, Algorithm 3 searches all the possible combinations of sets of execution intervals of all the tasks that can be executed on the current RZ and produce overlapping execution intervals respecting the rules described above (line 11). Then, from the resulting sets of execution intervals, Algorithm 3 extracts the execution intervals causing the conflicts in the current RZ which will result in Comb (line 12). Comb may contain two or more tasks. Each obtained Comb is processed individually to study the load of the RZ (line 13–line 24). For each Comb, we start the study only at the time at which all the tasks coexist to check for possible conflict and its consequence on the current RZ. Thus, Algorithm 3 searches the latest tasks in Comb (line 14) and for tasks that either have already started their executions and still running after arriving or have been ready before arriving; it searches their remaining execution times and periods (line 15–line 17) by promoting the tasks with the earliest deadlines and using the ASAP technique, especially in the case there are more than two tasks within the Comb. Finally, Algorithm 3 computes the load of the current RZ for this current Comb according to two cases. In the first case (Case 1, line 18-line 19), the remaining tasks with their new execution time values and periods are considered as virtual new tasks, and if their execution intervals in Comb corresponds to their total periods, Algorithm 3 intuitively computes the load of the RZ as mentioned in line 19 by considering the occupation rate of each virtual new task and each latest task in Comb on the current RZ.
In the second case (Case 2), the longest period of time that could be shared by the tasks in Comb is determined in line 21 and the load of the RZ is studied during (line 22).
This step deals with all possible cases of execution conflicts on all the RZs. At the end of this step, we obtain at each iteration during the HP, the loads of each RZ produced by each combination of tasks giving overlapping execution intervals. Consequently, we can detect the possible overloads in the RZs . Some combinations might be included within other ones. When overloads are detected on some RZs, the next step resolves the problem either by performing migration of tasks respecting their preemption points or by incrementing the number of overloaded RZs until the overloads are covered.

(c) Task Migration or Addition of RZs
Migration of tasks causing an overload on a given RZ during a combination of simultaneous executions might be total or partial. Total migration consists in replacing the entire execution of one or many tasks on another RZs. Partial migration consists in the reallocation of some execution sections within tasks predefined by their preemption points to nonoverloaded RZs. The migration is performed as explained by Algorithm 4. The worst-case temporal complexity of Algorithm 4 is . Algorithm 4 searches the combinations producing overloaded RZs obtained by Algorithm 3 (line 6). Firstly, Algorithm 4 starts by total migration (line 8–line 15) to avoid the preemption of tasks resulting in difficulties related to context switches. During each Comb causing overload, the algorithm extracts the execution interval of the task that provides the largest occupation rate in the current Comb (line 10). In the case of equality between tasks, the task producing the fewest RZs for total migration during the Comb should be selected. The algorithm then determines the execution iteration of the task corresponding to this extracted interval and checks whether the iteration is also studied in another nonoverloaded RZs. Should this be the case, total migration of the task is allowed, and the task is eliminated from the overloaded RZ (line 11). If there are several RZs accepting total migration of the selected task, the RZ which is least required by tasks in the Comb is chosen. In cases of equality, the least loaded RZ is kept. After each total migration of a task, Algorithm 4 updates the load of the RZ corresponding to Comb and checks whether it is no longer overloaded (line 12–line 14). When total migration fails to resolve the overload in the RZ, partial migration takes place (line 16–line 35). Partial migration reinitializes the load of the current RZ corresponding to Comb. It searches the tasks in Comb that give finite with other RZs and omits their occupation rates from the combinations on these latter RZs and from the current overloaded RZ considering the iteration of each task in Comb and includes the omitted tasks in the set Task-migration. The RZs accepting the tasks of Comb and the RZ corresponding to Comb are inserted within the set RZ-migration (line 21). Algorithm 4 attributes weights to tasks within Task-migration according to their occupation rates in Comb and according to the numbers of RZs in RZ-migration producing finite with them. The task yielding the best trade-off between the two parameters is selected (line 23). Within the selected task, partial migration tries to reallocate its predefined execution sections in RZs from RZ-migration without causing an overload (line 24). Partial migration promotes the biggest execution sections respecting this rule. It starts by placing the selected execution section on the RZ which is least required by the other tasks in Task-migration waiting for partial migration. In cases of equality, it starts with the least loaded RZ in RZ-migration. Algorithm 4 pursues the processing of these partial migrations until the set Task-migration is empty. If partial migration does not successfully map all the execution sections of a given task in Task-migration (line 26), Algorithm 4 reinitializes the loads of the RZs to their initial values before processing the Comb (line 27), stops its processing for the current Comb, and increments the number of the corresponding RZ to (line 28). When a given overloaded Comb is resolved by partial migration, Algorithm 4 takes into account the update of loads of all altered RZs (line 25) to be considered for the next Comb. Although migration might resolve many overloads for several combinations, it is still very difficult to perform as it is exhaustive and depends on the initial choices of tasks and RZs. One could consider the best case of studied migrations to minimize the number of added RZs. After each increment of RZ, the step must consider the added RZs to deal with the overloads of the remaining Comb not yet processed to avoid an excessive number of unusable RZs. As the proposed algorithm of migration is not exact, it might lead to an excessive number of RZs. This problem will be covered during resolution of mapping/scheduling which also optimizes the number of used RZs for the purpose of resource efficiency. However, an excess of RZs is very useful as it guarantees elimination of the infeasibility of analytic resolution and consequently, and it guarantees the feasibility of scheduling of the DAG.
At the end of this step, the number of RZs required to perform the scheduling of the chosen DAG on the FPGA is obtained. The resulting RZs constitute the target multi-reconfigurable-unit system where the scheduling of DAG will be conducted. The next step focuses essentially on determining the optimal valid scheduling that respects the predefined constraints.

3.3.4. Mapping and Scheduling Resolution

In the last step of the current stage, we concentrate on the resolution of mapping and scheduling tasks in the DAG on the resulting multi-reconfigurable-unit system. Mapping and scheduling are highly interlinked. It is well known that static multiprocessor scheduling of DAGs is performed by means of two actions: (i) assignment of an execution order expressed by temporal scheduling and (ii) assignment of processors expressed by mapping, for a set of tasks characterized by precedence and real-time constraints. With our methodology, based on a preemptive model, mapping consists in assigning each task to the most suitable RZs in terms of utilization of costly resources. Mapping is considered as spatial scheduling to a limited number of heterogeneous RZs. The scheduling searches the optimal scenario for task execution on RZs during the HP. At each execution iteration for a given task, it assigns for its execution sections specific times to launch their executions on the corresponding RZs. This scheduling is valid only when it satisfies predefined temporal constraints, and it should optimize the makespan of the graph, parallel efficiency, waiting time, and schedule response time. The proposed resolution leads to global static pipelined scheduling on a heterogeneous multi-reconfigurable-unit system because it is constructed at compile time, and it allows overlapping between execution iterations of distinct tasks on distinct RZs. Moreover, the problem of mapping/scheduling is a combinatorial optimization problem as it uses a discrete solution set and chooses the best combination of feasible assignments by optimizing a multiobjective function.

In this paper, resolution of mapping/scheduling is performed by mixed integer nonlinear programming solver as it is well adapted for this kind of problem. The mapping/scheduling problem is modeled by the quadruplet (constants, variables, constraints, and objective function).

Constants

NT:Number of tasks constituting the DAG
NZ:Number of RZs resulting from the task
migration or addition of RZs analysis
NP:Number of RB types existing in the
target technology
, :The references of iterations of
executions during the HP
:The references of RZs
, :The references of tasks
:The references of RB types
, :The references of preemption points
in tasks
:The references of times values,
:The cost between task and RZ
:The cost of each RB type
HP:The hyperperiod in the DAG
:The worst case execution time of task
:The period of task which is equal to
its relative deadline
:The number of possible preemption
points within task
:The set of possible preemption points
of task . The first preemption point
for all tasks is equal to 0
:The execution section within task
provided by the predefined
preemption point
:The number of execution iterations
of task during the HP
:The set of possible times assigned to
preemption points during the HP.
It is equal to
:Binary constant takes 1 when task
has a precedence constraint with
task in the DAG
:Binary constant takes 1 when task
has data dependence constraints
with tasks in the DAG
:The configuration overhead of
Com:The maximum value of time for
transmitting a data of unit length
between two dependent tasks
:The amount of data sent by for
execution.

Variables

Binary variable takes 1 when the preemption point of task is mapped to at time at iteration . This variable ensures the link between mapping and scheduling. In our resolution, the mapping/scheduling problem is solved when binary values are assigned to all these variables.

This variable represents the time value assigned to preemption point of task on at iteration during the HP. This variable is not defined when gives infinite with task . It is obtained as expressed by

This variable provides the time value assigned to preemption point of task at iteration during the HP and is calculated by means of