Abstract
An automated design approach for multiprocessor systems on FPGAs is presented which customizes architectures for parallel programs by simultaneously solving the problems of task mapping, resource allocation, and scheduling. The latter considers effects of fixedpriority preemptive scheduling in order to guarantee realtime requirements, hence covering a broad spectrum of embedded applications. Being inherently a combinatorial optimization problem, the design space is modeled using linear equations that capture highlevel design parameters. A comparison of two methods for solving resulting problem instances is then given. The intent is to study how well recent advances in propositional satisfiability (SAT) and thus Answer Set Programming (ASP) can be exploited to automate the design of flexible multiprocessor systems. Integer Linear Programming (ILP) is taken as a baseline, where architectures for IEEE 802.11g and WCDMA baseband signal processing are synthesized. ASPbased synthesis used a few seconds in the solver, faster by three orders of magnitude compared to ILPbased synthesis, thereby showing a great potential for solving difficult instances of the automated synthesis problem.
1. Introduction
In order to build flexible systems that can be adapted to applications, researchers have explored FPGAbased multiprocessor systems in an attempt to exploit both highlevel parallelism in applications and the flexibility of reconfigurable devices, targeting both single [1, 2] and multiFPGA platforms [3, 4].
The process of implementing such systems is a very complex undertaking, consisting of phases such as the design of constituent IPblocks (e.g., processors, memories, and buses), task mapping and architecture determination (highlevel synthesis), lowlevel system integration, and finally, FPGA synthesis and placement and routing. The focus of this work is on taskmapping and highlevel synthesis, and builds up on a design platform that targets system integration and synthesis [1].
An automated architecture synthesis methodology based on combinatorial optimization is used to simplify the design process. This methodology addresses the problem of determining applicationspecific optimum system architectures as well as mapping and scheduling corresponding parallel programs.
Often, an optimum solution requires the sharing of processor resources between tasks, necessitating the use of a task scheduler, whose impact on the overall solution must be considered. In cases where cooperative schedules suffice, that is, for applications without strict timing requirements, the resulting analysis during optimization is straightforward. This is because the overhead is easy to compute. In that case, the overall optimization objective is simply to minimize the overall execution time of the application, the makespan, or alternately, to explore areathroughputpower tradeoff.
However, applications which impose deadline guarantees for periodic tasks may require preemptive schedulers. In such situations, one must determine how often task switching actually takes place, depending on task priorities and the schedule. The reason is that the schedule has a direct impact on the overall execution time, and hence on the optimum task mapping and resource allocation. Moreover, the optimization process must take into account schedulability constraints, because some task mappings may not guarantee deadlines.
The method presented in this paper considers the effect of fixedpriority scheduling during architecture synthesis. This covers a broad spectrum of embedded applications, with and without realtime requirements. Experimental results for parallel implementations of IEEE 802.11g and WCDMA signal processing algorithms provide a proof of concept.
Because the structure of targeted parallel programs as well as task deadlines is known a priori, it is possible to synthesize multiprocessor architectures that are optimum for the programs. It can be experimentally shown that such customized architectures are superior compared to domainspecific ones. That fact is important because embedded applications exhibit a wide diversity with respect to the complexity of their algorithms, the rate at which the algorithms need to operate, as well as their memory and intertask communication patterns. The consequence is that it is virtually impossible to find a good architecture that can meet the requirements of a wide range of algorithms. Customized architectures are therefore vital.
Since it is desirable to customize a system for a target embedded parallel application, automated tools for design space exploration are required to cope with the complexity. Whereas a skilled engineer can effectively utilize workbenchbased tools [5, 6] to design a feasible architecture, the sheer number of design parameters renders a disciplined exploration infeasible. Further, it has been shown in [7] that often no consistent trend with respect to design objectives can be observed when design parameters are systematically changed, where the value of the objective function can increase or decrease by more than two orders of magnitude in an apparently random manner when moving between adjacent sets of parameters in the design space. Moreover, results obtained can be counterintuitive. Consequently, even experienced designers cannot effectively execute a guided exploration based on their expertise because it is not easy to predict the outcome of parameter variation so that superior design points can be easily missed. Those results underline the need for an automated approach.
To enable an automatic exploration, design parameters and the objective must be mathematically modeled. Since this is inherently a combinatorial optimization problem, it is natural to model the problem as such and solve it using Integer Linear Programming (ILP). However, the large number of design parameters that needs to be considered at the system level leads to a huge number of variables and constraints, thereby posing a serious challenge for ILP solvers that manifests itself in very long synthesis time. But to be useful, this automated synthesis approach must be fast in order to facilitate a systematic exploration.
On the other hand, recent advances in propositional satisfiability (SAT) methods [8] have spurred significant improvements in methods for Answer Set Programming (ASP) [9, 10]. ASP is a form of declarative programming oriented towards difficult search problems. Given the success that has been reported for solving such problems, it is interesting to study the effectiveness of these methods for speeding up the automated synthesis problem. Therefore, this paper compares ILP versus ASPbased highlevel synthesis both in terms of synthesis runtime and the quality of synthesized architectures. The study uses parallel implementations of baseband signal processing chains for IEEE 802.11g and WCDMA wireless standards.
The rest of this paper is organized as follows. Summaries on related work and on our design flow are given in Sections 2 and 3, respectively. The ILP model for optimization is presented in Section 4, followed by a model of the problem in ASP semantics in Section 5. Finally, a comparison of the two methods and concluding remarks are given in Sections 6 and 7, respectively.
2. Related Work
Mathematical modeling and tool automation for synthesizing multiprocessor systems are an area that has been extensively discussed in the literature using techniques ranging from combinatorial optimization, through dynamic programming, simulated annealing, evolutionary algorithms to applicationspecific heuristics.
The vast majority of related work in this area have the drawback that the design space is preconstrained by fixing the architecture first, followed by task mapping and scheduling [11–13]. Since no customization is possible, resulting solutions are not optimum because the optimum applicationspecific system is imposed by the nature of the application, as dictated by operations performed within its tasks, as well as by intertask communication pattern. Often preconstraining is employed to overcome the complexity of the design space.
Some approaches attempt to reduce the design complexity by eliminating design dimensions [14, 15] thereby limiting the optimality of architectures. Advanced related works [16–18] recognize and address this aspect and attempt to solve the subproblems simultaneously. However, these approaches separately consider subproblems, effectively preconstraining the design space, albeit to a much smaller degree. Other approaches such as [19] randomly search for a feasible solution and may not lead to an optimum solution.
Scheduling during or after mapping too has been extensively treated in the literature [19, 20]. There also have been efforts to map tasks in a way that optimizes for power [21], reduces chip temperature at run time [22], or minimizes interprocessor communications [23]. Dynamic mapping techniques have also been introduced with objectives such as temperature management [24] and performance optimization through adaptive mapping [25]. These approaches however consider scheduling on fixed architectures.
In contrast, in order to synthesize applicationspecific optimum architectures, it is important to simultaneously (i) select processors, (ii) map and schedule tasks to them, and (iii) select one or several networks for communications, such that design constraints and objectives are met. This avoids the problem of preconstraining the design space, leading to globally optimum architectures.
The contribution of the method presented in this paper is a comprehensive mathematical model that can be used for automated design space exploration without limiting the design space as well as a comparison of candidate approaches to tackle the problem.
3. The Design Flow
Figure 1 depicts the flow. The input to the flow is a parallel program, and optionally information on task periods. The application is simulated and analyzed to obtain intertask data traffic and task precedence information. This information is used to specify an instance of an ILP problem or an ASP program. Similarly to other related work in this area, the other input to the design flow is information on available processing elements and communication networks, as well as their costs and constraints. In our approach, the design space is not preconstrained, and the problem dimensions are not ranked. This ensures the optimality of found solutions. For realtime systems, it is often sufficient to meet timing constraints so that the interest is not to find the fastest solution. In such situations, the flow can be used to find the smallest system instead.
The solution generated by the ILP/ASP solver is used to generate an abstract description of the system, which is passed to further toolchains described in [1] to generate the configuration bit stream. Because postsynthesis results could deviate from initial cost models used, new cost models can optionally be extracted after placement and routing to start a new iteration.
4. ILP Model
The ILP model used for automated synthesis in this work consists of two major parts. The first part covers constraints that establish the system functionality, without any regards to deadlines [7]. The second part covers scheduling and the optimization objective and is the focus of this paper.
The following notation is used. is a task, is a processor, and is a Binary Decision Variable (BDV). means that task is mapped on processor , otherwise.
The objective function for a terminating parallel program, or for one period of a nonterminating program, is expressed as
where is the execution time of task on processor , and is the cost in time of using communication resources as described in [7]. is the cost in time of task switching which depends on several factors as described in subsequent sections
4.1. The Processor Architecture
This is the actual cost of switching context, which depends on the memory and on the microarchitecture, as well as on the mechanism for context switching (i.e., under software or with hardware support). The coefficient captures this cost and can be reliably precomputed for processors of interest. Whether this cost is actually incurred depends on task mapping as discussed in Section 4.4.
4.2. The Kernel/OS
The kernel or realtime OS introduces a control overhead due to scheduling (polling, moving tasks between run and delay queues, etc.). In this formulation, it is assumed that kernel/OS is already selected and is fixed, for each of the processors in the design space (i.e., OS selection is not a part of the optimization problem, so that the associated cost is coupled to the selected processor). This is however not a limitation because, when desired, instances of the same processor running different operating systems or microkernels can be specified in the ILP problem to extend the design space.
The overhead is caused by the clock interrupt handler interfering the execution of application tasks because of its higher priority. This increases the number of task switching and the response time of application tasks. The coefficient in (2) captures the latter cost, the clockhandler time. The analysis in [26] describes how the clockhandler time can be estimated for fixedpriority schedulers. Because the overhead is kernel/OSspecific, and may depend on task mapping, the computation/estimation of in the problem formulator (Figure 1) is implemented in an extensible way to support new kernel/OS models.
4.3. The Schedule/Task Switching
The schedule determines how often task switching takes place as captured by the coefficient in (2). is the number of task switching that is incurred for the duration of the application, or for one period, when a particular group of tasks with the index is mapped on a processor . One can distinguish between three major scheduler categories: cooperative, fixedpriority preemptive, and deadlinedriven schedulers.
In simple cooperative schedulers (e.g., cyclic executives), there is no preemption, so that . The overhead incurred when a task begins and ends is already captured by in (1) as part of functioncall overhead.
Preemptive schedulers often require task priorities to make decisions by letting higherpriority tasks run first. This can improve performance if critical tasks are assigned higher priorities. Priorities can be fixed or dynamic depending on whether priorities can change at runtime. The exception to this distinction are measures against priority inversion. Even though such measures do change priorities dynamically, the changes are temporary to otherwise fixed priorities.
Deadlinedriven schedulers change task priorities dynamically and have the advantage that deadlines can be guaranteed at higher CPU utilization compared to fixedpriority scheduling. In either case, is a function of the number of context switching, and its usage in the ILP model is the same. This section discusses how the worstcase number of context switching can be estimated for fixedpriority schedulers, which are more typical in realtime embedded systems.
For these schedulers, is equal to the number of interferences due to higher priority tasks and is obtained from Rate Monotonic Analysis (RMA) [27] within the problem formulator. The RMA in the formulator currently supports tasks with single deadlines, and which have fixed durations and nonvarying periods. However, the implementation is easily extensible to support flexible RMA models. Such models can be adapted to applications with arbitrary, multiple, or internal deadlines [28]. Future extensions will affect the computation of only. The ILP model for synthesis remains unaffected.
RMA is conducted for all possible task groups and mappings. The output of RMA, the response , is used to estimate . Algorithm 1 shows how the parameter is computed.
The first line computes a scheduling table for a group of tasks, if the group would be mapped on a processor . The rows in the scheduling table contain the priority of a tasks in the group, together with their deadlines, periods and execution times on the processor. The priorities are computed according to [27].
The table is initially filled in arbitrary order with task information, and the priorities are initially zero. The rows are then sorted in two passes according to periods and deadlines.
Prior to sorting, deadlines are relaxed according to Algorithm 2 in order to avoid pessimistic schedulability analysis. The analysis assumes that all tasks are released at the same time. If the response of a task is then greater than the deadline as shown in Algorithm 1, the schedule is declared infeasible. However, if there is a precedence relationship, not all tasks are released at the same time. In particular, if there is an edge between and , then cannot start until has finished. Therefore, the deadline needs to be relaxed to to reflect the fact that there is an offset from the release time of its parent task.

Relaxation proceeds by selecting the most critical task. If the subgraph of the application graph is circular, it is not immediately obvious which task is most critical because of circular producerconsumer relationships. Therefore, the algorithm selects the task in with the shortest deadline. Because a critical task is eliminated from at the end of each iteration, this selection has the effect of introducing cuts in which removes circular paths. Otherwise, if has no cycles, the most critical task in is the one that does not consume data from other tasks in the subset. Before a task is removed from , its deadline is relaxed by adding the deadline of its already removed parent, if the task had one. If the task had multiple parents, then the largest of its parents' deadlines is selected according to lines 8–12. This relaxation does not impose any limitation to the type of application graph that can be handled by the synthesis flow.
After relaxation, sorting begins. The first pass sorts tasks according to periods in ascending order. If the tasks have different periods, and is at least partially connected, then a critical assumption is made that if there is a node in with a period less than that of any of its parent, then the edge between the node and the parent represents a weak precedence meaning that the corresponding task can execute without receiving data from its parent. An example would be a task that infrequently obtains new parameters from another task for its internal computations. Otherwise, the application graph is faulty, and the resulting schedule is meaningless. Partial connectedness in this context means that contains at least one nontrivial connected subgraph.
The second pass sorts the table again according to deadlines, but the sorting is done only within rows containing the same period. Since deadlines have been previously relaxed, no distinction with respect to precedence relationship between tasks needs to be made; if tasks and have no direct or indirect precedence, then must finish before if , because needs to finish earlier; if there is a precedence relationship, then because of the relaxation step, and must appear before . Indirect precedence in this context means that there is a path from to via one or more intermediate tasks. Therefore, because second sorting is only done within rows, tasks with shorter periods appear before those with longer periods regardless of whether or not latter tasks have shorter deadlines.
The sorting is topological and is thus not unique. Moreover, if the group represents a nonconnected graph, then the result after sorting is a partial order. The final order after sorting reflects the priorities in descending order, which are assigned by a simple enumeration.
With the table in place, the algorithm proceeds to compute the response time of each task in the group according to the scheduling table. The response is computed recursively according to [27] as
This model of response time differs slightly from that of Liu and Layland [27] in that no bound in task blocking time due to safeguarding against priority inversion is included. This is because the programming model used here is message passing so that tasks do not share protected data such that semaphorebased synchronization for variables or memory locations is not required.
The analysis then concludes by comparing the response time against the execution time in lines 4–7 of Algorithm 1. The scheduling feasibility parameter in (2) is set to if the response time is larger. Finally, the number of task switching is estimated in line from the response time of the lowestpriority task and the period of the highestpriority task. This worstcase estimate is conservative by making the assumption that all tasks in the group are always ready when released so that the lowestpriority task experiences maximum interference. The use of the parameter is explained in the following subsection.
4.4. Task Mapping
Task mapping influences the switching costs in two ways: (i) by selecting the processor, the switching mechanism, and thus the cost, is determined and (ii) by grouping tasks on one processor, the optimum schedule that can be applied, and thus the number of task switching, is determined. Consequently, scheduling and taskmapping influence each other during optimization.
To include this crosseffect during optimization, two strategies can be followed: (i) integrate scheduling in an ILP solver so that a schedule is computed prior to cost calculation for a candidate mapping or (ii) precompute optimum schedules for all possible mappings, and integrate the schedules in the ILP formulation. In this work, we opted for the latter strategy as discussed in the previous subsection. This is because, by precomputing the schedules, infeasible mappings can be eliminated to reduce the size of the ILP instance. For this purpose, a coefficient is used in the formulation in (2). This coefficient is computed in the ILP formulator during RMA. Its value is if there is a feasible schedule for a group of tasks with the index on processor , and otherwise. We next describe how is used to enforce feasibility constraints in the formulation.
Let be the power set of the task set . Let be an element in the power set excluding the empty set, with . Let be a set, so that each element contains one or more tasks that will be mapped on the same processor. The solution to the combinatorial optimization problem consists of the set . Each element in is associated with a task switching overhead as dictated by its schedule.
Now since is not known at formulation time, a variable is introduced for each element in the power set . If , then . If , then is not an element of . Therefore, we insist that
so that if and only if the mapping is feasible, then constitutes a degree of freedom during synthesis. We next describe how the decision variables and are linked through ILP constraints.
Recalling that implies that a task is mapped on a processor , it follows for any group , if and only if for all . This results into a logical constraint
To transform the logical constraint into a linear form, two steps were applied. First, we specified that
These two constraints insure that when a schedule is not feasible, then at least one is not mapped on . This implies that other groups which are either proper subsets of , or which are not super sets of , can be mapped on , provided that they have a feasible schedule. The second step then is a set of inequalities that satisfy the specification in(6)
With (4), (7), and (8), feasible mappings are guaranteed. The last step is to capture the switching cost of the groups in the objective. A contribution of a group to the switching cost is given by . However, this contribution cannot be directly used in the objective function by taking the sum of all contributions from all groups. This is because, as it can be observed from (7) and (8), if a group is mapped on a processor , then the value of for all groups which are subsets of is also one. Consequently, taking the sum of contributions directly would erroneously include the switching cost of subgroups.
To prevent this incorrect inclusion of switching costs, we need to specify that the switching contribution of a group should be counted only when for all of its supersets. This leads to nonlinear terms in the objective function of the form , where and are indices of supersets for a group with an index .
To break the nonlinearity, a BDV is introduced for each . Its value is only when the group is mapped, and for all supersets. To model this property, we note from (8) that the lefthand side of the inequality for a group with is greater than the lefthand side of the same inequality for a group , if . Therefore, the following relationship holds:
The first sum in the above relationship is the total number of tasks that have been mapped on . The second sum is the size of a group, which is the same as the lefthand side of (8). The difference of the two sums is zero in two cases: (i) if nothing is mapped on and (ii) if a mapped group has no superset for which for that specific mapping. The sum is greater than zero, if for , there is a superset with . This is because there is at least one decision variable with value in the first sum, which is not present in the second. The largest value that the difference of the two sums can have is , so that the upper limit in (9) is .
We next exploit this relationship by specifying that
4.5. ProcessorExternal Factors
Processorexternal factors such as interrupts and data availability have a direct runtime impact on the schedule. The foregoing formulation has the limitation that it is based on worstcase assumptions in rate monotonic analysis. In particular, it is assumed that tasks in a group with no precedence relationship can become ready at the same time.
With respect to data availability, the worstcase assumptions can be relaxed by taking into account in RMA when data can actually arrive depending on sourcetaskdestinationtask mapping, and on the selected communication network.
A possible relaxing solution is to compute offsets between release times of tasks with indirect precedence. For example, if there are three tasks such that the first sends data to the second, and the second to the third, and the mapping is such that the first and third are mapped on one processor, and the second on another, then there is an offset between release times of the first and third tasks. This offset is equal to the time needed for data to be sent to the second task, plus the response time of the second task, plus the time for the resulting data to be sent from the second task to the third task. A suitable ILP formulation that will not significantly increase the problem size needs to be found.
5. Answer Set Programming
Answer Set Programming (ASP) is different from procedural programming in that a problem is described using a formal language, and a solver finds a solution. A problem is presented as a logic program consisting of a set of atoms and rules [29]. An atom is a Boolean proposition about the problem universe; whereas rules specify relationships between the atoms. A solution to a program is called a stable model and tells which atoms are true [29]. This is similar to SAT problems if rules and stable models are perceived to be clauses and satisfying assignments, respectively. ILP models can be encoded into equivalent ASP programs. We opted to use the ASP solver clasp [9] whose grounding tool natively supports the encoding of linear constraints [30]. The native support eliminates the need of having to translate constraints into clauses, a procedure that can lead to a huge number of clauses [31]. The rest of this section describes how the ILP model from Section 4 is coded into an equivalent ASP program. The same notation is used so that variables from Section 4 now stand for atoms.
Linear inequalities are coded into rules whose general form is
where the syntax denotes the weight of a variable in a linear (in)equality, and the syntax is a general form for constraining such that a coded constraint represents an equality if , a lessthan inequality if is not specified (is absent), and a greaterthan inequality if is not specified [30]. Since weights in the ILP model are generally out of whereas , rounding is required. We therefore round and up, and is rounded down. Consequently, more restrictive constraints result, which can theoretically exclude a solution that otherwise does not violate the original problem constraints. This is the reason that we compare the quality of generated solutions in Section 6.
The general form (11) is used throughout for constraints, with a few exceptions in which constraints can be directly expressed as clauses, and thus directly as ASP rules. This has the advantage of eliminating auxiliary variables and associated constraints such that the overall problem size becomes smaller. An example is (5); the link between the auxiliary variable for a group of tasks and mapping decision variables is already in conjunctive form so that the constraint can be specified directly as a rule
thereby dropping (7) and (8). However, (12) represents a logical implication, whereas (5) is a logical equality. Without further measures, a stable model can potentially have as true when one or several of the atoms are false. We therefore additionally add the rule
as an integrity constraint [30] such that is not derived if any of its associated atoms is not derived.
Similarly, while not directly obvious, (10) stand for logical conjunctions that can be represented by the rule
where is the th superset of the group . The implication is that is derived when the corresponding group is derived, but none of the atoms for associated supersets is derived.
Using the same syntax for specifying weights, the objective function has the form [30]
but in this case the weights are not directly rounded, rather, the weights which represent costs in time are converted into processor cycles so that (15) matches (1) as close as possible. In order to avoid large numbers which can overflow the computation of the objective function, these weights are expressed in terms of cycles that would have been spent on the slowest processor, normalized by the number of cycles on the same processor that would have been consumed for the smallest weight in the objective.
6. Comparison of Synthesis Results
This section compares synthesis runtime as well as quality of results for ASPbased synthesis against the ILPbased flow. For this purpose, two parallel programs have been implemented using the Message Passing Interface (MPI) standard [32]. Only a subset of the standard that can be efficiently implemented in embedded systems has been used.
The first application implements a signal processing chain for IEEE 802.11g WLAN standard between the antenna and the channel decoder. The algorithms are described in [33–36]. The second application implements the functionality for the same portion for WCDMA. The algorithms are described in [37–39]. Tables 1 and 2 show the parallel tasks and their deadlines in nanoseconds. In this implementation, the deadlines are equal to the periods. The latter were obtained from the standards. The three last columns in Table 1 show approximated execution time in nanoseconds of the tasks on three different processors with loosely coupled accelerators. The execution times were obtained by executing the tasks on given processors. Similarly, the two last columns in Table 2 show the approximated time for WCDMA tasks on two other processors.
The differences in execution times in the two tables are caused by different type of accelerators attached to the processors. For example, all cores have a butterfly accelerator based on the architecture described in [40]; therefore, the execution time for FFT tasks is the same for all processors (the difference for FFT tasks between WLAN and WCDMA is due to the different number of FFT points). On the other hand, only P2 and P3 have a CORDIC accelerator; therefore, the execution time for the carrierphase offset estimation and compensation task is only 156 nanoseconds for these cores, versus 909 nanoseconds for P1.
Tables 3 and 4 show the communication traffic between tasks for the two applications. These were obtained from MPI simulations for 27 OFDM symbols for WLAN, and for one slot for WCDMA. The third and fourth columns show the amount and number of data transfers, respectively.
In order to additionally compare the synthesis runtime with and without fixedpriority preemptive scheduling, configurations with a basic cyclic executive and with a preemptive kernel were used for all processors with each run.
For ILP, the same solver settings were used for all cases (node autoordering, most feasible basis crash, automatic branch and bound branching, and presolving of rows and columns) [41].
No options were specified for ASP. However, we determined that splitting the objective into its three constituent parts for execution, communication, and scheduling time significantly speeds up the ASP solver time by up to two orders of magnitude. This circumstance was exploited subsequently since the speed up is not accompanied by any penalty in quality of the solution. Splitting the objective function is a feature in clasp/clingo that was conceived to avoid possible overflows when computing the value of the objective function because of integrality of weights [30].
Tables 5 and 6 show the parameters of used processors and networks, respectively. The number of processors and networks used was 6 in each case. Whereas the number of resources could be selected to reflect the number of nodes and edges in the application graph, it is advantageous to start with a smaller number to speedup the synthesis. If resources are exhausted, the number should be increased in another run to avoid preconstraining the design space.
Table 7 summarizes the results, which were obtained on a machine with a T5500 processor and 2 GB of memory. The columns “No. cons.” show the number of constraints and the number of rules for ILP and ASP modes, respectively. Similarly, the columns “No. var.” show the number of decision variables and atoms for the two modes. These numbers give a measure for the complexity of the problem instances. The number of variables for ILP mode is much less than the number of atoms for ASP mode because the ILP solver, lp_solve, has a presolve option that can reduce the size of the problem by eliminating redundant constraints. This option was exploited because presolving tends to reduce the solver time.
The columns “Form.” and “Solver” show synthesis runtime spent formulating and solving the problem, respectively. Formulator time is rather large; most of this time is spent reading text files generated from MPI simulations. The sizes of the files in these experiments were 4.8 GB and 6.4 GB for WCDMA and WLAN, respectively. They contain, among others, time stamps for each data packet transmitted between tasks. This large time is not a limitation for automated exploration; much faster time can be achieved by using compressed binary files and/or usage of cache files to capture relevant information only during automated explorations.
The solver times for ASP mode are dramatically shorter by up to three orders of magnitude. Given that ASP solver time is in the order of few seconds (versus up to 8 hours for ILP mode), synthesizing using ASP is a promising approach for automatically exploring a large number of design alternatives as it is the requirement for this flexible multiprocessor synthesis problem.
As previously mentioned, we additionally need to compare the quality of results because of a potential for overconstraining in ASP mode, particularly in light of the fact that ASP mode is much faster. The columns “Obj.” show the value of the objective function after optimization. Since both methods are heuristic, it is interesting to know how far integer solutions are away from corresponding relaxed solutions, which is a measure of how good a solution is in case the solver times out. A timeout occurred once for WCDMA for ILP synthesis mode under cyclic scheduling. The columns “Gap” give this measure, which indicates that the solution was quite good even in the timeout case.
Comparing the value of the objectives for the two modes, the differences before rounding are insignificant with the exception of the timeout case where ASP mode found a better solution. The impact of more restrictive constraints for ASP mode due to rounding as discussed in Section 5 was not apparent in these experiments. While these particular results are suggestive, experiments with a much larger set of parallel programs are still required to characterize the potential impact.
The impact of scheduling constraints can be seen by comparing the values of the objective functions under the two scheduling modes. Better values were obtained under cyclic scheduling because no deadlines were imposed when mapping tasks so that only the execution times needed to be considered. Consequently, the solvers attempt to group tasks such that expensive intertask communications are minimized. However, these apparently faster architectures are practically not usable because of no deadline guarantees.
Figure 2 shows synthesized architectures under the two modes for preemptive scheduling. These architectures are very similar for WLAN, and the same architecture was obtained for WCDMA. This result emphasizes the potential for ASPbased synthesis, since the quality of results was not traded against solver runtime. Synthesized architectures are not necessarily unique because there may exist several optimum solutions to a combinatorial problem. Thus, for WLAN case, where the two architectures are similar but not the same, it is quite possible that either architecture could have been obtained through either of ILP or ASP synthesis mode. This is because all resources are allocated through a nonconstraining combinatorial optimization process according to (1) based on resource parameters and characteristics of the parallel program, and not on the synthesis method used.
(a) WLAN ILP
(b) WLAN ASP
(c) WLAN ILP and ASP
Finally, two discussion points are in order for these architectures. First, as previously mentioned, our proposed design flow does not preconstrain the design space. As a result, multiple communication resources are allocated between any pair of processors in this experiment. This allocation minimizes expensive intertask communications which are necessary under preemptive scheduling because of schedulability constraints leading to conditions such that certain tasks cannot be mapped on the same processor. Thus, the generated architecture description (Figure 1) includes not only the netlist, but also information on which communication libraries a task should use to communicate with any other task. This information is used by configuration tools to automatically bind appropriate lowlevel communication libraries for different networks, links and buses.
Second, the physical implementations of messaging passing interfaces make use of FIFO to queue messages. This means that, once a task has initiated a data transfer by calling the appropriate function, the task is free to do further processing and to initiate or wait for data via another communication resource. The result is that communication latencies can be hidden through overlapping. However, this effect is not accounted for in the objective function because temporal information is not used. Consequently, the actual cost in total computation time can be smaller than what the objective function indicates.
Accounting for temporal information is not a feasible prospect because a far greater number of variables would need to be considered to model and capture all possible moments in which communications can be initiated.
7. Summary and Conclusion
In this paper, a method for automated architecture synthesis for FPGA multiprocessor systems has been presented. The method takes into account fixedpriority preemptive scheduling to cover a broad spectrum of embedded application requirements. Combinatorial optimization is used during synthesis. A case study, in which architectures for IEEE 802.11g and WCDMA baseband signal processing algorithms are synthesized, demonstrates the feasibility of the automated synthesis by showing that problems with sizes that can be encountered in the embedded domain can be solved. Synthesis based on ILP and ASP methods has been compared. ASP mode has a far greater potential for solving difficult synthesis problems. Solver times in this mode were in the order of a few seconds, which is up to three orders of magnitude faster compared to ILPbased synthesis without sacrificing the quality of results.