EURASIP Journal on Embedded Systems
Volume 2008 (2008), Article ID 594129, 21 pages
doi:10.1155/2008/594129
Research Article

Compilation and Worst-Case Reaction Time Analysis for Multithreaded Esterel Processing

Department of Computer Science, University of Kiel, 24118 Kiel, Germany

Received 15 September 2007; Accepted 18 April 2008

Academic Editor: Michael Mendler

Copyright © 2008 Marian Boldt et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

The recently proposed reactive processing architectures are characterized by instruction set architectures (ISAs) that directly support reactive control fow including concurrency and preemption. These architectures provide efficient execution platforms for reactive synchronous programs; however, they do require novel compiler technologies, notably with respect to the handling of concurrency. Another key quality of the reactive architectures is that they have very predictable timing properties, which make it feasible to analyze their worst-case reaction time (WCRT). We present an approach to compile programs written in the synchronous language Esterel onto a reactive processing architecture that handles concurrency via priority-based multithreading. Building on this compilation approach, we also present a procedure for statically determining tight, safe upper bounds on the WCRT. Experimental results indicate the practicality of this approach, with WCRT estimates to be accurate within 22% on average.

1. Introduction

The programming language Esterel [1] has been designed for developing control-dominated reactive software or hardware systems. It belongs to the family of synchronous languages [2], which have a formal semantics that abstracts away run-time uncertainties, and allow abstract, well-defined, and executable descriptions of the application at the system level. Hence these languages are particularly suited to the design of safety-critical real-time systems. To express reactive behavior, Esterel offers numerous powerful control flow primitives, in particular concurrency and various preemption operators. Concurrent threads can communicate back and forth instantaneously, with a tight semantics that guarantees deterministic behavior. This is valuable for the designer, but also poses implementation challenges.

Besides being compiled to C and executed as software, or being compiled to VHDL and synthesized to hardware, Esterel can be executed on a reactive processor [3]. These processors directly support reactive control flow, such as preemption and concurrency, in their instruction set architecture (ISA). One approach to handle concurrency is multithreading, as implemented in the Kiel Esterel processor (KEP). The KEP uses a priority-based scheduler, which makes threads responsible to manage their own priorities. This scheme allows to keep the scheduler very light-weight. In the KEP, scheduling and context switching do not cost extra instruction cycles, only changing a thread's priority costs an instruction. One challenge for the compiler is to compute these priorities in a way that on the one hand preserves the execution semantics of Esterel and on the other hand does not lead to too many changes of the priorities, since this would decrease the execution speed. We have developed a priority assignment algorithm that makes use of a special concurrent control flow graph and has a complexity that is linear in the size of that graph, which in practice tends to be linear in the size of the program.

Apart from efficiency concerns, which may have been the primary driver towards reactive processing architectures, one of their advantages is their timing predictability. To leverage this, we have augmented our compiler with a timing analysis capability. As we here are investigating the timing behavior for reactive systems, we are specifically concerned with computing the maximal time it takes to compute a single reaction. We refer to this time, which is the time from given input events to generated output events, as worst-case reaction time (WCRT). The WCRT determines the maximal rate for the interaction with the environment.

There are two main factors that facilitate the WCRT analysis in the reactive processing context. These are on the one hand the synchronous execution model of Esterel, and on the other hand the direct implementation of this execution model on a reactive processor. Furthermore, these processors are not designed to optimize (average) performance for general purpose computations, and hence do not have a hierarchy of caches, pipelines, branch predictors, and so forth. This leads to a simpler design and execution behavior and further facilitates WCRT analysis. Furthermore, there are reactive processors, such as the KEP, which allow to fix the reaction lengths to a predetermined number of clock cycles, irrespective of the number of instructions required to compute a specific reaction, in order to minimize the jitter.

We here present a WCRT analysis of complete Esterel programs including concurrency and preemption. The analysis computes the WCRT in terms of KEP instruction cycles, which roughly match the number of executed Esterel statements. As part of the WCRT analysis, we also present an approach to calculate potential instantaneous paths, which may be used in compiler analysis and optimizations that go beyond WCRT analysis.

Thus this paper is concerned with both the compilation and the timing analysis of Esterel programs executed on multithreaded reactive processors. Previous reports presented earlier results in both fields [4, 5]. This paper extends and updates these reports, and represents the first comprehensive description of these two closely interrelated areas. Further details can be found in the theses of the first author [6, 7].

In the following section, we consider related work. In Section 3, we will give an introduction into the synchronous model of computation for Esterel and the KEP. We outline the generation of a concurrent KEP assembler graph (CKAG), an intermediate graph representation of an Esterel program, which we use for our analysis. Section 4 explains the compilation and Section 5 represents the algorithm for the WCRT analysis. Section 6 presents experimental results that compare the WCRT estimates with values obtained from exhaustive simulation. The paper concludes in Section 7.

2. Related Work

In the past, various techniques have been developed to synthesize Esterel into software; see Potop-Butucaru et al. [8] for an overview. The compiler presented here belongs to the family of simulation-based approaches, which try to emulate the control logic of the original Esterel program directly, and generally achieve compact and yet fairly efficient code. These approaches first translate an Esterel program into some specific graph formalism that represents computations and dependencies, and then generate code that schedules computations accordingly. The EC/Synopsys compiler first constructs a concurrent control flow graph (CCFG), which it then sequentializes [9]. Threads are statically interleaved according to signal dependencies, with the potential drawback of superfluous context switches; furthermore, code sections may be duplicated if they are reachable from different control points. The SAXO-RT compiler [10] divides the Esterel program into basic blocks, which schedule each other within the current and subsequent logical tick. An advantage relative to the Synopsis compiler is that the SAXO-RT compiler does not perform unnecessary context switches and largely avoids code duplications; however, the scheduler it employs has an overhead proportional to the total number of basic blocks present in the program. The grc2c compiler [11] is based on the graph code (GRC) format, which preserves the state-structure of the given program and uses static analysis techniques to determine redundancies in the activation patterns. A variant of the GRC has also been used in the Columbia Esterel compiler (CEC) [12], which again follows SAXO-RT's approach of dividing the Esterel program into atomically executed basic blocks. However, their scheduler does not traverse a score board that keeps track of all basic blocks, but instead uses a compact encoding based on linked lists, which has an overhead proportional to just the number of blocks actually executed.

In summary, there is currently not a single Esterel compiler that produces the best code on all benchmarks, and there is certainly still room for improvements. For example, the simulation-based approaches presented so far restrict themselves to interleaved single-pass thread execution, which in the case of repeated computations (“schizophrenia” [13]) requires code replications.

We differ from these approaches in that we do not want to compile Esterel to C, but instead want to map it to a concurrent reactive processing ISA. Initial reactive ISAs did not consider full concurrency [14, 15] and will not be discussed further here. Since then, two alternatives have been proposed that do include concurrency, namely multiprocessing and multithreading.

The multiprocessing approach is represented by the EMPEROR [16], which uses a cyclic executive to implement concurrency, and allows the arbitrary mapping of threads onto processing nodes. This approach has the potential for execution speed-ups relative to single-processor implementations. However, their execution model potentially requires to replicate parts of the control logic at each processor. The EMPEROR Esterel compiler 2 (EEC2) [16] is based on a variant of the GRC, and appears to be competitive even for sequential executions on a traditional processor. However, their synchronization mechanism, which is based on a three-valued signal logic, does not seem able to take compile-time scheduling knowledge into account, and instead repeatedly cycles through all threads until all signal values have been determined.

The multithreading approach has been introduced by the Kiel Esterel processor family and has subsequently been adapted by the STARPro architecture [17], a successor of the EMPEROR. The compilation for this type of architecture is a subject of this paper. In some sense, compilation onto KEP assembler is relatively simple, due to the similarities between the Esterel and the KEP assembler. However, we do have to compute priorities for the scheduling mechanism of the KEP, and cannot hard-code the scheduling-mechanism into the generated code directly. Incidentally, it is this dynamic, hardware-supported scheduling that contributes to the efficiency of the reactive processing approach.

It has also been proposed to run Esterel programs on a virtual machine (BAL [18]), which allows a very compact byte code representation. In a way, this execution platform can be considered as an intermediate form between traditional software synthesis and reactive processing; it is a software running on traditional processors, but uses a more abstract instruction set. The proposal by Plummer et al. also uses a multithreaded concurrency model, as in the KEP platform considered here. However, they do not assume the existence of a run-time scheduler, but instead hand control explicitly over between threads. Thus their scheduling problem is related to ours, but does not involve the need to compute priorities as we have to do here. Instead, they have to insert explicit points for context switches. The main difference in both approaches is that the KEP only switches to active threads, while the BAL switches to statically defined control points. One could, however, envision a virtual machine that has an ISA that adopts our multithreading model (a straightforward, albeit inefficient VM would be a KEP simulator), and for which the approach presented here could be applied.

One of the byproducts of our compilation approach is dead code elimination (DCE), see also Section 4.3. Our approach here is rather conservative, considering only static reachability. A more aggressive approach to DCE based on (an extension of Esterel with a noninstantaneous jump instruction) has been presented by Tardieu and Edwards [19]. Their approach, as well as other work that performs reachability analysis as part of constructiveness analysis [20], is more involved than our approach in that they perform an (more or less conservative) analysis of the reachable state space.

Regarding timing analysis, there exist numerous approaches to classical worst-case execution time (WCET) analysis. For surveys see, for example, Puschner and Burns [21] or Wilhelm et al. [22]. These approaches usually consider (subsets) of general purpose languages, such as C, and take information on the processor designs and caches into account. It has long been established that to perform an exact WCET analysis with traditional programming languages on traditional processors is difficult, and in general not possible for Turing-complete languages. Therefore WCET analysis typically impose fairly strong restrictions on the analyzed code, such as a-priori known upper bounds on loop iteration counts, and even then control flow analysis is often overly conservative [23, 24]. Furthermore, even for a linear sequence of instructions, typical modern architectures make it difficult to predict how much time exactly the execution of these instructions consumes, due to pipelining, out-of-order execution, argument-dependent execution times (e.g., particularly fast multiply-by-zero), and caching of instructions and/or data [25]. Finally, if external interrupts are possible or if an operating system is used, it becomes even more difficult to predict how long it really takes for an embedded system to react to its environment. Despite the advances already made in the field of WCET analysis, it appears that most practitioners today still resort to extensive testing plus adding a safety margin to validate timing characteristics. To summarize, performing conservative yet tight WCET analysis appears by no means trivial and is still an active research area.

Whether WCRT can be formulated as a classical WCET problem or not depends on the implementation approach. If the implementation is based on sequentialization such that there exist two dedicated points of control at the beginning and the end of each reaction, respectively, then WCRT can be formulated as WCET problem; this is the case, for example, if one “automaton function” is synthesized, which is called during each reaction. If, however, the implementation builds on a concurrent model of execution, where each thread maintains its own state-of-control across reactions, then WCRT requires not only determining the maximal length of predefined instruction sequences, as in WCET, but one also has to analyze the possible control point pairs that delimit these sequences. Thus, WCRT is more elementary than WCET in the sense that it considers single reactions, instead of whole programs, and at the same time WCRT is more general than WCET in that it is not limited to predefined control boundaries.

One step to make the timing analysis of reactive applications more feasible is to choose a programming language that provides direct, predictable support for reactive control flow patterns. We argue that synchronous languages, such as Esterel, are generally very suitable candidates for this, even though there has been little systematic treatment of this aspect of synchronous languages so far. One argument is that synchronous languages naturally provide a timing granularity at the application level, the logical ticks that correspond to system reactions, and impose clear restriction onto what programs may do within these ticks. For example, Esterel has the rule that there cannot be instantaneous loops: within a loop body, each statically feasible path must contain at least one tick-delimiting instruction, and the compiler must be able to verify this. Another argument is that synchronous languages directly express reactive control flow, including concurrency, thus lowering the need for an operating system with unpredictable timing.

Logothetis et al. [26, 27] have employed model checking to perform a precise WCET analysis for the synchronous language Quartz, which is closely related to Esterel. However, their problem formulation was different from the WCRT analysis problem we are addressing. They were interested in computing the number of ticks required to perform a certain computation, such as a primality test, which we would actually consider to be a transformational system rather than a reactive system [28]. We here instead are interested in how long it may take to compute a single tick, which can be considered an orthogonal issue.

Ringler [29] considers the WCET analysis of C code generated from Esterel. However, his approach is only feasible for the generation of circuit code [13], which scales well for large applications, but tends to be slower than the simulation-based approach.

Li et al. [15] compute a of sequential Esterel programs directly on the source code. However, they did not address concurrency, and their source-level approach could not consider compiler optimizations. We perform the analysis on an intermediate level after the compilation, as a last step before the generation of assembler code. This also allows a finer analysis and decreases the time needed for the analysis.

One important problem that must be solved when performing WCRT analysis for Esterel is to determine whether a code segment is reachable instantaneously, or delayed, or both. This is related to the well-studied property of surface and depth of an Esterel program, that is, to determine whether a statement is instantaneously reachable or not, which is also important for schizophrenic Esterel programs [13]. This was addressed in detail by Tardieu and de Simone [30]. They also point out that an exact analysis of instantaneous reachability has NP complexity. We, however, are not only interested whether a statement can be instantaneous, but also whether it can be noninstantaneous.

3. Esterel, the Kiel Esterel Processor and the Concurrent KEP Assembler Graph

Next we give a short overview of Esterel and the KEP. We also introduce the CKAG, a graph-representation of Esterel, which is used both for the compilation and the WCRT analysis.

3.1. Esterel

The execution of an Esterel program is divided into logical instants, or ticks, and communication within or across threads occurs via signals. At each tick, a signal is either present (emitted) or absent (not emitted). Esterel statements are either transient, in which case they do not consume logical time, or delayed, in which case execution is finished for the current tick. Per default statements are transient, and these include for example , , , or the preemption operators. Delayed statements include , (nonimmediate) , and . Esterel's parallel operator, , groups statements in concurrently executed threads. The parallel terminates when all its branches have terminated.

Esterel offers two types of preemption constructs. An abortion kills its body when an abortion trigger occurs. We distinguish strong abortion, which kills its body immediately (at the beginning of a tick), and weak abortion, which lets its body receive control for a last time (abortion at the end of the tick). A suspension freezes the state of a body in the instant when the trigger event occurs.

Esterel also offers an exception handling mechanism via the statements. An exception is declared with a scope, and is thrown (raised) with an statement. An statement causes control flow to move to the end of the scope of the corresponding declaration. This is similar to a statement, however, there are further rules when traps are nested or when the trap scope includes concurrent threads. If one thread raises an exception and the corresponding trap scope includes concurrent threads, then the concurrent threads are weakly aborted; if concurrent threads execute multiple instructions in the same tick, the outermost trap takes priority.

3.1.1. Examples

As an example of a simple, nonconcurrent program consider the module shown in Figure 1(a). As the sample execution trace illustrates, the module emits signal in every instant, until it is aborted by the presence of the input signal . As this is a weak abortion, the abortion body gets to execute (emit ) one last time when it is aborted, followed by an emission of .

Figure 1: A sequential Esterel example. The body of the KEP assembler program (without interface declaration and initialization of the ) is annotated with line numbers , which are also used in the CKAG and in the trace to identify instructions. The trace shows for each tick the input and output signals that are present and the reaction time (), in instruction cycles.

The program shown in Figure 2(a) introduces concurrency: a thread that emits and then terminates, and a concurrent thread that emits , pauses for an instant, emits , and then terminates are executed in an infinite loop. During each loop iteration, the parallel terminates when both threads have terminated, after which the subsequent loop iteration is started instantaneously, that is, within the same tick.

Figure 2: A concurrent example program.

A slightly more involved example is the program [9, 10], shown in Figure 3(a). This program implements the following behavior: whenever the signal is present, (re-)start two concurrent threads. The first thread first awaits a signal ; it then continuously emits until is present, in which case it emits one last time (weak abortion of the ), emits , and terminates. The second thread tests every other tick for the presence of , in which case it emits .

Figure 3: The example [9].
3.1.2. Statement Dismantling

At the Esterel level, one distinguishes kernel statements and derived statements. The derived statements are basically syntactic sugar, built up from the kernel statements. In principle, any set of Esterel statements from which the remaining statements can be constructed can be considered a valid set of kernel statements, and the accepted set of Esterel kernel statements has evolved over time. For example, the statement used to be considered a kernel statement, but is now considered to be derived from and . We here adopt the definition of which statements are kernel statements from the v5 standard [31]. The process of expanding derived statements into equivalent, more primitive statements—which may or may not be kernel statements—is also called dismantling. The Esterel program (Figure 3(b)) is a dismantled version of . It is instructive to compare this program to the original, undismantled version.

3.2. The Kiel Esterel Processor

The instruction set architecture (ISA) of the KEP is very similar to the Esterel language. Part of the KEP instruction set is shown in Table 1; a complete description can be found elsewhere [32]. The KEP instruction set includes all kernel statements (see Section 3.1.2), and in addition some frequently used derived statements. The KEP ISA also includes valued signals, which cannot be reduced to kernel statements. The only parts of Esterel v5 that are not part of the KEP ISA are combined-signal handling and external-task handling, as they both seem to be used only rarely in practice. However, adding these capabilities to the KEP ISA seems relatively straightforward.

Table 1: Overview of the KEP instruction set architecture, and their relation to Esterel and the number of processor cycles for the execution of each instruction.

Due to this direct mapping from Esterel to the KEP ISA, most Esterel statements can be executed in just one instruction cycle. For more complicated statements, well-known translations into kernel statements exist, allowing the KEP to execute arbitrary Esterel programs. The KEP assembler programs corresponding to and and sample traces are shown in Figures 1(c)-1(d) and 2(c)-2(d), respectively, and the KEP assembler program for is shown in Figure 3(c), respectively. Note that is executed for at least two consecutive ticks, and consumes an instruction cycle at each tick.

The KEP provides a configurable number of units, which detect whether a signal triggering a preemption is present and whether the program counter (PC) is in the corresponding preemption body [33]. Therefore, no additional instruction cycles are needed to test for preemption during each tick. Only upon entering a preemption scope two cycles are needed to initialize the , as for example the instruction in (Figure 1(c)) To aid readability, we here use the convention of subscripting KEP instructions with the line number where they occur.

To implement concurrency, the KEP employs a multithreaded architecture, where each thread has an independent program counter (PC) and threads are scheduled according to their statuses, thread id and dynamically changing priorities: between all active threads, the thread with the highest priority is scheduled. If there is more than one thread with this priority, the highest thread id wins. The scheduler is very light-weight. In the KEP, scheduling and context switching do not cost extra instruction cycles, only changing a thread's priority costs an instruction. The priority-based execution scheme allows on the one hand to enforce an ordering among threads that obeys the constraints given by Esterel's semantics, but on the other hand avoids unnecessary context switches. If a thread lowers its priority during execution but still has the highest priority, it simply keeps executing.

A concurrent Esterel statement with concurrent threads joined by the -operator is translated into KEP assembler as follows. First, threads are forked by a series of instructions that consist of instructions and one instruction. Each instruction creates one thread, by assigning a nonnegative priority, a start address, and the thread id. The end address of the thread is either given implicitly by the start address specified in a subsequent instruction, or, if there is no more thread to be created, it is specified in a instruction. The code block for the last thread is followed by a instruction, which waits for the termination of all forked threads and concludes the concurrent statement. The example in Figure 2(c) illustrates this: instruction constitutes thread 1, thread 2 spans , and the remaining instructions belong to the main thread, which implicitly has id 0.

The priority of a thread is assigned when the thread is created (with the aforementioned instruction), and can be changed subsequently by executing a priority setting instruction (). A thread keeps its priority across delay instructions; that is, at the start of a tick it resumes execution with the priority it had at the end of the previous tick. This mechanism allows an arbitrary interleaving of thread execution for communicating among threads within the same logical tick. Therefore, a thread may be executed partially, then control may jump to another thread, and later return to the first thread, all within the same tick.

When a concurrent statement terminates, through regular termination of all concurrent threads or via an exception/abort, the priorities associated with the terminated threads also disappear, and the priority of the main thread is restored to the priority upon entering the concurrent statement.

The KEP contains a TickManager, which monitors how many instructions are executed in the current logical tick. To minimize jitter, a maximum number of instructions for each logical tick can be specified, via the “special” valued signal . If the current tick needs less instructions, the start of the next tick is delayed, making the maximum number of instructions the exact number of instructions. If the tick needs more instructions, an error-output is set. Hence a tight, but conservative upper bound of the maximal instructions for one tick, as computed by the WCRT analysis presented in Section 5, is of direct value for the KEP. See Li et al. [15] for details on the TickManager and the relation between the maximum number of instruction per logical tick and the physical timing constraints from the environment perspective.

Note that the KEP compiler per default computes a value for the WCRT and adds a corresponding assembler instruction that specifies a value for . However, the KEP does not require such a specification of . If is left unspecified, the processor “runs freely” and starts the next logical tick as soon as the current tick is finished. This lowers, on average, the reaction time, at the price of a possible jitter.

3.3. The Concurrent KEP Assembler Graph

The CKAG is a directed graph composed of various types of nodes and edges to match KEP program behavior. It is used during compilation from Esterel to KEP assembler, for, for example, priority assigning, dead code elimination, further optimizations, and the WCRT analysis. The CKAG is generated from the Esterel program via a simple structural translation. The only nontrivial aspect is the determination of noninstantaneous paths, which is needed for certain edge types. For convenience, we label nodes with KEP instructions; however, we could alternatively have used Esterel instructions as well.

The CKAG distinguishes the following sets of nodes, see also Figure 4: L:label nodes (ellipses);T:transient nodes (rectangles), which include , , and so forth;D:delay nodes (octagons), which correspond to delayed KEP instructions (, , , );F:fork nodes (triangles), corresponding to ;J:join nodes (inverted triangles), corresponding to ;N:set of all nodes, with .

Figure 4: Nodes and edges of a concurrent KEP assembler graph.

We also define A:the abort nodes, which denote abortion scopes and correspond to and ; note that .

For each fork node (), we define n.join:the statement corresponding to (), andn.sub:the transitive closure of nodes in threads spawned by .

For abort nodes (), we define n.end:the end of the abort scope opened by , andn.scope:the nodes within 's abort scope.

A nontrivial task when defining the CKAG structure is to properly distinguish the different types of possible control flow, in particular with respect to their timing properties (instantaneous or delayed). We define the following types of successors for each : : the control successors. These are the nodes that follow sequentially after , considering normal control flow without any abortions. For , includes the nodes corresponding to the beginnings of the forked threads.The successors are statically inserted, based on the syntax of the Esterel program, based on the actual behavior, some of these can be removed. If is the last node of a concurrent thread, includes the node for the corresponding —unless 's thread is instantaneous and has a (provably) noninstantaneous sibling thread. Furthermore, the control successors exclude those reached via a preemption (, )—unless is an immediate strong abortion node, in which case .: the weak abort successors. If , this is the set of nodes to which control can be transferred immediately, that is when entering at the end of a tick, via a weak abort; if exits a , then contains the end of the scope; otherwise it is .If and for some abort node , it is in case of a , or in case of a if there can (possibly) be a delay between and .: the strong abort successors. If , these are the nodes to which control can be transferred after a delay, that is when restarting at the beginning of a tick, via a strong abort; otherwise it is .If and for some strong abort node , it is .Note that this is not a delayed abort in the sense that an abort signal in one tick triggers the preemption in the next tick. Instead, this means that first a delay has to elapse, and the abort signal must be present at the next tick (relative to the tick when is entered) for the preemption to take place.: the exit successors. These are the nodes that can be reached by raising an exception.: the flow successors. This is the set .

For , we also define two kinds of fork abort successors. These serve to ensure a correct priority assignment to parent threads in case there is an abort out of a concurrent statement.

: the weak fork abort successors. This is the union of for all where there exists an instantaneous path from to . : the strong fork abort successors. This is the set .

In the graphical representation, control successors are shown by solid lines, all other successors by dashed lines, annotated with the kind of successor.

The CKAG is built from Esterel source by traversing recursively over its absract syntax tree (AST) generated by the Colombia Esterel compiler (CEC) [34]. Visiting an Esterel statement results in creating the according CKAG node. A node typically contains exactly one statement, except label nodes containing just address labels and fork nodes containing one statement for each child thread initialization and a statement. When a delay node is created, additional preemption edges are added according to the abortion/exception context.

Note that some of the successor sets defined above cannot be determined precisely by the compiler, but have to be (conservatively) approximated instead. This applies in particular to those successor types that depend on the existence of an instantaneous path. Here it may be the case that for some pair of nodes there does not exist such an instantaneous path, but that the compiler is not able to determine that. In such cases, the compiler conservatively assumes that there may be such an instantaneous path. This is a common limitation of Esterel compilers, and compilers differ in their analysis capabilities here—see also Section 4.1.

4. The KEP Compiler

A central problem for compiling Esterel onto the KEP is the need to manage thread priorities during their creation and their further execution. In the KEP setting, this is not merely a question of efficiency or of meeting given deadlines, but a question of correct execution. Specifically, we have to schedule threads in such a fashion that all signal dependencies are obeyed. Such dependencies arise whenever a signal is possible emitted and tested in the same tick; we must ensure that all potential emitters for a signal have executed before that signal is tested.

A consequence of Esterel's synchronous model of execution is that there may be dependency cycles, which involve concurrent threads communicating back and forth within one tick. Such dependency cycles must be broken, for example, by a delay node, because otherwise it would not be possible for the compiler to devise a valid execution schedule that obeys all ordering (causality) constraints. In the example (Figure 3(a)), there is one dependency cycle, from the instruction in the first parallel thread to the in the second parallel to the back to the , which is weakly aborted whenever is present. The dependency cycle is broken in the dismantled version, as there the has been separated into signal emission () and a delay (), enclosed in a loop. The broken dependency cycle can also be observed in the CKAG, shown in Figure 5. Referring to nodes by the corresponding line numbers (the “L” part of the node labels) in the KEP assembler code (Figure 3(c)), the cycle is L14 L23 L24 L17 L18 L14; it is broken by the delay in L17.

Figure 5: The CKAG for the example from Figure 3(a). Dotted lines indicate dependencies (L14 L23 and L24 L17), the tail label “i” indicates that these are immediate dependencies (see Section 4.1). For the sake of compactness, label nodes have been incorporated into their (unique) successor nodes.

The priority assigned during the creation of a thread and by a particular instruction is fixed. Due to the nonlinear control flow, it is still possible that a given statement may be executed with varying priorities. In principle, the architecture would therefore allow a fully dynamic scheduling. However, we here assume that the given Esterel program can be executed with a statically determined schedule, which requires the existence of no cyclic signal dependencies. This is a common restriction, imposed for example by the Esterel v7 [35] and the CEC compilers; see also Section 3.3. Note that there are also Esterel programs that are causally correct (constructive [1]), yet cannot be executed with a static schedule and hence cannot be directly translated into KEP assembler using the approach presented here. However, these programs can be transformed into equivalent, acyclic Esterel programs [36], which can then be translated into KEP assembler. Hence, the actual run-time schedule of a concurrent program running on KEP is static in the sense that if two statements that depend on each other, such as the emission of a certain signal and a test for the presence of that signal, are executed in the same logical tick, they are always executed in the same order relative to each other, and the priority of each statement is known in advance. However, the run-time schedule is dynamic in the sense that due to the nonlinear control flow and the independent advancement of each program counter, it in general cannot be determined in advance which code fragments are executed at each tick. This means that the thread interleaving cannot be implemented with simple jump instructions. Instead, a run-time scheduling mechanism is needed that manages the interleaving according to the priority and actual program counter of each active thread.

To obtain a more general understanding of how the priority mechanism influences the order of execution, recall that at the start of each tick, all enabled threads are activated, and are subsequently scheduled according to their priorities. Furthermore, each thread is assigned a priority upon its creation. Once a thread is created, its priority remains the same, unless it changes its own priority with a instruction, in which case it keeps that new priority until it executes yet another instruction, and so on. Neither the scheduler nor other threads can change a thread's priority. Note also that a instruction is considered instantaneous. The only noninstantaneous instructions, which delimit the logical ticks and are also referred to delayed instructions, are the instruction and derived instructions, such as and . This mechanism has a couple of implications.(i)At the start of a tick, a thread is resumed with the priority corresponding to the last instruction it executed during the preceding ticks, or with the priority it has been created with if it has not executed any instructions. In particular, if we must set the priority of a thread to ensure that at the beginning of a tick the thread is resumed with a certain priority, it is not sufficient to execute a instruction at the beginning of that tick; instead, we must already have executed that instruction in the preceding tick.(ii) A thread is executed only if no other active thread has a higher priority. Once a thread is executing, it continues until a delayed statement is reached, or until its priority is lower than that of another active thread or equal to that of another thread with higher id. While a thread is executing, it is not possible for other inactive threads to become active; furthermore, while a thread is executing, it is not possible for other threads to change their priority. Hence, the only way for a thread's priority to become lower than that of other active threads is to execute a instruction that lowers its priority below that of other active threads.

4.1. Annotating the CKAG with Dependencies

In order to compute the thread priorities, we annotate the with additional information about already known priorities and dependencies. For all nodes , we define n.prio: the priority that the thread executing should be running with. For , we also define n.prionext: the priority that the thread executing should be resumed with in the subsequent tick.

We annotate each node with the set of nodes that read a signal which is emitted by . It turns out that analogously to the distinction between and , we must distinguish between dependencies that affect the current tick and the next tick: : the dependency sinks with respect to at the current tick (the immediate dependencies),: the dependency sinks with respect to at the next tick (the delayed dependencies).

We here assume that the Esterel program given to our compiler has already been established to be causal (constructive), using one of the established constructiveness analysis procedures [20], as for example implemented in the Esterel v5 compiler. We therefore consider only dependencies that cross thread boundaries, as dependencies within a thread do not affect the scheduling. In other words, we assume that intrathread dependencies are already covered by control dependencies; would that not be the case, the program would not be causal, and should be rejected. Should we not want to rely on a separate constructiveness analysis, we would have to consider intrathread dependencies as well.

In general, dependencies are immediate, meaning that they involve statements that are entered at the same tick. An exception are dependencies between emissions of a strong abort trigger signal and corresponding delay nodes within the abort scope, as strong aborts affect control flow at the beginning of a tick and not at the end of a tick. In this case, the trigger signal (say, ) is not tested when the delay node () is entered as the entering of marks the end of a tick, and hence control would not even reach if was present. However, is tested when is restarted at beginning of the next tick.

As already mentioned, we assume that the given program does not have cycles. However, what exactly constitutes a cycle in an Esterel program is not obvious, and to our knowledge there is no commonly accepted definition of cyclicity at the language level. The Esterel compilers that require acyclic programs differ in the programs they accept as “acyclic.” For example, the CEC accepts some programs that the v5 compiler rejects and vice versa [36], and a full discussion of this issue goes beyond the scope of this paper. Effectively, a program is considered cyclic if it is not (statically) schedulable—and compilers differ in their scheduling abilities. We here consider a program cyclic if the priority assignment algorithm presented in the next section fails. This results in the following definition, based on the CKAG.

Definition 1 (Program Cycle). An Esterel program is cyclic if the corresponding CKAG contains a path from a node to itself, where for each node and its successors along that path, and , the following holds:(1)

Note that some of the sets that this definition uses are conservatively approximated by the compiler, as already mentioned in Section 3.3. In other words, our compiler may detect spurious cycles and therefore reject a program even if it is causal. As we consider dependencies only if they cross thread boundaries, it appears that we can schedule more programs than other compilers typically can, and we did not encounter a scheduling problem with any of the tested programs. However, a systematic investigation of this issue is still open.

4.2. Computing Thread Priorities

The task of the priority algorithm is to compute a priority assignment that respects the Esterel semantics as well as the execution model of the KEP. The algorithm computes for each reachable node in the CKAG the priority and, for nodes in , . According to the Esterel semantics and the observations made in Section 3.3, a correct priority assignment must fulfill the following constraints, where are arbitrary nodes in the CKAG.

Constraint 1 (Dependencies). A thread executing a dependency source node must have a higher priority than the corresponding sink. Hence, for , it must be , and for , it must be .

Constraint 2 (Intratick Priority). Within a logical tick, a thread's priority cannot increase. Hence, for and , or and , or and , it must be .

Constraint 3 (Intertick Priority for Delay Nodes). To ensure that a thread resumes computation from some delay node with the correct priority, must hold for all .

Constraint 4 (Intertick Priority for Fork Nodes). To ensure that a main thread that has executed a fork node resumes computation—after termination of the forked threads—with the correct priority, must hold. Furthermore, must hold for all .

One could imagine an iterative approach for priority assignment, where all nodes are initially assigned a low priority and priorities are iteratively increased until all constraints are met. However, this would probably be not very efficient, and it would be difficult to validate its correctness and its termination. As it turns out, there is a better alternative. We can order the computations of all priorities such that when a specific priority value is computed, all the priorities that this value may depend on have already been computed. The algorithm shown in Figure 6 achieves this by performing recursive calls that traverse the CKAG in a specific manner.

Figure 6: Algorithm to compute priorities.

The algorithm starts in , which, after some initializations, in line 8 calls for all nodes that must yet be processed. This set of nodes, given by (for “Visited”), initially just contains the root of the CKAG. After has been computed for all reachable nodes in the CKAG, a loop computes for reachable delay/fork nodes that have not been computed yet.

first checks whether it has already computed . If not, it then checks for a recursive call to itself (lines 3/4, see also Lemma 1). The remainder of computes and, in case of delay and fork nodes, adds nodes to the list. Similarly computes .

Lemma 1 (Termination). For a valid, acyclic Esterel program, and terminate. Furthermore, they do not generate a “ ” error message.

Proof (Sketch). produces an error (line 4) if it has not computed yet (checked in line 2) but has already been called (line 3) earlier in the call chain. This means that it has called itself via one of the calls to or (via ). An inspection of the calling pattern yields that an acyclic program in the sense of Definition 1 cannot yield a cycle in the recursive call chain. Since the number of nodes is finite, both algorithms terminate.

Lemma 2 (Fulfillment of Constraints). For a valid, acyclic Esterel program, the priority assignment algorithm computes an assignment that fulfills Constraints 14.

Proof (Sketch). First observe that—apart from the initialization in —each is assigned only once. Hence, when returns the maximum of priorities for a given set of nodes, these priorities do not change any more. Therefore, the fulfillment of Constraint 1 can be deduced directly from . Similarly for Constraint 2. Analogously ensures that Constraints 3 and 4 are met.

Lemma 3 (Linearity). For a CKAG with nodes and edges, the computational complexity of the priority assignment algorithm is .

Proof (Sketch). Apart from the initialization phase, which has cost , the cost of the algorithm is dominated by the recursive calls to . The total number of calls is bounded by . With an amortization argument, where the costs of each call are attributed to the callee, it is easy to see that the overall cost of the calls is .

Note also that while the size of the CKAG may be quadratic in the size of the corresponding Esterel program in the worst case, it is in practice (for a bounded abort nesting depth) linear in the size of the program, resulting in an algorithm complexity linear in the program size as well; see also the discussion in Section 6.2.

After priorities have been computed for each reachable node in the CKAG, we must generate code that ensures that each thread is executed with the computed priority. This task is relatively straightforward, Figure 7 shows the algorithm.

Figure 7: Algorithm to annotate code with priority settings according to CKAG node priorities.

Another issue is the computation of thread ids, as these are also considered in scheduling decisions in case there are multiple threads of highest priority. This property is exploited by the scheduling scheme presented here, to avoid needless cycles. The compiler assigns increasing ids to threads during a depth-first traversal of the thread hierarchy; this is required in certain cases to ensure proper termination of concurrent threads [4].

4.3. Optimizations

Prior to running the priority/scheduling algorithm discussed before, the compiler tries to eliminate dependencies as much as possible. It does that using two mechanisms. The first is to try to be clever about the assignment of thread ids, as they are also used for scheduling decisions if there are multiple threads that have the highest priority (see Section 3.2). By considering dependencies between different threads, simple dependencies can be solved without any explicit priority changes. The second mechanism is to determine whether two nodes connected via a dependency are executable within the same instant. This is in general a difficult problem to analyze. We here only consider the special case where two nodes share some (least common) fork node, and one node has only instantaneous paths from that fork node, and the other node only not instantaneous paths. In this case, the dependency can be safely removed.

To preserve the signal-dependencies in the execution, additional priority assignments ( statements) might have to be introduced by the compiler. To assure schedulability, the program is completely dismantled, that is, transformed into kernel statements. In this dismantled graph the priority assignments are inserted. A subsequent “undismantling” step before the computation of the WCRT detects specific patterns in the CKAG and collapses them to more complex instructions, such as or , which are also part of the KEP instruction set.

The KEP compiler performs a statement dismantling (see Section 3.1.2) as a preprocessing step. This facilitates code selection and also helps to eliminate spurious dependency cycles, and to hence increase the set of schedulable (accepted) programs, as already discussed in Section 4. After assigning priorities, the compiler tries again to “undismantle” compound statements whenever this is possible. This becomes apparent in the example; the (Figure 3(c)) is the undismantled equivalent of the lines 7–9 in (Figure 3(b)).

The compiler suppresses statements for the main thread, because the main thread never runs concurrently to other threads. In the example, this avoids a statement at label .

Furthermore, the compiler performs dead code elimination, also using the traversal results of the priority assignment algorithm. In the example, it determines that execution never reaches the infinite loop in line 32 of , because the second parallel thread never terminates normally, and therefore does not generate code for it.

However, there is still the potential for further optimizations, in particular regarding the priority assignment. In the program, one could for example hoist the out of the enclosing loop, and avoid this statement altogether by just starting thread with priority 2 and never changing it again. Even more effective would be to start with priority 3, which would allow to undismantle into a single .

5. Worst-Case Reaction Time Analysis

Given a KEP program, we define its WCRT as the maximum number of KEP cycles executable in one instant. Thus WCRT analysis requires finding the longest instantaneous path in the CKAG, where the length metric is the number of required KEP instruction cycles. We abstract from signal relationships and might therefore consider unfeasible executions. Therefore the computed WCRT can be pessimistic. We first present, in Section 5.1, a restricted form of the WCRT algorithm that does not handle concurrency yet. The general algorithm requires an analysis of instant reachability between fork and join nodes, which is discussed in Section 5.2, followed by the presentation of the general WCRT algorithm in Section 5.3.

5.1. Sequential WCRT Algorithm

First we present a WCRT analysis of sequential CKAGs (no fork and join nodes). Consider again the example in Figure 1(a).

The longest possible execution occurs when the signal becomes present, as is the case in Tick 3 of the example trace shown in Figure 1(d). Since the abortion triggered by is weak, the abort body is still executed in this instant, which takes four instructions: , , the , and again. Then it is detected that the body has finished its execution for this instant, the abortion takes place, and and are executed. Hence the longest possible path takes six instruction cycles.

The sequential WCRT is computed via a depth-first search (DFS) traversal of the CKAG, see the algorithm in Figure 8. For each node a value is computed, which gives the WCRT from this node on in the same instant when execution reaches the node. For a transient node, the WCRT is simply the maximum over all children plus its own execution time.

Figure 8: WCRT algorithm, restricted to sequential programs. The nodes of a CKAG are given by (see Section 3.3), g.root indicates the first KEP statement. () returns the number of instruction cycles to execute , see third column in Table 1.

For noninstantaneous delay nodes, we distinguish two cases within a tick: control can reach a delay node , meaning that the thread executing has already executed some other instructions in that tick, or control can start in , meaning that must have been reached in some preceding tick. In the first case, the WCRT from on within an instant is expressed by the variable already introduced. For the second case, an additional value stores the WCRT from on within an instant; “next” here expresses that in the traversal done to analyze the overall WCRT, the value should not be included in the current tick, but in a next tick. Having these two values ensures that the algorithm terminates in the case of noninstantaneous loops: to compute we might need the value .

For a delay node, we also have to take abortions into account. The handlers (i.e., their continuations—typically the end of an associated abort/trap scope) of weak abortions and exceptions are instantaneously reachable, so their WCRTs are added to the d.inst value. In contrast, the handlers of strong abortions cannot be executed in the same instant the delay node is reached, because according to the Esterel semantics an abortion body is not executed at all when the abortion takes place. On the KEP, when a strong abort takes place, the delay nodes where the control of the (still active) threads in the abortion body resides are executed once, and then control moves to the abortion handler. In other words, control cannot move from a delay node to a (strong) abortion handler when control reaches , but only when it starts in . Therefore, the WCRT of the handler of a strong abortion is added to , and not to .

We do not need to take a weak abortion into account for , because it cannot contribute to a longest path. An abortion in an instant when a delay node is reached will always lead to a higher WCRT than an execution in a subsequent instant where a thread starts executing in the delay node.

The resulting WCRT for the whole program is computed as the maximum over all WCRTs of nodes where the execution may start. These are the start node and all delay nodes. To take into account that execution might start simultaneously in different concurrent threads, we also have to consider the next value of join nodes.

Consider again the example in Figure 1. Each node in the CKAG is annotated with a label “” or, for a delay node, a label “.” In the following, we will refer to specific CKAG nodes with their corresponding KEP assembler line numbers . It is . The sequential WCRT computation starts initializing the and values of all nodes to (line 2 in , Figure 8). Then is called, which computes . The call to computes and returns , hence . Next, in line 4 of , we call , which computes . The call to computes and returns . Hence , which corresponds to the longest path triggered by the presence of signal , as we have seen earlier. The WCRT analysis therefore inserts an “, #6” instruction before the body of the KEP assembler program to initialize the accordingly, as can be seen in Figure 1(c).

5.2. Instantaneous Statement Reachability for Concurrent Esterel Programs

It is important for the WCRT analysis whether a join and its corresponding fork can be executed within the same instant. The algorithm for instantaneous statement reachability computes for a source and a target node whether the target is reachable instantaneously from the source. Source and target have to be in sequence to each other, that is, not concurrent, to get correct results.

In simple cases like or the sequential control flow successor is executed in the same instant respectively next instant, but in general the behavior is more complicated. The parallel, for example, will terminate instantaneously if all subthreads are instantaneous or an will be reached instantaneously; it is noninstantaneous if at least one subthread is not instantaneous.

The complete algorithm is presented in detail elsewhere [6]. The basic idea is to compute for each node three potential reachability properties: instantaneous, noninstantaneous, exit-instantaneous. Note that a node might be as well (potentially) instantaneous as (potentially) noninstantaneous, depending on the signal context. Computation begins by setting the instantaneous predicate of the source node to true and the properties of all other nodes to false. When any property is changed, the new value is propagated to its successors. If we have set one of the properties to true, we will not set it to false again. Hence the algorithm is monotonic and will terminate. Its complexity is determined by the amount of property changes which are bounded to three for all nodes, so the complexity is .

The most complicated computation is the property instantaneous of a join node, because several attributes have to be fulfilled for it to be instantaneous: (i)For each thread, there has to be a (potentially) instantaneous path to the join node.(ii)The predecessor of the join node must not be an , because nodes are no real control flow predecessors. At the Esterel level, an exception () causes control to jump directly to the corresponding exception handler (at the end of the corresponding trap scope); this jump may also cross thread-bounderies, in which case all threads that contain the jump until the thread that contains the target of the jump and all their sibling threads terminate.To reflect this at the KEP level, an instruction does not jump directly to the exception handler, but first executes the instructions on the way, to give them the opportunity to terminate threads correctly. If a is executed this way, the statements that are instantaneously reachable from it are not executed, but control instead moves on to the exception handler, or to another intermediate . To express this, we use the third property besides instantaneous and noninstantaneous: exit-instantaneous.

Roughly speaking, the instantaneous property is propagated via for-all quantifier, noninstantaneous and exit-instantaneous via existence-quantifier.

Most other nodes simply propagate their own properties to their successors. The delay node propagates in addition its noninstantaneous predicate to its delayed successors and exit nodes propagate exit-instantaneous reachability, when they themselves are reachable instantaneously.

5.3. General WCRT Algorithm

The general algorithm, which can also handle concurrency, is shown in Figure 9. It emerges from the sequential algorithm that has been described in Section 5.1 by enhancing it with the ability to compute the WCRT of fork and join nodes. Note that the instantaneous of a join node is needed only by a fork node, all other transient nodes and delay nodes do not use this value for their WCRT. The WCRT of the join node has to be accounted for just once in the instantaneous WCRT of its corresponding fork node, which allows the use of a DFS-like algorithm.

Figure 9: General WCRT algorithm.

The instantaneous WCRT of a fork node is simply the sum of the instantaneously reachable statements of its subthreads, plus the statement for each subthread and the additional statement.

The join nodes, like delay nodes, also have a value. When a fork-join pair could be noninstantaneous, we have to compute a WCRT for the next instants analogously to the delay nodes. Its computation requires first the computation of all subthread WCRTs. Note that in case of nested concurrency these values can again result from a join node. But at the innermost level of concurrency the WCRT values all stem from delay nodes, which will be computed before the join next values. The delay next WCRT values are computed the same way as in the sequential case except that only successors within of the same thread are considered. We call successors of a different thread interthread-successors and their WCRT values are handled by the according join node. The join value is the maximum of all interthread-successor WCRT values and the sum of the maximum value for every thread.

If the parallel does not terminate instantaneously, all directly reachable states are reachable in the next instant. Therefore we have to add the execution time for all statements that are instantaneously reachable from the join node.

The whole algorithm computes first the next WCRT for all delay and join nodes; it computes recursively all needed values. Thereafter the instantaneous WCRT for all remaining nodes is computed. The result is simply the maximum over all computed values.

Consider the example in Figure 2(a). First we note that the fork/join pair is always noninstantaneous, due to the statement. We compute . From the fork node , the and statements, the instantaneous parts of both threads and the are executed, hence . It turns out that the WCRT of the program is . Note that the statement is executed twice.

A known difficulty when compiling Esterel programs is that due to the nesting of exceptions and concurrency, statements might be executed multiple times in one instant. This problem, also known as reincarnation, is handled correctly by our algorithm. Since we compute nested joins from inside to outside, the same statement may effect both the instantaneous and noninstantaneous WCRT, which are added up in the next join. This exactly matches the possible control-flow in case of reincarnation. Even when a statement is executed multiple times in an instant, we compute a correct upper bound for the WCRT.

Regarding the complexity of the algorithm, we observe that for each node its WCRT's and are computed at most once, and for all fork nodes a fork-join reachability analysis is additionally made, which has itself . So we get altogether a complexity of = = .

5.4. Unreachable Paths

Signal informations are not taken into account in the algorithms described above. This can lead to a conservative (too high) WCRT, because the analysis may consider unreachable paths that can never be executed. In Figure 10(a) we see an unreachable path increasing unnecessarily the WCRT because of demanding signal present and absent instantaneously, which is inconsistent. Nevertheless there is no dead code in the graph, but only two possible paths regarding to path signal predicates.

Figure 10: Unreachable path examples.

Figure 10(b) shows an unreachable parallel path that leads to a too high WCRT of the fork node, because the subpaths cannot be executed at the same time. Furthermore, the parallel is declared as possibly instantaneous, even though it is not. Therefore, all statements which are instantaneously reachable from the join node are also added.

Another unreachable parallel path is shown in Figure 10(c). This path is unreachable not because of signal informations but because of instantaneous behavior: the maximal paths of the two threads are never executed in the same instant. In other words, the system is never in a configuration (collection of states) such that both code segments become activated together. Instead of taking for each thread the maximum next WCRT and summing up, it would be more exact to sum up over all threads next WCRT's executable instantaneously and then taking the maximum of these sums. Therefore we would have to enhance the reachability algorithm of the ability to determine how many ticks later a statement could be executed behind another. However, in this case the possible tick counts can become arbitrarily high for each node, so we would get a higher complexity and a termination problem. Our analysis is conservative in simply assuming that all concurrent paths may occur in the same instant, and that all can be executed in the same instant as the join.

6. Experimental Results

To evaluate the compilation and WCRT analysis approach presented here, we have implemented a compiler for the KEP based on the CEC infrastructure [34]. We will discuss in turn our validation approach and the quantitative results for the compiler, specifically the priority assignment scheme, and for the WCRT estimation.

6.1. Validation

To validate the correctness of the compilation scheme, as well as of the KEP itself, we have collected a fairly substantial validation suite, currently containing some 500 Esterel programs. These include all benchmarks made available to us, such as the Estbench [37], and other programs written to test specific situations and corner cases. An automated regression procedure compiles each program into KEP assembler, downloads it into the KEP, provides an input trace for the program, and records the output at each step. This output is compared to the results obtained from running the same program on a work station, using Esterel Studio.

For each program, any differences in the output traces between the KEP results and the workstation/Esterel Studio results are recorded. Furthermore, the average-case reaction time (ACRT) and WCRT for each program are measured. For these measurements, the KEP is operating in “freely running” mode, that is, is left unspecified (see Section 3.2); the default would be to set according to the (conservatively) estimated WCRT, in which case the measured ACRT and WCRT values would be equal to the estimated WCRT. At this point, the full benchmark suite runs through without any differences in output, and the analyzed WCRT is always safe; that is, not lower than the measured WCRT.

Esterel Studio is also used to generate the input trace, using the “full transition coverage” mode. Note that the traces obtained this way still did not cover all possible paths. However, at this point we consider it very probable that a compilation approach that handles all transition coverage traces correctly would also handle the remaining paths. We also feel that this level of validation probably already exceeds the current state of the practice.

6.2. Compilation and Priority Assignment

As the emphasis here is more on the compilation approach and less on the underlying execution platform, we here refrain from a comparison of execution times and code sizes on the KEP versus traditional, nonreactive platforms; such a comparison can be found elsewhere [4]. Instead, we are here primarily interested in static code characteristics, and in particular how well the priority assignment algorithm works. Table 2 summarizes the experimental results for a selection of programs taken from the Estbench.

Table 2: Experimental results for the compiler and priority assignment. For each benchmark it lists the lines of code (LoC) for the source code, the lines of generated KEP assembler, the number of dependencies, the maximal nesting depth of abort scopes, the maximal degree of concurrency, the number of generated statements, the maximum priority of any thread, and the times for computing the priorities and for the overall compilation.

We note first that the generated code is very compact, and that the KEP assembler line count is comparable to the Esterel source. This is primarily a reflection on the KEP ISA, which provides instructions that directly implement most of the Esterel statements. Furthermore, the relationship between source code and KEP assembler size (and CKAG size) seems fairly linear. We note that the connection between program size and number of (interthread) dependencies is rather loose. For example, is smaller than , but contains more than twice the number of dependencies. Next, we see that the maximal abort nesting depth tends to be limited, only in one case it exceeded three. The degree of concurrency again varied widely; not too surprisingly, the degree of concurrency also influenced the required number of statements (which—potentially—induce context switches). However, overall the number of generated statements seems acceptable compared to overall code size, and there were cases where we did not need at all, despite having several interthread dependencies. This reflects that the thread id assignment mechanism (see Section 4.3) is already fairly efficient in resolving dependencies. Similarly, the assigned priorities tended to be low in general, for none of the benchmarks they exceeded three. Finally, the priority assignment algorithm and the overall compilation are quite fast, generally in the millisecond range.

6.3. Accuracy of WCRT Analysis

As mentioned before, the WCRT analysis is implemented in the KEP compiler, and is used to automatically insert a correct instruction at the beginning of the program, such that the reaction time is constant and as short as possible, without ever raising a timing violation by the TickManager. As discussed in Section 6.1, we measured the maximal reaction times and compared it to the computed value. Figure 11 provides a qualitative comparison of estimated and measured WCRT and measured ACRT, more details are given in Table 3. We have never underestimated the WCRT, and our results are on average 22% too high, which we consider fairly tight compared to other reported WCET results [22]. For each program, the lines of code, the computed WCRT and the measured WCRT with the resulting difference are given. We also give the average WCRT analysis time on a standard PC (AMD Athlon XP, 2.2 GHz, 512 KB Cache, 1 GB Main Memory); as the table indicates, the analysis takes only a couple of milliseconds.

Table 3: Detailed comparison of WCRT/ACRT times. The and data denote the estimated and measured WCRT, respectively, measured in instruction cycles. The ratio indicates by how much our analysis overestimates the WCRT. is the measured average case reaction time (ACRT), gives the ratio to the measured WCRT. Test cases and ticks are the number of different scenarios and logical ticks that were executed, respectively.
Figure 11: Estimated and measured worst- and average-case reaction times.

The table also compares the ACRT with the WCRT. The ACRT is on average about two thirds of the WCRT, which is relatively high compared to traditional architectures. In other words, the worst case on the KEP is not much worse than the average case, and padding the tick length according to the WCRT does not waste too much resources. On the same token, designing for worst-case performance, as typically must be done for hard real-time systems, does not cause too much overhead compared to the typical average-case performance design. Finally, the table also lists the number of scenarios generated by Esterel-studio and accumulated logical tick count for the test traces.

7. Conclusions and Further Work

We have presented a compiler for the KEP, and its integrated WCRT analysis. Since the KEP ISA is very similar to Esterel, the compilation of most constructs is straightforward. But the computation of priorities for concurrent threads is not trivial. The thread scheduling problem is related to the problem of generating statically scheduled code for sequential processors, for which Edwards has shown that finding efficient schedules is NP hard [9]. We encounter the same complexity, even though our performance metrics for an optimal schedule are a little different. The classical scheduling problem tries to minimize the number of context switches. On the KEP, context switches are free, because no state variables must be stored and resumed. However, to ensure that a program meets its dependency-implied scheduling constraints, threads must manage their priorities accordingly, and it is this priority switching which contributes to code size and costs an extra instruction at run time. Minimizing priority switches is related to classical constraint-based optimization problems as well as to compiler optimization problems such as loop invariant code motion.

We also have presented the WCRT analysis of Esterel programs. The restricted nature of Esterel and its sound mathematical semantics allow formal analysis of Esterel programs and make the computation of a WCRT for Esterel programs achievable. Our analysis is incorporated in the compiler and uses its internal graph representation, the concurrent KEP assembler graph (CKAG). In a first step we compute whether concurrent threads terminate instantaneously, thereafter we are able to compute for each statement how many instructions are maximally executable from it in one logical tick. The maximal value over all nodes gives us the WCRT of the program. The analysis considers concurrency and the multiple forms of preemption that Esterel offers. The asymptotic complexity of the WCRT analysis algorithm is quadratic in the size of the program; however, experimental results indicate that the overhead of WCRT analysis as part of compilation is negligible. We have implemented this analysis into the KEP compiler, and use it to automatically compute an initialization value for the KEP's . This allows to achieve a high- and constant-response frequency to the environment, and can also be used to detect hardware errors by detecting timing overruns.

Our analysis is safe, that is, conservative in that it never underestimates the WCRT, and it does not require any user annotations to the program. In our benchmarks, it overestimates the WCRT on average by about 22%. This is already competitive with the state of the art in general WCET analysis, and we expect this to be acceptable in most cases. However, there is still significant room for improvement. So far, we are not taking any signal status into account, therefore our analysis includes some unreachable paths. Considering all signals would lead to an exponential growth of the complexity, but some local knowledge should be enough to rule out most unreachable paths of this kind. Also a finer grained analysis of which parts of parallel threads can be executed in the same instant could lead to better results. However, it is not obvious how to do this efficiently.

Our analysis is influenced by the KEP in two ways: the exact number of instructions for each statement and the way parallelism is handled. At least for nonparallel programs our approach should be of value for other compilation methods for Esterel as well, for example, simulation-based code generation. A virtual machine with similar support for concurrency could also benefit from our approach. We would also like to generalize our approach to handle different ways to implement concurrency. A WCRT analysis directly on the Esterel level gives information on the longest possible execution path. Together with a known translation to C, this WCRT information could be combined with a traditional WCET analysis, which takes caches and other hardware details into account.

To conclude, while we think that the approaches for compilation and WCRT analysis presented here are another step towards making reactive processing attractive, there are still numerous paths to be investigated here, including the application of these results towards classical software synthesis. A further issue, which we have not investigated here at all, is to formalize the semantics of reactive ISAs. This would help to deepen the understanding of reactive processing platforms, and could open the door towards formal correctness proofs down to the execution platform. As the ISA provided by the KEP allows to execute programs that are not constructive in the classical sense (such as signal emissions after the signals are tested), and yet have a well-defined outcome (i.e., are deterministic), we also envision that this could ultimately lead towards new, interesting synchronous models of computation.

References

  1. G. Berry, “The foundations of Esterel,” in Proof, Language and Interaction: Essays in Honour of Robin Milner, G. Plotkin, C. Stirling, and M. Tofte, Eds., MIT Press, Cambridge, Mass, USA, 2000.
  2. A. Benveniste, P. Caspi, S. A. Edwards, N. Halbwachs, P. Le Guernic, and R. de Simone, “The synchronous languages 12 years later,” Proceedings of the IEEE, vol. 91, no. 1, pp. 64–83, 2003.
  3. R. von Hanxleden, X. Li, P. Roop, Z. Salcic, and L. H. Yoong, “Reactive processing for reactive systems,” ERCIM News, no. 66, pp. 28–29, October 2006.
  4. X. Li, M. Boldt, and R. von Hanxleden, “Mapping esterel onto a multi-threaded embedded processor,” in Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '06), pp. 303–314, San Jose, Calif, USA, October 2006.
  5. M. Boldt, C. Traulsen, and R. von Hanxleden, “Worst case reaction time analysis of concurrent reactive programs,” in Proceedings of the Workshop on Model-Driven High-level Programming of Embedded Systems (SLA++P07), Braga, Portugal, March 2007.
  6. M. Boldt, Worst-case reaction time analysis for the KEP3, Study thesis, Department of Computer Science, Christian-Albrechts-Universität zu Kiel, Kiel, Germany, May 2007, http://rtsys.informatik.uni-kiel.de/~biblio/downloads/theses/mabo-st.pdf.
  7. M. Boldt, A compiler for the Kiel Esterel Processor, Diploma thesis, Department of Computer Science, Christian-Albrechts-Universität zu Kiel, Kiel, Germany, December 2007, http://rtsys.informatik.uni-kiel.de/~biblio/downloads/theses/mabo-dt.pdf.
  8. D. Potop-Butucaru, S. A. Edwards, and G. Berry, Compiling Esterel, Springer, New York, NY, USA, 2007.
  9. S. A. Edwards, “An Esterel compiler for large control-dominated systems,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 21, no. 2, pp. 169–183, 2002.
  10. E. Closse, M. Poize, J. Pulou, P. Venier, and D. Weil, “SAXO-RT: interpreting Esterel semantic on a sequential execution structure,” Electronic Notes in Theoretical Computer Science, vol. 65, no. 5, pp. 80–94, 2002.
  11. D. Potop-Butucaru and R. de Simone, “Optimization for faster execution of Esterel programs,” in Formal Methods and Models for System Design: A System Level Perspective, pp. 285–315, Kluwer Academic Publishers, Norwell, Mass, USA, 2004.
  12. S. A. Edwards and J. Zeng, “Code generation in the Columbia Esterel compiler,” EURASIP Journal on Embedded Systems, vol. 2007, Article ID 52651, 31 pages, 2007.
  13. G. Berry, “The constructive semantics of pure Esterel,” draft book, 1999, ftp://ftp-sop.inria.fr/esterel/pub/papers/constructiveness3.ps.
  14. P. S. Roop, Z. Salcic, and M. W. S. Dayaratne, “Towards direct execution of Esterel programs on reactive processors,” in Proceedings of the 4th ACM International Conference on Embedded Software (EMSOFT '04), pp. 240–248, Pisa, Italy, September 2004.
  15. X. Li, J. Lukoschus, M. Boldt, M. Harder, and R. von Hanxleden, “An esterel processor with full preemption support and its worst reaction time analysis,” in Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES '05), pp. 225–236, ACM Press, San Francisco, Calif, USA, September 2005.
  16. L. H. Yoong, P. Roop, Z. Salcic, and F. Gruian, “Compiling Esterel for distributed execution,” in Proceedings of the International Workshop on Synchronous Languages, Applications, and Programming (SLAP '06), Vienna, Austria, March 2006.
  17. S. Yuan, S. Andalam, L. H. Yoong, P. S. Roop, and Z. Salcic, “Starpro—a new multithreaded direct execution platform for esterel,” in Proceedings of the Model Driven High-Level Programming of Embedded Systems Workshop (ETAPS '08), Budapest, Hungary, April 2008.
  18. B. Plummer, M. Khajanchi, and S. A. Edwards, “An Esterel virtual machine for embedded systems,” in Proceedings of International Workshop on Synchronous Languages, Applications, and Programming (SLAP '06), Vienna, Austria, March 2006.
  19. O. Tardieu and S. A. Edwards, “Approximate reachability for dead code elimination in Esterel,” in Proceedings of the 3rd International Symposium on Automated Technology for Verification and Analysis (ATVA '05), pp. 323–337, Taipei, Taiwan, October 2005.
  20. T. R. Shiple, G. Berry, and H. Touati, “Constructive analysis of cyclic circuits,” in Proceedings of the International Design and Test Conference (ITDC '96), pp. 328–333, Paris, France, March 1996.
  21. P. Puschner and A. Burns, “A review of worst-case execution-time analysis,” Real-Time Systems, vol. 18, no. 2-3, pp. 115–128, 2000.
  22. R. Wilhelm, J. Engblom, A. Ermedahl, et al., “The determination of worst-case execution times-overview of the methods and survey of tools,” ACM Transactions on Embedded Computing Systems (TECS), vol. 7, no. 3, 2008.
  23. S. Malik, M. Martonosi, and Y.-T. S. Li, “Static timing analysis of embedded software,” in Proceedings of the 34th Annual Conference on Design Automation (DAC '97), pp. 147–152, ACM Press, Anaheim, Calif, USA, June 1997.
  24. A. Burns and S. Edgar, “Predicting computation time for advanced processor architectures,” in Proceedings of the 12th Euromicro Conference on Real-Time Systems (Euromicro-RTS '00), pp. 89–96, Stockholm, Sweden, June 2000.
  25. C. Berg, J. Engblom, and R. Wilhelm, “Requirements for and design of a processor with predictable timing,” in Perspectives Workshop: Design of Systems with Predictable Behaviour, L. Thiele and R. Wilhelm, Eds., vol. 03471 of Dagstuhl Seminar Proceedings, Internationales Begegnungs-und Forschungszentrum für Informatik, Schloss Dagstuhl, Germany, 2004.
  26. G. Logothetis and K. Schneider, “Exact high level WCET analysis of synchronous programs by symbolic state space exploration,” in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE '03), pp. 196–203, IEEE Computer Society, Munich, Germany, March 2003.
  27. G. Logothetis, K. Schneider, and C. Metzler, “Exact low-level runtime analysis of synchronous programs for formal verification of real-time systems,” in Forum on Design Languages, Kluwer Academic Publishers, Frankfurt, Germany, 2003.
  28. D. Harel and A. Pnueli, “On the development of reactive systems,” in Logics and Models of Concurrent Systems, pp. 477–498, Springer, New York, NY, USA, 1985.
  29. T. Ringler, “Static worst-case execution time analysis of synchronous programs,” in Proceedings of the 5th Ada-Europe International Conference on Reliable Software Technologies (Ada-Europe '00), pp. 56–68, Potsdam, Germany, June 2000.
  30. O. Tardieu and R. de Simone, “Instantaneous termination in pure Esterel,” in Proceedings of the 10th International Symposium on Static Analysis Symposium (SAC '03), p. 1073, San Diego, Calif, USA, June 2003.
  31. G. Berry, “The Esterel v5 Language Primer, Version v5_91,” Centre de Mathématiques Appliquées Ecole des Mines and INRIA, 06565 Sophia-Antipolis, 2000, ftp://ftp-sop.inria.fr/esterel/pub/papers/primer.pdf.
  32. X. Li, The Kiel Esterel processor: a multi-threaded reactive processor, Ph.D. thesis, Christian-Albrechts-Universität zu Kiel, Faculty of Engineering, Kiel, Germany, July 2007, http://eldiss.uni-kiel.de/macau/receive/dissertation_diss_00002198.
  33. X. Li and R. von Hanxleden, “A concurrent reactive Esterel processor based on multi-threading,” in Proceedings of the 21st ACM Symposium on Applied Computing (SAC '06), vol. 1, pp. 912–917, Dijon, France, April 2006.
  34. S. A. Edwards, “CEC: the Columbia Esterel compiler,” 2006, http://www1.cs.columbia.edu/~sedwards/cec/.
  35. Esterel Technologies, Company homepage, http://www.esterel-technologies.com/.
  36. J. Lukoschus and R. von Hanxleden, “Removing cycles in Esterel programs,” EURASIP Journal on Embedded Systems, vol. 2007, Article ID 48979, 23 pages, 2007.
  37. Estbench Esterel Benchmark Suite, 2007, http://www1.cs.columbia.edu/~sedwards/software/estbench-1.0.tar.gz.