Abstract

Industrial communication networks like the Controller Area Network (CAN) are often required to operate reliably in harsh environments which expose the communication network to random errors. Probabilistic schedulability analysis can employ rich stochastic error models to capture random error behaviors, but this is most often at the expense of increased analysis complexity. In this paper, an efficient method (of time complexity ) to bound the message deadline failure probabilities for an industrial CAN network consisting of periodic/sporadic message transmissions is proposed. The paper develops bounds for Deadline Minus Jitter Monotonic (DMJM) and Earliest Deadline First (EDF) message scheduling techniques. Both random errors and random bursts of errors can be included in the model. Stochastic simulations and a case study considering DMJM and EDF scheduling of an automotive benchmark message set provide validation of the technique and highlight its application.

1. Introduction

Real-time industrial communication networks such as Controller Area Network (CAN) are often required to operate in harsh environments, where they may be subject to environmental hazards such as electromagnetic interference (EMI) and other forms of mechanical/electrical stresses. Exposure to hazards such as this can induce random errors into a system, which, if left uncorrected, may result in system failures [1, 2]. This paper is concerned with probabilistic schedulability analysis of the communications in real-time industrial CAN networks which are scheduled by a priority-driven algorithm in the presence of transient and/or intermittent errors. CAN is a multimaster, differential serial bus using NRZ encoding at the physical layer and is often employed for distributed real-time control applications. The full CAN protocol description may be found in [3]. Although in its native form CAN supports fixed priority scheduling, simple protocols operating at the node level have previously been developed to enable distributed Earliest Deadline First (EDF) scheduling with CAN [4, 5]. As such, the paper focuses upon both fixed priority (the Deadline Minus Jitter Monotonic (DMJM) algorithm) and dynamic priority (EDF) scheduling methods, the latter of which can be enforced at the CAN application layer, due to their known optimality in a networked environment under a wide variety of operating configurations [6, 7]. Specifically, the paper is concerned with the fast calculation of tight bounds on the probability that a deadline will be missed when error arrivals cause messages to be aborted and subsequently rescheduled for transmission after the transmission of an error message.

Such retransmission is a form of redundancy which requires some temporal “slack capacity” in the message schedule; how much slack is required to be allocated depends upon many factors including the level of criticality in the service the system provides, the message set parameters and scheduling algorithm, and also the nature of the error detection and correction mechanisms employed by the system. If insufficient slack is employed by a system to tolerate the effects of the errors it experiences, then aborted messages will not be processed or delivered correctly before their deadlines and system failures may occur. Since error arrivals are random in nature, then no 100% guarantees of timeliness can be given, and probabilistic guarantees are instead sought. In this respect, the principal contribution of the paper is the derivation of an efficient means to obtain tight bounds on the probability of deadline failure when errors occur randomly (with geometrically distributed interarrivals) and possibly in random bursts (with geometrically distributed burst interarrival and length), and aborted messages are immediately requeued for transmission.

Although previous work related to probabilistic schedulability analysis for wired networks such as CAN and also wireless networks has employed rich stochastic error models to capture these random behaviors, to date this has (mostly) been at the expense of increased analysis complexity. In this paper, some recent results on probabilistic real-time error models and schedulability analysis in [1, 8] are extended, and an efficient method to bound the deadline failure probabilities for a set of periodic/sporadic messages transmitted in a real-time industrial communication network is proposed. The procedure first carries out an analysis of the available slack in the DMJM or EDF schedule and uses this information, in conjunction with knowledge of the environmental error characteristics, to determine the probability that the slack will be exceeded by the extra load induced by errors. The analysis may be performed in time. Stochastic simulations and an example related to scheduling a benchmark set of messages on an automotive CAN network are used to illustrate the technique.

The motivation for this bounding method arises principally due to the need for efficient dependability-aware online admission or Quality of Service (QoS) controls in flexible networks (e.g., for automotive applications). In addition, motivation arises from the need for techniques which can provide designers with methods whereby the impacts of different design options (such as the choice of scheduling algorithm and configuration of message parameters and network bandwidth) upon system reliability may be quickly explored at early stages of a design. The remainder of the paper is organized as follows. Section 2 presents a brief summary of related work to contextualize the paper. Sections 3 and 4 present the network and error models, respectively. The proposed technique is outlined in general terms in Section 5 and is applied to DMJM and EDF scheduling in Section 6. Section 7 presents stochastic simulations to validate the proposals, and Section 8 describes a detailed example based upon a benchmark set of messages for an automotive network. Section 9 concludes the paper.

Although this paper is principally concerned with industrial CAN networks, related work on CPU and communications networks in general is included due to similarities in both the error and task models which are employed. As argued in previous works [1, 8], error models employed in real-time schedulability analysis (for both CPU and wired/wireless network message scheduling) typically assume either (i) a pseudoperiodic arrival of errors or (ii) the fact that some fixed number of errors will be experienced over a known time interval (e.g., the major cycle of the system operation) or (iii) error arrivals being purely random, typically as a result of a (possibly compound or Markovian) binomial or Poisson process.

Although approach (i) is relatively straightforward to incorporate into an existing schedulability analysis, in many cases it does not effectively capture either randomness (with exceptions, e.g., [9]) or bursty characteristics. Approach (ii), on the other hand, seems well suited to bursty characteristics: the work of [10] presents an exact analysis for CPU task scheduling with EDF in the case where the errors arrive during the system’s major cycle. However the assumption that not more than one burst of errors will arrive over this known time period implies a deterministic model, which seems inappropriate since errors are due to random noise and interference. In addition, if errors arrive in some time interval of length proportional to the smallest relative deadline of any task as per [10], then it again seems unjustified to assume that only errors will arrive in some proportionally much larger time interval (e.g., the major cycle). In [11], a similar method to [10] is employed to analyze the timing properties of wireless channels with bounded retransmissions; however the number of retransmissions can be specified independently for each wireless message and the length of time is made proportional to the message relative deadline. Although this seems a much improved model to use, there is no attempt made to link the retransmission bounds to environmental error models and/or the required reliability of message delivery. Approach (iii) is the one most generally taken for the analysis of distributed systems such as CAN, and the error models employed in these works are typically much richer than those employed for CPU schedulability analysis (although, as noted, the scope of these models is not restricted to the networked environment) [1, 1215].

Marques et al. analyze the use of servers for the retransmission of messages lost due to errors in time triggered CAN networks [16]. The obtained results indicated that the choice of server and rescheduling policy has a significant influence upon the results, and different choices are appropriate depending upon the metric of interest. The choice of a deadline-based retransmission server was found to minimize the number of lost transmissions (deadline failures). Gujarati and Brandenburg [17] describe a procedure to bound the Failures in Time (FIT) rate for a CAN network experiencing network interference and node failures. In [16, 17], error arrivals were assumed to follow a Poisson distribution.

A recent overview of response-time analysis techniques for CAN is provided in [18], for both the deterministic and probabilistic cases. Although these previous works have examined probabilistic (bursty/sporadic/intermittent) errors and also deterministic errors within combined reliability and schedulability analysis frameworks, as noted in [1, 8, 18], one complication of some of these methods and of the use of probabilistic models in general is the potential complexity of implementation. Typically, an iterative procedure is needed to produce either a probability distribution of response times or a breakdown reliability beyond which point one or more tasks or messages will miss a deadline. This complexity generally makes the use of probabilistic methods impractical where efficiency is required. Some work to help address this issue was presented in [1, 8], where a rich error model was employed to capture the effects of random errors and bursts of errors and simple closed-form expressions were developed to bound the number of error arrivals in a given interval of time given knowledge of the environmental conditions and desired reliability of the system.

A second drawback in almost all previous analysis is an implicit assumption that each and every error arrival impacts the schedule by inducing a retransmission of a worst-case frame [18]. This is pessimistic, especially for CAN networks which have early error detection and signaling. The pessimism is compounded for bursts of errors, in which multiple errors arrive with only potentially very small temporal separation; it is almost impossible for each error to induce a worst-case frame length. For bursts of errors, a more likely scenario seems that the first error in the burst sequence induces a frame retransmission, where subsequent errors in the burst sequence delay its starting time until the burst subsides. This is not taken into account in [1] or [8], or in any of the methods described in [18]. In the current work, efforts are made to improve the analysis in an attempt to address this issue, whilst retaining most aspects of the simplicity of application.

3. CAN Network Model

In the analysis that follows, it is taken that time is discrete (one time unit will typically, but not necessarily, correspond to one network bit-time) and is indexed by a nonnegative integer variable . It is assumed that the system consists of a number of distributed nodes, which share a communications channel to exchange real-time messages. A standard model for a shared real-time communication system is adopted to represent the CAN network in that the system to be implemented can be represented by a set of messages . Each message is represented by 4-tuple:

in which is the message period/interarrival, is the worst-case transmission time of any instance of the message, is the message relative deadline, and is the jitter (the worst-case time that may elapse between an (external or internal) event occurring initiating a message transmission in a distributed node and it being released to the network for transmission). Jitter is typically induced by variation in the latencies of distributed node event handlers. For CAN messages, either standard (11 bits) or extended (29 bits) identifiers can be used; as is well-known (e.g., [19]), for a message with Data Length Code , the total number of bits for a message with an 11-bit identifier is given by  DLC. For 29-bit identifiers,  DLC. For CAN networks, the worst-case length of an error frame bits.

The utilization of an individual message is given by and represents the fraction of time the network will be occupied processing the frames generated from the message over its lifetime. Successive frame arrivals from sporadic messages are invoked by both internal and external events (typically hardware or software interrupts on their host node) and are always separated by at least units of time; frame arrivals from periodic messages are always separated by exactly time units on their host node and are invoked by a logical timer, which may possibly be synchronized to a distributed clock. It is assumed that worst-case clock synchronization errors between any host nodes can be incorporated into these jitters and the synchronization protocol messages modeled as regular network traffic.

When a frame of message arrives (becomes ready) at some time , its absolute deadline is set at time and the scheduling procedure must allocate units of network time to process the job in the interval [, ); otherwise a deadline miss will occur. Both fixed priority (DMJM) and dynamic priority (EDF) scheduling procedures are considered in this paper, covering a wide variety of wired and wireless industrial network protocols and protocol extensions. It must be cautioned at this point that fixed priority and EDF scheduling, aside from a limited number of specific cases, are seldom supported directly at the MAC level. A drawback is that their use inevitably requires overlay protocols leading to the introduction of low-level overheads; however, the many recognized benefits (such as enabling higher network utilization and facilitating easier temporal analysis) oftentimes outweigh this drawback. CAN in its native form supports fixed priority scheduling, and hence DMJM is supported by design. Descriptions of overlay protocols enabling priority-based scheduling in Flexible Time Triggered (FTT) master-slave systems for both switched Ethernet and wireless networks can be found in [20, 21]. Extensions to enable EDF scheduling in wireless and wired networks using IEEE 802.15.4, Bluetooth, and CAN are discussed in [4, 5, 11, 22], respectively. Prospective system designers considering the use of any overlay protocol should be sure that the benefits outweigh the overheads for their particular application. Consideration of such analysis is outside the scope of the current paper.

Although the techniques described in this paper can be applied (in principle) to a message set with arbitrary deadlines, the main focus is on constrained deadline messages, in which . According to previous work, it is known that the worst-case arrival pattern of the message frames in terms of network load is the one in which all frames experience their worst-case jitter simultaneously, such that they are aligned in a synchronous release pattern, and thereafter the frames arrive at their maximum allowed rates [7, 23]. As most real-time networks inherently support nonpreemptive frame transmissions, worst-case blocking due to priority inversion is also required to be accounted for in the timing and schedulability analysis. Henceforth, for convenience it is assumed that the messages are sorted in order of nondecreasing relative deadline minus jitter, that is, for any two messages and , if then , and that the total network utilization is not greater than unity. It is also assumed that, under EDF scheduling, deadline ties are broken by lowest message index.

4. Error Model

To incorporate the effects of frame errors and retransmissions into a timing analysis, it is necessary to first state some assumptions regarding the effects of errors. As mentioned in the previous section, in almost all previous works it has been assumed that all errors manifest themselves in such a way that the last bit of the longest valid message frame from the (sub)set of messages currently under analysis is repeatedly corrupted by errors, forcing retransmission of the entire message. In this paper, this basic assumption is relaxed since the basic operation of the CAN protocol is intended to support early detection and signaling of many types of errors and early abandonment of corrupt transmissions, more so than similar serial protocols [19]. At this point it is worth recalling from the CAN protocol definition two important points related to error detection: (i) the probability of an undetected error is vanishingly small (of the order , where is the message error rate) and (ii) in addition to CRC failures, instantaneous bit errors, bit stuffing errors, and form errors may be detected and signaled at any point in a frame transmission by any node (whether transmitter or receiver) [3].

Although in reality the probability of undetected errors is higher than specified in the CAN protocol due to interactions between bit stuffing and the CRC [24] and the fact that inconsistent omissions can occur, let us proceed under the assumption that for detected errors there is a uniform probability for their detection and signaling along the length of a CAN message. The justification for this is as follows. Due to transmitter bit error monitoring, detected errors affecting any set of nodes which includes the transmitting node (including all global errors) are immediately detected and signaled. Local errors affecting a set of nodes which does not include the transmitting node, but only one or more receivers, typically result in form or bit stuffing violations which are immediately signaled as errors by the effected receivers. Detailed simulations and experiments have indicated that well over 99.9% of detected CAN errors will be signaled by these three primary mechanisms [19, 24]. The remaining proportion of detected local errors will principally manifest as CRC failures, signaled only after the CRC check. Thus, although this assumption of uniform probability for detecting and signaling errors along the length of a CAN message is clearly not 100% accurate, it would seem to be close enough for analysis purposes. Let us initially proceed as such.

4.1. Random Errors and Bursts of Errors

As previously mentioned, research has shown that many errors in CAN communication links are not just single independent events but are likely to occur in isolated transient bursts [19, 24, 25]. In order to develop a technique to effectively capture such behaviors, an error model that is rich enough to capture this bursty nature yet simple enough to lend itself to tractable analysis is required. A common way to model bursty behavior is to use a simple two-state discrete Markov model [1, 8, 26, 27], such as is shown in Figure 1.

The model has two states and B, representing “Good” and “Burst” states, respectively. Transitions between the two states and have associated with them static probabilities and . The probability of remaining in a given state is then given by and . Each state may also have associated with it a probability of bit error occurrence (denoted and , resp.). The model parameters and can be interpreted as follows: the reciprocal of defines the expected (mean) gap between error bursts , and the reciprocal of defines the expected (mean) duration of error bursts , both variables having a geometric distribution. The expected interarrival time of error bursts is then given as (). The optional model parameters and can be interpreted as follows: when the system is in the “Good” state, the reciprocal of defines the expected (mean) interarrival time of errors, and when the system is in the “Burst” state, the reciprocal of defines the expected (mean) interarrival time of errors, both variables again having a geometric distribution. Typically, it would be the case that and . Let the state of the link at step be denoted by , and let the state of the Markov model at step be encoded as the probability that the link is in the error state; that is, . Applying the normal rules for Markov model state transitions, then depends only upon the previous state and the transition probabilities and and can be recursively computed as follows:

Assuming that the initial state of the link is known, the transient and steady-state evolution of the link state can be calculated using (2). The steady-state solution to the Markov chain is obtained by first setting into (2). Solving for , we obtain that . Given any starting state , after a transient period the model converges upon π which represents the fraction of time the link can be expected to be in the “Bad” state when observed for long periods. π also represents the probability that when the link is observed at some random sample time we find that ; following the acquisition of explicit knowledge about the link state at step t (such as by making an observation), the state transiently moves back to the steady-state π with coefficient according to the relationship . The special case in which errors do not arrive in bursts but only have a constant bit error rate of is covered by setting , after which we may see that .

For simplicity, in this paper it is assumed that and were set to 1 and 0, respectively, under the assumption that the probability of a full CAN frame transmission occurring whilst the link is in the burst state is effectively negligible (e.g., if is 0.5, a realistic assumption for a network such as CAN, the probability of successfully sending a 60-bit frame whilst in a burst state is of the order 10−20). In such circumstances π represents the BER of the CAN network. Although other studies have suggested that errors and bursts of errors in CAN networks follow exponential distributions (and can be considered as a Poisson process which is possibly nonhomogeneous and/or generalized), the model above should be at least as descriptive. This is because (i) if multiple errors arrive during the same logical bit-time, as may be predicted with a Poisson process, then only one of these errors will be effectively “counted” since time is discrete in a CAN network and (ii) the geometric distribution is the discrete equivalent to the continuous exponential distribution. Maximum likelihood estimates of the model parameters and from observed data traces are known and have simple closed forms [26, 27].

4.2. Error Statistics

As will be described in the next section, a means to determine the mean and variance of the additional load on the network which could be introduced by random errors and bursts of errors is required. Let the additional fault load over a time duration of length be denoted by the random variable , which is the sum of the additional loads introduced due to errors at each individual time step, which are denoted as . Then we may examine the expected additional load at every time step and also its variance; the mean and variance of may then be formed accordingly. From discussions in the previous section, although the BER of the link is π, the impact of an individual error depends to a certain extent upon the arrivals of previous errors. At each time step , assuming an error arrives, either of the following (mutually exclusive) events could occur:(1)The link was in the “Good” state at step , with probability (), but transits to the “Bad” state at step with probability ; the resulting error forces the current transmission to abort and an error frame to commence transmission (denoted as a “type 1” error).(2)The link was in the “Bad” state at step , with probability π, and remains in the “Bad” state at step with probability ; the resulting error delays by one time unit the (on-going) transmission attempt of the error frame following on from the last “type 1 error” described above (denoted as a “type 2” error).

Let us denote the probability of a type 1 error occurring by and the probability of a type two error occurring as . Note that ; that is, the BER consists of a proportion of each of the two error types. Let the longest frame transmitted during the considered interval of length be . Since the assumption is that an error during message transmission is immediately detected and the transmission aborted, a type 1 error leads to an increase in the error load by a value uniformly drawn from the interval , with mean value , plus an error frame of length . A type 2 error, since the link was previously in the error state, simply leads to a unit increase in error load. Denoting expectation of a variable as , the mean load introduced at each time step can be written in closed form:

An expression for the variance of can also be written using the “mean of square minus square of the mean” rule:

To simplify the above expression further, well-known formulae for the sum of the first natural numbers and their squares can be applied:

Giving the following simple closed-form expression for the variance,

Using (3) and (6), the mean and variance of the additional load due to errors can be formed through summation. Observe that there is some weak dependence between the random variables (in the sense that if the value of is revealed, the probabilities associated with the values which can assume are influenced). This is present in all probabilistic timing analysis for CAN, due to the serialization of message transmission and the error detection mechanisms employed. The main dependence in the current context is such that a type 1 error cannot immediately be followed by another type 1 error; hence if , . This manifests itself as an exponentially decaying negative correlation between a variable at step and its successors at steps > . However, as in previous works on CAN, let us drop this correlation and proceed with the assumption of statistical independence. This is a safe assumption since we now allow the situation in which both and can be > 1 simultaneously, which cannot occur in practice due to the serialization of CAN message transmission. Although this is pessimistic it facilitates a simpler analysis in the sequel. Briefly returning again to the point related to nonuniform probability for detecting and signaling errors along the length of a CAN message before leaving this section, please refer to Appendix A for details of the modification necessary to (3) and (6) should this require to be modeled.

5. A Simple Deadline Failure Bound

5.1. Probability Inequality

In this paper, the following bound for a sum of independent random variables is to be employed.

Theorem 1. Let be independent random variables. Let each of the be bounded such that , with some nonzero (positive) constant. Denote the sum of the variables along with its mean and variance as follows: Then given some real , the upper tail of the distribution function of is bounded by where

Proof. Consider Equation 8b in Bennett (1962) [28].

The expression above is presented in a slightly different form to that found in [28]; however the difference is purely cosmetic. In the current context, represents the length of an interval under consideration, represents the additional fault load which may be observed in this interval, and each of the represents the additional error loads which could be introduced at each time step . The bound represents the worst-case length of a message that may be affected by a type 1 error plus the worst-case length of an error frame (). With this result recalled, the proposed method may now be outlined.

5.2. Proposed Method

The method takes as input a set of messages , each with parameters as given by (1), environmental error characteristics and , and the worst-case length of error frame and produces a set of deadline failure probabilities . The method can be stated in three basic steps which are as follows:(1)For each message , carry out a sensitivity analysis to find the critical slack that any generated instance of the message has between its release time and its deadline such that the deadline is not missed, and determine which is the worst-case length of frame that could be transmitted during the busy period plus the length of error frame . If for any message , then set and stop as the message is not deterministically schedulable.(2)For each message , compute the quantities:with the mean and variance of computed according to (3) and (6), using in both of these expressions. If for any message set and exit as an error load which is less than the mean (expected) load which is sufficient to cause a deadline miss.(3)For each message calculate ) using (8b), and bound the deadline failure probability using the inequality of (8a).

The rationale for the method outlined above is that, with knowledge of the most critically loaded intervals of the schedule, one may calculate the critical error load that may be tolerated during that interval to (just) maintain schedulability; this is performed using sensitivity analysis (step (1)). Given knowledge of the length of the considered intervals and the environmental error characteristics, the mean and variance of the error load may be determined (step (2)); this allows the application of the inequality given in (8a) in step (3) to determine the probability the error load exceeds the critical error load. The values as calculated in this way can easily be linked to higher-level reliability measures in that knowledge of the number of deadlines of each message occurring per unit operation time (e.g., an hour) can be used to calculate the probability that no deadlines will be missed per hour (i.e., the network reliability). In step (1) of the above, both and are required; their determination is the subject of the next section.

6. Application to Fixed and Dynamic Priority Scheduling

As mentioned in Introduction, the paper focuses upon the DMJM and EDF scheduling methods due to their known optimality as fixed priority assignment rule and dynamic priority scheduling algorithm, respectively (albeit the former is for constrained deadlines only), on a single processor or communications network [6, 7]. In order to develop a means to solve step (1) in the proposed method, it is necessary to make some further assumptions. These assumptions are such that all errors are detected and that all messages which fail due to an error are immediately rescheduled using their original priority or deadline. Although the worst-case error load tolerable for a given message can be calculated exactly using sensitivity analysis combined with an exact test like response-time analysis [18, 23] or the processor demand approach [7], both would require an application of an exact schedulability analysis. Since the exact analysis is known to be coNP-complete in both cases [29, 30], simple bounds which may be easily manipulated algebraically are instead sought. As such, sufficient and easily computable schedulability conditions will instead be employed.

For both DMJM and EDF, the technique proceeds as illustrated in Figure 2: assuming that a busy period of length occurs, existing sufficient schedulability conditions will be adapted to upper-bound the worst-case workload induced by jitter and priority inversion (blocking) plus the transmission of valid frames which can occur. Assuming that this workload does not exceed the length of the interval (in which case the network cannot be deemed to be deterministically schedulable with the utilized condition), the remaining slack time is assumed to be assigned for error handling and message retransmission purposes.

6.1. Fixed Priority Scheduling

For fixed priority scheduling, the following sufficient condition can be adapted from the relationships obtained by Davis and Burns [23] to verify schedulability under the DMJM priority assignment: where the blocking term represents the worst-case length of time a message of priority may be blocked by a lower priority frame. For nonpreemptive frame transmission under fixed priority scheduling, the blocking term is given by the largest transmission time of any message instance with priority less than or equal to :

Since interference from messages being retransmitted due to errors appears in (10) in a similar fashion to nonpreemptive blocking (i.e., a lengthening of the busy period), the slack term can be added into the left hand side of (10). Now, setting and asand assuming an equality in (10) allow one to solve for the slack terms . Given the fixed priority ordering of message transmissions the retransmission bound lengths are also easily obtained, giving

6.2. Dynamic Priority Scheduling

For dynamic priority scheduling, the following sufficient condition can be used to verify schedulability under the EDF priority assignment: where the blocking term again represents the worst-case length of time a message of priority may be blocked by a lower priority frame; for nonpreemptive frame transmission under EDF scheduling with jitter, the blocking term is given by the largest transmission time of any message instance that has a deadline minus jitter greater than the deadline minus jitter of message [7]:

The sufficient condition in (14) above was motivated by the work of Devi [31]; however it includes jitter terms: the proof of correctness of (14) is included as Appendix A in the current paper. Now, using (14) allows proceeding along similar lines to the DMJM case. Again, since interference from messages being retransmitted due to errors appears in (14) in a similar fashion to nonpreemptive blocking, the slack term can be added into the left hand side of (10). Again setting and asand assuming an equality in (14) allow one to solve for the slack terms . Given the deadline-based priority ordering the retransmission bound lengths are also easily obtained, giving

The methods proposed above allow the completion of step (1) in the proposed method for both forms of scheduling with a time complexity and space complexity as per the original schedulability conditions in [23, 31]. Since steps (2) and (3) both require constant time per message to complete, the overall method is bounded in time and space complexity and , respectively.

7. Stochastic Simulation Studies

In order to evaluate the accuracy and predictive abilities of the probability computations detailed in previous sections, a series of stochastic simulations was carried out to compare calculated values to statistical results. This was deemed important since the analysis in previous work [8] resulted in unacceptable pessimism; hence statistical evaluation can help to validate the improved analysis and assumptions made. The simulations were carried out to assess the measured message deadline failure probability for a single CAN message in the presence of both static bit errors and bursts of bit errors. A message length of 135 bits was assumed (DLC of 8 with 11-bit identifier), with a 31-bit error message length. The message relative deadline was taken to be 500 bits, with the CAN network speed assumed to be 1 Mbps. The stochastic simulator was written in C++.

A total of four simulation experiments were performed. In each experiment, which lasted for 5 simulated hours, the empirical BER, mean, and variance were measured alongside the empirical deadline failure probability for a given set of error characteristics. In each case, the theoretical BER was set to 0.001 and the mean burst length set to either 1, 10, 20, or 30 bits. The first case corresponds to a static BER, whilst the latter three correspond to bursty environments. The chosen BER is high, corresponding to a very noisy environment; this was done to allow meaningful statistical results to be observed without incurring inordinately large simulation times. For each experiment configuration, the mean and variance in the error load per bit were calculated using (3) and (6), with the deadline failure probability obtained using (8a) and (8b). The slack was taken to be 365 bits. The empirical and analytical results are as shown in Table 1. For comparative purposes, the deadline failure probability obtained by application of the methods reported in [8] is also displayed.

Observing the data reported in the table, one immediately sees a very close correspondence between the measured (empirical) values of BER, mean load, and its variance as compared to the calculated values. This gives some validation in the assumptions made in Section 5 and gives confidence in the use of (3) and (6) to capture the effects of errors in the CAN network. In addition, one sees that although the empirical BER in each case is virtually identical to its theoretical value of 0.001, the mean load and its variance are both reducing with an increase in the length of bursts. This captures the notion that not all errors have the same impact upon the schedule and that “type 1” errors are worse that “type 2” errors from a timing perspective.

Focusing attention now upon the use of (8a) and (8b) to bound the deadline failure rate, it can be observed that although some inaccuracy is necessarily present, it seems effective at capturing the relative order of magnitude of the empirically observed deadline failure probability. In particular, it captures the behavior that, for a constant BER, the deadline failure probability is reducing for increasing mean burst length and this reduction can be quite pronounced. In addition, the pessimism in the bound reduces as the failure probability decreases (as is typical for exponential inequalities). In contrast the previous method described in [8] fails to capture this behavior and the level of pessimism is significantly increased; in fact the method fails to give meaningful results when bursty behavior is considered and asserts only that the failure probability is > 0.5. This is due to the assumption that each error impacts upon the fault load in an identical way, such that only two errors may be tolerated before deadline failure occurs. When the mean burst length is either 10, 20, or 30 this exceeds the number of tolerable errors and the method fails. The simulation shows that the clustering of errors in bursts is, in fact, beneficial when the BER remains constant as errors tend to cluster together and their effects are less pronounced. This behavior is clearly captured by the methods proposed in this paper.

8. Application to SAE Benchmark Message Set

In this section, an extended example based upon a benchmark set of CAN messages is provided to further illustrate the features of the proposed technique. The well-known “SAE benchmark set” of messages under consideration is as described in [32] and consists of over 50 typical automotive signals packed into 17 periodic/sporadic messages. In this paper, we assume that messages are fixed priority (DMJM) scheduled using CAN in its native form. Assuming a bit rate of 330 Kbps and extended 29-bit (cf. standard 11 bits) message identifiers, the messages and their parameters are as shown in Table 2. The total utilization of this CAN network is 44.5% in the absence of errors. The messages have been ordered according to nondecreasing deadline minus jitter.

To illustrate the full use of the methods proposed in this paper, some realistic assumptions were first made upon the BER and burst distribution. The BER was selected to have a value 1 × 10−6, with two distributions of bursts considered: a static configuration with errors arriving independently and a burst configuration with errors arriving in bursts of mean length 5. This configuration is not dissimilar with typical error rates and burst characteristics reported in the literature for CAN networks and employed in previous studies [12, 25]. An error frame of  bits was employed. Bounds on were then determined for fixed priority (DMJM) scheduling, applying the techniques developed in Sections 4, 5, and 6. The values obtained are shown in Table 3. Also indicated are the results obtained from application of the method described in [8]. Entries in the table given as “” indicate that numerically the event occurs with such a low probability that it could not be represented with double-precision floating point accuracy.

From this table, some interesting observations can be made. In all cases, the new bound described in this paper gave deadline failure bounds smaller than the previous method described in [8]. In the case of static errors, both bounds predict extremely low deadline failure probability for all messages, that is, an indication that such events should be extremely rare. The reduced failure rate in the new method is due to the relaxation of the assumption that all errors strike the last bit of the longest frame transmitted in the considered interval.

Considering next the burst case, one observes drastic differences between the two methods. In particular, the previous work [8] predicts that, as the burst length increases for a given BER, the probability of deadline misses increases significantly; in fact, the method fails to produce results for messages 5, 6, and 10. In these cases, the critical number of errors required to cause a deadline miss exceeds the mean number of errors arriving in the () interval. This is not as predicted by the new method, which predicts that the failure probability should be decreasing for fixed BER; this is again due to the relaxation of the assumption that all errors strike the last bit of the longest frame transmitted in the considered interval and the classification of errors into those of type 1 and type 2. Note that these assumptions have been validated using stochastic simulations in the previous section. In particular, consider the highest priority message (Msg 1), which has length 90 bits and a deadline slack of 1,387 bits. Under error bursts, the deadline failure probability is bounded to be 0.0118 using the previous method: a comparatively commonplace event occurring once every 4.24 seconds for the given message period. However in the previous section, recall that under simulation a longer (135 bits) CAN message with shorter deadline slack (365 bits) was found to miss its deadline with probability 0.000008529 in a far more aggressive environment with longer burst durations. As such, the quantification by the new method where a deadline miss remains to be an extremely rare event even in a burst environment seems much more reasonable, although further detailed simulations are required to confirm this. Finally, note that the calculations necessary to obtain these results were embedded in a simple spreadsheet, giving a good indication of the ease of application of the method and that it is a good candidate for use in the stated area of motivation in Introduction.

9. Conclusions

In this paper, a simple and efficient method to bound the deadline failure probabilities for a real-time industrial CAN network transmitting periodic/sporadic messages which may experience release jitters under fixed or dynamic priority scheduling has been proposed. As demonstrated through stochastic simulation, indications are such that the procedure is effective at capturing the effects of errors and burst distributions on the message schedule and produces results which are more accurate and useful than previous methods. A representative example consisting of a benchmark message set for a CAN network has helped to illustrate key elements of the proposed technique and highlight its advantages over a recent similar method. In future work, more detailed simulation studies will be carried out to explore the accuracy of the proposed bound. In addition, extensions of this bounding technique will be used, in conjunction with appropriate environmental error monitoring techniques, to move towards efficient dependability-aware QoS controls for flexible real-time CAN networks.

Appendix

A.

Consider the case where a specified proportion of type 1 errors will manifest only as CRC failures signaled by dropping the current transmission and commencing an error frame after the CRC check. Let the number of lost bits following a CRC failure be given by . Incorporating this information into (3) gives a modified expression for the mean load introduced at each time-step:

An expression for the variance of can once again be written using the “mean of square minus square of the mean” rule and proceeding as before:

The above expressions can also be used in situations in which every type 1 error is taken to lead to the same (constant) error load (as in previous analysis for CAN, this could be the worst-case length of the longest message). This is achieved by setting and to the required number of bits.

B.

Proof of (14). The expression is proved by contradiction, using a similar approach to [31]: if the messages are not schedulable under EDF, it will be shown that the condition of (14) is false. Assume that a deadline miss occurs in the schedule and the condition of (14) holds. Let the first deadline miss in the schedule occur at time , and let the first time the network was idle (or occupied processing a frame with deadline > ) be time . Without loss of generality, assume that and set . As it was assumed that the messages are sorted in order of nondecreasing deadline minus jitter, for convenience let the highest-indexed message satisfying the relationship be given by . Observe that, under EDF scheduling, no message with index > i can contribute to the network demand in the interval [0, t) and that there is no idle time in this interval [7]. Assuming that each message with index is activated at time to maximize the network loading, then the network demand can then be calculated using [7]:Since the deadline miss occurs at time , the demand in the interval [0, t) must exceed the length of the interval and we can writeSince we can writewhere is as defined in (15). Collecting terms in a more convenient manner for computation gives rise to (14):Thus by checking that for each message relative deadline (14) holds, a deadline miss will not occur as claimed.

Competing Interests

The author declares that there are no competing interests regarding the publication of this paper.

Acknowledgments

Preliminary versions of some of the ideas presented in this paper were presented at the 19th IEEE International Conference on Emerging Technology & Factory Automation [8].