Journal of Computer Networks and Communications

Volume 2016, Article ID 5196092, 12 pages

http://dx.doi.org/10.1155/2016/5196092

## Bounds on Worst-Case Deadline Failure Probabilities in Controller Area Networks

Electronics & Control Group, Teesside University, Middlesbrough TS1 3BA, UK

Received 6 November 2015; Revised 2 March 2016; Accepted 28 March 2016

Academic Editor: Tin-Yu Wu

Copyright © 2016 Michael Short. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Industrial communication networks like the Controller Area Network (CAN) are often required to operate reliably in harsh environments which expose the communication network to random errors. Probabilistic schedulability analysis can employ rich stochastic error models to capture random error behaviors, but this is most often at the expense of increased analysis complexity. In this paper, an efficient method (of time complexity ) to bound the message deadline failure probabilities for an industrial CAN network consisting of periodic/sporadic message transmissions is proposed. The paper develops bounds for Deadline Minus Jitter Monotonic (DMJM) and Earliest Deadline First (EDF) message scheduling techniques. Both random errors and random bursts of errors can be included in the model. Stochastic simulations and a case study considering DMJM and EDF scheduling of an automotive benchmark message set provide validation of the technique and highlight its application.

#### 1. Introduction

Real-time industrial communication networks such as Controller Area Network (CAN) are often required to operate in harsh environments, where they may be subject to environmental hazards such as electromagnetic interference (EMI) and other forms of mechanical/electrical stresses. Exposure to hazards such as this can induce random errors into a system, which, if left uncorrected, may result in system failures [1, 2]. This paper is concerned with probabilistic schedulability analysis of the communications in real-time industrial CAN networks which are scheduled by a priority-driven algorithm in the presence of transient and/or intermittent errors. CAN is a multimaster, differential serial bus using NRZ encoding at the physical layer and is often employed for distributed real-time control applications. The full CAN protocol description may be found in [3]. Although in its native form CAN supports fixed priority scheduling, simple protocols operating at the node level have previously been developed to enable distributed Earliest Deadline First (EDF) scheduling with CAN [4, 5]. As such, the paper focuses upon both fixed priority (the Deadline Minus Jitter Monotonic (DMJM) algorithm) and dynamic priority (EDF) scheduling methods, the latter of which can be enforced at the CAN application layer, due to their known optimality in a networked environment under a wide variety of operating configurations [6, 7]. Specifically, the paper is concerned with the fast calculation of tight bounds on the probability that a deadline will be missed when error arrivals cause messages to be aborted and subsequently rescheduled for transmission after the transmission of an error message.

Such retransmission is a form of redundancy which requires some temporal “slack capacity” in the message schedule; how much slack is required to be allocated depends upon many factors including the level of criticality in the service the system provides, the message set parameters and scheduling algorithm, and also the nature of the error detection and correction mechanisms employed by the system. If insufficient slack is employed by a system to tolerate the effects of the errors it experiences, then aborted messages will not be processed or delivered correctly before their deadlines and system failures may occur. Since error arrivals are random in nature, then no 100% guarantees of timeliness can be given, and probabilistic guarantees are instead sought. In this respect, the principal contribution of the paper is the derivation of an efficient means to obtain tight bounds on the probability of deadline failure when errors occur randomly (with geometrically distributed interarrivals) and possibly in random bursts (with geometrically distributed burst interarrival and length), and aborted messages are immediately requeued for transmission.

Although previous work related to probabilistic schedulability analysis for wired networks such as CAN and also wireless networks has employed rich stochastic error models to capture these random behaviors, to date this has (mostly) been at the expense of increased analysis complexity. In this paper, some recent results on probabilistic real-time error models and schedulability analysis in [1, 8] are extended, and an efficient method to bound the deadline failure probabilities for a set of periodic/sporadic messages transmitted in a real-time industrial communication network is proposed. The procedure first carries out an analysis of the available slack in the DMJM or EDF schedule and uses this information, in conjunction with knowledge of the environmental error characteristics, to determine the probability that the slack will be exceeded by the extra load induced by errors. The analysis may be performed in time. Stochastic simulations and an example related to scheduling a benchmark set of messages on an automotive CAN network are used to illustrate the technique.

The motivation for this bounding method arises principally due to the need for efficient dependability-aware online admission or Quality of Service (QoS) controls in flexible networks (e.g., for automotive applications). In addition, motivation arises from the need for techniques which can provide designers with methods whereby the impacts of different design options (such as the choice of scheduling algorithm and configuration of message parameters and network bandwidth) upon system reliability may be quickly explored at early stages of a design. The remainder of the paper is organized as follows. Section 2 presents a brief summary of related work to contextualize the paper. Sections 3 and 4 present the network and error models, respectively. The proposed technique is outlined in general terms in Section 5 and is applied to DMJM and EDF scheduling in Section 6. Section 7 presents stochastic simulations to validate the proposals, and Section 8 describes a detailed example based upon a benchmark set of messages for an automotive network. Section 9 concludes the paper.

#### 2. Related Work

Although this paper is principally concerned with industrial CAN networks, related work on CPU and communications networks in general is included due to similarities in both the error and task models which are employed. As argued in previous works [1, 8], error models employed in real-time schedulability analysis (for both CPU and wired/wireless network message scheduling) typically assume either (i) a pseudoperiodic arrival of errors or (ii) the fact that some fixed number of errors will be experienced over a known time interval (e.g., the major cycle of the system operation) or (iii) error arrivals being purely random, typically as a result of a (possibly compound or Markovian) binomial or Poisson process.

Although approach (i) is relatively straightforward to incorporate into an existing schedulability analysis, in many cases it does not effectively capture either randomness (with exceptions, e.g., [9]) or bursty characteristics. Approach (ii), on the other hand, seems well suited to bursty characteristics: the work of [10] presents an exact analysis for CPU task scheduling with EDF in the case where the errors arrive during the system’s major cycle. However the assumption that not more than one burst of errors will arrive over this known time period implies a deterministic model, which seems inappropriate since errors are due to random noise and interference. In addition, if errors arrive in some time interval of length proportional to the smallest relative deadline of any task as per [10], then it again seems unjustified to assume that only errors will arrive in some proportionally much larger time interval (e.g., the major cycle). In [11], a similar method to [10] is employed to analyze the timing properties of wireless channels with bounded retransmissions; however the number of retransmissions can be specified independently for each wireless message and the length of time is made proportional to the message relative deadline. Although this seems a much improved model to use, there is no attempt made to link the retransmission bounds to environmental error models and/or the required reliability of message delivery. Approach (iii) is the one most generally taken for the analysis of distributed systems such as CAN, and the error models employed in these works are typically much richer than those employed for CPU schedulability analysis (although, as noted, the scope of these models is not restricted to the networked environment) [1, 12–15].

Marques et al. analyze the use of servers for the retransmission of messages lost due to errors in time triggered CAN networks [16]. The obtained results indicated that the choice of server and rescheduling policy has a significant influence upon the results, and different choices are appropriate depending upon the metric of interest. The choice of a deadline-based retransmission server was found to minimize the number of lost transmissions (deadline failures). Gujarati and Brandenburg [17] describe a procedure to bound the Failures in Time (FIT) rate for a CAN network experiencing network interference and node failures. In [16, 17], error arrivals were assumed to follow a Poisson distribution.

A recent overview of response-time analysis techniques for CAN is provided in [18], for both the deterministic and probabilistic cases. Although these previous works have examined probabilistic (bursty/sporadic/intermittent) errors and also deterministic errors within combined reliability and schedulability analysis frameworks, as noted in [1, 8, 18], one complication of some of these methods and of the use of probabilistic models in general is the potential complexity of implementation. Typically, an iterative procedure is needed to produce either a probability distribution of response times or a breakdown reliability beyond which point one or more tasks or messages will miss a deadline. This complexity generally makes the use of probabilistic methods impractical where efficiency is required. Some work to help address this issue was presented in [1, 8], where a rich error model was employed to capture the effects of random errors and bursts of errors and simple closed-form expressions were developed to bound the number of error arrivals in a given interval of time given knowledge of the environmental conditions and desired reliability of the system.

A second drawback in almost all previous analysis is an implicit assumption that each and every error arrival impacts the schedule by inducing a retransmission of a worst-case frame [18]. This is pessimistic, especially for CAN networks which have early error detection and signaling. The pessimism is compounded for bursts of errors, in which multiple errors arrive with only potentially very small temporal separation; it is almost impossible for each error to induce a worst-case frame length. For bursts of errors, a more likely scenario seems that the first error in the burst sequence induces a frame retransmission, where subsequent errors in the burst sequence delay its starting time until the burst subsides. This is not taken into account in [1] or [8], or in any of the methods described in [18]. In the current work, efforts are made to improve the analysis in an attempt to address this issue, whilst retaining most aspects of the simplicity of application.

#### 3. CAN Network Model

In the analysis that follows, it is taken that time is discrete (one time unit will typically, but not necessarily, correspond to one network bit-time) and is indexed by a nonnegative integer variable . It is assumed that the system consists of a number of distributed nodes, which share a communications channel to exchange real-time messages. A standard model for a shared real-time communication system is adopted to represent the CAN network in that the system to be implemented can be represented by a set of messages . Each message is represented by 4-tuple:

in which is the message period/interarrival, is the worst-case transmission time of any instance of the message, is the message relative deadline, and is the jitter (the worst-case time that may elapse between an (external or internal) event occurring initiating a message transmission in a distributed node and it being released to the network for transmission). Jitter is typically induced by variation in the latencies of distributed node event handlers. For CAN messages, either standard (11 bits) or extended (29 bits) identifiers can be used; as is well-known (e.g., [19]), for a message with Data Length Code , the total number of bits for a message with an 11-bit identifier is given by DLC. For 29-bit identifiers, DLC. For CAN networks, the worst-case length of an error frame bits.

The utilization of an individual message is given by and represents the fraction of time the network will be occupied processing the frames generated from the message over its lifetime. Successive frame arrivals from sporadic messages are invoked by both internal and external events (typically hardware or software interrupts on their host node) and are always separated by at least units of time; frame arrivals from periodic messages are always separated by exactly time units on their host node and are invoked by a logical timer, which may possibly be synchronized to a distributed clock. It is assumed that worst-case clock synchronization errors between any host nodes can be incorporated into these jitters and the synchronization protocol messages modeled as regular network traffic.

When a frame of message arrives (becomes ready) at some time , its* absolute* deadline is set at time and the scheduling procedure must allocate units of network time to process the job in the interval [, ); otherwise a deadline miss will occur. Both fixed priority (DMJM) and dynamic priority (EDF) scheduling procedures are considered in this paper, covering a wide variety of wired and wireless industrial network protocols and protocol extensions. It must be cautioned at this point that fixed priority and EDF scheduling, aside from a limited number of specific cases, are seldom supported directly at the MAC level. A drawback is that their use inevitably requires overlay protocols leading to the introduction of low-level overheads; however, the many recognized benefits (such as enabling higher network utilization and facilitating easier temporal analysis) oftentimes outweigh this drawback. CAN in its native form supports fixed priority scheduling, and hence DMJM is supported by design. Descriptions of overlay protocols enabling priority-based scheduling in Flexible Time Triggered (FTT) master-slave systems for both switched Ethernet and wireless networks can be found in [20, 21]. Extensions to enable EDF scheduling in wireless and wired networks using IEEE 802.15.4, Bluetooth, and CAN are discussed in [4, 5, 11, 22], respectively. Prospective system designers considering the use of any overlay protocol should be sure that the benefits outweigh the overheads for their particular application. Consideration of such analysis is outside the scope of the current paper.

Although the techniques described in this paper can be applied (in principle) to a message set with arbitrary deadlines, the main focus is on constrained deadline messages, in which . According to previous work, it is known that the worst-case arrival pattern of the message frames in terms of network load is the one in which all frames experience their worst-case jitter simultaneously, such that they are aligned in a synchronous release pattern, and thereafter the frames arrive at their maximum allowed rates [7, 23]. As most real-time networks inherently support nonpreemptive frame transmissions, worst-case blocking due to priority inversion is also required to be accounted for in the timing and schedulability analysis. Henceforth, for convenience it is assumed that the messages are sorted in order of nondecreasing relative deadline minus jitter, that is, for any two messages and , if then , and that the total network utilization is not greater than unity. It is also assumed that, under EDF scheduling, deadline ties are broken by lowest message index.

#### 4. Error Model

To incorporate the effects of frame errors and retransmissions into a timing analysis, it is necessary to first state some assumptions regarding the effects of errors. As mentioned in the previous section, in almost all previous works it has been assumed that all errors manifest themselves in such a way that the last bit of the longest valid message frame from the (sub)set of messages currently under analysis is repeatedly corrupted by errors, forcing retransmission of the entire message. In this paper, this basic assumption is relaxed since the basic operation of the CAN protocol is intended to support early detection and signaling of many types of errors and early abandonment of corrupt transmissions, more so than similar serial protocols [19]. At this point it is worth recalling from the CAN protocol definition two important points related to error detection: (i) the probability of an undetected error is vanishingly small (of the order , where is the message error rate) and (ii) in addition to CRC failures, instantaneous bit errors, bit stuffing errors, and form errors may be detected and signaled at any point in a frame transmission by any node (whether transmitter or receiver) [3].

Although in reality the probability of undetected errors is higher than specified in the CAN protocol due to interactions between bit stuffing and the CRC [24] and the fact that inconsistent omissions can occur, let us proceed under the assumption that for detected errors there is a uniform probability for their detection and signaling along the length of a CAN message. The justification for this is as follows. Due to transmitter bit error monitoring, detected errors affecting any set of nodes which includes the transmitting node (including all global errors) are immediately detected and signaled. Local errors affecting a set of nodes which does not include the transmitting node, but only one or more receivers, typically result in form or bit stuffing violations which are immediately signaled as errors by the effected receivers. Detailed simulations and experiments have indicated that well over 99.9% of detected CAN errors will be signaled by these three primary mechanisms [19, 24]. The remaining proportion of detected local errors will principally manifest as CRC failures, signaled only after the CRC check. Thus, although this assumption of uniform probability for detecting and signaling errors along the length of a CAN message is clearly not 100% accurate, it would seem to be close enough for analysis purposes. Let us initially proceed as such.

##### 4.1. Random Errors and Bursts of Errors

As previously mentioned, research has shown that many errors in CAN communication links are not just single independent events but are likely to occur in isolated transient bursts [19, 24, 25]. In order to develop a technique to effectively capture such behaviors, an error model that is rich enough to capture this bursty nature yet simple enough to lend itself to tractable analysis is required. A common way to model bursty behavior is to use a simple two-state discrete Markov model [1, 8, 26, 27], such as is shown in Figure 1.