Department of Electrical and Computer Engineering, University of Alberta, Edmonton T6G 2V4, AB, Canada
Department of Computer Science and Engineering, Aalborg University Esbjerg, Niels Bohrs Vej 8, Esbjerg 6700, Denmark
Abstract
This paper proposes a reliability monitoring scheme for active fault tolerant control
systems using a stochastic modeling method. The reliability index is defined based on
system dynamical responses and a safety region; the plant and controller are assumed to
have a multiple regime model structure, and a semi-Markov model is built for reliability
evaluation based on the safety behavior of each regime model estimated by using Monte
Carlo simulation. Moreover, the history data of fault detection and isolation decisions is
used to update its transition characteristics and reliability model. This method provides an
up-to-date reliability index as demonstrated on an aircraft model.
1. Introduction
In order to
meet high reliability requirement of safety-critical processes, major progress
has been made in fault tolerant control systems (FTCSs). FTCSs usually employ
fault detection and isolation (FDI) schemes and reconfigurable controllers to
accommodate fault effects, also known as active FTCSs. Most work on
reconfigurable controller design is performed under the assumption of perfect
FDI detections. However, imperfect FDI results are inevitable owing to
disturbances or modeling uncertainties and may corrupt designated reliability
requirement. Therefore, it is necessary to validate the design of FTCSs from a
reliability perspective.
The reliability of FTCSs has been investigated using
various methods. The key problem is to set up appropriate reliability models
with control objectives and safety requirements incorporated. As fault
occurrences and system failures are rare events, dynamic models are usually not
suitable for reliability analysis. For example, Wu used serial-parallel block diagrams
and Markov models for evaluation purpose, and defined a coverage concept to
relate reliability and control actions [1]. Walker proposed Markov and
semi-Markov models to describe the transitions of fault and FDI modes, but
control actions are not considered [2]. In previous work, we considered static
model-based control objectives and built a semi-Makov model from imperfect FDI
and hard-deadline concepts [3, 4]. However, in many practical systems, the
safety and reliability of operation are often assessed based on dynamic system
responses. For instance, reliability in structural control is defined as the
probability of system outputs outcrossing safety boundaries and evaluated by
using Gaussian approximation [5]. Also, an online available reliability monitoring
scheme using updated information may aid maintenance scheduling, provide
prealarming, and avoid emergent overhauls. How to evaluate reliability when it
is defined based on system trajectory and how to implement an online-monitoring
scheme are the main motivations of this paper.
The objectives of this paper are threefold. First of
all, a steady-state test (SST) is proposed to reduce false alarms of FDI
decisions. The stochastic modeling of such an FDI scheme is studied based on
which the transition characteristics of FDI modes can be described. The second
objective is to develop a reliability evaluation scheme for FTCSs based on
system dynamic responses and safety boundary. At last, online monitoring
features are considered, such as estimation of FDI transition parameters based
on history data and timely update of reliability index to reflect up-to-date
system behavior.
The remainder of this paper is organized as follows:
the assumptions and system structure are given in Section 2; FDI scheme,
modeling, and parameter estimation are discussed in Section 3; the
determination of outcrossing failure rates and hard-deadlines are discussed in
Section 4; the reliability model construction is discussed in Section 5
followed by a demonstration example of an F-14 aircraft model in Section 6.
2. Assumptions and System Structure
Assumption 1.
The considered plant is assumed to have finite fault modes, and dynamics under each fault mode
can be effectively represented by a linear system model.
Fault modes are represented by a set
with
integers;
represents the
set of dynamical plant models under various fault modes;
denotes a set
of reconfigurable controllers in a switching structure.
is designed for
fault mode
based on
,
. However, true fault modes are usually not directly
known, so an FDI scheme is used to generate estimates of fault modes, which may
deviate from true fault modes with error probabilities.
Assumption 2.
FDI scheme is assumed to generate a fault estimate
based on a batch of measurements and calculations for every fixed period
.
This assumption states a cyclic feature of FDI, such
as statistical tests and interactive multiple model (IMM) Kalman filters [6]. FDI modes are represented by a discrete-time
stochastic process
, where
, the set of nonnegative integers. The time duration
between consecutive discrete indices is equal to FDI detection period
.
is put in use
when
,
. Corresponding to
, a discrete-time stochastic process
denotes true
fault mode. In reliability engineering, constant failure rates are usually
assumed for the main part of component life cycle. In such a case,
can be
described as a Markov chain [7], and its transition probabilities are denoted as
,
.
Remark 1.
The semi-Markov process can be used as a general FDI
model. It can describe any type of sojourn time distribution; in contrast, the
Markov process model accepts exponential sojourn time distributions only. More
discussions can be found in [4].
Assumption 3.
System performance is assumed to be represented by a
vector signal
. Safety region, denoted as
, is assumed to be a fixed region in the space of
bounded by its
safety threshold. Failure is assumed to occur when
exceeds a
safety region for the first time.
This assumption intends to define an appropriate
reliability index based on system dynamical response. It is common in control
systems to use a signal
to represent
performance, and
is usually to
be kept at small values against influences from exogenous disturbances,
modeling uncertainties, and dynamical characteristic changes caused by faults.
Safety region
is assumed to
be fixed and known a priori. The scenario that
exceeds
represents lost
of control and system failures. More discussions on this assumption can be
found in [8].
Definition 1.
For a time interval from 0 to
, the reliability function
is defined as
the following probability:
(1)
Mean time to
failure (MTTF) is defined as the expected time of satisfactory operation:
(2)
Remark 2.
Different
from repairs relying on human intervention when system operation is stopped,
control actions are executed automatically and can be deemed as an internal
actions of FTCSs. Therefore, MTTF represents the mean operational time without
human intervention before failure.
Compared with
and
,
is typically a
fast changing function determined by both continuous and discrete dynamics. As
shown in Figure 1,
and
are two regime
modes and determine the transitions among regime models. When
and
are fixed,
evolves
according to plant model
and controller
. As a result of this hybrid dynamics, directly
evaluating
and MTTF is a
difficult problem. Therefore, a discrete-time semi-Markov chain
is constructed
for reliability evaluation purpose. The main idea is that the hybrid system is
decomposed into various regime models; each regime model is then evaluated for
related safety characteristics, and
is constructed
to integrate these characteristics with transition parameters of regime modes
and to solve its transition probabilities for reliability evaluation. The
structure and main components of reliability monitoring scheme are illustrated
in Figure 2.
Figure 1: Transitions among regime models.
Figure 2: System structure.
Semi-Markov reliability model
is the kernel
component for calculating MTTF. It is constructed based on the following
parameters: (1) the transition rates of
, called plant failure rates, (2) the estimates of
from FDI and
confirmation test, called confirmed fault modes, (3) the parameters of
estimated from
history data, called FDI transition characteristics, (4) the probability of
crossing safety
boundary during an FDI cycle
when
, called failure outcrossing rates, (5) the average
number of periods before crossing safety boundary when
, called hard deadlines. Among these parameters, the
second and third ones can be updated online.
3. FDI Scheme and Its Characterization
3.1. Steady-State Tests
It is well
known that false alarm and missing detection rates are two conflicting quality
criteria of FDI. One is usually improved at the cost of degrading the other.
What is worse, the general rules of adjusting FDI to improve these two criteria
simultaneously are often not known. For example, in a scheme based on IMM
Kalman filters, it is not clear how to determine Markov interaction parameters.
Considering that most false alarms last for short time only, an SST strategy is
adopted for postprocessing FDI decisions.
SST requires that, when FDI decision changes, new
decision is accepted only when it stays the same for a minimum number of
detection cycles. Let
denote the
required number of consistent cycles for FDI mode
,
. The effectiveness of this SST strategy relies on the
distribution of false alarm durations. For example, if a nonnegative discrete
random variable
denotes the
false alarm duration when system fault mode
,
can be taken as
-quantile of
,
, meaning
(3)
which implies
that false alarm probability can be reduced by ratio
when accepting
FDI decisions after
. The weakness of this method is additional detection
time delay of
when fault
occurs. However, this happens only under rare occurrences of faults. Compared
with the improvement on relatively more frequently transitions of FDI modes,
this weakness is acceptable.
Detection decisions from SST are represented by
and used for
controller reconfigurations. In Figure 2, the confirmation test is an SST with
large test period to further reduce false alarm probability to a negligible
level. It generates confirmed fault modes, which are used with FDI trajectories
for updating transition parameters of
and reliability
index.
3.2. Stochastic Models
A sample path of
is given in
Figure 3. Let
and
denote the FDI
mode and cycle index, respectively, after the
th transition
of
,
. For example, in Figure 3,
and
.
and
together determine
FDI trajectory, and
, where
is the
discrete-time counting process of the number of jumps in
.
is called a
discrete-time Markov renewal process if
(4)
holds for fixed
,
,
.
is then called
the associated discrete-time semi-Markov chain of
. It can be shown that
is a Markov
chain, and its transition probability matrix is denoted by
.
Figure 3: A sample path of

.
Given
, let
if
and
,
.
is the sojourn
time of
between its
transition to state
at
and the
consecutive transition to
at
. If the transition destination state is not
specified, let
denote the
sojourn time at state
.
As shown in Figure 3,
is the sum of
two variables: a constant
for SST period
and a random sojourn time
. Let
and
denote the
discrete distribution functions of
and
respectively,
which have the following relations:
(5)
This semi-Markov
description provides a general model on FDI mode transitions, but it involves a
large number of parameters. The transition characteristics of
are jointly
determined by
and
(or
). If
contains
fault modes,
there are
transition
probability matrices
and
distribution
functions
. If each
follows
geometric distribution, the description of
may degenerate
to a hypothetical Markov model
.
All Markov chains can be considered as a special type of semi-Markov
chains. If
can be modeled
as a Markov chain with transition probability matrix denoted by
for
, the following relations hold:
(6)
(7)
(8)
It is obvious
that
is a geometric
distribution. In fact, this is an essential property of Markov chain, as shown
in the following lemma.
Lemma 1.
A discrete-time semi-Markov chain degenerates to a
Markov chain if and only if the sojourn time at each state (when subsequent
state is not specified) follows geometric distribution.
The proof is given in the appendix. When
is nonzero, the
sojourn time of
does not follow
geometric distribution owing to this deterministic constant, and Lemma 1 cannot
be directly applied. However, as
is known, a hypothetical
process
can be
constructed by setting
to zeros; if
the sojourn time of
is
geometrically distributed, it can be described as a Markov chain; the original
sojourn time of
can be
recovered by adding
to that of
. This method may greatly reduce the number of
parameters for characterizing FDI results.
3.3. Transition Parameter Estimation
FDI transition
parameters can be estimated as an offline test on FDI when both fault mode and
FDI detection results are known. This estimation can also be carried out online
using FDI history data and confirmed fault modes.
When
is modeled as a
semi-Markov chain,
and
(or
) are
parameters to be estimated.
can be
estimated from the transition history of
. For example, when
is kept as a
constant
, if there are
transitions
from
to
among all
transitions
leaving
, the
th element of
can be
estimated as
.
The estimation of sojourn time distribution
can be
completed in two steps: the histogram of sojourn time is firstly examined to
select a standard distribution such that nonparametric estimation is converted
to a parametric one;
is then
obtained by estimating unknown parameters in distribution functions.
If
follows
geometric distribution for all
,
can be
described as a hypothetical Markov chain
under the
hypothesis that
. As a result, transition probability
from
to
and sojourn
time
at
have the
following relation:
(9)
Therefore,
, and
can be
estimated by
(10)
where
denote
sojourn time
samples at state
,
.
can be
estimated based on the transition frequency from state
to
:
(11)
where
is a
normalization coefficient and
represents the
number of FDI transitions from
to
.
4. Outcrossing Failure Rates and Hard-Deadlines
Owing to FDI
delays or incorrect decisions, controller
may be used for
its designated regime model
(namely,
matched cases) and other model
,
(namely,
mismatched cases). Matched cases usually account for major operation time,
while mismatched cases often appear as temporary operation.
Definition 2.
The outcrossing
failure rate in matched cases is defined as
(12)
Monte Carlo simulation can be used for estimating
: sample simulations are performed by using generated
sample uncertain plant model and sample disturbance input; the simulation time
when system fails is called a sample time-to-failure. With a large number of
time-to-failure samples obtained,
can be
estimated as the ratio between
and sample mean
of time-to-failure.
Mismatched cases are usually temporary operation
caused by FDI false alarms or delays, and system may return to matched cases if
does not
diverge to unsafe region. So, it is important to find out the average tolerable
time before system failure. This time limit is called hard-deadline, denoted by
for
and
. It can also be estimated by sample mean of
time-to-failure using Monte Carlo simulations.
5. Reliability Model Construction
The states of
semi-Markov chain
for reliability
evaluation are classified into two groups: one unique failure state, denoted by
, and multiple functional states, defined as state
combinations of
and
, denoted as
,
. For example, if two types of faults are considered
in the plant,
includes states
of fault-free, fault type 1, fault type 2, and both fault 1 and fault 2,
represented by
, and
contains 17
states.
The semi-Markov kernel of
is denoted as
, representing the one-time transition probability in
cycles. It is
determined by the following parameters: (1) transition characteristics of fault
and FDI modes, (2) outcrossing failure rate in state
denoted by
, (3) hard-deadline in state
denoted by
, (4) FDI SST period denoted by
for FDI mode
.
Let us begin with the case that FDI mode can be
described as a hypothetical Markov chain
with transition
probability denoted by
. The calculation of
is classified
into the following cases.
Case 1.
The transitions from functional states to themselves are not defined and the
corresponding elements are assigned as zeros:
(13)
Case 2.
Failure state
is absorbing:
(14)
Case 3.
Initial states are matched states
:
(15)
where
,
,
,
.
The derivation of these equations are based on Markov
transition probabilities and the decomposition of each event.
For example,
(16)
Considering the
SST of FDI, if
,
(17)
If
,
(18)
can be obtained by combining these two probabilities with
.
Case 4.
Mismatched states,
,
. When
, the transition probability of
to any other
state is zero because of SST period. When
, the probability of
transiting to
any other state is zero except to
. The above reasoning is based on the facts that FDI
rarely jumps to other false modes when current mode is incorrect, and mean
fault occurrence time is in a much higher order compared with a short false FDI
detection period. Therefore, when
,
(19)
When
,
jumps to
at the earliest
time
only:
(20)
In the general cases,
is modeled as a
semi-Markov chain, and the competition probabilities methods discussed in [4] can be utilized.
Definition 3.
Given
and
, the combinational mode is denoted as
,
. Suppose
and the next
combinational mode after the consequent transition of
or/and
at
is
, where
or/and
,
. The probability of this event is called the
competition probability, denoted by
.
The calculation formulas of
were derived in [4, Section 3]
and are omitted here for brevity. As the states of
are mainly
defined as the state combinations of
and
, the calculation of the semi-Markov kernel of
is simplified
when
is available,
as shown in the following listed formulas:
(21)
Although these
formulas appear to be simpler, both the parameter estimation and competition
probability calculations need much more calculation burden than the first case
when FDI decision is modeled as a hypothetical Markov chain. Once
is constructed,
calculation of reliability function and MTTF are straightforward using
available formulas [9].
6. Demonstration on an F-14 Aircraft Model
6.1. Model Description
A control
problem of F-14 aircraft was presented in [10], and also used as a demonstration
example in MATLAB Robust Control Toolbox.1 This problem considers the design of a
lateral-directional axis controller during powered approach to a carrier
landing with two command inputs from the pilot: lateral stick and rudder pedal.
At an angle-of-attack of 10.5 degrees and airspeed of 140 knots, the nominal
linearized F-14 model has four states: lateral velocity, yaw rate, roll rate,
and roll angle, denoted by
,
,
, and
, respectively,
two control inputs: differential stabilizer deflection and rudder deflection,
denoted by
and
respectively,
and four outputs: roll rate, yaw rate, lateral acceleration, and side-slip
angle, denoted by
,
,
, and
respectively.
The system dynamics equations are ignored here, and can be loaded in MATLAB 7.1
using command “load F14nominal.” An additional disturbance input is added to
represent the wind gust effects.
The control objective is to have desired handling quality (HQ) responses from
lateral stick to roll rate
and from rudder pedal to side-slip angle
. Under fault-free modes, the HQ models are
and
; when fault occurs, HQ models degrade to
and
, respectively.
The system block diagram is shown in Figure 4, where
F-
represents the
nominal linearized F-14 model, and
and
the actuator
models.
and
represent the
weighted model matching errors. Actuator energy is described by
, and noise is added to the measured output after
antialiasing filters.
Figure 4: Control design diagram for F-14 lateral axis (Courtesy of The MathWorks, Inc.).
The considered fault occurs in two actuators. Under
fault-free mode, their transfer functions are
(22)
Two types of actuator faults are considered here, each
has mean occurrence time
of FDI periods
or its failure rate is
. Under fault type 1, the transfer function of
becomes
(23)
Under fault type 2, the transfer function of
becomes
(24)
These fault modes are described as the change of
actuator gains and time constants. The set of fault modes is denoted by
, representing fault-free, fault type 1, type 2, and
simultaneous occurrence of both.
6.2. Performance Characterization of Controller and FDI
Four
controllers are
designed for each fault mode to achieve nominal HQ control objectives under
fault-free mode and degraded HQ performance under fault modes. Typical output
trajectories under fault-free mode are shown in Figure 5, where the curves
labeled with “Real” represent the measured outputs, “Ideal” the outputs
under nominal HQ performance, and “Degraded” the outputs under degraded HQ
performance. The absolute minimal matching errors between the real responses
and the expected outputs under ideal HQ performance are shown in Figure 6,
which are assumed to represent system safety behaviors. When these matching
errors go over the safety limits, 30% of expected output, aircraft is
considered as failed.
Figure 5: Output trajectories.
Figure 6: The trajectories of matching errors.
An IMM FDI is constructed to detect fault occurrences.
To reduce false alarms, a steady-state test strategy is applied on FDI
decisions with
for any FDI
mode
. A typical FDI trajectory is shown in Figure 7. It is
clear that the steady FDI mode is free of false alarms in the shown time
period. But detection time delays are introduced when fault occurs at 20 and 50
seconds, respectively.
Figure 7: FDI trajectory.
To represent FDI detection characteristics, a batch of
fault and FDI history data is collected for statistical estimation. First,
histograms of FDI delays are generated to check its distribution type. When
there is no fault, the histogram of FDI sojourn time at fault-free mode is
shown in Figure 8. It clearly resembles a geometric distribution. Equations (10)-(11)
are then used to estimate Markov transition probabilities, and those under
fault-free mode are obtained as
(25)
Note that
and
represent the
transition probabilities of FDI from a false alarm state. Estimated based on
the given history data, these values imply that the FDI leaves false alarm
state in one transition cycle. But there may exist estimation error, and the
true value of
may be close to
but not exact zero.
Figure 8: Histogram of FDI sojourn time.
As a result of FDI false alarms, missing detections,
and detection delays, controllers may be engaged for various fault modes for
which they are not designed. So, it is necessary to evaluate system behavior
under all possible combinations of FDI and fault modes. Here, Monte Carlo
simulations are adopted with the following settings: (1) command stick inputs
are square waves with frequency as a random variable ranging from 0.2 to 2 Hertz, (2) wind gust disturbances and sensor measurement noises are assumed to
be Gaussian processes, (3) actuator saturation effects limit control inputs to
20 and 30, respectively, (4) system failure is assumed to occur when model
matching errors go over 30% of stick commands. For example, with fault mode 2
occurred and
engaged, mean
time to system failure is 57 403 seconds when controller
is used, and 6
seconds when
is used.
Considering the sampling period to be 0.1 second for IMM FDI, the outcrossing
failure rate and hard-deadline are
,
.
6.3. Reliability Evaluation
Reliability semi-Markov model can be constructed based
on fault transition rates, FDI transition parameters, outcrossing failure rate,
and hard-deadlines. Predicted reliability function and MTTF can be thereby
calculated. By using MTTF as an objective, an optimization is performed on
. It is found that MTTF will be improved from 27 727
to 32 605 seconds if
is reduced from
6 to 1. A comparison of reliability functions before and after this
optimization is shown in Figure 9. It is clearly shown that reliability index
is improved.
Figure 9: Reliability functions comparison.
Comparisons on the transition probabilities between
these two SST periods are shown in Figure 10, in which each subfigure gives the
transition probability curves from
to other
states. For example, the subfigure at the first row and second column shows
that the transition probabilities to
are increased
from 0 to about 0.008. This is a natural result of increased false alarms when
reducing
. In fact, when
, new Markov transition parameters
become
(26)
Figure 10: Comparison of transition probabilities.
Compared with
, the element on the first row and second column is
increased from 0 to 0.0017, a confirmation of increased false alarms. On the other
hand, detection delays are reduced approximately from 6 to 1, and system stays
less time under mismatched fault and FDI cases. Overall, MTTF is improved.
This
evaluation procedure can be completed in an online manner. Estimated FDI
transition parameters
and current
mode of
provided by
confirmed test on FDI can be used to provide updated MTTF based on this most
recent information.
7. Conclusions
A reliability
monitoring scheme for FTCSs is reported in this paper. The scheme contains two
postprocessing strategies on FDI results to provide estimated fault mode for
control reconfiguration and confirmed mode for updating reliability. The stochastic
transitions of FDI mode is represented by a semi-Markov chain with parameters
estimated from history data. Under geometric sojourn time distributions, FDI
mode can be described by an equivalent hypothetical Markov chain that
simplifies its model and reliability analysis. Safety and satisfactory
operation of system is defined by system trajectories and safety boundaries;
the probability of violating this safety criterion under fixed fault and FDI
modes is estimated using Monte Carlo simulations. Overall reliability
evaluation is obtained through a semi-Markov model constructed by integrating
FDI transition characteristics and failure probabilities under each regime
model. This scheme provides timely monitoring on the reliability index of
FTCSs, and was demonstrated on an F-14 aircraft model.
1MATLAB and Robust Control Toolbox are the trademarks of The MathWorks, Inc.
Appendix
Proof of Lemma 1.
The “only if” part is trivial as shown in (8). Let
denote a
semi-Markov chain; the associated Markov renewal processes are denoted as
and
, and the sojourn time distribution
when subsequent
state is not specified is in geometric distribution:
(A.1)
If
,
(A.2)
otherwise,
, and we have
(A.3)
In the above
derivations, the memoryless property of geometric distributions has been used:
(A.4)
The Markov
property of
is proved, so
is a Markov
chain.
References
- G. J. Balas, A. K. Packard, J. Renfrow, C. Mullaney, and R. T. M'Closkey, “Control of the F-14 aircraft lateral-directional axis during powered approach,” Journal of Guidance, Control, and Dynamics, vol. 21, no. 6, pp. 899–908, 1998.
- V. Barbu, M. Boussemart, and N. Limnios, “Discrete-time semi-Markov model for reliability and survival analysis,” Communications in Statistics: Theory and Methods, vol. 33, no. 11, pp. 2833–2868, 2004.
- R. V. Field Jr. and L. A. Bergman, “Reliability-based approach to linear covariance control design,” Journal of Engineering Mechanics, vol. 124, no. 2, pp. 193–199, 1998.
- W. Kuo and M. Zuo, Optimal Reliability Modeling, John Wiley & Sons, Hoboken, NJ, USA, 2002.
- H. Li, Q. Zhao, and Z. Yang, “Reliability modeling of fault tolerant control systems,” to appear in International Journal of Applied Mathematics and Computer Science.
- H. Li and Q. Zhao, “Reliability evaluation of fault tolerant control with a semi-Markov fault detection and isolation model,” Proceedings of the Institution of Mechanical Engineers Part I, vol. 220, no. 5, pp. 329–338, 2006.
- J. Song and A. Der Kiureghian, “Joint first-passage probability and reliability of systems under stochastic excitation,” Journal of Engineering Mechanics, vol. 132, no. 1, pp. 65–77, 2006.
- B. Walker, “Fault tolerant control system reliability and performance prediction using semi-Markov models,” in Proceedings of Safeprocess, pp. 1053–1064, Kingston Upon Hull, UK, 1997.
- N. E. Wu, “Coverage in fault-tolerant control,” Automatica, vol. 40, no. 4, pp. 537–548, 2004.
- Y. Zhang and X. R. Li, “Detection and diagnosis of sensor and actuator failures using IMM estimator,” IEEE Transactions on Aerospace and Electronic Systems, vol. 34, no. 4, pp. 1293–1313, 1998.