Research Article | Open Access
System Reliability Assessment Based on Failure Propagation Processes
One or several component failures may lead to more related component malfunction and ultimately cause system reliability reduction. Based on this, we focus on the assessment system reliability of complex electromechanical systems (CEMSs) in a fault-propagation view. First, failure propagation model taking into consideration failure data based on network theory and improved polychromatic sets is proposed for system reliability evaluation. From the node point of view, system effectiveness index is constructed to investigate the variation of efficiency of the holistic network. Subsequently, from the system’s perspective, system reliability measurement is provided and estimated in combination with system effectiveness index and failure propagation models. Finally, the application of proposed method to a bogie system of high-speed train assesses system reliability, and meanwhile, the effectiveness of the proposed method is able to be illustrated.
Complex electromechanical system (CEMS) is defined as a set of interconnected components which work together to complete predetermined mission (Wang et al., 2017). Typical CEMSs include high-speed train, aircraft, nuclear equipment, and so on. Indeed, CEMS universally has higher reliability demand than simple system to ensure safety, due to the high complexity and maintenance costs. However, applying the traditional methods of reliability analysis, it is usually difficult to assess the reliability of the holistic systems in practical operation for a variety of reasons, such as the nonlinear coupling among components, the complexity of fault propagation mechanism, and the diversity of influencing factors. Hence, it seems, urgently, to be absolutely essential to explore a novel approach for system reliability assessment in order to ensure the safe operation of CEMS.
1.1. Literature Review
The complexity research of the CEMSs  mainly includes complex structure  and complex multifunction . System reliability also is considered from two aspects of function and topology correspondingly.
In function, reliability, which is defined as the ability or capability of a product to perform a specified function in a designated environment for a minimum number of events or a minimum length of time , has long been a vital topic in systems engineering. Based on this definition, there has been a steady move towards the systematical use of reliability theory and historical failure data to evaluate and further improve system reliability in the last few decades. These methods include, but are not limited to, fault tree analysis (FTA), reliability block diagram (RBD), binary decision diagrams (BDD), dynamic fault tree (DFT), Markov model, Petri net, and Bayesian method (e.g., [5–17]). However, self-defects of the above approaches hinder their application in the CEMSs. To name a few, some methods used for modeling system reliability often rely on the assumption of the only two states of the component (i.e., functioning and malfunction) and independent failures. However, numerous industrial experiences have shown that the above assumptions have been unrealistic and may lead to unacceptable analysis errors . Furthermore, these methods do not take into account the specificity of the physical structure of the entire system and the impact of failure propagation mechanism among components.
In the meantime, mostly evolved over the last decade, the development of network theory has provided an increasingly challenging reliability framework for characterizing CEMS. Indeed, a network can be commonly regarded as an abstract representation of system structure, in which the components are described as nodes and the interactions among the components are represented as edges. Not surprisingly, system reliability evaluation is equivalent to assessment of network reliability. Network reliability is concerned with the ability of a network to carry out a desired operation such as “communication.” Based on this definition, network reliability measures can be categorized as follows: (i)Terminal reliability . It is defined as the probability of achieving connectivity from the input nodes to the output nodes and usually includes two terminal reliability , K-terminal reliability , and all-terminal reliability . Unfortunately, combinatorial explosion commonly is the main problem in this method when it applies for the CEMS.(ii)Percolation reliability . It investigates and addresses questions of practical interest in a system view such as “how many failed nodes will break down the whole network.” Percolation reliability is constructed according to a percolation process, and the critical threshold of percolation is used as network failure criterion. It attempts to overcome the combinatorial explosion problem. However, the coupling relationships among nodes and failure propagation mechanism are disregarded, since node breakdown is not independent.(iii)Efficiency reliability. It reveals how much the system is fault tolerant; thus, it shows how efficient the communication is among nodes when some of the nodes are fault [24, 25]. The global efficiency , reliability efficiency , and improved reliability efficiency  are suggested here as more common efficiency reliability indicators. The biggest advantage of efficiency reliability is the connectivity of the network to be taken synthetically into account. But seriously, the influences of failure propagation among nodes and the properties on system reliability still are not considered.
As mentioned above, each type of measures has its own strengths and weaknesses that need to be carefully considered (see Table 1) if they are applied to actual systems, especially the network of CEMS. Specifically, there are the following reasons:
√ denotes that this factor is considered in estimating reliability and × represents that this factor is not taken into consideration.
First, the properties of nodes and edges, such as failure rate, reliability, and degree centrality (DC), are ignored. Different from the traditional network systems, both the nodes and edges in the network of CEMS represent the components and have their own attributes. What’s more, these attributes have a critical impact on system reliability. That is to say, system reliability is determined by those properties of components and their emergent behaviors. It is thus clear that the properties of nodes and edges are necessary for system reliability estimated.
Secondly, failure propagation caused by the coupling relationships among nodes is not considered. These relationships may cause failure propagation from one failure node to others, and then system reliability is decreased. In fact, the failure of a single node or a very few nodes can trigger failure propagation, which can disable the whole network almost entirely. Unluckily, most studies focus on one or several failure nodes of independent failure. Yet, failure propagation is, more often than not, ignored while system reliability is evaluated.
Thirdly, the edges serve as the medium that provide the possibility of failure propagation. Moreover, the attributes of edges have a great effect on the strength and depth of failure spread. Above detailed approaches explore the connectivity reliability of networks but miss the influence of failure spread.
In the above analysis, it can be seen that failure propagation is an indispensable part of system reliability estimation. Indeed, the problem of failure propagation for networks is not a new one. Numerous methodologies and models have been developed to describe, predict, and prevent failures or faults. They include classical probability models (Luo et al., 2009), Markovian models (Weber and Jouffe, 2006), Poisson models (Ren and Dobson, 2008), Bayesian models (Marquez et al., 2010), and Monte Carlo models (Lehmann and Bernasconi, 2010). However, these models or methods, more or less, have very limited applications in actual system, especially the CEMS. Typically, with the progress in structure and integration, system has become more and more complex and has shown that the assumption of independent failures has been unrealistic and has led to unacceptable analysis errors (Liu and An, 2014).
Subsequently, with the development of network theory, several failure propagation models clustering were proposed based on the small world. The most common problem taken in these models has been to focus on so-called the most possible propagation path. However, multipaths by one failure node in actual system may spread simultaneously. Multiple nodes also may fail at the same time, and then several paths are triggered. What’s more, if a node fails, it will (1) gradually spread to different other nodes due to the complexity of propagation mechanism, and it will not (2) not spread to all other nodes due to redundancy structure. Yet, the propagation distances of each path are also different. In addition, propagation path in the sense of topology is the main focus of the previously proposed ways, but the effects of functional attributes have been omitted. It is obvious not entirely satisfactory for the network of the CEMS. Therefore, it is vital to find out the whole probable failure paths and their occurring probability for the analysis of system reliability.
The remainder of this paper is organized as follows. Section 2 introduces brief definitions and notations of network construction and polychromatic sets, and their improved. In Section 3, the failure propagation model is proposed. Based on this, Section 4 defines the function-path length and then provides system reliability model. Section 5 presents our computational results of bogie system based on the proposed method. Conclusions and future research are discussed in Section 7.
In this paper, we propose a new method to evaluate system reliability from the fault propagation prospective. Compared to the existing methods, our proposed method has the following central contribution: (i)The influence of failure propagation is considered in system reliability estimation. The descriptions of failure propagation comply well with the process of system failure in the proposed method. System failure reflects the changes of reliability.(ii)Both topology and function of system are comprehensively analyzed in the proposed method. For example, the traditional reliability analysis ignored the influence of topology, and terminal reliability also missed the effect of function.(iii)System reliability is estimated in a system view. The proposed method explores system reliability according to failure propagation paths and system effectiveness. The paths and system effectiveness measure are both global variables.
2.1. Improved Network Representation
Network theory is a basic premise of research on system reliability that a tool reflects real information about system topology and structure. It also provides a natural framework for the mathematical representation of system topology. Within most of research, CEMS may be reduced to a set of nodes, connected through directed edges, depending on the definition (Wang et al., 2017). Previous studies define a CEMS as a directed network that consists of a set of nodes/vertices and a set of edges/links that connect some of the nodes. Figure 1 shows the network of suspension system for bogie. Each component is a single node, whereas an inherent coupling relationship between two components (i.e., if there is at least one physical connection which is routed directly from to ) is represented by a directed link. Through a project of cooperation with China XXX Railway Vehicles Co. Ltd. (according to National High Technology Research and Development Program, 863 Program, No. 2012AA112001), the physical connection can be divided into three classes: mechanical, electrical, and information connections. And the direction of edges for different types is fixed (Wang et al., 2017). Table 2 shows the direction of different edges.
Unfortunately, the properties of edges and nodes are not embodied in the existing network model. These properties are indispensable to completely reflect the structure and function of the whole system. For the CEMS, the properties of nodes and edges are selected in view of 863 Program and professional experience of field expert (see Figure 2).
Therefore, the improved network model is proposed as follows: where is the set of nodes and is the set of edges. shows the node-node adjacency matrix representation of components and connections in the network, where elements represent directed edges with Boolean magnitude as set out. is the number of nodes in the network. is the set of nodes’ properties and mathematical representation of these measures that belonged to are shown in Table 3.
2.2. Improved Polychromatic Sets
Polychromatic set is a newly established system theory (Chaudhry et al., 2000; Li and Da, 2003). Its key idea is to use standardized mathematical model to simulate different objects. This theory has a significant advantage in the set operation, which has also been considered as a contribution to theoretical development in systems theory. For a conventional set, the elements only describe their names even though these elements could be different. Obviously, names are impossible to represent all other characteristics of each element. In polychromatic sets, not only its elements but also its entirety can be, however, pigmented with different colors to represent the research object as well as the properties of its elements. Li et al. (2003, 2006) provided a more detailed description. Only important definitions are presented here for the sake of completeness.
Assume that the composition of a polychromatic set is . The color set of every element is where corresponds to every element , and denotes the th individual color of element .
The color set of the whole set is defined as where corresponds to the entirety of , and represents the th unified color of the entirety of .
The relationship between each element and unified color can be represented using the following Boolean matrix, in which , if and
Let the element be the node, and the color of each element represent the attribute of node . We can use polychromatic set to describe properties of components and their relationships. But it is important to note that the value of is 0 or 1 in polychromatic set theory. Obviously, the values of attributes in the CEMS, such as DC, CC, BC, and the probability of failure, are not an integer. Hence, we extend the definition of and then improve (4) as follows: where is the relationship between the element color and unified color , and represents the value of individual color and its probability value.
2.3. Basic Assumptions of the Models
Reliability evaluation of the CEMS under various operating conditions is a quite complicated issue. In order to deal with these complexities, the models proposed in this paper have been built on the following assumptions: (i)System failure is caused by nodes malfunction.(ii)Edges can help the spread of the failure but cannot cause the failure.(iii)The fault nodes are not able to fail again before maintaining.(iv)The different failure modes of the same component are independent.
3. Failure Propagation Model
In this section, the failure propagation model is proposed to obtain all possible propagation paths and their occurrence probability. All these are an extremely important foundation of system reliability assessment.
3.1. Correlation Matrix of Failure Modes
The failure modes of components, to some extent, reveal the degree of component failure. Serious failure mode of the component will increase the fault pervasion intensity (Shu et al., 2016). Indeed, there is a correlation between different failure modes of different components. Through communicating with experts and consulting the relevant literature, the correlations of failure modes for different components are listed in Table 5.
We can derive the correlation matrix of failure modes among different nodes as follows, where is the correlation matrix of failure mode between two nodes and . where is the possibility of the th failure mode of node , which is caused by the th failure mode of node . And the value of is shown in Table 5. denotes the th failure mode of node .
3.2. Failure Propagation Model
In the previous study, the fault pervasion intensity  is defined and described as the process of failure propagation for a single node in the traditional network according to the grade-diffusing process. where is the fault pervasion intensity from node to in the th step. and are the weight of the propagation probability and DC, respectively. The propagation probability from node to , which is directly caused by the th failure mode of node , is . If there is no connection between nodes, is 0. represents the set of nodes, which fail in theth step of failure propagation. is the DC of the th node. is the cluster coefficient.
However, (9) cannot directly apply for the CEMS. Differentiating from traditional networks, the fault pervasion intensity does relate not only to the fault propagation probability of edges and the probability of failure of nodes but also the comprehensive importance and failure modes of nodes. This is a consequence of the following two facts: (1) the failure of critical components has a great effect on system inherent topology and normal functional realization of the whole system. The failure of critical components can, to some extent, increase the risk of failure propagation. (2) Through exploratory failure data analysis, we find that the different failure modes of components represent the degree of performance degradation of a component. A severe failure mode of components will increase the degree or intensity of failure propagation. Therefore, we improve the calculation formula of fault pervasion intensity in (9) as follows: where represents the failure probability of node in the th step of propagation. is the comprehensive importance (CI) measure (Wang et al., 2017). is the probability of the most likely failure modes of node in the th step of failure propagation. and are the weights.
However, (10) still describes the failure propagation process of a single node. For the CEMS, propagation paths have diversity and complexity due to randomness and uncertainty. In other words, there is a possibility that multiple nodes simultaneously fail to cause multiple propagation paths. Therefore, the failure propagation model for the system level is proposed.
First, we define two kinds of operators:
(1) Corresponding multiplication operator .
If and is -dimensional column vector, then .
(2) Compact multiplication operator .
If and is -dimensional row vector, then .
According to (6) and (10), the failure propagation model, after the -steps fault pervasion, is where where denotes the set of failure paths after the th step of failure propagation. is the state of nodes in the th paths after the th step of failure propagation. represents the state of failure nodes in the th paths in the th step of failure propagation. is the set of failure nodes in the th paths in the th step of failure propagation. is the comprehensive importance measure of failure nodes in the th paths in the th step of failure propagation. is the most likely failure modes in the th paths in the th step of failure propagation. denotes failure node number in theth paths after theth step of failure propagation. is the th failure mode of node in the th step of failure propagation.
From the energy point of view, there is a constant accumulation of energy within the component, and the energy density increases continuously before this component failing. A fault occurs if the accumulated energy exceeds the maximum capacity of this component. Hence, the following constraints have to be satisfied for (11): (1)The fault pervasion intensity between components will reduce by orders of magnitude with the increase of propagation path length. If the fault pervasion intensity is lower than 10−8, the node is in secure state. In other words, the failure does not spread continually.(2)If , then the fault propagation stops.
From (11), , which is the set of nodes in the th path, and , which is the occurrence probability of the th propagation path, play an important role for system reliability assessment. In fact, is the th failure propagation path.
4. System Reliability Evaluation
In this section, we illustrate how to calculate theoretically the system reliability from failure propagation mechanism point of view. First, system effectiveness measure is proposed to analyze reliability for a node failure based on the function-path length. Then, system reliability is provided in view of the system effectiveness measure and network theory.
4.1. The Function-Path Length
From the view of the network’s topology, the topology-path length is the sum of the number of its constituent edges between two vertices (the so-called path length in the previous literature). In essence, it indicates the physical distance between two generic nodes. However, the network of CEMS is different from general complex networks such as small-world network, random network, and scale-free network. The nodes and edges correspond to components of actual system. As such, they may have multiproperties, which include topological and functional properties. Moreover, the path length should be able to characterize the distance of failure propagation paths. Obviously, the definition of traditional path length is ill-posed for reliability analysis of the CEMS network. Therefore, the function-path length is proposed through a combination of data-based functional properties and network-based topological attributes.
The function-path length is defined the distance of failure propagation between two nodes. It relates to the topology-path length and the properties of nodes and edges (see Figure 2) in this path. Figure 3 exposes the basic ideas of the calculation of the function-path length. As you can see, the whole process consists of three stages: (1) the same types of measures of nodes or edges in this path are fused based on fuzzy integral, respectively. (2) Then, measures, which belong to identical properties, are namely integrated. (3) All properties are aggregated, and finally, the function-path length can be obtained.
Mathematically, the function-path length between nodes and is defined as where is the topology-path length. is the integrated value of all topological properties of nodes in this path, where represents the th measure of the th node in this path, is the weight of all measures, which belong to topological properties of nodes, and. is the integrated value of all functional properties of nodes in this path, where represents the th measure of the th node in this path, is the weight of all measures belong to functional properties of nodes and. is the integrated value of all functional properties of edges in this path, where is the th measure of the edges in this path, and is the weight of all measures belong to functional properties of edges.
Correspondingly, the shortest function-path length is where is the number of the function-path between node and .
4.2. System Reliability Measurement
Most previous studies have dealt with the efficiency measure by using topology-path length. There is no doubt it is not applicable to the CEMS. For this reason, we improve global efficiency and construct system effectiveness (SE) measure based on the function-path length as follows: where is the shortest function-path length.
Due to the complexity and uncertainty of failure propagation, the existence of multiple paths is possible. Obviously, SE measure is not suitable for the CEMS with complicated propagation mechanism. For example, the possibility and relationship of multiple propagation paths are ignored. Hence, a novel system reliability measurement is defined as where is SE measure if node faulted and caused the th failure path. is obtained from (11). is the occurrence probability of the th failure path, which is caused by failure node . is the set of failure nodes in initial state. is the weight of each failure path.
5. Case Study
Throughout the world, high-speed railway offers a fast and comfortable transportation mode with a high carrying capacity . The high-speed train (HST) system, as an essential component of high-speed railway, is the main carrier for passengers’ transportation from one place to another. To illustrate the method described in Section 3 and 4, we present a case study for bogie system. Bogie system, which is a critical component of HST system, is considered to play a fundamental role in both improving passenger comfort and maintaining safety of system. Figure 4 shows the bogie system of China Railway High-speed X (CRHX), which is a type of the HST system. It has been under investigation for many years with the aim to increase the reliability and safety of the HST system. Especially, understanding its reliability is important as a basis to improve design and cost-effective ways to protect system safety.
5.1. Data Analysis
Bogie system consists of the interacting elements, giving rise to the emergence of organization without any external organizing principle being applied. These components, including bogie frame, brake caliper, brake lining, and gearbox (see Table 6), usually interact through the mechanical, electrical, and information connections between them.
In terms of components as well as their connections, bogie system is modeled as a directed network that consists of 33 nodes and a series of edges connecting some of the components as shown in Figure 5. The mathematical expression of the network for the bogie system is as below:
The nodes in Figure 5 are in one-to-one correspondence with the components in Table 6. In addition, the directions of edges, such as mechanical connection, electrical connection, and information connection (Wang et al., 2017), are fixed listed in Table 2.
Based on (17), the topological properties of nodes, such as DC, BC, and CC, could be easily observed. Figures 6(a)–6(c) plot the DC, BC, and CC, respectively. The results show that node , on average, is the most critical component in topology. It should not be surprising due to its “core status.” Indeed, about 60.6 percent of components are directly installed on bogie frame (node ) in order to support the train. Perhaps the importance of node is self-evident from the topological point of view. However, an interesting observation against the failure data is that the critical nodes, such as bogie frame (node ), in topology achieve high reliability. These components are not more prone to failure, but once they fail, the consequences are disastrous.
Furthermore, Figure 6(d) shows comprehensive importance (CI) of all nodes, for the purpose of comparison. One striking result apparent is that the influential component is node by the assessment of CI, instead of node . The reason of this is that CI measure focus on the comprehensive consideration of the effects on node importance. However, the topological properties of nodes only concern the node importance in topology. Obviously, CI measure is more applicable to the HST system, since human factors and uncertainty can be effectively reduced. Therefore, we select CI measure to participate in system reliability evaluating.
The properties of nodes and edges include topological and functional attributes, in which topological properties (see Figure 6) can be derived by the network model in (17), and functional attributes can be collected from historical failure data. Functional properties are the data basis for analysis of system reliability. Through a project (863 Program, number 2012AA112001), the historical failure databases of bogie system of CRHX during 2011–2015 are provided and essential to investigate system reliability. In which, each failure data record contains the failure ID numbers, the vehicle ID number, the section of failure, the failure mode, the date of failure, the environment of failure, and so on. We deal with the data by removing some irrelevant items. Besides, a preprocessed failure data of these components in Table 6 is presented in Table 7.
Furthermore, it is worth noting that edges also correspond to components in the network of bogie system. Hence, edges’ functional properties can be calculated through historical failure data, and they also have great influence on system reliability. Table 9 lists the functional properties of edges within 120 million kilometers based on equations in Table 4.
5.2. System Reliability of Bogie System
5.2.1. Failure Propagation Model
As revealed from (11), both and are the weights of the influence factors of failure propagation. To make the model and the corresponding analysis simple, we here assume . And the critical nodes (i.e., , , and ) and noncritical nodes (such as , , and ) are selected as a fault source for the expression of failure propagation process, respectively.
Table 10 illustrates all possible failure propagation paths and their probability if the node fails. An interesting observation is that node , which is a topologically critical node, does not cause failure propagation. As expected earlier, node (bogie frame) is a critical skeleton component. Once it breaks down, serious consequences may result for the whole bogie system. Therefore, node usually has the higher reliability in the design and manufacturing phase and hardly malfunctions. Another interesting fact observed is that, as presented in Table 10, path length, which is caused by critical nodes, is shorter than the noncritical nodes. Besides, the longer the path length, the smaller is the probability of the failure path. These results are consistent with the observations of historical failure data. It is due to various reasons including inherent redundancy device for critical nodes and warning device, as well as improved design which prevent the further failure propagation.
As a graphical illustration, Figure 7 presents the failure propagation path of nodes in Table 10. The red nodes represent the fault source, and the blue nodes are also the failure nodes which are caused by other nodes through failure propagation. The edges with different color describe the different propagation paths. We can see from Figure 8 that the topology-path length of failure propagation is shorter and usually lower than 3. Figure 8 also demonstrates that only one failure node does not cause the failure of all other nodes in the network. In other words, failure propagation has limits.
5.2.2. System Reliability
Notice, the function-length path is an important quantity to observe system reliability. To illustrate, take a concrete example of the path (i.e., ). According to (13), we first need to determine the types of integral. In general, fuzzy integral includes Choquet integral (Marichal, 2000), Sugeno integral (Klement et al., 2010), and Weber integral (Tomaschitz, 2014). This is an important consideration in view of the fact that weights of the various properties or measures and their relationships can be described. Hence, Choquet integral is selected to integrate multiproperties or measures. This is due to (1) Sugeno integral only considers the most critical factors and all others are ignored. (2) Weber integral gives the infimum of information fusion. (3) Choquet integral takes all factors into consideration and also gives a certain value.
Based on (13), the weights, such as , , , and , can be obtained by Labreuche and Grabisch (2013). Therefore, the function-path length is as below and Figure 9 explains the basic ideas of the calculation of function-path length. where
Similarly, and are also calculated as follows:
Finally, according to (14), the shortest functional-path length is arrived to a compact expression.
According to (16), the results of system reliability are reported in Table 11 if node or malfunctions. It can be seen from Table 11 that as expected, system reliability can be obtained no matter what a single node or several nodes fail. Besides, it also can be seen that the system reliability is lower if more than one node fails.
6.1. Analysis of Parameters
6.1.1. The Parameters in Failure Propagation Model
In order to verify the effectiveness of the proposed failure propagation model, we discuss the effect of the weight on fault pervasion intensity. Figure 9 suggests the relationship between the number of steps of failure propagation and the parameter . An important observation reflected in Figure 9 is that the higher the weight is, the shorter the number of steps of failure propagation is. In addition, we also can see that the influence of the weights on failure propagation of critical nodes is not more significant changes than non-critical nodes. All these results further reflect that the impact of critical nodes on system reliability is not ignored.
To further illustrate the effectiveness of this model, the previous methods, such as the signed directed graph-fault graph (SDG-FG) (Hu et al., 2015) and improved fuzzy fault Petri net-based (IFFPN) method (Wang et al., 2013), and the proposed failure propagation model are compared in Table 12. By using SDG-FG method, the failure propagation path with the highest risk is with the ant colony algorithm. From Table 12, our proposed method can obtain all possible failure propagation paths and their probability. However, IFFPN-based method only can derive only one path for each failure node, and SDG-FG model is able to obtain the highest risk path for the whole network. Different from the general network, the bogie system, as a complex electromechanical system, has the complex topology and function and is also affected by complex operating environments. Hence, the analysis of multipaths will help the maintenance personnel to find quickly the fault component and reduce economic losses according to actual conditions. Furthermore, it also can be seen that the results of the proposed model are found to coincide well with the paths derived from failure data. The effectiveness and feasibility of the proposed method is proved again.