Abstract

A system with more than two states is called a multistate system (MSS), and such systems have already become a general trend in the arena of complex industrial products and/or systems. Fault-tolerant technology often plays a very important role in improving the reliability of an MSS. However, the existence of imperfect coverage failure (ICF) in a work-sharing group (WSG) decreases the reliability of MSS. A method is proposed to assess the reliability and sensitivity of an MSS with ICF. The components in a WSG can cooperate so as to improve overall efficiency by increasing performance levels. Using the technique of the universal generating function (UGF), a component’s UGF expression with ICF can be incorporated in two steps. During the computation of the system’s UGF, an algorithm based on matrix (ABM) is developed to reduce the computational complexity. Consequently, indices of reliability can be easily calculated based on the UGF expression of an MSS. Sensitivity analysis can help engineers judge which WSG should be eliminated first under various resource limitations. Examples illustrate and validate this method.

1. Introduction

The fault-tolerant system is a high-reliability system designed by incorporating redundant components for critical elements in order to prevent the overall system from failing even when some of its individual elements fail. Fault-tolerant systems are often used in life-critical applications such as flight control, nuclear plant monitoring, and space missions and in mission-critical computer monitoring systems and data storage systems [1]. In addition to redundancy, the implementation of fault tolerance also requires automatic recovery and reconfiguration mechanisms. That is to say, even if sufficient redundancy exists, if the system cannot adequately detect, locate, and recover from internal faults and/or errors that have occurred, the entire system or one of its subsystems can fail [2]. The degree of fault tolerance is determined by the proportion of faults from which a system automatically recovers, and these faults are said to be covered by the recovery strategy [3]. Therefore, the reliability analysis of such systems must take into account the process by which faults and errors are detected and recovered from, as well as the complex system structure.

The problem of determining a system’s fault tolerance and its trend have been intensively explored, with special methods and analysis techniques to evaluate the system reliability having been put forward [4]. A new, simple, and efficient approach is presented for incorporating imperfect fault coverage into a combinational model [5]. The optimal design of some systems subjected to the imperfect fault coverage has been formulated for G: (k/n) structures [6]. Furthermore, the optimal reliability of systems subjected to the same reason, imperfect fault coverage, has already been generalized to more systems including parallel-series, series-parallel, parallel, etc., and the non s-identical component case [7]. According to the type of fault-tolerant techniques, the appropriate modeling method of multifault coverage is suggested for evaluating reliability indices in fault-tolerant system, and it is easy to be used in the case of hierarchy structure of fault-level coverage groups [8]. An approach based on a binary decision diagram is proposed to analyze the reliability and sensitivity of a benchmark network [9]. Based on the total probability theorem and a divide-and-conquer strategy, a new combinational approach to handling functional dependency has been put forward in the reliability analysis of imperfect coverage systems [10]. A long-neglected issue is that an initially relevant component could become irrelevant after the failures of other components, and this issue has been modeled through the coverage of irrelevant components in the system with imperfect fault coverage [11].

The MSS reliability theory and related conceptions have been investigated from many aspects. In an MSS, the system and its elements can function at a range of different performance levels, e.g., from perfect operation to complete failure. A universal moment generating function is extended to incorporate common cause failure into MSS reliability estimation [12]. A new second-order reliability method without parabolic approximation of the fitted quadratic surface is presented to improve the accuracy of reliability analysis [13]. The reliability modeling of MSS with preventive maintenance and customer demand is proposed to improve the reliability of MSS [14]. Employing Kalman filtering approach to estimate system and degraded sensor state is the base to calculate reliability for making a dynamic maintenance decision [15]. A life cycle cost reliability model with copula is established for system with multiple dependent degradation processes and environmental influence [16]. Simulation is often used to evaluate MSS because of its computational complexity directly by analytical method [17]. Similarly, an efficient simulation method based on survival signature has been proposed for system reliability analysis [18]. A complex framework based on integrated direct partial logic derivative (DPLD), whose computational complexity correlates with the number of system components and does not dependent on the structural complexity of MSS, is developed for qualitative and quantitative analysis focusing on component criticality [19]. Calculating critical system state with DPLD, multivalued logic method is used for the reliability analysis of MSS [20]. Application of structural function is presented in time-dependent reliability analysis, and DPLDs can be also used to find formulae for computation of most commonly known time-dependent importance measures [21]. Besides those approaches mentioned above for reliability, it is also critical to determine the impact of each component on MSS performance. Due to the nonbinary state of MSS and their components as well as dependences among different states of the same component, the performability analysis of MSS becomes more difficult and multivalued decision diagram is presented as an efficient algorithm to analyze the MSS [22]. Degraded performance is a common phenomenon in industrial products, and reliability is a vital quality of the MSS for providing the required performance level. An integrated routing risk model is constructed and risk control performance is also proved by simulated algorithm [23]. Other risk assessment model and risky multicriteria decision-making steps are developed [24, 25]. Focusing on performance analysis and optimization, a wireless sensor network framework is constructed [26, 27]. Regarding industrial cyberphysical system, safety control and performance monitoring are also focused on to answer how urgent they are and what degrees of fault tolerance and fault recovery are needed [2830]. Although many researchers devote efforts on different aspects of system reliability and performance, the issue of computational complexity for an MSS is often notorious one because of the curse of “dimension explosion.”

An MSS can also be subjected to imperfect coverage failures that lead to the entire failure of the whole system or its subsystems. Based on the ordered binary decision diagram, an efficient algorithm for the reliability evaluation of an MSS with a combinational performance requirement subject to imperfect fault cover has been proposed [31]. Considering the importance of the system-component state of an MSS, Griffith’s importance measures and reliability are evaluated by the combination of conditional probability methods to find solutions for the multistate imperfect fault coverage model [32]. An optimal structure of an MSS with uncovered failure can attain maximum reliability through a proper balance between two types of task parallelization: parallel task execution with work sharing and redundant task execution [33]. The MSS with three different types of imperfect fault coverage, element-level coverage, fault-level coverage, and performance-dependent coverage, has been modeled for evaluating the reliability of the system [34]. Hereafter, the similar two types of parallelization of MSS with multifault coverage have been studied to obtain the optimal trade-off based on various settings of fault coverage factor [35]. Stochastic multivalue models are proposed to predict the reliability of a multistate phase-mission system with three different imperfect fault coverage conditions, and the efficiency of this model is compared with the universal generating function (UGF) technique [36]. Within a subsystem of MSS, when the effectiveness of recover mechanism depends on the entire performance level of the subsystem, the MSS performance distribution can be obtained by a recursive procedure based on the UGF [37]. A modification of generalized reliability block diagram method is suggested for the reliability assessment of MSS with imperfect multifault coverage [38]. However, the computational time of these models and methods is proportional to the number of components and, hence, is of high computational complexity. UGF is a tool for efficiently dealing with the performance distribution of a complex MSS [39], and many reliability assessment studies of MSS have adopted it as the primary analysis tool [4043].

The novelty of this paper lies in that a revised approach based on the UGF technique is proposed to assess the reliability of an MSS with imperfect coverage failures (ICFs). Here the ICF is modeled by incorporating a scheme of working-share group (WSG), which is similar to the parallel structure. However, they are two different types of data-transmitting schemes. During the assessment, an algorithm based on matrix (ABM) is developed to significantly reduce the computation complexity. This approach is compared with other methods on time consumption. Furthermore, the sensitivity analysis can be handled by the engineers to decide which work group with ICF should be eliminated first under the limited conditions.

The remainder of this paper is organized as follows. Section 2 describes the imperfect coverage failure model. In Section 3, the UGF technique is revised to incorporate the ICF in order to perform the reliability and sensitivity analysis for the MSS with ICF. In Section 4, this approach is illustrated by several applications. Section 5 includes the authors’ conclusions and suggests future research dealing with the reliability of MSS.

2. Modeling of Imperfect Coverage and MSS

2.1. Imperfect Coverage Model

There are many models developed for the ICF, such as element-level coverage models and fault-level coverage models. Here, the focused structure of the imperfect coverage failure model with single point failure can be modeled as shown in Figure 1 for reliability analysis [10, 44]. The single point failure is also refereed as single fault at a component where its coverage probability is solely depended on the properties of the failed component. The entry point to the model signifies an occurrence of a single point failure with three types of possible outcomes. They are three types of exits—R, C, S—that signify, respectively, different possible outcomes with different exit probabilities . Exit R indicates that the offending failure is transient and can be handled without discarding the component. Exit C implies an occurrence of a permanent covered failure in the component that needs to be discarded for the normal operation of the system. Exit S is a single point failure signifying that the uncovered failure without discarding the component leads to the system failure. When a single point failure occurs, there are the three types of possible outcomes. So, they consist of a partitioning of the event space, and the three exit probabilities sum to one, namely, . For the multipoint failures, it is to descript system failure because of concurrent faults on multiple components. And it will be modeled and studied in the future.

In order to improve the reliability of a fault-tolerant system with imperfect coverage failures, a scheme of a work-sharing group (WSG) is introduced [33]. A WSG usually consists of several identical components that are connected in parallel and form a set. Every component will fulfill a different subtask so that its WSG can finish the entire task. For a multichannel data transmission system from A to B, as shown in Figure 2, the components C1 and C2 make up a WSG to share the work under the reconfiguration of a data exchange management system (DEMS). Here, the DEMS can be assumed that it does not fail because its form of existence can be implemented as built-in software or other forms. The WSG is parallel with component C3 so that they can undertake the data transmission task together. Because of the existence of DEMS, the same tasks that the data packages are transmitted by C3 as DP3 are divided into two sections, DP1 and DP2, according to the performance of components C1 and C2. Under this assumption, if one component in the WSG, such as C1, fails, the DEMS can discover it and will not assign data packages to it again. This type of failure can be defined as covered failure. The DEMS will then reassign the data transmission task to C2. The WSG will not completely fail, but its performance will decrease consequently. However, if C1 fails but the DEMS cannot discover it and continues to assign data transmission to C1, C1 will be in a state of exit s, namely, single point failure. The failure occurred in this situation can be defined as uncovered failure and the WSG cannot finish the task of data transmission.

The sole purpose of a parallel structure is to improve the reliability of the system. However, if there is no DEMS, the data packages will not be divided, and the efficiency of the system cannot be improved to transmit the data packages. Introducing the WSG into the system will increase efficiency, but due to the change in the system structure, reliability may be affected.

2.2. Modeling of MSS

For the purpose of modeling an MSS, the characteristics of its elements must first be defined. Generally speaking, any element in MSS can have different states corresponding to the performance levels that can be represented by the set where is the performance level of element in the state , . The current performance level of element at any instant in time is a discrete random variable that takes value from : . The probabilities of each different state or performance level for element can be denoted by the set , where

Furthermore, the entire states of one element constitute a whole set whose states can be seen as mutually exclusive events. That is to say, the element will always be in one and only one of states, such that

The performance level distribution of element will be determined completely by the collection of pairs

The system elements have certain performance levels corresponding to their respective states at one instant in time. The modeling of an MSS performance level is completely determined by . Therefore, the states of an MSS are determined completely by the states of its components. Now, suppose that the MSS has different states and the performance level corresponding to one state of the MSS at a certain moment can be represented by In this case, the MSS performance level is a random variable denoted by that takes values from the set

The probability mass function (PMF) of the MSS performance levels can be obtained as follows:

Using the Cartesian product operation, we can define the space of all possible combinations of performance levels for all system components as

The MSS structure function is naturally introduced as

The function is to map the space of the component’s performance levels into the space of the performance levels of an MSS.

From the above analysis, we can see that the model of an MSS includes two parts: PMF of performance levels for all system components and the structure function of the system. These can be rewritten as follows:

For an MSS whose performance level is defined as task completion time, its reliability can be expressed as the probability that the system satisfies the maximum allowed completion time of . From (5), one can obtainwhere

Another important measure is the conditional expected performance . This expresses the system’s expected performance under the condition that the MSS is in an acceptable state. Having the system reliability , this measure can be calculated by

3. UGF Technique and Reliability Evaluation

3.1. UGF Technique

UGF, also called as u-function or universal z-transform [45], has been proven to be an effective method for solving high-dimensional combinatorial problems. UGF of a multistate component associated with its performance level PMF can be defined as a polynomial

The essential property of UGF enables the entire UGF for an MSS, whose components are connected in series or parallel, to be obtained using simple algebraic operations corresponding to the individual UGF of a multistate component. To represent the PMF of the stochastic variable , the composition operator is defined in the following equation:

Note that the polynomial represents all possible mutually exclusive combinations of individual independent components’ UGFs. The function is determined according to the physical nature of the interaction between the performances of components.

Indeed, the derivation of for various types of systems is usually a difficult computational task. As shown in [45], from the two perspectives of computational simplicity and derivation clarity, representing in a recursive form is beneficial. In particular, when an MSS has a complex configuration, the entire system can be represented as the composition of subsystems corresponding to the subsets of multistate components. This property can be defined by

The configuration of any MSS can always be represented as a composition of independent subsystems containing only components connected in parallel or in series. For any components connected in parallel or in series in the MSS, the composition operator can be applied recursively in order to obtain the UGF of the intermediate pure parallel or pure series structures.

Consider one type of MSS system, a task processing computer system with its performance level defined as task completion time. For components connected in series, the system’s total completion time is the sum of completion time of all its components. If two independent components ( and ) work in series, the total completion time is the sum of their individual completion times. The function should calculate the sum of corresponding parameters. The performance of the pair of components in this case is defined as

For components connected in parallel, the total completion time is decided by the component with the shortest completion time. The function should obtain the minimum of all parameters. Therefore, the UGF in this case should take the following form:

For components connected in parallel with WSG, the task processing can be divided in proportion to their processing speeds. That is to say, components can share the task according to their performance levels. The function should be the inverse of its performance, and the UGF for this pair of components can be determined by the functionwhere .

3.2. Reliability Analysis of MSS with ICF

For the MSS with ICF, its single point failure can usually be modeled as state 0 with assigned performance level [31]. The assigned performance level can coincide with component performance in the permanent coverage. The specific performance level is related to the system property, and here it might as well be assigned to . According to (8), it can be expanded aswhere the second item represents the UGF of component except for its single point failure state. If the component is not of the single point failure, .

For the WSG with the single point failure, its UGF can be obtained in three steps:(1) of every component can be expressed according to (18), where is the set of components in the WSG.(2) can be obtained based on the operation of components’ by (17).(3)UGF will be calculated as follows:

Once the UGF of WSG with ICF is obtained, it can be seen as a common component to be operated with other components according to the structure of the system. The UGF of the entire system can now be expressed easily. By applying (9) and (11), the reliability indices can be calculated.

During the calculation, the computation complexity of determining the UGF of two components is always high if it is only calculated manually. This is why the algorithm based on matrix (ABM) is developed. The seven steps of ABM, along with substeps, are as follows:Step 1: the UGF of two components can be expressed in matrices A and B, respectively. Both A and B are composed by two rows and n columns. The first row is the performance level, and the second row is its corresponding probability. The number of columns in A and B may be different.Step 2: define a matrix C with two rows. The number of C’s columns is equal to the product of A’s columns multiplied by B’s columns. Matrix C is used to store the primitive values of the multiplication of matrixes A and B according to specific calculation rules.Step 3: by applying the dual-iteration method, the elements in matrix C can be obtained from the following rules:(i)Define the first column of A and B as the outer and inner iteration variables, respectively. Let k represent the column order of matrix C and k = 1 at beginning.(ii)The value of matrix C’s k column and first row should be calculated by the corresponding element of matrices A and B according to the function , , or , based on (15)–(17).(iii)The probability value of matrix C’s k column and second row can be obtained by multiplying the corresponding element values of matrices A and B. The value in C (2, k) is equal to the product of A (2, 1) and B (2, 1).(iv)Assign k with k + 1, viz., k = k + 1.(v)Modify the inner iteration variable to the second column of B and repeat the above three steps until the end of B’s columns.(vi)Modify the outer iteration variable to the second column of A, and repeat the above four steps until the end of A’s columns.Step 4: sort the values of C’s first rows in ascending order, i.e., performance level. The elements of the second row of matrix C can be reordered correspondingly.Step 5: abstract the unique elements from the first row of matrix C and form row vector E.Step 6: define a result matrix D whose first row is equal to E. The second row of D is constructed by the following iteration steps:(i)Define the first column of matrix D as the iteration variable.(ii)From the first column of matrix C’s first row, namely, C (1, 1), locate for the same value with the first column of D’s first row, namely, D (1, 1).(iii)Add the corresponding probabilities at the location determined by the previous step to the first column of D’s second row, namely, D (2, 1).(iv)Modify the iteration variable to the second column, and then repeat the above two steps until the end of the number of D’s columns.Step 7: the results of the operation of matrices A and B are stored in the matrix D with just the performance level and probability. According to this matrix D, it can be rewritten in the UGF expression.

In the following examples, the ABM can be used frequently.

4. Illustrative Examples

4.1. Reliability Evaluation

Consider a task processing system with the structure as shown in Figure 2, in which components C1 and C2 comprise the WSG. After C3 is connected with the WSG in parallel, C4 is connected in series. These components are statistically independent. The data package can be transmitted through C3 and C4 in 15 seconds and 20 seconds with the probability of 0.7 and 0.75, respectively. The performance levels of C1 and C2 are the same. Their completion time and probability are both 15 seconds and 0.7. Their detected and undetected failures have equal probabilities, namely, 0.15. These parameters are listed in Table 1.

In this case, WSG is affected by the ICF. According to the steps above, the UGF of these components’ performance distribution can be represented as follows.

Based on the above analysis, the UGF of every component can be represented aswhere . Because the WSG consists of C1 and C2, they finish the data transmission task by sharing. Applying (17), such that

A component’s single point failure also leads to its lowest performance level, namely, infinity of task completion time . According to (19),

Based on the reliability block diagram, because the WSG is connected with C3 in parallel, (16) is adopted first. Component C4 is connected by series, and then (15) needs to be applied. The UGF of the system can be expressed as follows:

According to (9) and (11), the probability that system can transmit data packages within and the conditional expected performance are

If there is no WSG in this system, that is to say, if components C1, C2, and C3 are connected in parallel, then the UGF of C1 and C2 is expressed as

The UGF of the system can be calculated by

Similarly, the reliability and conditional expected performance can be calculated as

From the comparison of (24) and (27), it can be seen that the reliability of the system with ICF is decreased. However, its expected task completion time is less than that of the system without WSG. In other words, the system with WSG has higher efficiency to complete the transmission of data packages. However, because of the existence of ICF, its reliability has declined.

In order to compare the efficiency of the above analyzing method with multivalued decision diagram (MDD) derived from reference [22], the system in Figure 2 and some parameters in Table 1 are still used for the reliability calculation. The other assumptions and parameters needed in MDD are in concert with those in the mentioned reference. Here, the focus is to compare the time consumption during the calculation for reliability. In the same configuration of computing environments, those results calculated by MDD are approaching to ones in (24). There is a distance on the time consuming between those two methods. From the validation listed in Table 2, it shows the ABM is less time consuming than MDD.

4.2. Sensitivity Analysis

Figure 3 shows a more complex system, composed of three subsystems. Within each subsystem, a WSG is configured to improve task processing efficiency.

Because the components are of several states in each subsystem, the performance levels of task processing time and other parameters of these components are listed in the form of a five-element tuple, as shown in Table 3.

Based on the above method and these parameters, the system reliability satisfying the maximum allowed completion time can be calculated by (9). Consider four different cases:A: the WSG in each subsystem is operating and works wellB: the WSG1 is eliminated and components C12 and C13 work in parallelC: the WSG2 is eliminated and components C21 and C22 work in parallelD: WSG3 is eliminated and components C32 and C33 work in parallel

The system reliability curves for these four cases are shown in Figure 4. Figure 4(a) shows the entire curve of the four cases. From Figure 4(b), which is a partial curve scaled in [25], [55], it can be seen that case A has the lowest reliability of the four cases. The sequence of reliability improvement for the other three cases, from low to high, is D, B, and C.

For the purpose of quantitative analysis of reliability improvements, those results are listed in Table 4. It can be seen that Case C improves the reliability the most. At the same time, its expected performance is the lowest of the four cases. Although Cases B and D have the almost same reduction in expected performance, Case B shows higher improvement of reliability than Case D. It can be also seen that the results of reliability improvement from the quantitative analysis are in agreement with those from the above figures.

By applying the above analysis, the reliability engineer can choose the optimal scheme according to the indices of reliability or expected performance. Moreover, in this example, the sensitivity analysis is aimed only at the single WSG. In fact, combination of several WSGs is also a suitable mechanism for further sensitivity analysis based on this approach.

5. Concluding Remarks

In this paper, a reliability and sensitivity analysis approach is proposed for an MSS with ICF. In order to improve transmission efficiency, the components within a WSG can share a task such as data transmission according to its processing time. During the task sharing, the system may fail to detect the failure of one component, in which case, system reliability will be reduced and could even lead to the failure of task transmission. Based on the technique of UGF, the impact of ICF is incorporated in the expression of its UGF and its calculation methods for reliability and sensitivity analysis are also suggested. During the computation, the ABM algorithm is developed in order to significantly reduce the computational complexity. Two examples in this paper illustrate the application of the suggested approach. A reliability engineer can easily apply this approach to decide the optimal scheme under limited resources. In the above analysis, only the series-parallel MSS is considered. More complex MSS topologies, such as bridge and G: (k/n) structures, should be the focus. For other ICF models with different structures, they will be also explored in the future research.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was supported financially in part by a grant from Fundamental Research Funds for the Central Universities (nos. 2020MS120 and 2018MS076).