A Demonstration of Modern Bayesian Methods for Assessing System Reliability with Multilevel Data and for Allocating Resources
Good estimates of the reliability of a system make use of test data and expert knowledge at all available levels. Furthermore, by integrating all these information sources, one can determine how best to allocate scarce testing resources to reduce uncertainty. Both of these goals are facilitated by modern Bayesian computational methods. We demonstrate these tools using examples that were previously solvable only through the use of ingenious approximations, and employ genetic algorithms to guide resource allocation.
Assessing the reliability of systems represented by reliability block diagrams remains important. Take for example, U.S. military weapon systems and nuclear power plants. In making these assessments, often there are information and data available at all levels of these systems, whether they be at the component, subsystem, or system level. For example, there may be data from component and subsystem tests as well as expensive full system tests. In this paper, we are concerned with assessing the reliability of a system by combining all available information and data at whatever level they are available; here we consider the case where we have success/failure test data.
Much of the reliability literature ([1–6]) predates the advances made in Bayesian computation in the 1990s and resorts to various approximations. However, today a fully Bayesian method using the framework in , which simultaneously combines all available multilevel data and information, can be implemented using Markov chain Monte Carlo (MCMC). In this paper, we employ such modern Bayesian methods as MCMC to make reliability assessments.
In the next section, we introduce the statistical model that combines all available multilevel data and briefly present MCMC for analyzing such data. Then, we illustrate this methodology by making reliability assessments for an air-to-air heat-seeking missile system and a low-pressure coolant injection system in a nuclear power plant first considered by [5, 6], respectively.
Once multilevel data and information can be analyzed, the question arises of what additional tests should be done when new funding becomes available. That is, what tests will reduce the system reliability uncertainty the most? In this paper, we show how a genetic algorithm using a preposterior-based criterion can address this resource allocation question. Reference  considered resource allocation for a two-component series system. In this paper, we illustrate resource allocation with a more complex series-parallel system.
2. A Model for Combining Multilevel Data
To combine multilevel data for system reliability assessment, we use the framework in . We introduce the framework's notation and models by considering the reliability block diagram of a series-parallel system given in Figure 1. First, components, subsystems, and the system are referred to as nodes. In this example, the system is node 0 which consists of two subsystems (nodes 1 and 2) in series. The first subsystem consists of two components in parallel (nodes 3 and 4) and the second subsystem consists of three components in series (nodes 5, 6, and 7).
We begin by considering the binomial data model when data are available at a node. At the th node, there are successes in trials with probability of success (reliability) . If node is a subsystem or the full system (i.e., not a component), then is expressed in terms of the component reliabilities. For the series-parallel system, the subsystem reliabilities are expressed as and and the system reliability is expressed as . In general, let be the subset of nodes which are components, and let ; then for and for some function , .
Next, we consider prior distributions for node reliabilities. For components, we use beta prior distributions in terms of an estimated reliability and a precision which acts like an effective sample size. That is, if the th node is a component, then . If no information is available, the Jeffreys' prior or a uniform prior can be used.
We also allow the possibility that information (expert knowledge) is available on the reliabilities of subsystems and/or the full system; we assume that this information is independent of the test data and any information used to build the prior distributions for the component reliabilities. (Frequently, we will not use any such information: in particular, expert opinion about upper-level nodes will often be based on the same information that led to the prior distributions for component reliabilities. This information should not be used twice, so a simple solution is to exclude the upper-level expert opinion.) Assume that the information takes the form of an estimated reliability and a precision . We then express the information contribution, including the successful tests in trials, from the th subsystem or system as a term proportional to As discussed above, the subsystem or system reliability is expressed in terms of the component reliabilities as . In effect, we have treated this information as if it were derived from binomial data instead of as a beta distribution; the difference involves a change in the exponents of and by one. One effect of this treatment is to ensure that the posterior distribution of is well defined. We can define to be the indicator that node is a component (i.e., if node is a component, and 0, otherwise), in which case the information contribution from the th node is regardless of whether node is a component. If no information at the th node is available beyond binomial tests, then , although should be used for components to ensure a proper prior. In the remainder of this paper, when we refer to the prior distribution, we mean the distribution that arises from combining the component Beta distributions with the upper-level expert knowledge. This is in fact a posterior distribution if there is nonzero expert knowledge, and in this case the components no longer have independent “prior” distributions.
A variety of models might be employed for the . The might be treated as constants when they are really thought to be effective sample sizes. On the other hand, they might be described by a distribution, such as . This allows expert knowledge to be downweighted if it is inconsistent with the data. Now consider the data and prior information for the series-parallel system given in Table 1. Note that no precisions are provided so that a prior distribution needs to be specified. For illustration, we consider the same precision , that is, , and take the prior distribution for to be
That is, we believe that the expert information on average is worth five Bernoulli observations.
To combine the data with the expert knowledge represented as above, we use Bayes theorem where is the parameter vector (i.e., the component reliabilities and any other unknown parameters), is the data vector, is the prior probability density function, and is the data probability density function (i.e., the binomial probability mass function for binomial data) which viewed as a function of the parameter vector given that the data is known as the likelihood. The result of combining the data with expert knowledge is which is known as the posterior distribution. Since the 1990s, advances in Bayesian computing through Markov chain Monte Carlo or MCMC have made it possible to sample from the posterior distribution . Next, we discuss how the Metropolis-Hastings algorithm  can be used to obtain draws or samples from the parameter posterior distribution.
A fully Bayesian analysis of the model described above, which simultaneously combines all available multilevel data and information, is nontrivial. The posterior distribution is not analytically tractable: up to a normalizing constant, it is
This looks superficially like a beta distribution, but it is not so simple because of the functional relationships between the ; that is, the subsystem and system . Consequently, a Bayesian analysis requires an implementation of an MCMC algorithm such as Metropolis-Hastings; see, for example, . We use a variable-at-a-time Metropolis-Hastings algorithm as follows. The algorithm loops through all the unknown parameters and , proposing changes to one parameter at a time and either accepting or rejecting changes according to the Metropolis-Hastings rule. We update the on the logit scale: suppose we are at the stage in one iteration of the algorithm where we are updating (for some ). Propose a new value according to where are tunable constants. Accept the value with probability
where is equal to except with its th node reliability replaced by . If the move is accepted, change the current value of the parameter to be , otherwise its value continues to be . After all the for have been updated in this way, we update on the log scale; this proceeds similarly except that the proposed new values of satisfy so that these proposed new values are accepted with probability After a complete iteration (after attempts to move each of the for and also ), record the current values of all the parameters; this is treated as one sample from the posterior distribution. In practice the first several iterations are discarded as part of a “burn-in” period. Choosing good values of the is not difficult: in particular, the YADAS software system [11–13] has a method to tune these automatically in the burn-in period. This method consists of running an experiment with a wide range of 's, modeling the acceptance rates of the proposed moves using logistic regression with as a predictor, and choosing so that the logistic regression model predicts an acceptance rate close to a target value such as 0.35.
The same MCMC algorithm just described for making draws from the joint posterior distribution can be used for making draws from the joint prior distribution
where for a subsystem or system is a function of . Draws for the subsystem and system reliabilities are obtained by evaluating the appropriate functions with the draws. The resulting prior distributions for the node reliabilities and are displayed as dashed lines in Figures 2 and 3, respectively.
In assessing the system reliability for the series-parallel system of Figure 1, we combine the node data with the prior distributions using MCMC as just described that result in the posterior distributions displayed as the solid lines in Figures 2 and 3. From these results, the 90% (central) credible interval for the system (node 0), reliability is calculated as (0.697, 0.861) whose length is 0.164. Note that even though there is no data for the first subsystem (node 1), the system data (node 0), and the component data (nodes 3 and 4), dramatically improve what we know about the first subsystem reliability. As shown in Figure 3, the addition of the data does not change much, except that is somewhat larger than indicated by the prior distribution; that is, the data essentially confirms the prior distribution of .
3. Reliability Assessments for Two Applications
3.1. Series System Example
Reference  considered the reliability of a certain air-to-air heat-seeking missile system consisting of five subsystems in series each consisting of multiple components themselves combined in series as depicted in Figure 4. The data and prior information that  used are presented in Table 2 as (successes/trials) and estimated reliabilities and precisions . Reference  did not provide details on how these data were obtained and how the prior information was arrived at.
To compare with , we treat the precisions as constants and then obtain the posterior node reliabilities using YADAS [11–13]. The posterior node reliabilities are displayed in Figure 5 as solid lines; the results from  are displayed as dashed lines. The median (0.50 quantile) and 90% credible intervals (0.05, 0.95 quantiles) for the system and subsystem posterior reliabilities from the fully Bayesian and  methods are given in Table 3.
Note that there is quite a difference for the subsystem 1 results. The difference in location is due to the fact that the approximations used in  do not use higher-level information (system data) to estimate lower-level parameters (such as subsystem 1 reliability). The expert judgment estimate of system reliability, or , is lower than the data and expert judgment at the lower levels would imply, and the fully Bayesian analysis needs to attribute this unreliability to one of the subsystems. Subsystem 1 and in particular component 19 has the sparsest information and is the natural targets. For this reason, the fully Bayesian analysis is more useful than the approach of  in evaluating the usefulness of gathering more data at low levels. In practice one would review the information that led to the low system reliability estimate. The fully Bayesian analysis could be rerun with random 's, and this would presumably allocate positive probability to the event that is an underestimate.
3.2. Complex Series-Parallel System Example
Reference  considered the reliability of a low-pressure coolant injection system, an important safety system in a nuclear-power boiling-water reactor. It consists of twin trains consisting of pumps, valves, heat exchanges, and piping whose reliability block diagram is displayed in Figure 6. The data and prior information that  used are presented in Table 4 as (successes/trials), estimated reliabilities , and precision .
Martz and Waller  based component prior distributions (i.e., for nodes 121, 122, 1111, 1112, 1121, 1122, 221, 222, 2111, 2112, 2121, 2122) on data from the Nuclear Regulatory Commission Accident Sequence Evaluation Program database  and some subsystem prior distributions (i.e., for nodes 12, 111, 112, 222, 211, 212) on composite IEEE Std. 500 reliability data (). See  for more details.
We treat the precisions as constants as in , and then obtain the posterior node reliabilities using YADAS ([11–13]). The resulting posterior reliabilities for the subsystems and system are displayed in Figure 7. Also, the summaries of the posterior reliabilities for all nodes are given in Table 5. The results in Table 5 are similar to those given in  although somewhat smaller; for example, the (0.05, 0.5, 0.95) quantiles for nodes 0–2 from  are (0.999968, 0.9999940, 0.99999975), (0.9925, 0.9974, 0.99944), and (0.9926, 0.9974, 0.99948), respectively.
4. Resource Allocation
In Section 2, we showed how to analyze multilevel data to assess system reliability. In this section we address test design. When additional funding becomes available, the question of where should the tests be done and how many should be taken arises to improve the system reliability assessment. In this section, we consider the optimal allocation of additional testing within a fixed budget that results in the least uncertainty of system reliability. We explore this by using the series-parallel system in Figure 1. We must determine how many tests should be performed at the system, subsystem, and component level (i.e., nodes 0–7) under a fixed budget for specified costs at each level (system, subsystem, component). In this paper, we use a genetic algorithm (GA) [16, 17] to do the optimization because it is simple to implement and generally provides good results. But other optimization methods like particle swarms  could easily be used instead.
Thus, we assume that there is a cost for collecting additional data with higher-level data being more costly than lower-level data. Consider the following costs as an example of the costs for testing at each node. Recall that node 0 is the system, nodes 1 and 2 are subsystems and nodes 3–7 are components:
We evaluate a candidate allocation (i.e., a specified number of tests for each of the eight nodes) using a preposterior-based criterion as follows. We take a draw from the current joint posterior distribution (based on the current data) of the node reliabilities and draw binomial data according to the candidate allocation. Then we combine these new data with the current data using the same prior distributions to obtain an updated posterior distribution of the node reliabilities; again we use MCMC to obtain draws from this updated posterior distribution. The length of the 90% central credible interval of the system reliability posterior distribution is taken as a measure of uncertainty. This is repeated times, each with a different draw from the current joint posterior distribution of the node reliabilities. The uncertainty criterion is then calculated as the 0.90 quantile of the resulting 90% credible interval lengths.
Briefly, we describe how a GA can be used to find a nearly optimal allocation. A GA operates on a “population” of candidate allocations, where a candidate allocation is a vector of node test sizes. The GA begins by constructing an initial population or generation of allocations by randomly generating allocations that do not exceed the given fixed budget. The uncertainty criterion for each of these allocations in the initial population is evaluated and the allocations are ranked from smallest to largest, that is, the best allocation has the smallest criterion in the initial population. The second (and subsequent) GA generations are then populated using two genetic operations: crossover and mutation [16, 17]. A crossover is achieved by randomly selecting two parent allocations from the initial (or current) generation without replacement with probabilities inversely proportional to their rank among the allocations in the initial (or current) generation. A new allocation is generated node by node from these two selected parent allocations by randomly picking one of the two parents each time and taking its node test size. The two parent allocations are then returned to the initial (or current) population before the next crossover is performed. In this way, an additional allocations are generated using the crossover operator. The generated allocations are checked to make sure they do not exceed the budget, so that new allocations are generated until there are such allocations. The uncertainty criterion is then evaluated for each of these new allocations. A mutation of each of the initial (or current) allocations is obtained node by node by first randomly deciding to change the node test size and if so then randomly perturbing the current node test size. Using mutation, additional allocations which remain within the budget are generated and the uncertainty criterion for each is evaluated. At this point there are allocations. In the next generation, the current population consists of the best allocations from these allocations, that is, with the smallest uncertainty criterion. The GA is executed for generations. We implemented the GA for resource allocation in  which generates the candidate allocations. An allocation is evaluated in by repeatedly building YADAS [11–13] input data files, running the YADAS code using the reliability package (through the “system” call) to analyze the new and current data, and reading the resulting YADAS output files back into to calculate the uncertainty criterion.
In the implementation, there are a number of issues regarding the choice of , , , and . As the population size and number of generations increase, more candidate allocations (i.e., are entertained, but then more calculation is required. As the number of posterior draws for each generated data set and the number of generated data sets to analyze increase, the uncertainty criterion is better evaluated, but the calculation needed to evaluate a single candidate allocation can dramatically increase let alone that for candidate allocations. One has to realize that the nearly optimal allocation found by the GA may not be the optimal allocation if the difference between them is less that the variability of the evaluated uncertainty criterion, that is, within the simulation error of the uncertainty criterion.
One might ask if there are any general insights regarding resource allocation with assessment of system reliability in mind. If we consider testing at the same level, for components (or subsystem), the component (or subsystem) with the most uncertainty will require more testing than the others. If the subsystems are connected in series, but some subsystems have components connected in series where as other subsystems have components connected in parallel, in terms of component testing, the parallel configured subsystems will require less testing; this can be explained by examining the subsystem reliability expression, which shows that the reliability of series configured subsystems is of second order in their component reliabilities, where as that for parallel configured subsystems is of first order. The allocation will also depend on the testing costs relative to the amount of uncertainty reduction that it provides. If we consider a series configured subsystem, if the subsystem cost exceeds the sum of the components costs, then performing components tests will be recommended; if the subsystem cost is less than the sum of the components costs, then performing some subsystem tests may be recommended if they provide relatively more information. But for complicated systems with many subsystems and components whose costs are all different, it will be difficult to choose an optimal allocation with these rules of thumb. However, the proposed methodology balances all these costs and information across the entire system in finding a nearly optimal allocation.
Next, we illustrate the GA for the resource allocation problem described above for the series-parallel system depicted in Figure 1 for a fixed budget of $1000. The length of the 90% credible interval of system reliability based on the existing data is 0.164. We use populations of size and generations, so that 2020 () candidate allocations were generated and evaluated. To evaluate the uncertainty criterion, we generated posterior draws per data analysis and generated data sets corresponding to posterior draws based on the existing data. For this situation, what allocation yields the most reduction in the uncertainty criterion for system reliability?
Based on the proposed methodology described above, the GA produced the traces presented in Figures 8 and 9 which display the best uncertainty criterion and allocation found during each generation. The uncertainty criterion drops to 0.0804 for the initial population and decreases to 0.0725 by generation 50 with an allocation of test sizes (0, 0, 175, 0, 0, 208, 137, 128) for nodes 0–7. We evaluated this allocation with and and obtained uncertainty criterion values of 0.073358 and 0.073363, so we take the uncertainty criterion for this allocation as 0.0734. These results suggest that there is enough data for node 1, the two component parallel subsystems and the cost structure prohibit additional system tests (i.e., the system cost equals the sum of the subsystem costs, which equals the sum of the components costs). Because node 2 subsystem cost equals the sum of its component costs, we tried an allocation which proportionally allocated the subsystem tests to its components (i.e., splitting up by the proportion (208/473, 137/473, 128/473) found by the GA) giving the allocation (0, 0, 0, 0, 0, 439, 289, 270). Evaluating this allocation again with and gave uncertainty criterion values of 0.071439 and 0.071426, which we round to 0.0714. Consequently, there is some improvement by doing all component tests for node 2 subsystem.
For relatively complex systems, we have illustrated how to respond to the challenge of integrating all information available at the various levels of a system in order to estimate its reliability. Bayesian models have always been natural for doing this integration, and the computational tools have now caught up to make this practical. Moreover, because we are able to analyze such data, we can now consider the problem of allocating additional resources that best reduce the uncertainty in the system reliability assessment.
We have discussed the case of binomial test data only for systems represented by reliability block diagrams. Reference  showed how binomial data can be analyzed for problems using fault tree representations. Component and subsystem tests may generate continuous data such as lifetimes, and their distributions may depend on covariates such as different suppliers. Reference  presented an example of such an analysis. However, the problem of resource allocation for nonbinomial test data is a topic for future research.
The authors thank C. C. Essix for her encouragement of this work and Vivian Romero for her assistance in producing the reliability block diagram figures used in this paper. We also thank the referees for helpful comments that improved the presentation of this paper.
P. V. Z. Cole, “A Bayesian reliability assessment of complex systems for binomial sampling,” IEEE Transactions on Reliability, vol. 24, no. 2, pp. 114–117, 1975.View at: Google Scholar
B. Natvig and H. Eide, “Bayesian estimation of system reliability,” Scandinavian Journal of Statistics, vol. 14, no. 4, pp. 319–327, 1987.View at: Google Scholar
H. F. Martz, R. A. Walter, and E. T. Fickas, “Bayesian reliability analysis of series systems of binomial subsystems and components,” Technometrics, vol. 30, no. 2, pp. 143–154, 1988.View at: Google Scholar
H. F. Martz and R. A. Waller, “Bayesian reliability analysis of complex series/parallel systems of binomial subsystems and components,” Technometrics, vol. 32, no. 4, pp. 407–416, 1990.View at: Google Scholar
V. E. Johnson, T. L. Graves, M. S. Hamada, and C. S. Reese, “A hierarchical model for estimating the reliability of complex systems,” in Bayesian Statistics 7, J. M. Bernardo, M. J. Bayarri, J. Berger et al., Eds., pp. 199–213, Oxford University Press, London, UK, 2003.View at: Google Scholar
S. Chib and E. Greenberg, “Understanding the Metropolis-Hastings algorithm,” The American Statistician, vol. 49, pp. 327–335, 1995.View at: Google Scholar
T. L. Graves, “The YADAS reliability package,” Tech. Rep. LA-UR-06-7739, Los Alamos National Laboratory, Los Alamos, NM, USA, 2006, http://www.stat.lanl.gov/yadas/node1.html#download.View at: Google Scholar
U.S. Nuclear Regulatory Commission, “Reactor risk reference document (vols. 1–3, draft),” Tech. Rep. NUREG-1150, 1987.View at: Google Scholar
Institute of Electrical and Electronic Engineers, IEEE Guide to the Collection and Presentation of Electrical, Electronic, Sensing Component, and Mechanical Equipment Reliability Data for Nuclear-Power Generating Stations, Wiley-Interscience, New York, NY, USA, 1983.
D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, New York, NY, USA, 1989.
Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer, New York, NY, USA, 1992.
R. C. Eberhart and J. Kennedy, “A new optimizer using particle swarm theory,” in Proceedings of the 6th International Symposium on Micro Machine and Human Science, pp. 39–43, IEEE Service Center, Nagoya, Japan, October 1995.View at: Google Scholar
R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2004, http://www.R-project.org/.
M. Hamada, H. F. Martz, C. S. Reese, T. Graves, V. Johnson, and A. G. Wilson, “A fully Bayesian approach for combining multilevel failure information in fault tree quantification and optimal follow-on resource allocation,” Reliability Engineering and System Safety, vol. 86, no. 3, pp. 297–305, 2004.View at: Publisher Site | Google Scholar
T. L. Graves and M. S. Hamada, “Bayesian methods for assessing system reliability: models and computation,” in Modern Statistical and Mathematical Methods in Reliability, A. Wilson, N. Limnios, S. Keller-McNulty, and Y. Armijo, Eds., vol. 10 of Series on Quality, Reliability, and Engineering Statistics, pp. 41–54, World Scientific, Singapore, 2005.View at: Google Scholar