Availability Allocation of Networked Systems Using Markov Model and Heuristics Algorithm
It is a common practice to allocate the system availability goal to reliability and maintainability goals of components in the early design phase. However, the networked system availability is difficult to be allocated due to its complex topology and multiple down states. To solve these problems, a practical availability allocation method is proposed. Network reliability algebraic methods are used to derive the availability expression of the networked topology on the system level, and Markov model is introduced to determine that on the component level. A heuristic algorithm is proposed to obtain the reliability and maintainability allocation values of components. The principles applied in the AGREE reliability allocation method, proposed by the Advisory Group on Reliability of Electronic Equipment, and failure rate-based maintainability allocation method persist in our allocation method. A series system is used to verify the new algorithm, and the result shows that the allocation based on the heuristic algorithm is quite accurate compared to the traditional one. Moreover, our case study of a signaling system number 7 shows that the proposed allocation method is quite efficient for networked systems.
Availability is the probability that a system or a component is performing its required function at a given point in time or over a stated period of time when operated and maintained in a prescribed manner . If the system or component repair can be viewed as a renewal process, the steady-state availability exists. One type of the steady-state availability, inherent availability, is based solely on the failure distribution and repair-time distribution as a design parameter and is defined as follows where is the operating time, is the mean time between failures, and is the mean time to repair.
In the early design phase, the system availability goal is specified and should be allocated to reliability requirements (e.g., failure rate, MTBF) and maintainability requirements (e.g., repair rate, MTTR) of components for further design and verification. The reliability and maintainability allocation results provide meaningful inputs to design (i.e., establishment of the right input design criteria at the proper level) and criterions for verification.
As Messer stated , availability allocation is extended from reliability allocation. Bouissou and Brizec summarized more than 20 availability allocation methods and generalized them into two categories: one is optimal availability allocation which aims at finding the minimum cost under availability goal or the maximal system availability under cost constraints, and the other is based on weighing factors which considers the system structure . However, these availability allocation methods are only suitable for simple structured systems. In the recent years, researchers made a great effort improving the availability allocation methods. For example, Elegbede and Adjallah applied the genetic algorithms to solve the NP-hard multiobjective optimal availability allocation problem for series-parallel systems ; Chiang and Chen proposed a simulated annealing based multiobjective genetic algorithm (saMOGA) to solve the optimal availability allocation problem for series-parallel systems ; Barabady and Kumar used the availability importance measure based on MTBF and MTTR to find optimal allocation results with the minimum cost based on genetic algorithm for series, parallel, and series-parallel systems ; Juang et al. proposed a genetic algorithm based on a knowledge-based interactive decision support system to improve the availability allocation for series-parallel systems ; Liu studied the availability optimization problem for -stage standby system under different resource and design configuration constraints by applying Tabu-genetic algorithm combination method ; Xie et al. extended the optimal availability allocation to consider redundancy allocation and spare parts provisioning simultaneously for -out-of-: G systems . However, the systems mentioned above are simple structured ones. Mayer considered the availability allocation problem for multipath networks, but the system availability was modeled using series-parallel relationships, while networked structure was not included . Nowadays, networked systems are common across natural and man-made world, for example, networked communication systems, networked control systems, and networked power systems. For these systems, the system availability goal cannot be allocated using the methods above due to the networked structure. To the best of our knowledge, the availability allocation is still not well studied for networked systems.
Moreover, there are multiple down states for some complex components of networked systems. As Ali stated in , several types of complex failures, for example, detection failure, coverage failure, diagnostic failure, and recovery failure, are common for digital switched systems. The availability of such components cannot be directly expressed by (1). Markov model is widely used in complex system availability analysis. For example, Lazaroiu and Staicut applied the Markov model to derive availability expression for telecommunication switching systems ; Lai et al. used the Markov model for hardware/software systems to cover both hardware and software failures ; Liu and Trivedi introduced the Markov model to drive availability expression of telecommunications switching systems and combined it to the performance model . Further, Hu et al. applied the Markov model to optimal allocation problem for series-parallel systems .
In this paper, we study the availability allocation based on weighing factors for networked systems. Traditionally, the system inherent availability goal is broken down to reliability and maintainability goals on the system level, and those system goals are allocated to subsystems or components using reliability allocation method and maintainability allocation method, respectively. However, as mentioned earlier, the traditional availability allocation methods are not so practical for networked systems due to their complex structures and multiple down states. To solve these problems, we propose an availability allocation method based on the Markov model and heuristics algorithm, in which the principles of both AGREE reliability allocation method and failure rate-based maintainability allocation method persist.
The remainder of the paper is organized as follows. Section 2 introduces the availability models for networked systems based on network reliability algebraic method and Markov model. Section 3 proposes our availability allocation method, including goals, assumptions, principles, and procedures. In Section 4, a series system is allocated to verify our heuristic algorithm compared to the traditional one. A case study of a signaling system number 7 (SS7) is presented in Section 5 to validate our availability allocation method on networked systems. Finally, concluding remarks are provided in Section 6.
2. Availability Models for Networked System
A simple structure of a networked system is illustrated in Figure 1.
Since availability is a probability, the network reliability algebraic methods, for example, inclusion-exclusion method, sum of disjoint method, and factoring method as Shier summarized in , can be applied to compute the availability of a networked system, and the system availability can be calculated from knowledge of node and link availability. Furthermore, the availability of links and nodes can be modeled using the reliability block diagrams (RBD) and expressed as a function of the availability of components that make them up. Therefore, the availability of such a networked system is given by where are the availability of the types of components.
According to Ali , several fault tolerance techniques are applied to the component design in the networked system, and some complex failures are introduced. For example, (1) detection failure occurs when a component fails to detect failure when it is supposed to; (2) coverage failure occurs when a component fails during a switchover between active and standby model; (3) diagnostic failure occurs when a component’s diagnostic cannot correctly identify failed units; and (4) recovery failure occurs when a component’s emergency recovery program cannot bring the component back to an operational mode.
For components with such complex failures, their availability cannot be calculated through (1). Markov model is capable of solving this problem. After creating a state transition diagram for the component, its steady-state probability can be solved through the flow rate equations, and the component availability can be obtained by adding all the available states together. Therefore, in addition to reliability and maintainability parameters, there are other variables in the component availability expressions, for example, detection frequency, coverage probability, diagnostic frequency, and recovery rate. The component availability can be expressed as where and are the failure rate and repair rate of component , and , represent other variables in the availability expression of component .
3. Availability Allocation Method
The system inherent availability goal, , needs to be allocated to reliability and maintainability requirements of components in a manner that will support the specific goal. In general, the following inequality must hold
In this paper, we study the availability allocation problem based on the following assumptions.(1)The nodes and links of the networked system only have two states, perfect functioning and complete failure.(2)All nodes and links are independent physically and statistically.(3)Upon completion of a maintenance function, a repaired unit is as good as a new one.(4)All failure time and repair time of components in the lowest allocation level follow exponential distributions.(5)The system maintainability goal is already specified as , and other variables in the component availability expression (see (3)) are also given.(6)The operating time for all the components is the same.
AGREE and failure rate-based method are two of the most widely used reliability and maintainability allocation methods. However, these two methods cannot be applied for networked system directly according to its complex topology and multiple down states. The ideas of these allocation methods, such as allocating reliability according to component importance and complexity and allocating maintainability considering component failure rate, can still be used as our allocation principles.
In AGREE method, the reliability allocation is applied for the series system which is constituted by components with exponential lifetime. It is realized by allocating the following failure rate to component : where is the system reliability goal at system operating time , is the complexity number, for example, the number of modules within component , is total number of modules in the system, is the probability that the system will fail if component fails, and is the operating time of the component .
In the failure rate-based method, for a system whose repair follows renewal process, the maintainability allocation is implemented by allocating the following repair rate to component type : where is the number of component types and is the number of identical components of type .
As the structure and failures of a networked system are complex, the availability goal cannot be allocated through reliability allocation and maintainability allocation separately. Moreover, (6) cannot be applied directly to nonseries system. Generally, the AGREE method and failure rate-based method set up four basic principles of our availability allocation:(1)assign higher reliability goals for less complex components;(2)assign higher reliability goals for more important components;(3)assign higher reliability goals for components which operate longer;(4)assign higher maintainability goals for components with higher failure frequency.
Let be the availability allocation accuracy requirement, and allocate the system availability goal to its component reliability and maintainability requirements using the following procedures.
Step 1. Determine the system reliability expression using network reliability algebraic method on the network level and RBD on the lower level as where is the system reliability at time and is the reliability of identical component type .
Step 2. Obtain the system availability expression by combining with Markov method as
Step 3. Let the initial reliability importance of each component type be equal to 1; that is,
Step 4. Calculate the failure rate coefficient for component type as where is its longest operating time. To persist the allocation principle in (8), let the allocated failure rate for component type be where is a positive variable waiting to be solved.
Step 6. By substituting (14) and (15) into (11), the system availability is a function with the variable . Solve the following optimization problem using the bisection search method: where is the decision variable, and the maximum allowable failure rate can be obtained under the constraint of the system availability goal.
Step 7. From the optimal , compute the allocated and for component type using (14) and (15). Then, calculate the allocated reliability as compute the probability of completing a repair in less than hours as and obtain the allocated availability as (3).
Step 9. Compare the allocation results , , and with the last allocation results , , and . If any , or , then let and go to Step 4; otherwise, stop and let () be the final allocation result.
Under the assumptions described in Section 2, the reliability and availability of component can be expressed as respectively. The system reliability and availability can be obtained from RBD as
Suppose that the system availability goal is , the system maintainability goal is hours, the allocation accuracy requirement is , and the module number of components 1, 2, 3, and 4 are 10, 30, 20, and 10, respectively. Using the procedures in Section 3.3, the accuracy requirement was achieved after 4 iterations. The iteration process is illustrated in Table 1. The bold numbers indicate the allocation results that could not satisfy the accuracy requirement and needed more iteration. The data in the last 3 rows are the final allocation results.
The root mean square error (RMSE) between the allocation results in each iteration and the final results can be calculated as where represents , , or in (20), is the number of component types, and is the number of iterations. RMSE decreases after each iteration as Figure 3 illustrates. One can see that our new allocation algorithm has a good convergence behavior.
If the traditional availability allocation method is used, the system reliability goal is firstly obtained as Then, the reliability and maintainability goals are allocated to components using the AGREE reliability allocation method and failure rate-based maintainability allocation method described in Section 3.2. The allocation results are illustrated in Table 2.
By comparing the allocation results obtained from our method and the traditional method, one can see that the RMSE is only and this error is mainly caused by different importance calculation methods. In AGREE method, the reliability importance is the probability that the system will fail given component has failed, while the Birnbaum importance in our new method is about the maximum loss in system reliability when component switches from normal state to failed state. This case shows that the new heuristic algorithm in our availability allocation method is suitable for series systems and the allocation difference is very low.
5. Case Study
In this Section, a SS7 system is used to illustrate the effectiveness of our allocation method. The topology of the system is shown in Figure 4, where we have the following.(i)Service switching point (SSP): it is an end-point used as switches that originate, terminate or tandem calls. It sends signaling messages to other SSP to setup, manage and release voice circuits required, or sends a query message to service control point to seek routing information.(ii)Signaling transfer point (STP): it is a packet switch used to transfer traffic between signaling points based on routing information contained in the SS7 message.(iii)Service control point (SCP): it is an end-point used as a specialized database. It can accept queries from SSP and retrieves routing information to support services.(iv)A link, access link, connects a signaling end point (e.g., an SCP or SSP) to an STP.(v)B link, bridge link, connects one STP to another. Typically, a quad of B links interconnects primary STP.(vi)C link, cross link, connects STP performing identical functions into a mated pair. A C link is used only when an STP has no other route available to a destination signaling point due to link failures.
The data transmission process works as follows. When a customer dials the telephone number, this number is forwarded to SSP, and then SSP recognizes it as a call requiring special handling and queries SCP database through STP. The response containing routing information is passed via the STP switching system back to SSP. Finally, the virtual link is constructed and the source and the destination are connected together through the rout given by SCP.
5.1. Availability Model
To successfully build a connection between the two telephones, at least one path needs to exist from the source telephone and one of the SCP, and at least one path should exist between the two telephones. The RBD of the SS7 system is shown in Figure 5. One can see that it is a type of networked structure. It is assumed that links are perfect and the system availability goal is only allocated to the components that make up the nodes.
From Figure 5, we can find 8 minimal paths, and the analytic expressions of the SS7 system reliability and availability can be obtained using inclusion-exclusion method as where , , , and are the reliability of the telephone, SSP, STP, and SCP, and , , , and are the availability of the corresponding nodes.
For the SS7 system, due to the multiple down states, the component availability cannot be directly modeled only using RBD. Take the STP as an example. Its RBD is illustrated in Figure 6. One can find that it is a series system, and the STP reliability and availability can be calculated by where , , , , and are the reliability and availability of the STP processor, packet switcher, and power supply, respectively.
The Markov models of the STP signal processor, packet switcher, and power supply are shown in Figure 7. The states in one circle are the available states, and the states in two circles are down states. The STP signal processor has diagnostic and recovery function, and the packet switcher has failure detection function. From these Markov models, the steady-state probabilities can be calculated from the flow rate equations, and the availability for , , and are expressed as follows: where and ; , , and are the failure rates of the STP signal processor, packet switcher, and power supply; , , and are the repair rates of the three components; , , and are recovery rate, recovery failure probability, diagnostic return rate, and diagnostic frequency of the STP signal processor; and and are detection frequency and detection probability of the STP packet switcher, respectively.
(a) STP signal processor
(b) STP packet switcher
(c) STP power supply
In this case, we assume that availability of other components are expressed as
By combining above availability and reliability expressions together, we have The parameters are illustrated in Table 3.
5.2. Availability Allocation
Assume that the accuracy requirement is , and the module number of phone, SSP, STP signal processor, STP packet switcher, STP power supply, and SCP are 50, 200, 80, 90, 30, and 200, respectively.
Allocate the system availability goal down to the reliability and maintainability requirements using our procedures in Section 3.3, and the accuracy requirement was achieved after 11 iterations. Table 4 shows the iteration process, and the bold numbers indicate the allocation results that need more iteration. The final results were obtained in the 11th iteration. When component importance shifts between two adjacent iterations, their mean can be used to accelerate the iteration process.
In this paper, an availability allocation method is proposed for networked systems. This method has three advantages: (1) a heuristic algorithm is proposed to solve the problem with the networked structure, whereas the traditional availability allocation methods can only be used for simple structures; (2) Birnbaum importance is applied to calculate the component importance, where the component importance is not easy to be obtained based on the networked structure; and (3) Markov method is introduced into the availability modeling process in order to model the component with multiple down states.
Our availability allocation method is suitable for networked systems which have analytic availability expression based on network reliability algebraic method and Markov model. The numerical results show that the allocation process is efficient, and the allocation results satisfy the specific availability goal of the networked system.
In this method, as Markov model is applied to compute system availability, all the failure time or the repair time of the components in the lowest allocation level need to follow exponential distribution. For those with nonexponential distributions, the system availability calculation requires more advanced models, such as semi-Markov model. The related topics will be studied in our future research.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by the National Natural Science Foundation of China (61304220) and the Beijing Natural Science Foundation (4143064).
C. E. Ebeling, An Introduction to Reliability and Maintainability Engineering, Waveland Press, 2nd edition, 2009.
G. H. Messer, “The allocation of availability parameters-repair times and failure rates,” Tech. Rep., Texas A&M University, 1970.View at: Google Scholar
M. Bouissou and C. Brizec, “Application of two generic availability allocation methods on a real life example,” in Proceedings of the European Safety and Reliability Association Conference (ESREL '96), Crete, Greece, 1996.View at: Google Scholar
G.-S. Liu, “Availability optimization for repairable n-s tage standby system by applying tabu-ga combination method,” International Journal of Modeling and Optimization, vol. 3, pp. 245–250, 2013.View at: Google Scholar
R. C. Mayer, “Calculating availability for a time-varying multi-path network,” in Proceedings of the 24th AIAA International Communications Satellite Systems Conference (ICSSC '06), pp. 1–7, San Diego, Calif, USA, June 2006.View at: Google Scholar
S. R. Ali, Digital Switching Systems: System Reliability and Analysis, McGraw-Hill, New York, NY, USA, 1997.
Y. Liu and K. S. Trivedi, “Survivability quantification: the analytical modeling approach,” International Journal of Performability Engineering, vol. 2, no. 1, pp. 29–44, 2006.View at: Google Scholar
L. Hu, D. Yue, and J. Li, “Availability analysis and design optimization for a repairable series-parallel system with failure dependencies,” International Journal of Innovative Computing, Information and Control, vol. 8, no. 10, pp. 6693–6705, 2012.View at: Google Scholar
D. R. Shier, Network Reliability and Algebraic Structures, Clarendon Press, Oxford, UK, 1991.View at: MathSciNet
D. Zhou, X. Jia, C. Lv, and Y. Li, “Maintainability allocation method based on time characteristics for complex equipment,” Eksploatacja i Niezawodnosc, vol. 15, no. 4, pp. 441–448, 2013.View at: Google Scholar