Abstract

Service-oriented architecture (SOA) provides an elastic and automatic way to discover, publish, and compose individual services. SOA enables faster integration of existing software components from different parties, makes fault tolerance (FT) feasible, and is also one of the fundamentals of cloud computing. However, the unpredictable nature of SOA systems introduces new challenges for reliability evaluation, while reliability and dependability have become the basic requirements of enterprise systems. This paper proposes an SOA system reliability model which incorporates three common fault-tolerance strategies. Sensitivity analysis of SOA at both coarse and fine grain levels is also studied, which can be used to efficiently identify the critical parts within the system. Two SOA system scenarios based on real industrial practices are studied. Experimental results show that the proposed SOA model can be used to accurately depict the behavior of SOA systems. Additionally, a sensitivity analysis that quantizes the effects of system structure as well as fault tolerance on the overall reliability is also studied. On the whole, the proposed reliability modeling and analysis framework may help the SOA system service provider to evaluate the overall system reliability effectively and also make smarter improvement plans by focusing resources on enhancing reliability-sensitive parts within the system.

1. Introduction

Service-oriented architecture (SOA) has become a major distributed computing framework [1]. With characteristics like standardized interfaces, loosely coupled structure, cross-platform as well as elastic service discovery, deployment, and reuse capabilities, SOA opens a new door to faster integration of existing software components from different parties, especially in the scheme of Web services (WS). Legacy components may still live within the system via service adapters [2], which is good for enterprises which prefer system upgrades in gentle and stable way.

It is noted that SOA also makes fault-tolerance (FT) techniques feasible for building reliable systems. Since it is difficult to build failure-free useful systems under limited development costs and the pressure of time to market, software fault tolerance [3], whose concepts originated from hardware reliability assurance, was proposed as an effective way to utilize redundancy to mask software failures and recover to normal operational states in a long running system. However, the extra costs of bringing out alternative software designs (redundancy) basically limit the applications of software fault tolerance to the fields that require ultrahigh reliabilities such as military, transportation, and aerospace. The emergence of service-oriented computing (SOC) helps lower the costs of making redundant software logic by reusing similar services published by different parties [4].

The unpredictability of open distributed environments where SOA systems are exposed, such as service fail-outs, service modification, linkage failures, and traffic congestion, threatens the dependability of the overall system. To ensure sufficient SOA system dependability, suitable reliability assessment and improvement methodologies are needed. Reliability modeling has been studied extensively in the field of software engineering, and many elegant solutions have emerged [5, 6], among which the component-based or architecture-based models [7, 8] appear to be most conceptually suited to be mapped to SOA systems. However, one of the major distinctions between traditional software system reliability models and SOA system reliability models are that the former models usually assume reliable communication between components and high transparency of system operation information, which generally is not applicable for SOA systems.

Based upon our previous research [9], in this paper, we further investigate the sensitivity analysis of each of the system’s subparts, which may be useful in making critical system maintenance plans for resource allocation. Note that the proposed model is an extended version of the work by Wang et al. [10], whereas a large portion of changes and extensions have been made for SOA systems such as the incorporation of service-based concepts and more detailed considerations of fault-tolerance mechanisms. The major contributions of this paper are twofold. The first is elegant SOA system reliability modeling considering unreliable services, unreliable communication links, and internal mechanisms of three typical fault-tolerance strategies. Note that existing reliability modeling methods tend to ignore the unreliabilities of communication links and oversimplify the FT mechanisms, which may deviate from the real situations. This paper is also among the earliest SOA reliability studies that adopt Markov-based analysis rather than popular path-based analysis, and therefore the complexities of tricky path analysis involving various branches and loops are eliminated. The second is sensitivity analysis of SOA at two grain levels, which helps the system service provider to identify reliability-critical subparts and make smarter system improvement plans.

The remainder of the paper is organized as follows. Section 2 presents some work on the reliability modeling and the realization of FT on service-based systems. Section 3 describes the proposed system reliability model for SOA systems and presents the reliability sensitivity analysis. Section 4 is the experiment results and analysis. Then in Section 5, there are discussions of factors that should receive attention when the results of this research are extended to other systems. Finally, Section 6 concludes the paper. Supplements such as RcB reliability formulation, sensitivity analysis derivation, and manual path-based reliability computation are presented in the appendixes.

SOA systems generally run in open and distributed environments, which introduces new sources of failures against traditional software systems such as interface changes, workflow inconsistency, time-outs, service-level agreement (SLA) constraints, and QoS constraints. Chan et al. [11] presented a fault taxonomy for service-oriented systems. In addition to the basic enabling protocols for SOA such as SOAP [12] and WSDL [13], protocols to enhance reliability on SOA systems have been proposed. WS-ReliableMessaging [14] allows SOAP messages to be delivered between services reliably. WS-coordination [15] describes a framework enabling coordination of transaction and workflow to operate in a heterogeneous environment. However, these protocols have not yet formed a complete SOA reliability solution and sometimes it is inevitable to interact with services without the support of those protocols. As mentioned in the preceding section, fault tolerance is suitable for applications that require high reliability, and various FT implementations under service-oriented environments are available such as [1618], which makes it a viable option for SOA system engineers.

It is also noted that some models for evaluating the reliability of SOA systems have been proposed. Some focus on the reliability assessment of a single Web service. For example, Zheng et al. [19] designed a framework to retrieve the information of a large number of Web services worldwide, autogenerate testing modules, make testing invocations, and finally analyze the feedback from Web services to retrieve their reliability values. Other work, such as [20], evaluates the reliability of a composite service based on its structural information or considers basic FT behaviors on SOA systems, such as [4]. Existing models are helpful in understanding the QoS of service-oriented systems though improvements in the SOA reliability models are still desired, including more detailed considerations of FT mechanisms as well as uncertainties of communication links. This paper covers three major fault-tolerance designs, namely, Recovery Block (RcB), N-version Programming (NVP), and Retry Block (RtB), without assuming the implementation details. Basic introduction considerations of their exceptional behavior are briefly presented in the following.

Recovery block (RcB) [3] is a backward recovery technique that uses the acceptance test (AT) to check the outputs of a module; the next alternative is invoked once the former alternative fails the AT. The executive in RcB is responsible for checkpoint establishment, checkpoint restoration, invocation of the alternatives, and successful return of RcB. RcB is vulnerable to failure under conditions such as when checkpoint establishment fails, checkpoint restoration or invocation of the next alternative fails, or the AT itself fails or no alternative passes the AT.

N-version programming (NVP) [3] is a forward recovery technique that uses the decision maker (DM) to vote for the outputs of all the involved alternative modules. The executive is responsible for input distribution and successful return of the NVP block. NVP is vulnerable to failure under conditions such as when input distribution to the alternatives fails, less than consistent and correct results successfully reach the DM, or the DM itself fails.

Retry block (RtB) [3] is a backward recovery technique that also uses AT like RcB to check the outputs of a module. Contrary to RcB, RtB retries the same module rather than using another module once the AT evaluation fails. The original design of RtB requires a data reexpression algorithm (DRA) to change the form of inputs before reusing the same module [3], but since in the Web applications or SOA systems failures may be caused by temporary service busy or network congestion (Heisenbug) and it is not always possible to tailor the DRA for the application, some designs eliminate DRA and use the same inputs on retry, and this simplified retry mechanism is widely supported in modern SOA execution environments. This paper considers the latter designs. RtB is vulnerable to failure under conditions such as when checkpoint establishment fails, checkpoint restoration fails, or the AT itself fails or the retry limit is exceeded before the AT passes.

It should be noted that although the main function is protected by the redundant designs/executions in the fault-tolerance blocks, the executive and the AT/DM are not protected and may be a potential reliability bottleneck. Several modified designs of RcB and NVP such as [21, 22] also introduce redundancy at these points, resulting in even more complex fault-tolerance design and higher costs. Therefore such improvements are not considered in this paper.

3. SOA System Reliability Modeling and Analysis

Before the introduction of the proposed model, several definitions of SOA systems and reliability modeling are listed here.

3.1. Framework of the Model

Without loss of generality, an SOA system may be viewed as a flow of services (called workflow) and be depicted by a BPMN diagram [23] as in Figure 1, which contains the single start event (the thin-lined circle), n abstract services (round rectangles), message transmission between services (arrows), branching points (diamonds), and the single end event (the thick-lined circle). Specifically, this research takes the viewpoint of on-demand business process rather than long running process, where at the request of a client, the system starts at the virtual start event, passes control to the next abstract service along the message path, and finally stops normally and transfers to the success state when the control reaches the virtual end point. The workflow may branch and converge at some points.

In the realization of the SOA system, an abstract service is often fulfilled by one or more physical services, which could be an atomic Web service or a composite Web service which invokes one or more atomic/composite services internally, and messages travel in the network under the protocols such as SOAP [12]. Services may also be protected by the FT techniques in the form of a composite Web service. It should be noted that in the real world, failures (or exceptions) may originate from any service, control point, or communication path. Unhandled failures interrupt the normal process flow and turn the system into a failure state.

A number of notations in this paper are defined as follows.

Reliability. This paper views the operation of SOA systems as an on-demand business process, and the reliability is the probability of successful executions. This kind of time-invariant reliability definition is used in traditional software reliability estimation [7, 24] and later widely adopted in existing research on SOA reliability modeling that also considers on-demand cases, such as [1, 25, 26].

System Reliability (R). It is the probability of a system execution that finally reaches the success state, estimated by , where is the total number of system executions and is the number of system executions that ended up in failure states.

Service Reliability (). It is the probability of successful executions of physical service , estimated by , where is the total number of service invocations and is the number of service executions that threw unhandled exceptions.

Link Reliability (). It is the probability of successful messages passing from physical service to physical service , estimated by , where is the total number of messages passing from service to service and is the total number of unsuccessful messages passing from service to service .

System Reliability Model. The SOA system reliability model is denoted by a 5-tuple [9, 10], where(i): a finite set of states ;(ii): state transition mapping , and is a set of triggering events;(iii): the virtual initial state with reliability 1, which transits to all the real initial states with their corresponding initiating probabilities, and if there is exactly one initial state, can be replaced by that state for simplicity;(iv): the virtual success final state with reliability 1, which is reached from all the real final states with their corresponding terminating probabilities;(v): a Markov transition matrix for the states in , whose entries are defined by where is the transition probability from state to state .

To estimate the reliability of an SOA system, one may apply the following steps (Figure 2).(1)Identify the workflow of the system and the physical services that realize each abstract service within the workflow. This may be obtained from process specifications, such as WS-BPEL [27] documents.(2)Determine the reliability of each physical service and message link within the system. This may be estimated from the service access logs if they are available or be collected and estimated from user feedback.(3)Determine the transition probabilities (or frequency) between abstract services. Existing techniques for building operational profiles [28] may apply here.(4)Identify the internal states and construct the Markov transition matrix from the information collected in Steps 1–3. This is explained further in Section 3.2.(5)Derive the reliability estimation of the whole SOA system.(6)Optionally, perform further analysis (such as sensitivities) and action based on the results of Step 5.

Note that, as reported by Zheng and Lyu [1], different users may actually experience different performance from the same service, since the service performance is influenced by the communication links substantially, and some SOA system may provide different localized physical services depending on the users’ preferences or location. Therefore, in Steps 2 and 3, the information may be collected together (from the viewpoint of the system operator) or be classified in groups of users individually (from the viewpoint of specific user groups) depending on the purpose of the reliability analysis.

3.2. Construction of Markov Transition Matrix

In the system reliability model, the internal states are derived from abstract services. Each abstract service is mapped to a distinct macro-state. A macro-state consists of only one microstate unless its corresponding abstract service is guarded by fault tolerance, in which more than one tightly coupled microstate may be generated for that service. It should be noted that each microstate belongs to exactly one macro-state; each macro-state has exactly one entry micro-state and one exit micro-state and the two may overlap. Also remember that macro-state 1 is the initial state while the last macro-state is the final state as stated in Section 3.1. The transition matrix is obtained by In (2), , , and denote state reliability, state transition probability, and state message link reliability, respectively; the entry and exit micro-states of a macro-state are denoted by and ; and the inclusion relation of microstate to macro-state is denoted by .

In the first condition of (2), assuming that the corresponding physical services to micro-states and are services and respectively, then (It is the default case unless redefined later in this section.), , and . The method for retrieving the and values has been briefly explained earlier, and [9] has the interpretation of in detail under two major service composition schemes in SOA systems. The major problem left is to define for the macro-states whose corresponding abstract services are realized in more complicated way, especially in various FT blocks.

Generally, for nonfault-tolerant services, either atomic or composite, it is intuitive to define a macro-state with a single micro-state, which reduces to the first condition and it is therefore not required to define for them.

For any abstract service implemented in RcB, two approaches may be used. One way is to define micro-states separately for the executive, the AT, and each alternative. Assume that there are alternatives in total, and both the microstate numbers as well as the corresponding physical service numbers for (executive, alternatives from the first to last, AT) are (, , ). Then within the domain , is defined as follows:

Here a convenient substitution is used to simplify the formula. Further note that . The second approach is to equivalently define a single microstate with the state reliability computed by

The first approach is more intuitive while the second one results in more compact transition matrix. Refer to Appendix A for more on the RcB reliability formulation.

For any abstract service implemented in NVP, assuming that there are alternatives in total and the corresponding physical service numbers for (executive, alternatives from the first to last, DM) are (, , ), then one can also define a single microstate with the state reliability computed by where is an indicator function such that for condition , and denotes a configuration set of binary values representing the outcomes of each alternative defined as follows: For any abstract service implemented in RtB, assuming that the retry limit is and the service numbers for (executive, the retry alternative, AT) are numbered (, , ), then its operation is equivalent to an RcB where there are duplicates of the alternative . One may create micro-states and define as in (3) or otherwise define a single microstate with the state reliability computed by

Finally, once the transition matrix is completed, the system reliability is computed by according to [7], where is the identity matrix of dimension and is the minor matrix eliminating the last row and the leading column of .

3.3. Sensitivity Analysis

Based on the previous reliability modeling, this section tries to explore the system more deeply and identifies the critical services with respect to their impact on the overall reliability of an SOA system.

Here the sensitivity analysis is made at two levels [10]: a coarse-grained level (L1, denoted by sc) and a fine-grained level (L2, denoted by sf). L1 involves the sensitivity of each macro-state within the reliability model, while L2 involves the sensitivity of each service within the SOA system. Substantial modifications of the original sensitivity analysis have been made for SOA systems, and link reliabilities are also included in this study. For convenience of analysis, here it is assumed that each macro-state consists of exactly one microstate and no physical services are shared among different abstract services.

A number of notations are defined as follows:(i),(ii),(iii): cofactor of ,(iv): cofactor of .

In L1, (9) can be rewritten as where , , and . It is noted that none of , , or is a function of . In addition, where , , and for . It is also noted that none of , , or is a function of .

Then, and the sensitivity for microstate is is only defined over because macro-states and are virtual states. Related derivations of and are given in Appendix B.

Next in L2, the sensitivity analysis is made by different categories of the abstract services connected to each macro-state.

For a nonfault-tolerant service , since there is only one physical service associated with the macro-state, the state sensitivity is consistent with the service sensitivity. That is,

For a service implemented in RcB, assuming the same service number settings in Section 3.2 and recalling the state reliability for RcB in (4), then the sensitivity of the executive is the sensitivities of the alternatives are where and the sensitivity of the AT is

For a service implemented in NVP, note that (5) is not in a simple closed form. To simplify the discussion, here it is assumed that 3 alternatives in total are in the NVP block. Equation (5) then reduces to

The sensitivity of the executive is

The sensitivities of the alternatives are where and as well as can be derived similarly.

The sensitivity of the DM is

Equations for the sensitivities of the RtB services are similar to those for the RcB services.

4. Experiments and Discussion

In this paper, results and discussions for a vehicle rental system adapted from industrial practices [29] are presented. The workflow diagram (Figure 3) for the system has sequences, branching, and looping, common in real-world projects. According to the diagram, when a customer uses the service, a rental agreement is prepared in the system, and then, based on the customer’s preferences, the system searches for vehicle choices from the database. The customer may either accept a choice, call for more choices, or drop the service directly. Upon accepting the rental choice, the system then assists the user in contacting the insurance company and paying the security deposit through a third-party payment service. Finally, the system activates the rental agreement and the customer is guided to take his car.

Two scenarios based upon this workflow specification are incorporated: scenario no. 1 is the fault-tolerance-free version where each abstract service is fulfilled by one physical (either internal or external) service, while scenario no. 2 uses RcB to ensure higher reliability for the payment operation. For each scenario, Table 1 displays the physical services involved and their reliabilities, Table 2 displays the transition (branching) probabilities between the abstract services, and Table 3 displays the link reliabilities between the physical services.

4.1. Reliability Modeling

Applying the Steps in Section 3.2, one can derive the transition matrix for scenario no. 1 and the transition matrix for scenario no. 2

Ten simulations of the system where each run contained 100,000 system calls were executed for each scenario. Comparing the experimental results and the values from the proposed model in Table 4. In Table 4, the 1st row is the reliability estimation by the proposed model, the 2nd row is from the manual probability computation as explained in Appendix C, the 3rd row contains the statistics of simulation, and the last row is the relative error between the simulation results and the theoretical values in the 3rd row. It can be seen that the proposed SOA reliability model has identical values to those computed from probability theory, and the average simulation results are very close to the theoretical values. Comparing the two scenarios, it can also be seen that the introduction of RcB in the payment service (abstract service 5) improves the local service reliability by 11.36% (Considering both the service invocation and the service execution, the payment service reliability is 0.874 in scenario no. 1 and is 0.973 in scenario no. 2.) and improves the overall system reliability by 5.2%.

It is also clarified that the 0.00% simulation relative error in Table 4 is only the statistical value for certain scenario in certain precision level and should not be otherwise interpreted as 100% simulation accuracy.

4.2. Sensitivity Analysis

Results of the sensitivity analysis for both scenarios are displayed in Tables 5 and 6, respectively. The sensitivity value of a state/service indicates its impact on the overall system reliability. It could also be observed from Figure 4 that abstract services with higher execution probabilities (such as abstract services 1 and 2) have higher sensitivity values. In abstract service 5 in scenario no. 2, it is also noted that the executive service and the AT service actually dominate the reliability of RcB, as one observes the sub-graph within Figure 4. Such result is reasonable since both services are always on the execution paths for all the alternatives. On the other hand, the sensitivity values of the alternatives are significantly reduced (94% for physical service 6) in the RcB because their failures are largely masked by invoking the succeeding alternatives. Sensitivity analysis effectively quantifies the importance of each part within the system, which may be very useful for system engineers in determining reliability bottlenecks of the system and making further system improvement plans.

5. Threats to Validity

Users who adopt the proposed SOA reliability model and sensitivity analysis technique should be aware of the potential threats to validity. Factors limiting the generalizability of the results include the following.

Reliability Model. The proposed reliability model assumes independent failure occurrences between services. The Markovian property [30] also implies history independent system behavior. Possible bias should be taken into account when these assumptions do not hold in application.

Sensitivity Analysis. The derivative-based sensitivity analysis adopted in this paper is efficient to calculate sensitivity values at the current computing point. The major limitation of such technique lies on the narrow parameter input space, and higher-order sensitivity indices are also not explored [31]. Care must be taken when there are uncertainties of model inputs or when there is strong interaction/dependency between the constituent services within the SOA system.

Subject Systems. The experiments are performed on a set of scenarios based on real-world project presented in [29], and the settings of service reliability, link reliability, and transmission probability are constructed in this research. To the best of our knowledge, no public reliability data source for SOA systems is available. Researches on SOA reliability modeling such as [4, 8, 20, 25] generated their own scenarios for validation either by simulation or by operating their own example systems. Therefore, care must be taken in applying the model to other subject systems or system configuration.

Performance Evaluation. The restrictions of the proposed model also apply in the experiments. Additionally, failures are generated binomially in the simulation. The evaluation results are valid only with respect to the conditions and system scales in the experiments.

6. Conclusions

This paper presents a Markov-based system reliability model for SOA systems. Starting from the workflow specification of an SOA system, we have shown how to map each part in the workflow to the model. The proposed reliability model also considers the internal mechanisms of three well-known fault-tolerance strategies such that the execute node, the AT/DM node, the alternatives, and the interactions between them are well reflected in the reliability model. Sensitivity analysis at two grain levels is also incorporated in this paper, which enables the SOA system engineers to identify the reliability critical blocks or internal services effectively and efficiently. Experimental results show that the proposed SOA model and methods give very close results to theoretical and simulation values and how the sensitivity analysis quantizes the effects of system structure as well as fault tolerance on the overall reliability.

Appendix

A. RcB Reliability Formulation

It is possible that the reliability formulation depends on different design variations. Since the implementation is not the focus of this paper, we assume the basic RcB scheme as described in Section 2, where an RcB with alternatives will successfully execute in one of the paths in Figure 5  (“AT passes” blocks in Figure 5 mean that the AT itself is functional and returns true positive result. On the other hand, “AT fails” blocks mean that the AT itself is still functional but returns true negative or false results.). Equation (4) can then be derived from summing the probabilities of all the above paths together.

On the other hand, in (3), the first condition is for the checkpoint establishment, which is made by the executive (). The second one is for shifting the execution to the next alternative (), conditioning on the AT () successfully identifying the failure of the current alternative (), and successful checkpoint restoration by the executive (). The third one is for successful execution of the current alternative () and passes of AT (). The last one is for all remaining exceptional operations that cannot be guarded by the RcB.

An example is presented here to show the equivalence of (3) and (4). For a system with exactly one 3-RcB, where the executor, the three alternatives, and the AT are numbered from 1 to 5, respectively, by applying (3), the transition matrix is

By applying (4), the transition matrix is

It can be verified that and are equal.

B. Derivation of Sensitivity Analysis

Let be the transition matrix in the reliability modeling of an SOA system. Then,

The derivation of is similar, and therefore some of the steps are eliminated. For ,

C. Reliability Analysis by Probability Computation

The execution of the example system in Section 4 (Figure 3) can be broken down into a set of execution paths as follows (Figure 6).

Suppose that the physical services corresponding to abstract services 1–6 are numbered , respectively. For scenario no. 1, , , , , , and as in Table 1. For scenario no. 2, is replaced by the reliability of the RcB composite service consisting of physical services 5–9, whose value can be derived from (4).

Then, the following probabilities are computed:

In the previous equations, denotes transferring to abstract service upon completion of abstract service , and denotes not transferring to abstract service upon completion of abstract service .

Finally, the system probability is obtained by summing the probabilities of successful execution of each path as follows:

Through manual path-based reliability computation, one potential advantage of our Markov-based method would be discovered that it avoids the complexities of path analysis involving branches and loops, which may be tricky to automate in some cases.

Conflict of Interests

The authors have no financial relations with the commercial identities related to the technologies or standards covered in this paper.

Acknowledgments

The work described in this paper was supported by the National Science Council, Taiwan, under Grants NSC 101-2221-E-007-034-MY2 and NSC 101-2220-E-007-005. Further, the authors would like to thank the anonymous referees for their critical review and valuable comments.