Abstract
Reliability is a critical issue for componentbased distributed computing systems, some distributed software allows the existence of large numbers of potentially faulty components on an open network. Faults are inevitable in this largescale, complex, distributed components setting, which may include a lot of untrustworthy parts. How to provide highly reliable componentbased distributed systems is a challenging problem and a critical research. Generally, redundancy and replication are utilized to realize the goal of fault tolerance. In this paper, we propose a CFI (critical fault iterative) redundancy technique, by which the efficiency can be guaranteed to make use of resources (e.g., computation and storage) and to create faulttolerance applications. When operating in an environment with unknown components’ reliability, CFI redundancy is more efficient and adaptive than other techniques (e.g., KModular Redundancy and NVersion Programming). In the CFI strategy of redundancy, the function invocation relationships and invocation frequencies are employed to rank the functions’ importance and identify the most vulnerable function implemented via functionally equivalent components. A tradeoff has to be made between efficiency and reliability. In this paper, a formal theoretical analysis and an experimental analysis are presented. Compared with the existing methods, the reliability of componentsbased distributed system can be greatly improved by tolerating a small part of significant components.
1. Introduction
With technology scaling, the occurrence of Internetbased services, such as cloud computing, volunteer computing, is sharing resources (e.g., software, hardware platform, and computation resources) to provide services on demand. At the beginning of Elastic Compute Cloud (EC2) proposed by Amazon, clouding computing which involves multiple components communication by incomplete reliable networks has become one of the hottest research areas in recent years. As a typical cloudbased application, volunteer computing uses Internetconnected computers volunteered by their owners as the source of computing power and storage. It can support applications that are significantly more dataintensive or have larger memory or storage requirements. Compared with other types of highperformance computing (e.g., grid computing), volunteer computing has a high degree of diversity. The volunteered computers vary widely in terms of software and hardware type, speed, availability, reliability, and network connectivity, as well as the resource requirements and completion time constraints of the applications [1].
The reliability of cloud computing and volunteer computing is far from perfect in reality. In traditional reliability engineering, faultforecasting, faultprevention, faultremoval, and faulttolerance are used. But how to build a highly reliable and available componentbased services is a challenging and urgentlydemanded research problem no matter academic world or industrial community. Therefore, how to make a tradeoff between efficient use of resources and system reliability should also be taken into account. There are existing large numbers of redundant computing resources in the setting of cloud computing, especially in volunteer computing which is based on unreliable volunteer resources. A wellknown technique of software fault tolerance called design diversity can be employed to tolerate faults in this setting. But when the reliability of each functionally component is low enough, the traditional three modular redundancy may not obtain consistency results at one deployment. For instance, when the reliability of each functionally equivalent component is 0.55, the probability of three modular redundancy, that gets two or three consistency results, is . The correct result of three modular redundancy in this setting may not meet the goal of high system reliability.
We present a CFI (critical fault iterative) redundancy technique in this paper, ensuring that efficient redundancy resources can gain high system reliability. We first construct a function ranking model based on the graphic representation of the functions’ invocation relationships and invocation frequencies. A function ranking algorithm is used to identify the TopK significant functions via the invocation relationships and invocation frequencies. A new iterative redundancy technique is then proposed to enhance the system reliability, which does not require to know the component reliability (assuming the reliability of components ). In this paper, the concepts of function and component are interchangeable. However, when a function executes via several functionally equivalent components, there exist some discrepancies between the reliability of the function and the reliability component. CFI, based on majority voting algorithms (such as TMR [2] and NVP [3, 4]), exploits the properties of distributed computation architectures to adapt more efficiently and to achieve the same level of system reliability at a lower cost factor. By using the function ranking algorithm, we observe that the function invoked frequently by other functions generally has a higher ranking score. On the other hand, the functions invoked by the functions with lower ranking score will get lower scores. CFI can be adapted to the dynamic environment by reexecuting the function ranking algorithm. The key property of CFI redundancy is that resources can be assigned efficiently to the most vulnerable functions to improve the system reliability. The key superiority of CFI is that it is unnecessary to know the reliability of each component. In order to show the effectiveness of the proposed method, a theoretical analysis based on probability theory and an experimental analysis based on Pajek simulation environment [5] are conducted. The CFI method can be used by the architecture designers/engineers of distributed computing or volunteering computing systems to design highly robust applications under untrusted components.
The main contributions of this paper are summarized in the following.(i)Paper introduces a novel iterative fault tolerance strategy called CFI that does not need to know the components reliability. This expends some redundant techniques which need the components reliability and expends the scenario that iterative redundancy can be applied.(ii)We conduct function ranking inspired by Google PageRank algorithm [6] and expand PageRank by adding invocation frequencies to better identify significance functions in complex componentbased systems for redundancy. In order to make appropriate cost and reliability tradeoffs.(iii)A formal theoretical analysis based on probability theory and experiments are designed to compare the reliability of system reliability.(iv)Extensive experiments are designed to evaluate the implicit effects of cost factor and percent of significant functions redundancy on system reliability.
The rest of this paper is organized as follows. The background and related works are introduced in Section 2. In Section 3, a system model is presented based on a ranking algorithm for searching significant functions and an iterative redundancy algorithm for fault tolerance is presented. Theoretical analysis and experiment results on present strategy are given in Section 4. Section 5 presents implicit effects on system reliability. The conclusion of the paper is shown in Section 6.
2. Background and Related Works
Many Internet services interact over unreliable networks, such as clouding computing, ecommerce, search engines, and volunteer computing. These systems utilize redundancy and replication to realize the goal of high reliability. Distributed computation architectures (DCA) systems utilize highly parallel computing resources to dynamical networks; the computing resources of DCA are built by potentially faulty and untrusted components. Widely used DCA systems such as Hadoop project [7], which uses Distributed File System (DFS) to provide highthroughput access to application data and MapReduce for parallel processing of large data sets. A form of distributed computing in which the general public volunteer processing and storage resources to scientific research project called BOINC (Berkeley Open Infrastructure for Network Computing) [8] is being used by a number of projects, including CAS@home, SETI@home, Climateprediction.net [9]. Volunteer participates provide their idle computation resources to cure diseases, study global warming, discover pulsars, and do many other types of scientific research.
Oliner and Aiken [10] propose an online, scalable method for inferring the interactions among the components of large production systems, such as supercomputers, data center clusters, and complex control systems. This work uses the idea of computing correlations and delays between component signals. Convert raw logs into meaningful anomaly signals, then use these anomaly signals to identify important relationships among components, and these relationship information is useful for system administrators to set earlywarning alarms.
Automated vulnerability discovery (AVD) [11] presents a feedbackdriven techniques, automatically assessing a small number of malicious participant nodes that inflict on large distributed system performance. The work focuses on the fact that the interface between correct and faulty nodes can help developers build highassurance distributedsystems. A smart redundancy for volunteer distributed computing proposed by Brun et al. [12] demonstrates redundant strategy, which ensures efficient replication of computation and data given finite processing and storage resources. However, the shortcoming of smart redundancy is aiming at single computing task only.
Progressive redundancy on a selfconfiguring optimistic programming technique aims at componentbased systems proposed by Bondavalli et al. [13]. It focuses on the problem of providing tolerance to both hardware and software faults at componentbased hybrid fault tolerance architecture systems. But they only consider minimizing response time and typically allocate finite resources to each task.
The motivation of this work is that intuition of failures of critical components in distributed computing system will have greater impact on system reliability; thus these critical components will have higher fault tolerance requirements. On the contrary, the other noncritical components’ failure will have less impact and need less fault tolerance requirements, especially, in the circumstance that traditional three modular redundancy may not get two or three consistency results at once employment.
3. Iterative Model and Fault Tolerance Strategy
The key idea of iterative redundancy for vulnerabilitydriven fault tolerance strategy is made up of two steps. First of all, it identifies significant functions via invocation relationships and invocation frequencies of interconnected functions, accomplished by single component or several functionally equivalent components. Then using iterative strategy to fault tolerance unreliable components. The detailed information of these two steps is shown bellow.
3.1. Function Ranking
The purpose of function ranking is using functionally equivalent components’ redundant execution to the most significant functions (or the most vulnerable functions for system reliability), in order to improve the system reliability and make the tradeoff between system reliability and efficiency. The measure, based on the invocation relationships and frequencies between interconnected functions, comes from the intuition of PageRank [6] that web pages linked by large numbers of significant pages are also important. Since the failure of these significant functions must have heavier impact on the whole system reliability than other functions, so these significant functions are more vulnerable to system reliability.
In the componentbased distributed application, a weighted directed graph, called Function Graph, can be modeled via invocation relationships and frequencies. A node in the graph represents a function accomplished by single component or several functionally equivalent components. A directed link from to represents an invocation relationship between different functions, and a nonnegative weight value , where , represents edge weight which can be calculated by
Here, is invocation frequency of function pair , represents that there is no invocation relationship between function and , and is the set of the incoming edge of . Through the definition of , the larger invocation ratio represents that function is invoked more frequently by function , compared with other functions in the set of .
The weight of the function is defined as the sum of the incoming edges weight multiply the weight of function , such that
The sum weights of all function nodes in Function Graph is 1, such that .
Based on these definitions, the procedure of componentbased ranking algorithm can be computed as follows.(i)Randomly assign an initial numerical ranking scores to the nodes in Function Graph, where .(ii)Compute the ranking score for each function by the following: where . The parameter is a damping factor which can be set between 0 and 1, and is employed to adjust the significance values derived from other functions. The resulting weight values of are affected by , but the resulting ranking scores are insensitive to . In the experiment of Section 4, when we set from 0.7 to 0.9, the result of function ranking is stable; thus we set the parameter of to 0.85 which is similar to [6, 14]. From (3), the weight score of function is composed by the basic value and the weights score of the functions that invoked . Assume that is a vector of the functions’ weight,
And is a matrix of the invocation relationship, If has no function to invoke, we set to 1/ in general. Therefore, the simultaneous equations can be rewritten by vector form where is the transposed matrix of . If we assume that the computing process is represented by a probabilistic state transition, the function graph can be seen as a Markov chain model. Therefore the weight of each function is corresponding to the stationary state of the Markov chain. (iii)Equation (6) can be solved by repeating the computation until all the ranking scores become stable. For the sake of simplicity, instead of repeating the computation of Markov chain’s stationary state, we solve it by computing the eigenvector with eigenvalue 1 in our experiments.
Figure 1 shows a function invocation graph with computed weights. The node represents the function accomplished by component, the weighted value represents invocation frequency from function to function , and the sum of weighted values of node ’s incoming edges is equal to 1. In this example, when setting , we will get the function significant ranking in Table 1, where invoked by and gets the highest ranking score and only invoked by gets lowest ranking score. This function ranking result is in accordance with the intuition that function invoked by significant functions is also important for system reliability.
With the approach above, TopK most significant functions whose weight scores are highest have been identified. In the next subsection we will use redundant components’ execution to enhance the reliability of the these functions to obtain higher system dependability.
3.2. Critical Fault Iterative Strategy
At the step of function ranking, TopK significant functions have been recognized. In order to obtain high system reliability, functionally equivalent fault tolerance components can be used to meet this target. In this paper, the CFI redundant strategy is proposed to improve the system reliability efficiently. By contrast, several wellknown fault tolerance techniques will be introduced, and a formal analytical analysis and a simulated empirical analysis of the system failure probability are presented.
3.2.1. Traditional Strategies
Primary Backup Replication (PBR). Primary backup replication and active replication are also wellknown in the area of distributed computing. Primary backup uses serval replications to improve the system reliability. There is a replication assigned as primary. It handles onthefly updated of the backups to ensure limits on losses from primary replica failures, while keeping the cost of updates of the replications low. Active replication does not assigne any replica as primary replica, so it removes the centralized control of primary backup. All replicas receive system’s invocation, and then reply the result. So it incurs a high cost for keeping all replicas synchronized. Active replication costs more system resources than primary backup but minimizes losses that occur when some replicas fail. Taking into account the cost of primary backup and active replication, they obtain the same failure probability, which can be calculated by where is the number of the replicas and is failure probability of the th replica.
KModular Redundancy (KMR). KModular Redundancy (KMR) or NVersion Programming (NVP) are wellknown fault tolerance strategies in software reliable engineering. These strategies perform , functionally equivalent and independent executions in parallel, and then take a majority voting to determine the final result. If there exists a consensus result whose votes number is bigger than , then this consensus result is taken to be the solution. The failure of the KMR or NVP can be calculated by where is the number of functionally equivalent and independent executions and is the probability that executions are failed. Supposing the failure probability of each functionally equivalent component is and , when a componentbased service distributed a job to these 5 components, then the function’s failure probability is . In other words, traditional KMR or NVP fault tolerance strategies get system reliability at cost factor , such that
Traditional redundant strategies have different advantages and disadvantages. KMR and NVP strategy must wait until all the redundant replicas have executed to determine the final result, while active replica strategy takes the first response replica as the final result. The scenarios that these redundant strategies can be employed are variant. Active replication is employed in the areas which have strict constraint of response time. Primary backup is widely used in commercial faulttolerance systems.
3.2.2. Progressive Redundancy Strategy
Progressive redundancy strategy is a step by step calculation process, when facing componentbased distributed systems whose components’ reliability is high and seldom return failure results because of high reliability. In this environment, the calculation results of traditional redundancy strategy often gets consensus quickly, but it still requires to distribute jobs which will not change the task’s output. Progressive redundancy strategy distributes the number of jobs to functionally equivalent components as less as possible. Taking majority voting for example, progressive redundancy strategy just distribute jobs to componentbased distributed systems. If all jobs completed by functionally equivalent components return with the same result, the consensus result will be regarded as final result, because any additional computation is irrelevant. If some functionally equivalent components (represented by ) return with disagreeing results, the server will automatically distributes the minimum number of additional jobs, such as (), to produce a consensus. This process is repeated until a consensus has been reached; the algorithm of progressive redundancy strategy is shown in Algorithm 1.

The reliability of progressive redundancy with majority voting is at most functionally equivalent components fail, and return disagree results: where represents the reliability of the functionally equivalent components and represents cost factor.
3.2.3. Critical Fault Iterative Redundancy
The CFI redundancy will assign appropriate number of components to different functions according to the function ranking algorithm introduced above (in Section 3.1). It distributes the minimum number of functionally equivalent components to reach the system desired reliability. Since some components will fail, the results of functionally equivalent components will be variant. If all the results agree with majority components, then the task assigned to these components is completed. If some of the components fail or results disagree with majority components, then the degree of confidence of majority results is decreased. For instance, if the reliability of functionally equivalent components is 0.75, and the desired function reliability accomplished by these functionally equivalent components is 0.96. Function server distributes only one component to execute the job of this function; there is a probability that the result is correct. But if the server distributes 3 functionally equivalent components to accomplish the job and all of these three components return with the same result, the degree of confidence that the consistent result is correct will be , so three is minimum number that the function to achieve confidence threshold 0.96. However, if two of three components return with agreeing results and one returns with disagreeing result, the function server at least distributes two more components return the agreeing result to achieve confidence threshold 0.96. In this scenario, how many independent components should be allocated to this function to meet the level of system reliability is determined by CFI redundancy algorithm as follows. This process can be repeated until the gap between the majority result with other results meets the requirement of the system confidence threshold.
From intuition by Bayes’ Theorem, we can draw the following conclusions. If the number of the majority response results () minus the other response results () is constant (i.e., , where is constant), we will get the same degree of confidence. For example, if a function is distributed to 10 functionally equivalent components, and 8 of them has response result A and the remaining 2 has other results, it will get the same confidence as 108 components response result A, 102 components response other results. Supposing that functionally equivalent components are distributed jobs to complete a given function, components return one result with probability , and components return other result with probability . represents that components reported result is correct and components reported result is wrong (e.g., represents result that is majority). So . Then the proof of this Bayes’ theorem, that for all , , is givin as follows:
Corollary: no matter what the reliability of component is, if these components get result A times and get other results times, the confidence that is true depends only on and is independent of . Let be a Bernoulli random variable that represents the number of components and let ; then there exists such that, for all , if out of components of , exactly results of components are correct, so the rest of results are wrong. Then the probability that is constant , because there are two possibilities: either or . If , the probability that exactly results are correct is . If , the probability that exactly results are correct is . Then where is only depending on and does not depend on . Thus, we can conclude that is identical for all .
Now, we have shown that the result’s confidence is only depending on the different value between the majority result with others. For instance, if is 3 that means a function distributed to functionally equivalent components until 3 more components reported one result than the other. Then we can conduct an automatically critical fault iterative (CFI) redundant algorithm to meet the requirement of system reliability in Algorithm 2.

Using Algorithm 2, we only need to determine the system reliability requirement factor (i.e. ); then the system reliability , where represents the reliability of the functionally equivalent components. The algorithm first distributes jobs to functionally equivalent components, reports minus value between the number of jobs reporting the majority results and the number reporting the other results, and then the algorithm iterates automatically distribute jobs until more jobs have reported one result than the others.
In order to obtain the system reliability requirement factor , jobs should be distributed to functionally equivalent components to execution, and functionally equivalent components return the same result, and components return the other result. The cost factor of the iterative redundancy is shown as follows: where is the reliability of functionally equivalent components. In the case of large requirement factor , the cost factor can be approximate compute by .
4. Experiment Results
In this section, we compare the improvement of system reliability based on the CFI strategy with traditional strategies and discuss the experiment results.
4.1. Experimental Framework
We use a scalefree directed function graphs generator tool called Pajek [5] to simulate componentbased distributed system. A scalefree graph is a graph whose degree distribution follows a power law [15]. Large selforganizing networks, such as the Internet, the World Wide Web, and social and biological networks, often exhibit powerlaw degrees. Four fault tolerance approaches have been conducted to learn the performance of CFI redundancy on system reliability improvement:(i)NoR: there is no fault tolerance strategy that is employed for the function in component based systems;(ii)RandomR: randomly select functions to employ fault tolerance strategy to improve the reliability of these functions;(iii)CFIR: using the function ranking algorithm to identify the most vulnerable Topk functions to employ iterative redundancy to improve the system reliability;(iv)AllR: using fault tolerance strategy for all functions to obtain the system reliability.
Towards the componentbased distributed system, we conduct a random trace to travel from the scalefree directed graph generated by Pajek to simulate the invocation behavior and invocation relationship. A node in the directed graph stands for the function accomplished by single component or several functionally equivalent components, an edge stands for invocation relationship, and the weight value of the edge is used to simulate the invocation probability or invocation frequency. During the execution the componentbased system, initial node is randomly selected, and a random trace starting from the selected function is performed. We regard the execution as failed if the invoked function is failure; a failure probability is set to the functions provided by these functionally equivalent components. If there is a fault tolerance strategy employed for the invoked functions, the reliability of these functions will be improved. We conducted 100 travel traces for each generated scalefree directed graph. Four method, such as NoR, RandomR, CFIR, and AllR fault tolerance strategies, have been deployed for these travel traces, then averaging the simulate results.
4.2. Reliability Comparison of Distributed Computing System
When we employ different fault tolerance strategies, system will obtain different failure probabilities. The results of the experiment are showing in Table 2. In the experiment, a scalefree directed function graph with 5000 nodes is generated by Pajek. Among the experiments we simulated, AllR always gets the lowest system failure probability, while NoR always gets the highest system failure probability. The results of AllR and NoR are very intuitive, since AllR employs redundant strategies for all the functions while NoR provides no fault tolerance strategies for any function.
In the experiment, since failure probability of component is less than 1%, we just set the function requirement factor to 3. In this setting, the function, accomplished by functionally equivalent components, will get high reliability. Compared with NoR, RandomR does not improve the system reliability obviously. This observation indicates that fault tolerance the functions that are not frequently invoked will be useless, and the failures of these nonsignificant functions will have less impact on the system reliability.
CFI redundant strategy makes a tradeoff between system reliability and cost factor. Through comparing with CFI redundant strategy, AllR obtains better system reliability in all the simulated experiments, but AllR pays a bigger price than CFI. In all our experiments, CFI fault tolerance strategy obtains better reliability than RandomR. Because significant functions identified by the step of function ranking are invoked more frequently, the failure of these significant functions will have greater impact on the componentbased distributed system. So tolerating failures of these significant functions can achieve better system reliability than tolerating failures of randomly selected functions.
When the components failure probability increases from 1% to 5% and 10%, the whole system failure probabilities of four redundant strategies (e.g., NoR, RandomR, CFIR, and AllR) are increased greatly. This is because when the number of failure components increases greatly, only tolerating the failure of functions which are frequently invoked is not enough for providing a highly reliable system.
5. Implicit Effects on System Reliability
5.1. Implicit Effects of Cost Factor on Reliability
To study the impact of Cost Factor on the componentbased distributed system’s failure probability. Iterative redundancy method, called CFI redundancy (CFIR), proposed in this paper is compared with traditional majority voting redundancy (MajorR). The cost factor is setting from 3 to 17 with a step value of 2. The number of functions in this experiment created by Pajek is 1024. Table 3 shows that CFIR outperforms MajorR in all the cost factors no matter what redundant percent is deployed (e.g., Top 1%, Top 5%, and Top 10%). With the increase of cost factor from 3 to 17, system failure probabilities of these two redundant methods are all becoming lower.
In the corollary of system model, we have shown that it is unnecessary to know the reliability of each component to implement CFI redundant strategy (assuming the reliability of each component is bigger than 0.5). Therefore, the system architect engineer just only needs to specify how much improvement is required to enhance the system reliability. In Figure 2 we have shown that if the reliability of each component is higher than 0.75, the iterative redundant algorithm just needs to set the requirement factor to 4; then the reliability of the function accomplished by these functionally equivalent components will be higher than 0.95. The higher the component reliability, the smaller the cost factor needed to achieve the high system reliability. Therefore, if architect engineer has the knowledge of component failure probability, he may make requirement factor more effective.
In some realtime system which have strict time constraints, traditional fault tolerance strategy such as threemodular redundancy which can be deployed to three components at once, but using CFI redundant strategy, a job must first be deployed to several components, and waiting for the results before determining whether should to deploy more jobs to functionally equivalent components or not. The responding time depends on the requirement factor and component failure probability. So CFI redundancy increases the responding time for some jobs that need high reliability. In this case, more jobs can be deployed to functionally equivalent components at once to decrease the responding time.
5.2. Implicit Effects of TopK on Reliability
In order to study the impact of the redundant percentage on system reliability, we set different redundant percents of components to compare the CFIR with MajorR. The result is showing in Table 3. The tendency of system failure probabilities when different redundant percents are deployed is shown in Figure 3.
(a) TopK = 1%
(b) TopK = 5%
(c) TopK = 10%
(d) TopK = 20%
We can conclude that when the redundant percent increases, the failure probabilities of CFIR and MajorR decrease. Under different component redundant percent settings, CFIR strategy consistently outperforms MajorR in the from TopK = 1% to TopK = 10%. When component failure probability is high, in order to obtain higher system reliability, larger cost factor and component redundant percent are needed.
5.3. Implicit Effects of Component Failure Probability on Reliability
We compare AllR, CFIR, RandomR, and NoR under component failure probability from 1% to 9%. The tendency of system failure probabilities when different redundant percents, which under different component failure probability settings are deployed, is shown in Figure 4. When failure probability of these components are increasing from 1% to 9%, the distributed system failure probability of these four methods (e.g., AllR, CFIR, RandomR, and NoR) becomes larger. CFIR outperform RandomR in all the settings and have a more effective use of redundant components.
(a) TopK = 1%
(b) TopK = 5%
(c) TopK = 10%
(d) TopK = 50%
6. Conclusion
The paper proposes a CFI redundant strategy that improves the existing techniques by using resource more efficient, especially in the environment that the failure probability of component is high. The CFI redundant strategy includes two steps: function ranking and iterative redundancy. In function ranking, the significant function is determined by the functions that invoke it and the weight scores of these invoke functions. At the step of iterative redundancy, different cost factors are deployed to different ranking score functions accomplished by functionally equivalent components, in order to make a tradeoff between system reliability and cost factors. In future work, when we compute the function ranking, we will consider components’ failure exposure probability and failure propagation effect on ranking and considering invocation latency, concurrency, throughput, and component failure correlations when computing the weight of invocation relationship.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The authors thank Kun Jiang for the contributions to our modeling framework and simulation experiments, as well as Ling Zhou for the input and suggestions. This work was support by the Natural Science Foundation of China under Grant no. 60973122 and 863 HiTech Program in China under Grant no. 2011AA040502.