Abstract

In cloud computing, there are four effective measurement criteria: (I) priority, (II) fault probability, (III) risk, and (IV) the duration of the repair action determining the efficacy of troubleshooting. In this paper, we propose a new theoretical algorithm to construct a model for fault troubleshooting; we do this by combining a Naïve-Bayes classifier (NBC) with a multivalued decision diagram (MDD) and influence diagram (ID), which structure and manage problems related to unambiguous modeling for any connection between significant entities. First, the NBC establish the fault probability based on a Naïve-Bayes probabilistic model for fault diagnosis. This approach consists of three steps: (I) identifying the network parameters to also show the reliance for probability relationship among the entire set of nodes; (II) determining the structure of the network topology; (III) assessing the probability of the fault being propagated. This calculates the probability of each node being faulty given the evidence. Second, the MDD combines the influential factors of four measurements and determines the utility value of prioritizing their actions during each step of the fault troubleshooting which in turn assesses which fault is selected for repair. We demonstrate how the procedure is adapted by our method, with the host server’s failure to initiate a case-study. This approach is highly efficient and enables low-risk fault troubleshooting in the field of cloud computing.

1. Introduction

Cloud computing refers to flexible, self-service, and network-accessible computing resource pools that can be allocated to meet demand, allowing thousands of virtual machines (VMs) to be used even by small organizations. As the scale of cloud deployments continues to grow, the necessity for scalable monitoring tools that support the unique requirements of cloud computing is of greater priority [1]. Endowing clouds with fault-troubleshooting abilities for the management of the reduction, existing complexity, and continued development of the complexity of a utility cloud [2] presents a difficult but obviously interesting solution. Troubleshooting for faults in the cloud remains relatively unexplored so that there has been no universally accepted tool chain or systems for the purpose. In the presented work fault detection is defined as a problem of classifying the time instances during runtime in which cloud computing is having anomalies and faults [3]. A useful algorithm in which the problem needs to be addressed must contain the following: (a) high detection and rates with few false alarms; (b) an independently supervised technique due to insufficient a priori knowledge regarding standard or anomalous behavior, or circumstances; (c) an autonomic methodology so that the increase in cloud scaling includes personnel costs. Additionally, a cloud is typically used by multiple tenants and multiple VMs may be colocated on the same host server [4]. Because resources such as CPU usage, memory usage, and network overhead are not virtualized, there exists a complex dynamism between the VMs, which have to share resources. The increased complexity is hugely problematic in terms of troubleshooting [5].

However, it is unsuitable for environments with dynamic changes, is susceptible to high false-alarm rates, and is expected to perform poorly in the context of large-scale cloud computing [6]. Current monitoring tools are service tools, such as Nagios and Ganglia [7, 8] which are designed with greater functionality to obtain low-level metrics including the CPU, disk, memory, and I/O, but not designed to service monitoring for dynamic cloud environments. Current apparatus are founded on threshold methods, which in industry monitoring commodities are more frequently employed. First, upper and lower bounds are created for all of the metrics. The threshold values come from the knowledge of performance along with predictions rooted in the historical data analysis from the long-term. Violation of the threshold limit by any of the observed metrics triggers an alarm of anomaly. Therefore, in this paper we propose a new theoretical approach for constructing the steps of a fault-troubleshooting model. The model combines a Naïve-Bayes classifier (NBC) with a multivalued decision diagram (MDD) and influence diagram (ID) to structure and manage fault troubleshooting. An NBC is a simple probabilistic classifier algorithm based on the application of a Bayesian network (BN) with independence assumptions between predictors [9, 10]. The NBC is a modeling technique based on probability, which is the most acceptable for knowledge-based diagnostic systems which are knowledge-based. Ultimately this allows for the modeling and reasoning regarding uncertainty. An NBC and an ID were developed in the artificial intelligence community as principled formalisms for reasoning and representing under uncertainty in intelligent systems [11]. Therefore, this technique is ideally suited for the diagnosing in real-time world complication where uncertain incomplete datasets exist [12]; this method offers an appropriate mechanism for diagnosing complicated virtualization systems in the cloud.

The major advantage of using a BN is its ability to represent and hence allow knowledge to be understood. A BN comprises two parts: qualitative knowledge through its network structure and quantitative knowledge through its parameters. Whereas expert knowledge from practitioners is mostly qualitative, it can be used directly for building the structure of a BN [13]. In addition, data mining algorithms can encode both qualitative and quantitative knowledge and are able to encode both forms of knowledge simultaneously in a BN. Therefore, BNs can bridge the gap between different types of knowledge and serve to unify all knowledge availability into a single form of representation [14]; therefore, an NBC is suitable for producing probability estimates rather than predictions. These estimates allow predictions to be ranked and their expected costs to be minimized. The NBC still provides a high degree of accuracy and speed when applied to large datasets [5]. A MDD is a generalization of the binary decision diagrams (BDD), which have been used to model static multistate systems for reliability and availability analysis. MDDs are one of the effective mathematical methods for the representation of multiple-valued logic (MVL) functions of large dimensions. They have recently been adapted for the reliability analysis of fault tolerant systems [15, 16]. The MDD can quickly categorize the degree of the risk of consequences according to the symptoms [17]. Specifically in relation to prior work, an NBC and a MDD have been widely used in many research projects and applications that require dependability, self-diagnosis, and monitoring abilities, especially in terms of fault diagnosis and monitoring systems [18]. Zhai et al. [5] proposed a method for the analysis of the multistate system (MSS) structure function by a BN and a Multistate System (MSS), which is an effective approach for the analysis and estimation of high-dimensional functions in MVL. Sharma et al. [6] developed CloudPD as the first innovative end-to-end fault management system capable of detecting, diagnosing, classifying, and suggesting remediation actions for virtualized cloud-based anomalies. Wang et al. [19] proposed EbAT, a system for anomaly identification in data centers that analyzes system metric distributions rather than individual metric thresholds. Smith et al. [20] proposed an availability model, which combines a high-level fault tree model with a number of lower level Markov models of a blade server system (IBM BladeCenter). Xu et al. [21] proposed an approach based on BNs for error diagnosis. However, a comprehensive analysis indicates that these researchers who work in different research areas use Bayes’ theorem only for reasoning and for calculating the fault probability with threshold-based methods and other techniques for the detection of faults/anomalies, in which the tuning for sensitivity must be far-reaching in order to prevent an unacceptable number of false positives. Furthermore, the drawback of an extremely high number of false alarms is that it simply induces monitoring of individual metrics rather than monitoring metric combinations, which is more desirable.

Thus, in contrast to previous research and contributions, this paper addresses fault troubleshooting in cloud computing. The objective of this work is to monitor collections, develop classifiers, and analyze attributions of metrics rather than individual metric thresholds by extending the diagnosis of faults into troubleshooting while multiple measurements of decisions are still considered including priority, fault probability, risk or severity, and duration of the construction steps for fault detection and repair actions. The contribution of this paper is the following: it proposes a new theoretical approach algorithm to construct the steps of a model for fault troubleshooting by combining NBC with MDD and an ID to structure and manage fault troubleshooting on cloud anomaly detection. The theoretical approach takes into account multiple measurements (criteria), such as the priority, fault probability, risk, and duration of the repair when making high priority decisions about repair actions. The NBC is used to determine the fault probability based on the use of a Naïve-Bayes probabilistic model for fault diagnosis, whereas MDD and ID are used to subsequently combine the impact of the four measures and compute the utility value with the priority of the troubleshooting steps for each action to determine which fault is selected for repair. The decision for recovery repair is determined by the utility coefficient and the priority of the action for each repair. The case-study used in this work is the failure of a host server to start up. The theoretical proposition ensures that the most sensible action is carried out during the procedure of troubleshooting and generates the highest efficacy and cost-saving fault repair through three construction steps: (I) determining the network parameters which indicates the probability dependency relationship among all of the nodes; (II) evaluating the structure of the network topology; (III) assessing the probability of the fault being propagated, as shown in Figure 1.

The rest of the paper is organized as follows. Section 2 describes the fault-troubleshooting analysis based on MDD, determination of the action set, and the measurements criteria. In Section 3 we introduce the concepts of the NBC and ID models. Section 4 focuses on modeling uncertainty using the proposed approach and presents our ongoing work toward developing a framework based on NBC and its extension an ID network topology and phase of parameters process. Section 5 explains probability propagation for the NBC model. Section 6 explains the process of the troubleshooting decision. Section 7 explains the experimental setup for a single data center. Section 8 presents the method implementation, evaluation results, and a comparison with current troubleshooting methods. Finally, conclusions and future work are presented in Section 9.

2. Fault-Troubleshooting Analysis Based on MDD

A troubleshooting process aims to perform, detect, and fix faults with high efficiency and with low risk. The success of each fix is ascertained by its likelihood of failure, risk, duration, and priority. The MDD is able to rapidly categorize the degree of the risk of faults according to the symptoms [11]. MDD investigation is a procedure covering the proposal of the complication to the execution of the ultimate action. The most important notions for the application of troubleshooting for faulty applications are given in Table 1.

2.1. Ascertainment of Action Groups and Measurements Specifications

There are four categories of causes of start-up failure in a cloud system: host server, host operating system, virtual machine monitor (VMM) [22], and VMs, as shown in Figure 2. In this paper, the failure of the host server to start up is used as a case-study to show the strategy of the method. Troubleshooting for faults of the procedure is separated into two steps: the first step is to ascertain the fault group that needs to be troubleshot, and for step two which part should be troubleshot needs to be determined. The set of actions is deterministic for each of the steps. In step one, a host server start-up failure can be the result of six types of failure as shown in Figure 2, CPU utilization (CPU-utilization fault), memory (memory-leaks-fault), I/O storage (throughput-fault), network (bandwidth-fault), and other factors such as power/cooling failures; thus, the actions set in the first step includes the network, CPU, memory, storage, power, and cooling which is in need of repair. Thus, if a repair action of the CPU utilization is selected in the first step, step two will be used to select a set of actions to repair host-CPU-usage, VM-CPU-usage, and VM-CPU-Read-Time errors. As mentioned before, to perform a troubleshooting decision, troubleshooters must examine the fault probability of the fault’s component and also the risk, duration, and priority required to repair the component faults. Four evaluation measurements criteria are determined: () probability of component failure; () duration; () risk; and () priority.

Fault probability indicates the likelihood of fault components causing the failure of the host server to start up. Duration is defined as the time in minutes to complete the repair fault action. Risk is defined as the degree of risk (normal, minor, and serious) of creating additional new faults and making the troubleshooter aware of the safety issue during the repair action. Priority is the priority level service of the repair actions. Of the four quantification criteria, the likelihood of the fault being uncertain and the NBC diagnostic model can be used to determine such likelihood. The other three are confirmed for a particular action and are able to be inferred from expert understanding that had been accumulated over a period of time through investigation and time-consuming analysis.

2.2. Determine the Weights of Measurements

MDD assesses the effect of many measurements by computing the value of a utility in which each criterion has a specific measurement weight. The ranking method and pair-wise comparison method have been used in dictating the measurement weights. For this paper, the ranking method uses a criterion according to the importance that has been concluded to be believable by the decision makers and thus accordingly determines the weights [23]. The following formula is used for computing the weights:where is the weight of measurement; , are the decision criteria numbers; and is the measurement criterion in the status of hierarchical importance of . Based on the domain expert’s opinion, the order of importance for each measurement is explained as follows:(1)Fault probability: first(2)Severity/risk: second(3)Priority: third(4)Duration: fourth.

As seen in Table 2, a sample weight of each measurement is calculated by the ranking method.

3. Determining Fault Probability Using NBC and ID

3.1. The Naïve-Bayes Probabilistic Model

A Naïve-Bayes classifier [24] is a function that maps input feature vectors , to output class labels , where is the feature space. Abstractly, the probability model for a Bayes classifier is a conditional model , which is the so-called posterior probability distribution [25]. Applying Bayes’ rule, the posterior can be expressed as where is the posterior probability of the class (target) given predictor (attribute); is the prior probability of the class; is the likelihood probability of the given class; is the prior probability of the predictor (evidence).

The simplest classification rule is used to assign an observed feature vector to the class with the maximum posterior probability.Because is independent of and does not influence the operator, can be written asThis is known as the maximum a posteriori probability (MAP) decision rule [26].

The BN model that was established for fault diagnosis contains three aspects of work [27]:(1)Determine the network parameters to indicate the probability dependency relationship between all nodes;(2)Determine the network topology structure;(3)Probability propagation: this is to calculate the probability of each node given the evidence.

In this paper, the purpose of building an NBC model is to determine the most likely cause of a fault, given a fault symptom, that is, to compute the end possibilities for the cause of the fault given the evidence [28]. The end possibility is calculated using the joint probability calculator. To further simplify this calculation, the conditional independence is determined by NBC to make some assumptions. Any node is independent of unlinked nodes and is dependent on its parent node. Thus, for any node , which belongs to a group of nodes, , if there is a clear node will be conditionally independent of all other nodes except for that of . Conditional independent terms, according to the definition, areFigure 3 is an example of the NBC model with the content arrangement of three layers with a four-node connection According to (2), the posterior conditional probability can be obtained bywhere and are called marginal probability and can be calculated fromwhere and involve calculation of the joint probability. The joint probability can be calculated according to the definition of the chain rule and joint probability, from

3.2. The Influence Diagrams Model

Influence diagrams (ID) [29] for solving decision problems extend BN with two additional types of nodes, utility nodes and decision nodes. A utility node is a random variable whose value is the utility of the outcome [30]. Nodes for the random variables in the BN are called chance nodes in the ID. A decision node defines the action alternatives considered by the user. A decision node is connected to those chance nodes whose probability distributions are directly affected by the decision. A utility node is a random variable whose value is the utility of the outcome. Like other random variables, a utility node holds a table of utility values for all value configurations of its parent nodes [31]. In an ID, let be a set of mutually exclusive actions and the set of determining variables. A utility table is needed for yielding the utility for each configuration of action and determining variable in order to assess the actions in . The problem is solved by calculating the action that maximizes the expected utility:where represents the expected utility of action ; is the conditional probability of variables , where , given action , is executed. This conditional probability is calculated from the conditional probability table (CPT) while traversing the BN of these variables.

Figure 4 represents an example of an ID of a CPU and a decision to detect a CPU-utilization fault. Prediction and CPU are chance nodes containing probabilistic information about the CPU and prediction. Satisfaction is a utility or value node. CPU utilization is a decision node. The objective is to maximize expected satisfaction by appropriately selecting values of CPU utilization for each possible prediction. The values of satisfaction for each combination of CPU utilization and CPU are also given. The evaluation of the algorithm of an ID is performed using the following procedure [31]:(1)Set the evidence variables for the current state.(2)For each possible value of the decision node, set the decision node to that value.(3)Calculate the posterior probabilities for the parent nodes of the utility node using a standard probabilistic inference algorithm.(4)Calculate the resulting utility function for the action and return the action with the highest utility.

4. Determination of NBC Network Topology

In this section, we focus on modeling uncertainty using the proposed approach and present our ongoing work toward developing a framework based on NBC and ID models.

4.1. The Topology of the NBC Combined with an ID Model

As shown in Figure 5, start-up failure of the host server may have its roots in six groups of faults which include CPU utilization [32], memory usage, I/O storage, network overhead, and power and cooling failure [20]. Each group has some specific reasons as shown in Table 3.

Using GeNle [33] the topology of the NBC combined with an ID model was created in accordance with the causes of the fault of the host server start-up failure, as shown in Figure 5. The model has eight chance nodes (conditions) for the root nodes pointing to the fault root causes (attribute values), which are shown in the yellow circles, five intermediate nodes pointing to the fault category (observed attributes), which are shown as green circles, a symptom node (predict) shown as an orange circle, one node indicating the priority drawn as a white circle, one decision node (Is fault?) drawn as a white rectangle, and the utility (decision) nodes drawn as the final white shape.

As seen in Figure 3, each node in the graph is associated with a set of local (conditional) probabilities expressing the influence of the values of the nodes’ predecessors (parents) on the probabilities of the values of this node itself (e.g., , (2)). These probabilities form the quantitative knowledge input into the NBC network. An important property of the network is (conditional) independence. The lack of a directed link between two variables from the CPU-utilization node and host-CPU-usage node are independent, conditional on some subset of other variables in the model. For example, if no path existed between the CPU-utilization node and host-CPU-usage then will be empty.

4.2. Normal Distribution

Numerical data [34] need to be transformed to their categorical counterparts (binning) before constructing their frequency tables. Other options we have are to use the distribution of the numerical variables to obtain a good estimate of the frequency. A normal probability distribution is defined by two parameters, and . Besides and , the normal probability density function (PDF) also depends on the constants and [35]. The normal distribution is especially important as a sampling distribution for estimation and hypothesis testing.

4.3. Determination of Network Parameters

Parameters were taken from large monitoring engines (metric-collection) to obtain the metrics of interest (numerical predictors or attributes). Additional details are provided in Section 8, for example, for the Virtual Machine Manager (VMM) [36] hypervisor (testing dataset) and historical data (training dataset). Sample parameters of CPU utilization (testing/training dataset) are provided in Table 4.

The metrics of the VMs and host server were collected by running the VMs on the Xen (hypervisor) [37, 38], which was installed on the host server in combination with preprocessing (reported in Table 3) using Ganglia metrics software. Because these metrics exist in the form of numerical data, the numerical variables had to be transformed into their categorical counterparts (binning) before constructing their frequency table to use them as input for the network topology structure [39]. Therefore, as shown in Figure 6, the preprocessing step consists of four steps to translate continuous percentage utilization into interval probability values and generate monitoring vectors of events (-events) by using the method proposed in [19]. We adopted this method with a filtering dataset to remove and process the outlier data and noise by using Extended Kalman Filter (EKF) [40, 41] and we generated a new method algorithm for transforming a numerical data to binning data, as shown in Algorithm 1.

Input: Metrics_values , mean, m, r,    //mean[] is a list of summation columns by
n. num. index generated from step 1, calculated from
Eq. (10), and are predetermined statistically
in this experiment and , where, is
the value range [], is equal-sized bins
indexed from 0 to . // size of look-back
window index for row table for normalized
values table. //num of the number of
components metrics index for column
normalized values table. //Normalized is
the normalization values table. //Bin is the
a bin index data binning table. //Filtering-
dataset is function algorithm to remove and
process outliner and noises data.
Output: Data binning table.
() For  , ; i++ do
()  For  , <= num,   j++  do
()   Normalized = Metrices_values/mean
()   If  Normalized  then
()      = m
()   else
()     =TRANC(Normlized/(r/m))
()  End for
() End for
() Data_binning=Filtering-dataset(Bin)
() End Function return (Data_binning).

Hence, the presented method has a buffer size of (a look-back window) for the metrics of the previous samples that had been observed (e.g., , range and #Bin = 5). The look-back window is used for multiple reasons [19]: (I) shifts of work patterns may render old history data even useless or misleading, (II) at exascale, it is impractical to maintain all history data, and (III) it can be implemented in high speed RAM, which can further increase detection performance. We use a window-size of 3, sampling interval of 4 seconds, and interval length of 12 minutes. The number of data points in the 12 minutes’ interval are, thus, . Once the collection of sample data is complete, the data are preprocessed and transformed into a series of bin numbers for every metric type, using (10) to perform the data binning, with a time instance server () and -events (), where is number of instances, with the results as shown in Table 5.

In the above equations, is the population mean, is the population standard deviation, is from the domain of measurement values (), and is the number of components or services. A normal probability distribution is defined by the two parameters and . Besides μ and , the normal probability density function (PDF) depends on the constants and ; because these attributes are numerical data, the numerical variables in Table 4 need to be transformed into their categorical counterparts (binning) before their frequency tables are constructed by (11) (e.g., look-back window-size = 3, range = , and number of bins = 5), as shown in Tables 5 and 6. The values for binning-value and decision-value are determined by the following formulas:where is the normalization value for the attribute and 0.4 is a statistic suggested by the probability values. Then one can begin the classification and analysis into a dataset probability (predictor) by use of the cumulative distribution function (CDF) [35] as shown in Table 8. For a continuous random variable, the CDF equation iswhere is the lower limit and is the upper limit 5, .

The metrics observed in the look-back window at each time instance serve as inputs for the preprocess, as shown in Table 5, so that an -event creation is the input for every component metric for our presented methodology, as shown in Table 6.

Figure 6 shows the generated probability table of the test dataset using the CDF, which is used as input into the NBC network model. For example, a look-back window with size 3 (metrics record) creates a dataset table with the following decision-values: if (decision-value = 0 or decision-value < 2), then (fault category is “normal” and node fault-state is “no” (working)); if (decision-value = 2 or decision-value = 3), then (fault category is “minor” and node fault-state is “no” (working)); if (decision-value > 3) then (fault category is “serious” and node fault-state is “yes” (fault/anomaly), as shown in Table 7.

In the network in Figure 5, each node has two fault-states, that is, “no” (normal or working) and “yes” (fault or anomaly), and each fault category node has three states, i.e., “normal,” “minor,” and “serious,” as shown in Table 8. In addition, each node has a probability table acquired either from monitoring engines and preprocessing (testing dataset) or from prior historical data (training dataset). The root nodes of the fault cause are expressed in terms of the previous probabilities. For other nodes, a conditional probability table is used showing the relationship of probability dependence. The probability table, as is shown in Table 9, is the “host-CPU-usage” of the root node.

Table 10 presents the conditional probability table (CPT) [42] of the CPU utilization of the fault category node. In this case, we use the bucket elimination algorithm [43, 44] to calculate the probability of each fault category such as CPU utilization according to the prior probabilities and the CPT as the following equation:where is the CPU utilization and , and represent the VM-CPU-Ready-Time, host-CPU-usage, and VM-CPU-usage probability values, respectively.

subjective estimates from experts and learning from case data are the methods by which the table can be created. Example of numerical predictors is sample numerical values for host-CPU-utilization predictors as shown in Table 11.

Assume we have new evidence for prediction, such as . What is the prediction of the fault-state for this evidence? Solution is as follows: we normalize the numerical values = 61.34 and obtain the maximum value of the normal distribution function values, as follows:

Result. =. Then, fault-state = no.

5. Probability Propagation

NBC can propagate the probability for the given evidence after the structure of the model and the establishment of the CPT of all nodes. The NBC model calculates the probability and exerts inference when the host server is set to “fault” and cannot start properly for each node. The results of the probability propagation of the NBC model are shown in Figure 7. As illustrated in Figure 7, the fault probability of CPU utilization in the fault categories is the highest at 74% and is shown in Figure 7. Within this category the fault cause host-CPU-usage has the highest fault probability of 88%. Likewise, when new evidence like a test result is input into the model, the model will update all of the other nodes with respect to their new probabilities.

6. Processing Troubleshooting Decision

A troubleshooting decision involves shaping the action of repair, which is to be done in every fault-troubleshooting step. The MDD analytical method is used to calculate the values of utility and assorted actions of priority values, which are used to integrate the state of the five measurements. The action with the highest value of utility will be selected for services detection and repair. One of the MDD evaluation methods, namely, the MVL utility approach [45, 46], is adopted in this paper. Each criterion is assumed to be measurable on a ratio scale. The value of a criterion can be normalized to a scale of , where 0 and 1 represent the “worst” and “best” effect for the measurement criteria, respectively. The criterion risk, for example, is measured by the probability of severity . An action with a severity equal to 0.65 has a higher severity than an action with a severity equal to 0.85. The criterion priority, which is given in seconds, can also be normalized to . An action with a priority equal to 0.25 takes a higher priority to service than one with a priority equal to 0.5. As seen in Table 12, the five measurements are given values mapped to the range in step one to enable a decision to be made regarding which fault group needs fixing. For example, the “best” value for the severity of the five identified repair actions is 0.85 (fault-repair-others). Likewise, the “best” value for duration is 0.75 (fault-repair-CPU). The indistinct values for the measurement criteria are found using NBC.

MDD evaluates the weight of influence for multiple criteria by calculating the value of utility whereby all of the criteria have their own individual weights. All criterions are individually given a weight that indicates how important the measurement criterion is, and the whole of the utility of an action is the weighted sum.

Table 13 lists the calculated utility values of five actions in the first step of the decision-making process. The troubleshooters should proceed to the CPU fault in the first step because it has a high utility value 0.69, as shown in Table 13. Because the decision to repair the CPU fault was made in step 1, in the second troubleshooting step the set of actions repairing the host-CPU-usage, VM-CPU-usage, and the VM-CPU-Ready-Time are included. The values normalized to , based on the three criteria, for every action in step one are presented in Table 14.

Table 15 contains the calculated efficacy values of the decision-making procedure. As seen in Table 15, the utility value of 0.70 for the action of repairing host-CPU-usage is the highest. Thus, the troubleshooting decision is to repair the host-CPU-usage fault of the host server.

7. Experimental Setup

The architecture of the physical machine that was used as experimental setup is shown in Figure 8. The test bed uses two virtual machines VM1 and VM2 configured on a Xen (hypervisor) platform hosted on one Dell blade server with dual core 3.2 GHz CPUs and 8 GB RAM.

OpenStack is open source software capable of controlling large pools of storage, computing, and networking resources throughout a data center, managed through a dashboard or via the OpenStack API [47]. We injected 40 anomalies into the OpenStack online service on the host server, which results in faults or anomalies for global resource consumption, as presented in Figure 9. These 40 anomalies stem from extreme failure source issues in online services [19, 48]. The following patterns of CPU utilization were observed when the test bed exhibited the expected behavior for the host server, that is, the case-study in this work. The metrics of the both the VMs and the host were collected using Ganglia metrics and analyzed according to a fault-troubleshooting approach from a previous study, for four seconds. During this period, we were injecting the anomalies into the testbed system; more details for implementation are discussed in Section 8.

8. Approach Method Implementation and Evaluation Results

We implement our approach at both the VM level and host server level. The host server is considered as the direct parent of the VMs. Thus, the local time series is calculated first for the VMs, followed by their aggregation for the host server’s global time series. As mentioned in Section 4, the first implementation uses the Ganglia metrics method to record and to identify global resource provisioning anomalies. Forty anomaly samples were injected into the testbed, leading to global resource consumption by the faults/anomalies, which does not exclude the CPU utilization of the running host server. As shown in Figure 10, the aggregation of the anomalies is conducted by integrating the measurements performed with the Graphite tool with the Ganglia metrics to obtain the CPU-utilization metrics (host-CPU-usage, VM-CPU-usage, and VM-CPU-Read-Time). The identification of a fault/anomaly is performed by applying a threshold-based methodology and troubleshooting approach for the online CPU-utilization observations of the VMs and the host server.

As seen in Figure 10, the violations are indicated by red dots when the baseline threshold method is applied. The consecutive red dots are considered as a single anomaly as there may be a delay period of some time. A positive alarm and its matching actual faulty/anomaly injection into the testbed may occur at different times because of the delay between the injection and its effects based on the metrics being collected and analyzed. In the presented paper, such a problem needs to be developed. To set the threshold values optimally, 5% and 90% are chosen as the lower and upper probability boundaries, respectively, for the thresholds based on representative values used in state-of-the-art deployments. As shown in Figure 11, a similar implementation is used for the host server to obtain CPU-utilization data utilizing these threshold values to identify faults/anomalies. The anomaly events are represented by dotted lines and red dots indicate the position of an alarm.

Four statistical measures [6] are used to evaluate the effectiveness of the construction steps of the troubleshooting procedure to detect faults/anomalies in the testbed of our experimental setup which is explained in Section 7, as shown in Table 16.

Overall, as compared with the method of threshold-based detection, the results obtained show a 94% improvement in the accuracy (F1) score, on average, against the theoretical approach, with an average false-alarm rate of 0.03% as shown in Table 17.

9. Conclusion

Fault troubleshooting for cloud computing has the goal of diagnosing and repairing faults at the highest level of effectiveness and accuracy and with minimal risk. The effectiveness is dependent upon multiple measurement criteria including the fault probability, priority, risk, and duration of how long the repair action takes. In this paper, a new fault-troubleshooting decision method algorithm is proposed based on NBC and MDD, which includes an influence diagram. The method makes certain that the most sensible repair action is chosen for each of the troubleshooting steps, thus enabling rapid, highly efficient, and low-risk troubleshooting. The practical consideration for implementing this approach is to provide a decision-theoretical methodology for modeling the construction steps for fault troubleshooting of cloud computing. The failure of a host server to start up is utilized as a case-study in this paper, although the proposed method has more robust significance for troubleshooting cloud platforms. In conclusion, the proposed method contains the following construction steps: () identification of possible actions for the set; () identification of measurement criteria for the set that are attributes of actions in which available options are determined; () establishing the uncertainty measurement criteria such as fault potential and certain measurements such as duration, risk, and fault probability; () using the NBC and the MDD with ID to build a model to surmise the ambiguity measurements criteria; () for each criterion determining the values and weights and then normalizing the criterion values to a scale of ; establish the general utility values of the actions employing the weighted sum method; () using the utility and priority values acquired, generating a decision. A high utility value indicates a high service priority. Future work regarding the proposed approach includes the following: (I) evaluating scalability with multiple virtual machines, (II) evaluation of scalability for large datasets based on the Hadoop MapReduce and Apache Spark platforms for analyzing large volumes of real-time datasets by using aggregation, (III) additionally, the correction and recovery of each fault are expected to lead to improvement with the use of new reasoning algorithms, such as case-based reasoning (CBR).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Chinese High Tech R&D (863) Program Project “Cloud Computer Test and Evaluation System Development (2013AA01A215).”