Abstract

Fault tolerance in cloud computing is considered as one of the most vital issues to deliver reliable services. Checkpoint/restart is one of the methods used to enhance the reliability of the cloud services. However, many existing methods do not focus on virtual machine (VM) failure that occurs due to the higher response time of a node, byzantine fault, and performance fault, and existing methods also ignore the optimization during the recovery phase. This paper proposes a checkpoint/restart mechanism to enhance reliability of cloud services. Our work is threefold: (1) we design an algorithm to identify virtual machine failure due to several faults; (2) an algorithm to optimize the checkpoint interval time is designed; (3) lastly, the asynchronous checkpoint/restart with log-based recovery mechanism is used to restart the failed tasks. The valuation results obtained using a real-time dataset shows that the proposed model reduces power consumption and improves the performance with a better fault tolerance solution compared to the nonoptimization method.

1. Introduction

Cloud computing has emerged as a prominent paradigm over the past decade and its use has seen substantial growth [1]. Not only small scale users but also large scale commercial business and scientific applications are getting benefited by the use of cloud. With minimal effort, users can get services from the cloud as it enables ubiquitous, on demand access to a shared pool of computing resources. Resources like software, hardware, and apps are shared resources. The three main layers of cloud architecture are Software as a Service, Infrastructure as a Service, and Platform as a Service. Fault may occur on all these three layers; nevertheless, software based algorithms are identified and applied to recover from faults.

Fault tolerance is described as a system's capacity to continue executing its intended purpose in the face of errors or faults [2, 3]. Even a well-designed system with the greatest components and services cannot be called dependable without fault tolerance capabilities [4]. Because a large number of delay-sensitive (real-time) applications must be run, reliability is a critical aspect of cloud computing. Furthermore, service dependability is critical to the cloud's wider acceptance. As a result, fault tolerance has gotten a lot of attention in research. There are various fault tolerance mechanisms—replication, checkpointing, Self-Healing, Task Migration, Retry, Safety-Bag Checks, Reconfiguration, Task Resubmission, Masking, etc. [58]—to tackle faults at various levels either in reactive or proactive fashion.

Cloud computing entails the dynamic allocation of resources and the use of data centers that are often dispersed geographically. The hypervisor, also known as the virtual machine monitor (VMM), is a high-level monitoring unit that splits the server's available resources into virtual machines (VMs) or virtual nodes (VNs) and monitors their performance and availability. Single or many VMs are assigned to run the submitted application based on the user's request. The benefit of utilizing a virtual machine is that it allows users to run applications on a variety of operating systems, IDEs, and software environments. In most cases, the virtual infrastructure management (VIM) module of cloud computing manages resource pooling, physical and virtual resource management, and other tasks [7].

A cluster is formed using a group of different hosts or servers. Here, we consider clusters as assets of servers for better generalization. Cluster allows cloud service providers to assign VMs to virtual clusters in a dynamic mode based on SLA or user request. Such prior knowledge for cloud service providers is very much necessary to handle dynamic allocation of virtual machines.

In this work, we propose an intelligent fault-tolerant mechanism that performs the following tasks: (a) detecting VM failure due to the higher response time of a node, byzantine fault, and performance fault; (b) optimizing checkpoint interval time; and (c) using asynchronous checkpoint/restart method to model the cloud service execution. In our cloud model, fault tolerance procedure is illustrated in Figure 1. At the beginning, tasks are submitted by the users. The cloud supervisor forms the virtual clusters of hosts and performs allocation of tasks to virtual machines (VM) along with monitoring of VMs and hosts. Virtual machines will start executing the allotted tasks along with checkpointing it at the optimized regular interval of time that is derived from the optimization algorithm.

If a node response time exceeds the response time defined in the QoS requirement, it is halted and all the tasks are restarted on another host. If a virtual machine fails, all the tasks running on the virtual machines will be restarted on other virtual machines from their most consistent checkpoints. Byzantine faults are detected as described in Section 3.1.1. The node in which byzantine fault is detected will be halted, and another virtual machine is launched. Log-based recovery mechanism is implemented to optimize the restart process of the tasks. It is noted that there will be overhead in identifying different types of faults and finding the most consistent checkpoint to restart the tasks.

The rest of the paper is organized as follows: Section 2 presents literature survey, the proposed method is discussed in Section 3, Section 4 gives evaluation of the proposed method with experimental setup and results, and lastly conclusion is provided in Section 5.

2. Literature Review

This section presents some of the work done by researchers.

Authors in [9] proposed a fault-tolerant VM placement, where fault tolerance is implemented using VM replication technique. Here, based on VM requirements, different numbers of replicated copies are used. Each physical machine has its own requirement or constraint, and the replicated copies of the same VM cannot be placed on the same physical machine. The integer linear programing method is used here to handle VM replica placement. In [10], to increase the reliability of the system, a checkpointing/restart mechanism was proposed along with a replication scheme. The development of a fault-tolerant system assures the reliability and continuity of services. Checkpointing is the most susceptible in the event of a higher failure rate since the checkpointing file will become inaccessible if the computer that stores it fails, rendering the failed job unrecoverable. Hence, a replica of the checkpointing file is maintained to improve the reliability. A checkpoint and replication based fault tolerance technique was developed [11]. The work focuses on MapReduce framework in cloud, where proactive based fault tolerance is used to recover from the fault.

Cloud service reliability enhancement through optimization of VMP was developed by Zhou et al. Three algorithms are used in this method. Based on the network topology, the first algorithm chooses an acceptable selection of VM-hosting servers from a potentially large collection of possible host servers. With K-fault-tolerance assurance, the second algorithm develops an appropriate strategy for placing the primary and backup VMs on the specified host servers. Finally, to solve the task-to-VM reassignment optimization issue, which is defined as finding the greatest weight matching in bipartite graphs, a heuristic is utilized. In [13], an (m, n)-fault tolerance virtual machine placement for cloud data center was proposed. m represents the number of edge switches, and n denotes the host servers. K-fault-tolerant replication strategy was used to enhance reliability of the application or services. The first step is to recast the issue as an integer linear programming problem and demonstrate that it is NP-hard. Second, to address the integer linear programming issue, the differential evolution (DE) technique is implemented. Authors in [14] proposed a unique execution time prediction model that takes into account execution events that other multilevel checkpointing models did not include. The relationship between the system failure rates, checkpoint/restart overhead, and time between consecutive checkpoints is complicated, and determining the ideal time between checkpoints is a difficult task. The work explains how the proposed model can be used to set checkpoint intervals and why these execution events are essential to consider.

In [16], a fault-tolerant cloud computing service based on checkpointing is proposed. The fault tolerance service employs semicoordinated checkpointing, which reduces the time spent in the coordination phase and thereby reduces the amount of energy consumed and overhead. Results showed that the proposed approach also lowers the expense of a rollback. Bansal et al. [17] introduced the WQR-FT fault-tolerant WQR method, which employs a group manager to guarantee the existence of a certain number of copies in the system. Checkpointing adds overhead, which might lengthen the execution time [18, 19]. The checkpointing method (protocol), checkpointing storage, or recovery process can contribute to this cost [20].

3. Proposed Work

Many existing methods do not focus on virtual machine (VM) failure that occurs due to multiple factors like higher response time of a node, byzantine fault, and performance fault, and existing methods also ignore the optimization during the recovery phase. The proposed approach for fault tolerance in the cloud data center involves three phases. Phase 1 focuses on finding VM failure. Here, few algorithms are proposed to detect VM failure due to higher response time that occurs at the virtualization layer of the cloud, and even the byzantine faults are also detected. Phase 2 also describes the proposed algorithm for intelligent fault-tolerant mechanism cloud data center which also involves the checkpoint interval time calculation process. In phase 3, asynchronous checkpoint and optimized recovery process using log-based mechanism is discussed. Figure 2 shows the working principle of the proposed model.

3.1. Phase 1: Detection of Different Types of Faults
3.1.1. Byzantine Fault Detection Using Checksum Validation

To detect byzantine faults, we have used the SHA-2 algorithm. SHA-2 is one of the novel hash functions used in different fields. SHA-256 uses a 256-bit hash value. Hash value is computed using eight 32-bit words. SHA-256 checksum can also be used in cloud platforms.

In cloud environment, when a node uses the TCP/IP protocol to connect to another node, it is expected to produce a checksum, so such nodes are automatically equipped with SHA-256. The nodes which are connected to other nodes through IP protocol are termed as internodes.

In this work, every node in the cloud environment performs the checksum. The checksum of a particular data block is always unique and does not clash with the result of another data block. As a result, when a node is provided with a message and fails to produce the necessary checksum, the node can be identified as erroneous and compromised. Malicious nodes are discouraged from altering the checksum findings because reconstructing the original data from the checksum or conducting collision analysis is generally a time, space, and cost constrained task. SHA-256 checksum computation on arbitrary datasets is simple, easy, and feasible. Byzantine nodes frequently produce genuine-looking output that is incorrect owing to byzantine fault-induced miscalculation.

3.1.2. Checksum Prerequisites

In the cloud environment, a cloud monitoring (supervisor) node is expected to send the message M to k number of internodes automatically and receive the checksum in time . 512 bits is the size of the standard message for SHA-256, and the resultant checksum is 256 bits. We consider a supervisor node that has a precomputed checksum C in time T. Next, we compare set with .

To compare checksum, if P and Q are the sets and Q’s every element is also P’s element, then ; i.e., Q is a subset of P; hence,

If Q is not a subset of P, then one or more elements of Q exhibit processing error, hence the difference in the checksum.

If the set Q contains no element of P, then it is a null set represented by ∅; i.e., . This means that the entire set of observed checksums is incorrect, so the entire set of observed nodes is compromised. This also may reflect that the supervisor node itself is compromised.

If there exist set of checksums in set Q which is produced by erroneous nodes, then

Before any application begins execution in cloud, supervisor node selects message M, generates checksum , sends it to k nodes automatically, and receives the checksum in time .

If then we record the response time, i.e., transit time + processing time as the set R.

3.1.3. Algorithms for Detection of Different Types of Faults

Cloud computing delivers services to users maintaining QoS as mentioned in the SLA. Response time and QoS delay are among the QoS metrics which are associated with all the cloud nodes. Supervisor nodes monitor the set of nodes that meets the SLA.

Input: N operating nodes
Output: faulty node or normal node
 for all operating nodes (N)
  if response_time of Nj ≥ response time in QoS
   then
    take the checkpoint
    call checksum_compare()
  else
   continue supervise
  end if
 end for

In Algorithm 1, if the response time for any node exceeds QoS response time, then node is checkpointed and Algorithm 2 is called.

Input: Message M to all operating nodes N
 for each Nj in N
  if CjC//byzantine fault
  then
   halt Nj
   start new node as Nj from recent
   consistent checkpoint
  else
  call delay_deflection_compare()
  end if
 end for

Algorithm 2 submits the message M to the operating node. If the operating node produces a checksum which is not matching the C, then it shows a checksum error denoting byzantine fault. If the fault is detected, the algorithm will shut down the node and start a new virtual machine. If no fault is detected, then the set is compared with R.

Consider a function f with set R and partially ordered set S as subset, an element s of S is upper bound of f if for each r in R. If this holds good for at least one value of r, then it indicates that variation in the delay experienced is high or extreme, and it indicates performance fault; hence, the node is shut down after the transfer of workload at previous checkpoint as depicted in Algorithm 3.

 for each operating node Nj
  Choose Ti in S
  Copy corresponding Ti in R
   if Ti in S < Ti in R
    no fault//minimal delay variation
    call checkpoint_optimization()
   else if Ti in S= Ti in R
    no fault//call checkpoint_optimization()
   else if Ti in S upper bound in R
    shut down Nj
    start new node as Nj from recent
    consistent checkpoint
   else
    call checkpoint_optimization()
   end if
   end if
   end if
 end for
3.1.4. State Transition for Checksum

Figure 3 shows the state transition diagram for the virtual node. A node, after receiving the message M from the supervisor, calculates the checksum. Here, the initial state of the node is considered as 0. If the node fails to compute the expected checksum after receiving message M, there is an error and it enters a byzantine state (i.e., 1). From the byzantine state, it reaches state 2 with probability p = 1 where the node is shut down and a new VM is started. If no error is detected, it remains in the same state. Here, E indicates the error, and NE indicates no error.

3.1.5. State Transition with Delay Variation

We consider three delay variations: (Δ) normal, high, and extreme. As shown in Figure 4, the initial state of the node is 0. If the delay variation is normal (N), the node remains in the same state; if the delay variation is high (H) or extreme (Ex), it denotes byzantine or performance failure, so the node takes transition to state 1. After this, the node is transited to state 2 where a shutdown of the node takes place.

3.1.6. Delay-Sensitive Server Scheduling (DSSS)

The DSSS algorithm's goal is to maintain track of all of the servers that make up the virtual cluster. It is a lightweight model and can be integrated into cloud supervisor.

DSSS keeps track of the number of failed delay-sensitive tasks that surpass the QoS delays, as well as faults caused by VM failures, resource contention, and other factors. The count is then used to rank the server after each state interval, with the server with the fewest fault counts being at the top of the list. As a result, DSSS can help with dynamic job placement based on the server's performance. It may be also used to rate servers based on their prior performance and to keep track of the status of previous cluster implementation. Having such knowledge of prior performance can aid the management model in selecting the server for forming clusters to execute sensitive applications in an appropriate and dynamic manner. Notations used in DSSS algorithm are depicted in Table 1.

Input: R, S
Output: LDSSS
 Divide R = 
 for all s is S do
   if sj is assigned to Ji then
   if sj not in LDSS then
    
    
   end if
   end if
 end for
 for each sj in LDSSS
  if VM = FT
   C = C + 1
  else if VM = FVM
   C = C + 1
  else VM = FVM
   C = C + 1
   end if
  end if
 end for
 sort LDSSS(s, C)
  for j = 0 to n−1
   if sj.csj-1.c then
    swap (LDSSS[sj-1], LDSSS[sj])
   end if
  end for

After the selection of the suitable server for processing the job, the next step is to apply an appropriate fault tolerance mechanism.

3.2. Phase 2: Checkpoint Interval Optimization

Checkpoint/restart optimization is a challenging task keeping checkpoint intervals at the optimal value. It aims at finding the time interval that is necessary to take checkpoints for the tasks. Let α represent the preset initial state monitoring interval. The optimization algorithm works in the following way: if a node does not exhibit delay variation or checksum error that happens when a node stays in the same state (0) as shown in Figure 5, then the interval value is incremented. If a node exhibits high or extreme delay variation and checksum error, the state interval is reset to initial.

 Set I = α
 Set s = 0
 for each node Nj in state 0
  if ΔC = {NE or N}
   
   Call delay_deflection_compare()
   Call checksum()
  else if ΔC = { Ex or H or E}
   
   shut down Nj
   start new node as Nj
  end if
 end for
3.2.1. Proposed Algorithm

To execute the tasks generated by users, our proposed algorithm (intelligent fault-tolerant mechanism, IFTM) uses several algorithms that have been discussed in Sections 3.1, 3.2, and 3.3 of this paper. The proposed algorithm identifies the VM failure due to higher response time, byzantine fault, and performance fault. It also calculates the optimal checkpoint interval time and restarts the failed tasks using an asynchronous checkpoint/restart mechanism.

 for each Ji
  for each VMi
   do
    supervise (Checksum, Delay Variation)
    DSSS()
    Checkpoint_Interval_Optimization()
    Asynchronous_checkpoint()
    Checksum_compare()
    delay_deflection_compare()
    if ()
     recovery_algorithm()
    end if
  end for
 end for
3.3. Phase 3: Asynchronous Checkpointing and Recovery

There are two types of VM fault-tolerant methods that are often utilized. One is based on checkpoint-based and log-based rollback techniques. The other is based on the primary-backup paradigm, with incremental checkpoints as a feature [15].

In this work, fault tolerance is modeled using the asynchronous checkpoint and log-based rollback. The applications or the processes/tasks getting executed on the allocated VMs running concurrently were checkpointed independently. These checkpoints are taken independently without any synchronization among the processes, hence the lower runtime overhead during normal execution. If VM failure is detected or some of the tasks fail, then the recovery process is activated. Recovery process needs to iterate to find a consistent set of checkpoints, which is one of the limitations of this method. Figure 6 shows an example of checkpoints and global consistent recovery points for different processes. The recovery algorithm must search for the most recent consistent set of checkpoints before it initiates recovery.

As shown in Figure 4, three processes, Pi, Pj, and Pz, take checkpoint at {{Ci, 0}, {Ci, 1}}; {{Cy, 0}, {Cy, 1}}; and {{Cz, 0}, {Cz, 1}} respectively. When the process Pi fails, it rolls back to the previous consistent checkpoint {Ci, 1}. Rollback of process Pi to {Ci, 1} creates an orphan message M7, and it forces Pj to roll back to checkpoint {Cy, 1}. Since asynchronous checkpoints face a domino effect during recovery, to overcome that effect and to optimize recovery, we have used a log-based recovery mechanism.

During checkpoint and recovery, few assumptions are taken into account. Communication channels are considered to be reliable, having infinite buffers, and deliver messages in FIFO order.ff Triplet (S, M, MSG_SENT) represents the state of P. Process at state S receives the message M, and it moves to the state S1 and sends the message out. Two types of log storages, volatile and stable log, are used. After the execution of an event, the triplet is recorded without any synchronization with other processes. Local checkpoints consist of a set of records that are first stored in volatile log and then moved to stable log. During recovery, Algorithm 7 is used.

Notations used in the algorithm are as follows:RCab (CPa) indicates the number of messages received by process Pa from Pb, from the beginning of the computation to checkpoint CPa.SDab (CPa) indicates the number of messages sent by process Pa to Pb, from the beginning of the computation to checkpoint CPa.R is the number of process recovered after failure.K is the number of processes.Oc is orphan message.

Here, a set of consistent checkpoints are selected from the set of checkpoints based on the number of messages sent and received.

3.4. Model Execution with Checkpoint Mechanism

The checkpoint procedure is regarded as deterministic, and the cost of a checkpoint is solely determined by the amount of work already completed. Let W be the workload and be the number of checkpoints. are the amount of work between each checkpoint such that , where ß represents the overhead factor and m denotes the number of virtual machines. Wq is the amount of work done between checkpoint number q−1 and q. Let C(Fq) represent the checkpoint cost after quantity of work Fq, where , where Wi denotes the quantity of work that must be completed prior to each checkpoint. R represents the restart process cost before qth checkpoint, denoted as R(Fq-1), where .

 Process Pa accomplishes the following:
 part 1
  if Ra then
   CPa: = latest event logged in stable storage
 else
   CPa: = latest event that took place in Pa {can be in volatile storage or stable storage}
   end if
 part 2
 for i = 1 to K
  do
   for each neighbor process q do
    calculate SDab (CPa)
    send a ROLLBACK(a, SDab (CPa))
    message to Pb
   end for
  for every ROLLBACK (b, Oc) message
   received from a neighbor b do
   if RCab (CPa) > Oc // indicates presence // of orphan message
    then
     find the latest event e such that
      RCab (e) = Oc
     CPa: = e
    end if
   end for
 end for

It is assumed that no failure happens during the rollback recovery process. The total execution time can be represented aswhere is a process without failure. is a process with failure and recovery..

4. Simulation Results

Experimental setup, performance metrics, and experimental results are discussed in this section.

4.1. Experimental Setup

CloudSim toolkit simulator is used to evaluate the proposed method. We have used real workload traces (log) files from PlanerLab which is part of CoMon project having CPU utilization from more than 1000 VMs running on different hosts in more than 500 locations across the world. We have used 4 types of VMs, micro, small, medium, and large instances. 800 heterogeneous hosts, which belong to HP ProLiant G4 and HP ProLiant G5 category, are used. The number of tasks generated is between 100 and 1000.

Faults are generated using the FaultGenerator class in CloudSim. VM fault is induced by shutting down the VM, resource contention fault is simulated by reducing the resource capacity, and a modified FaultGenerator class is used to simulate byzantine fault or performance fault.

4.2. Performance Metrics and Results

In the proposed method (intelligent fault-tolerant mechanism), BFD is used as a VM placement technique. Here, the active hosts are categorized according to their power efficiency, and the most efficient ones are favored. In BFD, the host is better than other host if its power efficiency is greater than the other host and lesser in fault counts. The proposed method is compared with checkpointing technique without FCFS..

The following metrics are used to evaluate the performance of the proposed and other methods.

4.2.1. Power Consumption

It represents the total amount of energy utilized by all of the data center's physical machines (PMs). The linear cubicle power consumption model is used to calculate the energy consumption of PMs. In this power paradigm, the physical host's power consumption climbs linearly as CPU use rises. For the power model, we consider the following parameters.: maximum power consumed when the host k is completely utilized.: idle power value of the host k.Uk : current CPU utilization of the host k.T : total number of hosts in the data center.

The power consumption of host Pk can be expressed as

Our goal is to reduce data center power usage, and subsequently we aim to minimize

Figure 7 shows the power consumption of both the methods. Here, the average power consumption of the proposed method is lesser compared to the nonoptimization method for the dataset planetlab/20110303 to planetlab/20110420.

4.2.2. Makespan

It is the total execution time required to process all the tasks. Since the faults are simulated, few tasks may fail and get restarted from the identified checkpoint, causing the completion of the tasks to take more time than expected. Makespan is one of the key performance metrics to evaluate the algorithms/methods. Figure 8 shows the execution time of the proposed method and nonoptimization method. As shown in the figure, average execution time of the proposed method using optimization technique is less by 25% compared to nonoptimization method.

Figure 9 and 10 show standard deviation of execution time of the proposed method and nonoptimization method with VM selection. The standard deviation value falls within the range of 0.005 to 0.012 seconds in the proposed method, and in the nonoptimization method it ranges from 0.009 to 0.021 seconds.

The comparison of number of tasks completed by the proposed method and nonoptimization method is replesented in Figure 11 by varying the number of tasks from 100 to 1000. Total number of tasks completed by the proposed method is higher compared to nonoptimization method. Consequently, reliability is high because of a lesser number of failed tasks in the proposed method. Reliability can be measured as the inverse of the failure probability. The more number of tasks completed signifies that the reliability of the system is high.

5. Conclusion

This paper aims at enhancing the reliability of cloud services through fault-based mechanism. The proposed approach is a three-phase process: phase 1 is the detection of virtual machine (VM) failure due to the higher response time of a node, byzantine fault, and performance fault. The checkpoint optimization algorithm in phase 2 finds the suitable time to mark checkpoints periodically while executing the tasks. Finally, in the checkpoint and recovery phase, in case of failure, the backup and recovery algorithm finds the optimal global checkpoint to restart the failed tasks. The evaluation result using a real-time dataset shows that the proposed method gives a better fault-tolerant solution decreasing the execution time and energy consumption and increasing reliability compared to the nonoptimization method. Our future work includes developing fault tolerance mechanism using other reactive techniques and testing on different workload traces [12].

Data Availability

No dataset is required.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Madhusudhan H. S., Satish Kumar T., and Punit Gupta developed the theory and proposed the model. Dr. S. M. F. D. Syed Mustapha and Rajan Prasad Tripathi worked on data preparation and model for the dataset and verified the analytical methods. All authors discussed the results, contributed to the final manuscript, and agreed to the submitted version. All authors confirm sole responsibility for the following: study conception and design, data collection, analysis and interpretation of results, and manuscript preparation.