Abstract

Sensitivity assessment of availability for data center networks (DCNs) is of paramount importance in design and management of cloud computing based businesses. Previous work has presented a performance modeling and analysis of a fat-tree based DCN using queuing theory. In this paper, we present a comprehensive availability modeling and sensitivity analysis of a DCell-based DCN with server virtualization for business continuity using stochastic reward nets (SRN). We use SRN in modeling to capture complex behaviors and dependencies of the system in detail. The models take into account (i) two DCell configurations, respectively, composed of two and three physical hosts in a DCell0 unit, (ii) failure modes and corresponding recovery behaviors of hosts, switches, and VMs, and VM live migration mechanism within and between DCell0s, and (iii) dependencies between subsystems (e.g., between a host and VMs and between switches and VMs in the same DCell0). The constructed SRN models are analyzed in detail with regard to various metrics of interest to investigate system’s characteristics. A comprehensive sensitivity analysis of system availability is carried out in consideration of the major impacting parameters in order to observe the system’s complicated behaviors and find the bottlenecks of system availability. The analysis results show the availability improvement, capability of fault tolerance, and business continuity of the DCNs complying with DCell network topology. This study provides a basis of designing and management of DCNs for business continuity.

1. Introduction

Cloud computing based businesses have been demanding a rapid escalation of IT infrastructures with efficient resources organization and high level of continuity. To endure business continuity, data centers have drastically evolved in their size and architecture design to host a variety of cloud computing applications and services such as online social networking, e-commerce services, scientific computing, and big data processing. Nevertheless, a data center becomes a centric single point of failure in the cloud infrastructure in the way that the failures of components (e.g., links, switches, and servers) may result in the overall failures of a set of connected components [1]. Internet enterprises may incur millions of dollar per hour [2] due to their service outage, since their business operations require constantly connected and online services. It is demanding to avoid such risks and improve safety of DCN. And thus designing of a DCN for fault tolerance and business continuity is critical and a sharp focus of both academia and industry areas.

Recent work has attempted to design and organize a data center’s resources in the networking manner in which a large number of physical hosts are interconnected in a specific network topology called data center network (DCN), for instance, fat-tree [3], DCell [4], and BCube [5]. Thus, DCN topologies are the communication backbone in a data center [6]. The critical requirements in designing a DCN are scalability and efficiency to connect tens or even hundreds of thousands physical hosts [3, 4]. In the perspective of system users, the metrics of interest for a DCN, however, are the overall system availability and continuity of their hosted services and applications [7]. In this context, DCell proposed by Guo et al. [4] has emerged as an appropriate solution for DCN architecture in which the architecture is extremely scalable to millions of servers in data centers [8] by recursively constructing higher level DCells based on a DCell0 as the fundamental building block. The DCell network architecture allows avoiding any single point of failure and thus is able to tolerate different types of failures such as node failures, link failures, and network device failures. Furthermore, to enhance system availability and capability of fault tolerance, one may employ server virtualization [7, 911] into a DCN. The approach creates virtual computing machines (VMs) on each physical host of the DCN. Along with the nature of the DCell topology, the VMs become the core elements of the network to deliver high availability and fault tolerance in the way that a VM is able to be migrated from a host to another host [12, 13] and from a DCell0 to another DCell0 in the DCN [14] in order to avoid any hardware failures and thus to assure business continuity of system users. The DCell-based DCN with server virtualization is our sharp focus in this paper.

There are a number of papers on presentation and description of DCN topologies [35]. Some other work concerned with different aspects of DCN including fault tolerance characteristics [1, 15], structural robustness of DCN topologies [16], or connectivity of DCNs [6]. Nevertheless, none of these papers presented a quantitative assessment of system behaviors using stochastic models [17]. One of the previous works [18] attempted to model and analyze a simple configuration of a two-computer network with redundancy of network devices/links for fault tolerance. To the best of our knowledge, only a recent paper [19] delivered a thorough performance modeling and analysis of a fat-tree based DCN using queuing theory. Thus we find that modeling and analysis of a virtualized DCN using stochastic models are still a preliminary endeavor. This motivates us to model and analyze a virtualized DCell-based DCN using SRN.

We summarize the main contributions of our work as follows: (i)Modeled a DCell-based DCN for business continuity under two configurations, respectively, consisting of two and three virtualized servers in a DCell0 in a complete manner using SRN.(ii)Incorporated failure modes and recovery behaviors of hosts, switches, and VMs along with VM live migration within and between DCell0s for the sake of fault tolerance.(iii)Captured the featured dependencies between components in the system architecture in detail: (a) between hosts and VMs and (b) between switches and VMs.(iv)Performed detailed analyses of the constructed SRN models in terms of steady state availability, downtime cost, modeling complexity, and sensitivity with respect to major parameters.Through modeling and analysis, we have found the following: (i)A virtualized DCell-based DCN with greater number of hosts in a DCell0 can enhance the level of continuity and availability. The DCN with three hosts in a DCell0 can deliver a level of availability over tier 4 in HA standards for a data center [20].(ii)In a virtualized DCN based on DCell network topology, the recovery of hardware (hosts) and software (VMs) subsystems contributes a major impact on system availability. As the size of the DCN increases, the recovery of software subsystem exposes a more important role versus that of hardware subsystem.(iii)A bigger size of the VMs in a DCN causes a declining tendency of system availability. Nevertheless, in a more complicated DCN, the influence of VM image size is mitigated.(iv)The cross-links between the pairs of hosts in different DCell0s in a DCN and their bandwidth are necessary elements to tolerate switch failures, to mitigate system downtime, to improve system overall availability, and thus to secure system operation for business continuity.(v)The modeling and analysis of the virtualized DCNs in this paper help guide the system design and management of a DCN:(a)A thorough adoption of software fault tolerance is necessary in the design of a DCN.(b)The effectiveness and readiness of the repair and maintenance services in a DCN need more attention and improvement.(c)The trade-off between system availability and performance and the overall cost of networking [21] in a DCN is an important metric in system design.

The rest of this paper is organized as follows. Related work is presented in Section 2. Section 3 introduces a virtualized DCN. Section 4 presents SRN models for the DCNs. The numerical analysis and discussion are presented in Section 5. Finally, Section 6 concludes the paper.

Design of data center infrastructure is critically demanding in research and development both from academia and industry to deliver cloud-based online apps and services with highest availability. Server-network architecture within a data center thus plays an important role in enhancing agility and reconfigurability of interconnecting different infrastructure resources. In that context, topologies of a DCN contribute major impact on system performance, availability, and scalability to deliver changeable application demands and service requirements [22]. Current DCNs adopt three-layer network topology in which physical servers are interconnected into a rack by a top of rack (ToR) switch, the ToR switches are networked through end of rack (EoR) switches, and core switches connect these EoR switches together to the external network providers [15]. The topology however confronts a variety of critical issues as a nature: (i) fault-dependency propagation in which a failure of an upper-level switch causes the complete disconnection and unavailability of a number of dependent switches and servers connected to the failed switches and (ii) significant bandwidths that are required to maintain efficient connectivity. To avoid and eliminate the current issues, a number of network topologies have been proposed for alternatives including (i) tree based topologies, such as fat-tree [3, 23, 24] and Clos Network [25] and (ii) recursive-based topologies, such as DCell [4], FiConn [26], BCube [5], and Hyper-BCube [27]. In the work [15], Liu et al. conducted a detailed comparison between the DCNs and concluded that no single topology outperforms the others in all aspects and there will always be trade-offs among cost, performance, and reliability. Among the network topologies, the DCell comes out as a candidate to satisfy the requirements of robustness and connectivity [6], fault tolerance, and scalability in a data center even though the aggregated bottleneck throughput is comparatively low [15]. In this paper, we focus on the DCell-based network topology in consideration of fault tolerance and network availability, which were not considered in the previous work [15]. A DCell [4] is recursively constructed based on the most basic element DCell0 as follows:(i)A DCell0 consists of physical servers all connected to -port switch.(ii)A DCell1 is composed of DCell0s. Each server of a DCell0 in a DCell1 has two links. One connects to its switch; the other connects to a corresponding server in another DCell0, complying with a predetermined DCell routing algorithm. Consequently, every pair of DCell0s in a DCell1 has exactly a unique link between each other.(iii)A is a level- of .In this paper, we study two DCell-based DCNs at the level of DCell1 which, respectively, consist of two and three servers in a DCell0. We will show through availability as the measure of interest that such DCell-based DCNs expose better ability to tolerate node and switch failures.

High availability (HA) and business continuity (BC) are the key factors in designing enterprise computing systems for a cloud-based business to be successful [28]. Nevertheless, as computing systems with high level of complexity and dependency have been coming out such as Infrastructure-as-a-Service (IaaS), software defined data center (SDDC), and software defined network (SDN), the systems are likely prone to a variety of failures. Severely, a failure of a component may cause a cascading failure or unavailability of a group of other components. For instance, the failure of a switch connecting to a number of physical servers causes the unavailability of the set of servers at the same time. To achieve predetermined levels of HA and BC, which are indicated in service level agreement (SLA) [29, 30] between customers and system owner, the system design has to tolerate any single point of failure in both hardware and software subsystems. The ANSI/TIA-942 [20] presents four tiers; each specifies basic requirements of availability and downtime minutes per year as follows: (i) tier 1 (basic): 99.671% for availability and 1729.224 downtime minutes in a year; (ii) tier 2 (redundant components): 99.74% for availability and 1361.304 downtime minutes in a year; (iii) tier 3 (concurrently maintainable): 99.982% for availability and 94.608 downtime minutes in a year; and (iv) tier 4 (fault tolerant): 99.995% for availability (four nines) and 26.28 downtime minutes in a year. In order to achieve the above levels of HA standards, one may need to adopt server virtualization [11, 3134] on nodes and apply VM live migration [12, 35, 36] as the means of fault tolerance for nodes and switches. Server virtualization creates and fosters a plurality of VMs on each physical server. With VM live migration mechanism, a VM is not only able to be migrated from a failed host to another host in the same DCell0 as soon as a host’s failure occurs but also it can be migrated from a running host in a DCell0 to another host in other DCell0s if the switch in the former DCell0 fails. In this paper, we will show a comprehensive modeling and analysis of a virtualized DCell-based DCN for high availability and continuity. The analysis results shown in Section 5 reflect that a DCN complying with DCell network topology along with server virtualization and VM live migration mechanism can achieve HA whereas the DCN with a standalone DCell0 cannot. Furthermore, the overall system availability of the DCell-based DCN surpasses tier 4 (which is a high available and fault tolerant system) in the ANSI/TIA-942 HA standard for a data center.

Sensitivity analysis [10, 18, 37, 38] is used popularly to provide a selection basis and help design system parameters by observing system characteristics and responses with respect to predetermined valuables in order to identify the most impacting factors as well as to detect bottlenecks in system availability. One may adopt two types of sensitivity analysis: (i) nonparametric sensitivity analysis [39], which studies the system responses upon the component addition/removal or modifications of system model and (ii) parametric sensitivity analysis [40], which observes the system behaviors with respect to the variations of given input parameters. The parametric sensitivity analysis has been adopted to assess system performance and reliability/availability upon the effect of changes of given parameters in different systems. Nguyen et al. [10] presented a comprehensive sensitivity analysis of system steady state availability for a virtualized servers system. The thorough study of availability sensitivity with respect to the intervals of software rejuvenation on VMMs and VMs provides a design basis on how to improve availability of a virtualized system in a wiser manner by combining both software rejuvenations at VM and VMM. In the work [18], Matos Jr. et al. applied parametric sensitivity analysis on a redundant computer network system with respect to MTTF and MTTR of every network component to figure out the important and influent factors of network availability. In another work [38], Matos et al. implemented four different sensitivity analysis techniques to identify the parameters with greatest impact on the availability of a mobile cloud computing system. Accordingly, a sensitivity analysis can be conducted to assess the importance of parameters in the following approaches:(i)Repeatedly vary one selected parameter at a time while the others are kept constant and observe the system behaviors on the measures of interest with respect to the varying parameter. This approach studies the system responses upon a broad value range of the parameters in consideration.(ii)For differential sensitivity analysis, compute partial derivatives of the measure of interest with respect to each system parameter. This approach is useful in the case that input values of parameters are assigned in a continuous domain. The differential sensitivity of the system availability with respect to variable is defined as in (1) or (2) for a scaled sensitivity:(iii)Calculate the percentage difference in the variation of a parameter from its minimum to maximum values. This technique is designed for integer-valued parameters which are not properly evaluated by the differential sensitivity analysis approach.(iv)For design of experiments (DOE) [41], simultaneously examine individual and interactive effects of factors on the output measures.In this paper, we adopt approaches (i) and (ii) to study system behaviors and responses with respect to parameters at the default values and in a broad value domain. The analyses (i) help find the major impact factors on system availability and (ii) study system characteristics upon the variations of parameters and thus (iv) help design system parameters and (v) guide to adopt proactively different tolerance techniques to achieve optimized overall system availability.

There are a few works on sensitivity modeling and analysis of availability for DCNs. Matos Jr. et al. in the work [18] modeled and analyzed the availability of a small-scale computer network using continuous time Markov chain (CTMC). This very first work studied the impact of the failures of network devices (switches and routers) and network links on system availability. The study took into account the redundancy of either network devices or links as a measure to tolerate the aforementioned failures and to improve system availability. The contributions of this work suggest the approach of adopting stochastic models to analyze a DCN. However, as the system scale increases, the CTMC models (where each state in the model is the combination of all states of the components in consideration) likely confront largeness problem or state-space explosion as well as intractable presentation of the model. Furthermore, unplanned redundancy of physical devices is a costly solution especially for DCNs as the number of machines increases vastly. It is needed to organize the physical components in a well-designed network topology for either fault tolerance or high availability and/or performance. Alshahrani and Peyravi in a very first work [19] on modeling and analysis of DCNs attempted to adopt queuing theory to model a typical topology of DCN, fat-tree [3]. This work proposed a detailed analytical model to assess performance metrics of interest (e.g., throughput and delay) of a fat-tree based DCN. The work nevertheless did not take into consideration any type of failures. The system architecture is composed of only network devices for simplification of theoritical formulation. This preliminary work raises a need to conduct further studies on various attributes (e.g., availability, reliability, and performability [42]) of dependability of different DCN topologies. Several other works studied various essential issues of contemparory DCNs. Liu et al. [1, 15] studied fault tolerance characteristics of renowned network topologies of DCNs. Among different topologies of DCNs presented in the works [35, 23, 25, 27, 43], DCell topology is pointed out as a typical topology with high scalability and fault tolerance capability [1, 4, 15] with more and more interest in practice. Several works [4446] presented large-scale emperical studies of failures in typical data centers. The works have charaterized a variety of failures in DCNs such as failures of servers (e.g., hard disk, memory and raid controller failures) and failures of network devices (top-of-rack switch, aggregation switch, and router failures). Some other works [7, 9, 11, 4750] showed the adoption of renowned virtualization technology and VM migration in computing systems is of paramount importance to achieve high availability and to tolerate unexpected risks or failures.

Based on the above literature review, we find that the modeling and analysis of a DCN are still in initial steps. Previous work attempted to model and analyze DCNs using either stochastic models or queuing theory without an adequate consideration of different system failures and corresponding fault tolerant techniques. Moreover, the system architectures did not incorporate contemporary virtualization technology and VM migration techniques for high availability and effecient fault tolerance in virtualized environment. Our focus in this paper is on the ability of DCell-based DCNs to tolerate any type of risks in the system in order to ensure the system’s safety in terms of system operation and availability. We are more interested in risks and measures to tolerate risks for DCNs to achieve high availability (which are critically demanding in design of network in data centers) rather than other attibutes of DCN’s dependability. Therefore we choose to study DCell network topology for DCNs which allows us to enhance the system capability of fault tolerance. We attempted to use SRN with a variety of modeling funtionalities in order to capture the dependency between upper and lower level components in the system architecture (for instance, between a physical server and its hosted VMs or between a switch and its connected servers). Furthermore, the SRN models are tractable (literally compared to CTMC) and thus enable us to incorporate various behaviors (failures, VM migration, and interaction between submodels). We will be using SRN to model typical DCell-based DCNs in the next sections.

3. A Virtualized Data Center Network

3.1. System Architectures

The system architectures of a DCell-based DCN are depicted in Figure 1. Figure 1(a) shows the architecture of a DCell-based DCN with two servers in a DCell0 (called DCN2 from now on) whereas Figure 1(b) depicts the architecture of the DCell-based DCN with three servers in a DCell0 (called DCN3). Both DCN2 and DCN3 comply with DCell configuration and routing [4]. Accordingly, DCN2 consists of three DCell0s: , , and . Each of DCell0s comprises two servers (also called host) and a gigabit switch to help clients use the servers. All hosts are virtualized to run a number of virtual machines (VMs). Within a DCell0, gigabit links for high speed data transactions connect the hosts and the corresponding switch. For easy understanding of the SRN models of the system (to be presented in the next sections), we apply a naming convention in which characters represented for a component accompany specific numbers. In particular, the consists of the switch S0, the hosts H00 and H01, and their respective virtual machines VM00 and VM01. The naming convention is applied in the same way for which comprises S1, , and and for which is also composed of S2, , and . In the same way of the above descriptions, DCN3 is composed of four DCell0s from to . Each DCell0 in DCN3 consists of one switch and three hosts and each host in turn runs a number of VMs. The notation of components in DCN3 complies with the naming convention aforementioned in DCN2. The network routing is formed between hosts among different DCell0s. Particularly, internal network links in DCN2 are formed between the following pairs of hosts: ; ; and . Whereas, in DCN3 the links are formed between the pairs: ; ; ; ; ; and , which comply with DCell routing tactics [4]. We will use these architectures to model and analyze the system availability in the next sections.

3.2. System Behaviors and Assumptions

(i) Operational States. A host and a switch can fail and recover upon the states of their hardware components as similar as the blade server described in [51]. And a VM may go through a variety of complicated states as in [10, 44, 45]. But capturing the different operational states of hosts, VMs, and switches in modeling in a complete manner is beyond our focus and could lead to largeness problem [52] of the models. Hence, the two-states model (up and down states) is used to represent the basic operational states of system subsystems.

(ii) VM Live Migration. The VM live migration technique is employed to tolerate unexpected failures of hosts and switches. In a DCell0, if a host fails, all VMs running on the failed host are immediately migrated onto the remaining hosts with good consideration of load balancing. Moreover, if a switch fails, all VMs operating on the hosts in the DCell0 of that failed switch are instantly migrated to the other hosts of all other remaining DCell0s. For instance, in the case of the DCN2, if the host H00 fails, the VMs running on H00 are live-migrated onto the host H01. When the switch S0 goes down, the live migration processes are triggered instantly to migrate the VMs running on H00 and H01, respectively, onto the host H10 of the and onto the host H20 of the . The descriptions can be applied in the same way for other DCell0s in DCN2 and DCN3. The above VM live migration mechanisms are used to prevent the VMs from unexpected downtime due to failures of hosts and switches; thus system availability is improved and business continuity is endured. In order to reduce the complexity of system models, it is necessary to assume that the VM live migration processes do not confront any unexpected failures such as data loss and memory errors, during the migration period as captured in some work [35, 53].

(iii) Virtual Machine Monitor (VMM). Hypervisor or VMM is in charge of creating and maintaining virtualization environment to operate the upper VMs. Thus, the operational states of VMs are dependent on the operations of the underlying VMM. The detailed dependencies of a host, a VMM, and a VM are captured in a number of works [10, 44, 45]. Nevertheless, we do not take into account the VMM in modeling for simplification and our focus is on the states of VMs, since user’s applications and services run on VMs. We consider the up and down states of VMM as one part of the host’s up and down states.

(iv) High Availability. Our systems are designed to deliver HA services to the user. In HA system, the availability is assigned to the cases that more components are in up states. Thus, we assume that repair and maintenance services in data centers are good enough to recover the failed hardware components in advance of the remaining component failures. Particularly in DCN2, we assume the following: (i)If a host fails in a DCell0, the remaining host only fails after the recovery of the aforementioned host.(ii)If the number of failed hosts in the DCN2 is larger or equal to the haft of the total number of hosts, the remaining hosts can fail after one of the failed hosts is recovered.(iii)If two switches fail, the repair person is summoned and repairs subsequently the failed switch before the failure of the remaining switch.(iv)A switch and a host are in operation and can fail if there is at least a VM running in the host of the same DCell0.In the case of DCN3, we assume the following: (i)If two hosts fail in a DCell0, the remaining can fail as either one or both of the failed hosts are recovered.(ii)If the number of failed hosts in DCN3 is larger or equal to the haft of the total number of hosts, the remaining hosts can fail as one of the failed hosts is recovered.(iii)If three out of four switches fail, the remaining switch can fail after the recovery of one of the failed switches.(iv)A switch and a host are in operation and can fail if there is at least a VM running in the host of the same DCell0. The purposes of these assumptions are to reduce the largeness of the model and to mitigate the cases with very low probability to occur in HA system.

(v) Distributions. The time to occurrence of any event in actual computing system may follow different types of probability distribution [54]. However, we can make appropriate distribution assumptions for every transition so that the analytical system model is closer to the practical system. In this paper, we choose to use exponential distribution for simplification in modeling and analysis as a common option in a large number of papers [10, 44, 45].

The above system behaviors and assumptions are all taken into consideration in the modeling of the DCNs to be carried out and described in detail in Section 4.

4. Stochastic Reward Net Models

4.1. System Models

The SRN system models of DCN2 and DCN3 are, respectively, depicted in Figures 2 and 3. The system models are composed of partial models including host models, switch models, and VM subsystem model as chronologically named from Figure 2(a to g) in SRN system model for DCN2 and from Figure 3(a to q) in SRN system model for DCN3. In consideration of system availability measures, we use two-states model (up and down states) to model hosts, switches, and VMs for the sake of modeling simplification. Our sharp focus in modeling is on the dependency between components and fault tolerant behaviors for high availability in the case of any component’s failure. We will describe the model of a specific component as an example to refer to the other similar components. Then the model integration and dependency will be presented subsequently. The transitions of tokens within the models are conducted stochastically by enabling/disabling the timed/immediate transitions based on the predefined behaviors. The combinations of all tokens’ locations in the models represent the system’s respective states. We list down all the states of every component and possible locations of tokens in the models as in Table 1. In order to capture exact predefined system behaviors we apply a set of guard functions [5558] attached to every transitions in the models to control the transitions of tokens.

4.2. SRN Models of Hosts, VMs, and Switches

Figure 4 shows the two-state SRN models of selected host, VM, and switch. In the assumption, we mentioned that the characteristics and configurations of all hosts, VMs, and switches are assumed to be identical initially. Thus, we can use the two-state SRN models to capture up and down states of the component in regard of availability measures. We describe the modeling of the host H00, the VM00, and the switch S0 as the examples to refer to the modeling of the other identical hosts, VMs, and switches in both SRN system models of DCN2 and DCN3.

Figure 4(a) depicts the modeling of a host with repair actions. Initially, a host is considered in running state depicted by a token in up state . A virtualized host in DCN may undergo an expected failure or maintenance period after a specific time with MTTF . In this case, the transition is triggered to fire and the token in is removed and deposited in . As the host goes down, a repair person is summoned to recover the host. After the repair, the transition is enabled and the token in is removed and deposited in . The host returns operational.

Figure 4(b) captures the behaviors of VMs running on host H00. Assume that there are initially in running states. As time goes by, a VM can fail with a failure rate . The transition is fired subsequently and one token in is removed and deposited in . The VM goes down. Because of the competition between the VMs in up state to fail, the failure rate of the running VMs at a time depends on the number of VMs or, in other word, the number of tokens in the place . Therefore, we apply marking dependence on the transition represented by the marker “#.” The VMs in downstate are repaired in sequence by software or by a repair person. The repaired VM restarts to healthy state. This repair action is captured by firing the transition ; then a token in is taken out and deposited in .

Figure 4(c) presents the failure and repair action of a switch in modeling. At the beginning the switch is considered in healthy state depicted by a token in . After some time, it may fail, the transition is triggered to fire, and the token in is taken out and deposited in . The switch fails consequently. After repairing the failed switch, the transition is enabled and the token in is removed and deposited in . The switch starts running normally.

4.3. SRN Models of a Standalone

Figures 5 and 6, respectively, depict the SRN models of the DCell0s comprising two and three hosts (hereinafter called DCN0 and DCN1), which are the basic units to construct the DCell-based DCN2 and DCN3. The DCN0 and DCN1 are actually the taken out for an example of modeling description, respectively, in the DCN2 and DCN3. The SRN model of DCN0 in Figure 5 consists of host models of the hosts H00 (Figure 5(a)) and H01 (Figure 5(c)), switch model of the switch S0 (Figure 5(b)), and VM models of the VM00 and VM01 (Figure 5(d)). The modeling of these partial components can be referred to the description of the corresponding models in Figure 4. Henceforth we describe the dependency of the VM model upon the host and switch models. In particular, we apply VM live migration as a fault tolerant technique to avoid the downtime of VMs because of their host’s failures. Initially, all components are in up state depicted by the tokens in , , , , and . At a certain time, the host H00 may fail, which is represented by a token in . The failure and repair transitions and are disabled. All the VMs running on the host H00 consequently are triggered to undergo a live migration process. This behavior is captured by enabling the immediate transition . The tokens in are taken out and deposited in the intermediate place . At this point, the VM migration processes start by enabling the transition . A token in the place is removed and deposited in the place one after another. Thus, the VMs in the failed host H00 are all live-migrated onto the operational host H01 in the same DCell0. In the case that the host H01 fails in the progress of VM migration, the migration processes are interrupted and halted until the failed host H01 is recovered completely. The VMs’ image files and related data are stored in the DCell0’s storage, represented by the tokens in the intermediate place . Hence, the transition is disabled until the completion of the failed host H01’s recovery processes. Based on the above description, we can refer to the case of the host H01’s failure. As soon as the host H01 fails (represented by a token in ) but the host H00 still runs (one token in ), the VMs running on the host H01 are live-migrated onto the host H00 with the same aforementioned processes. The immediate transition is triggered to fire. All the tokens in are taken out and deposited in . The migration process of VMs is carried out in sequence as long as the host H00 is operational. If the host H00 fails during the migration of VMs from the failed host H01, the transition is disabled and the VM migration processes stop until the host H00 is recovered. Furthermore, if both hosts H00 and H01 go down along with each other, the running VMs’ image file and related data are stored on a shared memory, which is captured by a number of VMs in the places and previously taken out from the respective places and . Then all the transitions in the VM subsystem model (Figure 5(d)) are disabled to stop completely the VMs’ operations. In addition, if the switch S0 fails (a token resides in ), the running VMs on the hosts H00 and H01 are live-migrated to the respective hosts in the other DCell0s upon the network routing presented in the system architecture in Figure 1(a). This behavior is presented in detail in the next section. Modeling description of the DCN1 model in Figure 6 is carried out in detail as the above description of the DCN0 model in Figure 5 with good consideration on the notation alteration. The DCN1 model consists of the host models of the hosts H00, H01, and H02; the switch model of the switch S0 and the VM models of the VM00, VM01, and VM02, respectively, hosted on the aforementioned hosts. The dependency and behaviors of VM subsystem upon the operational states of the hosts and switch are similar as described in the DCN0 model. The VM live migration processes are conducted between the two among three hosts. If a host fails, the running VMs on the failed host are live-migrated to the two remaining hosts in consideration of balancing the number of VMs on each host. If a switch goes down, the running VMs on each host are live-migrated to the corresponding hosts in the other DCell0s through the cross-links between DCell0s according to the network routing showed in the DCN3 system architecture in Figure 1(b).

4.4. System Model Integration

The models of DCN2 in Figure 2 and of DCN3 in Figure 3 are made of, respectively, three DCN0s and four DCN1s complying with the DCell-based network routing topologies as in the system architecture in Figure 1. The modeling descriptions of every component and DCell0 units are carried out based on the detailed descriptions of partial component models in Figures 4, 5, and 6. In this section, we show the features of DCell-based DCN upon system model integration. In consideration of a standalone DCell0, if its switch undergoes a downtime period because of unexpected failure or planned maintenance, the communication between computing machines in the DCell0 and system users is disconnected as a result. To avoid this adverse situation, in DCN2 and DCN3 the computing VMs are live-migrated to other DCell0s through the cross-links between hosts from different DCell0. In particular, in the DCN2, the connects to the via the link between the hosts H00 and H10 and to the via the link between the hosts H00 and H20. In turn, the connects to the via the link between the hosts H11 and H21. In the DCN3, the above description goes in similar way in which a DCell0 connects to three remaining DCell0s via different links between the pairs of specific hosts. We take the failure of the switch S0 in the -DCN2 as an example to describe the system behaviors and interactions between DCell0s upon the failure of switches. As the switch S0 fails depicted by a token in the place in Figure 2, all the running VMs on the hosts H00 and H01 (represented by the tokens residing in the places and ) are live-migrated, respectively, to the hosts H10 in and H20 in . To capture these behaviors in modeling, the immediate transitions and are triggered to fire as soon as the token in is removed and deposited in . Subsequently, the tokens in the places and are taken out and deposited, respectively, in the places and . At this point, the VMs’ image files and related data stored in the local memory system of the are organized to be migrated from to and . The timed transitions and are enabled to start migration processes. After the completion of the VM migration processes, the VMs hosted on the host H00 in with the failure of the switch S0 now operate on the host H10 in and the VMs hosted on the host H01 in with the failure of the switch S0 now run on the host H20 in. Under the same reasoning, we can describe the live migration mechanism of VMs from to and upon the failure of the switch S1 and also from to and upon the failure of the switch S2. In the DCN3 system model, as soon as the switch S0 in fails, the tokens running on the hosts H00, H01, and H02 (depicted by the tokens in the places , , ) are, respectively, migrated to the hosts H10 in , H20 in , and H30 in (captured by, resp., depositing the tokens in , , and ). Based on the above detailed description, the migration of VMs from the other DCell0s upon switch failures can be conducted accordingly.

5. Numerical Results

The SRN models of the DCNs are implemented in Stochastic Petri Net Package (SPNP) [57]. SPNP provides two ways to implement the SRN models: (i) raw input language for SPNP called CSPL (C-based SPN Language), an extension of the C programming language with a variety of application programming interfaces (API) for easier description of SRN models; and (ii) a Graphical User Interface (GUI) for intuitive specification of the SRN models, which later is converted into CSPL automatically by the software itself. The models are converted into Markov Reward Model (MRM) and then solved by using analytic-numeric methods with regard to specific metrics of interest. We use GUI to construct and verify the correctness of the SRN models and CSPL input source to solve the models and generate various numerical analysis results as well as to investigate the complexity of those analyses. Our metrics of interest for analyses include (i) steady state availability (SSA), (ii) downtime cost, and (iii) sensitivity of SSA with respect to major impacting parameters. Default values of parameters used in modeling are provided in Table 2 based on previous works [10, 50, 58, 59].

To investigate the capability of the DCNs to assure business continuity, we initiate one VM to run on each host of the in either DCN2 or DCN3 at the beginning; and none of VMs is initialized on all the other hosts. In general, the DCell-based DCNs can maintain business operations on the aforementioned VMs even in the case of switch failures by migrating the VMs onto the other hosts of all the other DCell0s. Thus the overall system availability is improved apparently. These features of the DCell-based DCNs are shown by the numerical analysis results in the next subsections.

5.1. Steady State Analysis

The steady state analyses are carried out along with the downtime and cost analyses for four case studies from (I) to (IV) as in Tables 3 and 4. In order to compute the measures of interest using SPNP, we define the requirements of our systems’ availability as follows: (i) there is at least a VM running in a certain DCell0 and (ii) the switch in the DCell0 stays in operational state. The requirements are to ensure that there is at least a connection between system users and running computing units. Based on the predetermined requirements, we define the reward functions to compute the system availability for the four DCNs as follows:The numerical results of the steady state analyses and downtime cost analyses with default parameters are shown in Table 5. We assume that a minute of system downtime incurs a penalty of 16,000 USD on the system owner according the SLA signed with customers [60]. The number of nines (a correspondence to availability, nines = [58]) is used to present the improvement and change of steady state availability in an intuitive way. The results show that the adoption of DCell-based architectures improves significantly the system availability and thus decreases vastly the downtime and the corresponding downtime cost. Particularly, the DCN of a DCell0 with two hosts (DCN0 in Figure 5) has the state availability at correspondingly about 2.55 of nines; thus the downtime in a year is at a huge number of 1450.4 minutes and the system owner must bear 23,206,942 USD per year for this system’s performance. If we adopt the DCell-based architecture in Figure 3 (DCN3), the system’s steady state availability improves vastly with the corresponding number of nines at about 5.19 (almost double compared to DCN0), the system downtime drops off at about 3.4 minutes in a year, and thus the incurred cost now is only 53,959 USD per year. This analysis results reflect the efficiency of the DCell-based DCN in terms of fault tolerance to achieve high availability and mitigate system downtime in comparison with the normal DCN without the adoption of DCell topology. In comparison, the DCNs with three hosts in a DCell0 (e.g., DCN3 and DCN1) gain relatively higher availability than the respective DCNs with two hosts in a DCell0 (e.g., DCN2 and DCN0). This is to say that an increase of number of VMs can benefit the system owner to provide higher availability to customers.

To observe the impact of the number of VMs () on the steady state availability of DCNs, we conduct the analyses with different values of (from 1 to 6) until the SPNP suffers unexpected memory computation errors (m.e). Table 5 shows the analysis results of steady state availabilities and their corresponding number of nines. In all cases, the increase of slightly gains higher but not significantly system availability for the DCNs.

The adoption of DCell network topology and the increase of number of VMs in DCNs do achieve significantly higher availability for the systems. Nevertheless, it is costly and time-consuming to model and analyze such complicated systems. Table 6 points out the complexity of the analyses using two measures: (i) number of tangible markings and (ii) number of marking-to-marking transitions. As shown clearly, the number of VMs exposes a major influence on the system complexity in modeling and analysis, especially for the systems under the adoption of DCell network topology (DCN2 and DCN3). For DCN0 and DCN1 (without adoption of DCell), the system complexity increases from tens or hundreds to about hundreds or thousands of markings and transitions as the increases from 1 to 6. Whereas in the cases of DCN2 and DCN3, the system complexity boosts up from tens to tens of millions of markings and marking transitions as increases. The vast increase of the system complexity quickly causes memory errors in computation. The DCN2 SRN model suffers unexpected memory errors as the number of marking transitions is at tens of millions. The memory errors in analysis of the DCN3 SRN model occur as is larger than 3 and thus the complexity could reach hundreds of millions of markings and transitions.

5.2. Sensitivity Analysis

The major purposes of sensitivity analysis in this study are (i) to optimize system design and (ii) to pinpoint the bottlenecks regarding availability, performance, and performability of the systems. Therefore, we conduct a variety of parametric sensitivity analyses of the DCN2 and DCN3 SRN models with respect to the major parameters in Table 2. The analysis results are shown in Table 7. We see that the parameters and assume the greatest importance in system steady state availability of both DCN2 and DCN3, since they present highest absolute values. A major impact upon any change in the value of these parameters bears on the system availability in opposite directions. Sensitivities with respect to these two parameters are negative, since the smaller values the repair times of hosts and VMs get, the higher availability the DCN can achieve. This result reminds the system owner to improve the performance and readiness of the repair and maintenance services in a data center to mitigate recovery time of failed components. Nevertheless, in comparison between the cases of DCN2 and DCN3, the absolute value of the sensitivity with respect to the parameter is greater in the case of DCN2 than it is in the case of DCN3. However the absolute value of the sensitivity with respect to the parameter is higher in the case of DCN3 compared to it in the case of DCN2. The results imply that, in the DCN with more hosts in a DCell0 and more DCell0 units in the network (DCN3 in comparison with DCN2), the recovery of software subsystems (VMs) plays a more important role compared to the recovery of hardware systems (hosts). Thus in the DCell-based DCN with higher number of VMs and DCell0 units, a failure of a host does not cause a significant impact on the operations of a VM compared to the failure of the VM itself, since the VMs have more chances to be migrated onto other hosts in other DCell0s. Therefore, the DCN system designer ought to consider the thorough adoption of software fault tolerance on VMs in a DCN. In Table 7 we also see that the parameter contributes a significant impact on the system availability. The negative values of the sensitivities with respect to the parameter in both cases of DCN2 and DCN3 say that the bigger size of a VM in storage system causes a declining tendency of system availability, since the VM migration processes between hosts within or between DCell0 units last longer to complete. Furthermore, the sensitivity with respect to the parameter has higher value than it with respect to the parameter , and both are positive. This is to say that an increase in network speed leads to a corresponding increase of system availability, since the time to migrate VMs could be reduced. Also, the link bandwidth of the pairs of hosts between DCell0s () reveals a more important contribution on the system availability than that of the pairs of hosts within a DCell0 (). The reason is that the cross-links of the hosts between DCell0s are to tolerate the switch failures (which disconnect the communication between system users and VMs in a DCell0) so that a VM is migrated from a DCell0 to others upon any failure of the switch in the DCell0. However, the cross-links could cause the high complexity of network routing and the requirements of high speed links could lead to a huge amount of overall system cost. Thus the system design has to be aware of the trade-offs between system availability and performance and the overall cost of networking.

Figure 7 shows the sensitivity analysis results with respect to the major impacting parameters in both DCN2 and DCN3. The analyses are carried out by altering the value of a parameter of interest as the other parameters remained constant.

Figures 7(a) and 7(b) show the analysis results with respect to MTTFs of host (), VM (), and switch (). There are several similarities of the graphs in which (i) in the early period (0–1000] hours the system availability increases quickly as long as the MTTFs increase and (ii) the system availability slowly increases and approaches a steady value as the MTTFs get greater values in the late period (over 1000 hours). Also, switches in each DCell show its major impact on the system availability. If the MTTF of switches gets a low value in the early period (the switches fails more frequently), the system availability is severely pulled down in comparison with the sensitivity analysis results of system availability with respect to MTTFs of host and VM. The MTTFs of host and VM only contribute a little impact on system availability in the early period (showed by declining vertical graphs with circle and star markers) but mostly do not cause a great impact on system availability in the late period (depicted by approximately horizontal graphs with circle and star markers). This is to say that a DCN is likely prone to switches’ failures. Since the switches are the key components to connect a number of physical hosts in DCell0s, a failure of a switch severely causes a failure of the whole DCell0 (unable to connect system user to the DCell0).

Figures 7(c) and 7(d) present the results of availability sensitivity analysis with respect to MTTRs of host (), VM (), and switch (). The figures apparently reflect the significant impact of the MTTR of software system (VMs) onto the overall system availability. In both DCN2 and DCN3, as the MTTR of VMs increases to get higher values, the system availability slides down very quickly depicted by the graph marked with stars. This is because the user’s applications run on VMs hence the VMs’ up or down states decisively influence the system availability. Moreover, in the DCN2 with less number of hosts, the increase of MTTR of hosts can decrease the system availability as shown by the graph with circle marker in Figure 7(c). But in the DCN3 with more numbers of hosts, the value of MTTR of hosts does not significantly impact the system availability as shown in Figure 7(d). The reason is that if a host fails, the VM running on that host can be migrated to other hosts in the same DCell0. In the DCN2 with less number of hosts, the longer time the repair of hosts spends, the less chance the system can have to be available. In the DCN3 with more numbers of hosts and under the assumption of a high available system where a host can be recovered before the last host’s failure, the MTTR of hosts mostly does not impact the system availability. At last, the MTTR of switches does not affect the system availability as depicted by the graphs with triangle markers in Figures 7(c) and 7(d), since, as long as a switch fails, all VMs running on the DCell0 of that switch are migrated to the other DCell0s.

Figures 7(e) and 7(f) depict the availability sensitivity with respect to network bandwidths within a DCell0 () and between DCell0s (). If the network speeds within or between DCell0s surpass a specific value at about 400 Mb/s, the system can achieve high availability. However if the speeds get slower, the system availability is pulled down quickly. The figures also reflect the importance of network bandwidth within a DCell0 compared to that between DCell0s. The low value of the network bandwidth within a DCell0 pulls down the system availability more severely (as depicted by the vertical slope of the star-marked graph) than that between DCell0s does. The reason is that the connection between hosts in a DCell0 is to tolerate hosts’ failures which are more frequent to occur but the connection between hosts among different DCell0 is to tolerate switches’ failures which happen less frequently.

Figure 7(g) shows the availability sensitivity with respect to VM image sizes (). Under the default values of parameters, the size of VM image files affects the system availability in a negative manner. As the size increases, the system availability slides down quickly. Furthermore, the VM image size has greater influence on the system availability in the DCN with less number of hosts in a DCell0 (DCN2) than that in the DCN with more numbers of hosts (DCN3) does. The bigger size of VM can pull down the system availability more quickly in DCN2 in comparison with that in DCN3. This is depicted by the smaller slope of the star-marked graph (for DCN3) compared to that of the circle-marked graph (DCN2). This result implies that, in a DCell-based DCN with higher number of physical hosts in a DCell0, the system can have better ability to tolerate hardware and software failures and thus be able to deliver bigger size of VM image files.

5.3. Discussion

A practical DCell-based DCN system comprises tens to hundreds of thousands of hardware components (hosts, switches, links, etc.) and thus hosts even an enormous number of VMs in a very complicated topology of networking and routing as described in [1, 4]. An effort to model and analyze such DCN system is critically important to help provide a guide basis for design and management of both hardware and software subsystems. We find this a fruitful topic for further work on system scalability. Nevertheless, the endeavor to build a complete and monolithic model to capture the whole system behaviors also confronts largeness problem (also known as state-space explosion) in modeling. To deal with this issue, one may adopt different modeling techniques and methodologies such as state truncation [61], state aggregation [62], model decomposition [63, 64], state exploration [65, 66], and model composition [67, 68]. Other different methodologies have been also adopted popularly in literature, which are also appropriate to deal with scalability and largeness problems of modeling a large DCN system such as (i) hierarchical models, which partition a complex model into a hierarchy of submodels [69] or combine combinatorial models and state-space models [7072], (ii) interactive models [22, 73, 74], which divide a large monolithic model into a number of smaller scale models with comprehensive interactions and dependencies, (iii) fixed-point iterative models [75], and (iv) discrete-event simulation [76]. Thus this is a broad future research avenue to scale up system configuration and to resolve the largeness problem in modeling a DCN system. In this paper, our sharp focus is on system capability of fault tolerance and business continuity through availability modeling and analysis. We have shown that the DCell-based DCN can have higher availability to assure business continuity even in the presence of severe failures of components. Nevertheless, it is necessary to observe and study the system in different perspectives including reliability [77], survivability [78], performability [79] for instance. These topics are still open for future endeavor.

6. Conclusion

This paper has presented a comprehensive availability modeling and sensitivity analysis of a DCell-based DCN. Our work studied two typical DCell configurations of the DCN, respectively, comprising two and three hosts in a DCell0. Our focus is on the fault tolerant capability and business continuity of the DCNs; thus the VM live migration mechanisms are incorporated in detail to tolerate failures of switches and hosts. The modeling captured the distributed fault tolerant routing protocol designed in system architectures. A variety of analyses were carried out thoroughly in consideration of different measures of interest. The steady state availability analyses have shown that the DCell-based DCNs can assure HA and business continuity, tolerate hardware failures of switches and hosts, and enhance vastly the system’s overall availability. The increasing number of VMs in a DCN slightly improves the system availability but causes a high complexity and largeness problems in modeling and analysis. The comprehensive sensitivity analyses of system steady state availability were also performed in order to observe the system characteristics and behaviors upon any change of major impacting parameters. The sensitivity analysis results have pointed out that (i) recovery actions of hosts and VMs are significantly important to mitigate system downtime, (ii) recovery actions of software subsystem (VMs) in a DCell-based DCN cause major impacts on system availability in comparison with those of hardware subsystems (hosts and switches), and (iii) network bandwidth of the link between DCell0s is a critical parameter to obtain and maintain high availability of the system. This study brings about a guide basis to help manage and operate a DCN in data centers in terms of (i) maintenance and repair readiness, (ii) awareness of software fault tolerance in DCNs, and (iii) selection basis of network performance and availability and cost to avoid potential risks as well as tolerate faults.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2015-H8501-15-1011) supervised by the IITP (Institute for Information & Communications Technology Promotion). The authors would like to thank Professor Dr. Kishor Trivedi, Professor of Duke University, United States, for providing SPNP.