Network survivability—the ability to maintain operation when one or a few network components fail—is indispensable for present-day networks. In this paper, we characterize three main components in establishing network survivability for an existing network, namely, (1) determining network connectivity, (2) augmenting the network, and (3) finding disjoint paths. We present a concise overview of network survivability algorithms, where we focus on presenting a few polynomial-time algorithms that could be implemented by practitioners and give references to more involved algorithms.

1. Introduction

Given the present-day importance of communications systems and infrastructures in general, networks should be designed and operated in such a way that failures can be mitigated. Network nodes and/or links might for instance fail due to malicious attacks, natural disasters, unintentional cable cuts, planned maintenance, equipment malfunctioning, and so forth. Resilient, fault tolerant, survivable, reliable, robust, and dependable, are different terms that have been used by the networking community to capture the ability of a communications system to maintain operation when confronted with network failures. Unfortunately, the terminology has overlapping meanings or contains ambiguities, as pointed out by Al-Kuwaiti et al. [1]. In this paper, we will use the term survivable networks to refer to networks that, when a component fails, may “survive” by finding alternative paths that circumvent the failed component. Three ingredients are needed to reach survivability.(1) Network connectivity, that is, the network should be well connected (connectivity properties are discussed in Section 1.1).(2) Network augmentation, that is, new links may need to be added to increase the connectivity of a network.(3) Path protection, that is, a procedure to find alternative paths in case of failures.

These three ingredients will be explained in the following sections.

1.1. Network Connectivity

A network is often represented as a graph , where is the set of nodes (which for instance represent routers) and is the set of links (which for instance represent optical fiber lines or radio channels). Links may be characterized by weights representing for instance their capacity, delay, length, cost, and/or failure probability. A graph is said to be connected if there exists a path between each pair of nodes in the graph, else the graph is said to be disconnected. In the context of survivability, the notion of connectivity may be further specified as -connectivity, where at least disjoint paths exist between each pair of nodes. Depending on whether these paths are node or link disjoint, we may discriminate between node and link connectivity. The link connectivity of a graph is the smallest number of links whose removal disconnects . Correspondingly, the node connectivity of a graph is the smallest number of nodes whose removal disconnects .

In 1927, Menger provided a theorem [2]—in German—that could be interpreted as follows.

Theorem 1 (Menger's theorem). The maximum number of link/node-disjoint paths between A and B is equal to the minimum number of links/nodes that would separate A and B.

Menger's theorem clearly relates to the link/node connectivity of a graph, in the sense that a link/node-connected graph has at least link/node-disjoint paths between any pair of nodes in the graph. The minimum number of links/nodes separating two nodes or sets of nodes is referred to as a minimum cut. In order to assess the link/node connectivity of a network, we therefore need to find its minimum cut.

A somewhat less intuitive notion of connectivity stems from the spectrum of the Laplacian matrix of a graph and is denoted as algebraic connectivity. The algebraic connectivity was introduced by Fiedler in 1973 [3] and is defined as follows.

Definition 2 (algebraic connectivity). The algebraic connectivity equals the value of the second smallest eigenvalue of , where is the Laplacian matrix , with an adjacency matrix with elements if there is a link between nodes and , else , and an diagonal matrix with the degree of node .

The algebraic connectivity has many interesting properties that characterize how strongly a graph is connected (e.g., see [4, 5]). Moreover, the multiplicity of the smallest eigenvalue (of value 0) of the Laplacian is equal to the number of components in the graph . Hence, if the algebraic connectivity is larger than 0, the network is connected, else the algebraic connectivity is 0, and the network is disconnected. We have that where is the minimum degree in the network. For ease of notation, when is not specified, we use , , , and .

Connectivity properties, may be less obvious when applied to multilayered networks [6, 7], like IP over WDM networks, where a 2-link-connected IP network operated on top of an optical WDM network, with multiple IP links sharing (e.g., groomed on) the same WDM link, could still be disconnected by a single-link failure at the optical layer.

In probabilistic networks, links and/or nodes are available with a certain probability , which is often computed as , with the mean time to failure of and the mean repair time of . Often the term network availability is used to denote the probability that the network is connected (e.g., see [8]). When the node probabilities are all one and all the link probabilities are independent and of equal value , then a reliability polynomial (a special case of the Tutte polynomial, e.g., see [9]) is a polynomial function in that gives the probability that the network remains connected after its links fail with probability .

1.2. Network Augmentation

The outcome of testing for network connectivity could be that the network is not sufficiently robust (connected). Possibly, rewiring the (overlay) network could improve its robustness properties [10]. However, this is more involved when applied to the physical network, and improving network performance or network robustness is therefore often established by adding new links and possibly also nodes to the network. Adding links or nodes can be costly (which could be reflected by link/node weights), and the new links/nodes should therefore be placed wisely, such that the desired network property is obtained with the fewest amount of links/nodes or such that the addition of a fixed amount of links/nodes maximizes the desired network property. This class of problems is referred to as (network) augmentation problems, and within this class the problems only differ in their objectives. For instance, -connectivity is an important property in the context of network robustness, and reaching it through link additions might be one such objective. The alternative objective of algebraic connectivity augmentation leads to an NP-hard problem [11]. Similarly, adding a minimum amount of links to make a graph chordal is also NP-hard [12] (a graph is chordal if each of its cycles of four or more nodes has a link connecting two nonadjacent nodes in the cycle).

1.3. Path Protection

Network protocols like OSPF are deployed in the internet to obtain a correct view of the topology and in case of changes (like the failure of a link) to converge the routing towards the new (perturbed) situation. Unfortunately, this process is not fast, and applications may still face unacceptable disruptions in performance. In conjunction with MPLS, an MPLS fast reroute mechanism can be used that, as the name suggests, provides the ability to switch over in subsecond time from a failed primary path to an alternate (backup) path. This fast reroute mechanism is specified in RFC 4090 [13], May 2005, and has already been implemented by several vendors. The concept has also been extended to pure IP networks and is referred to as IP fast reroute [14]. RFC 4090 defines RSVP-TE extensions to establish backup label-switched path (LSP) tunnels for local repair of LSP tunnels. The backup path can either be configured to protect against a link or a node failure. Since the backup paths are precomputed, no time is lost in computing backup paths or performing signalling in the event of a failure. The fast reroute mechanism as described in RFC 4090 assumes that MPLS (primary and backup) paths are computed and explicitly routed by the network operator. Hence, there is a strong need for efficient algorithms to compute disjoint paths.

Depending on whether backup paths are computed before or after a failure of the primary path, survivability techniques can be broadly classified into restoration or protection techniques.(i) Protection scheme: protection is a proactive scheme, where backup paths are precomputed and reserved in advance. In 1 : 1 protection, traffic is rerouted along the backup path upon the failure of the primary path. In 1+1 protection, the data is duplicated and sent concurrently over the primary and backup paths.(ii) Restoration scheme: restoration is a reactive mechanism that handles a failure after it occurs. Thus, the backup path is not known a priori. Instead, a backup path is computed only after the failure in the primary path is sensed.

In general, protection has a shorter recovery time since the backup path is precomputed, but it is less efficient in terms of capacity utilization and less flexible. Restoration, on the other hand, provides increased flexibility and efficient resource utilization, but it may take a longer time for recovery, and there is no guarantee that a backup path will be found. As a compromise between the two schemes, Banner and Orda [15] considered designing a low-capacity backup network (using spare capacity or by installing new resources) that is fully provisioned to reroute traffic on the primary network in case of a failure. The backup network itself is not used to transport “primary” traffic. Backup networks with specific topological features have also been addressed in the literature, for instance protection [16] and preconfigured [17] cycles or redundant trees [18].

Depending on how rerouting is done after a failure in the primary path, there are three categories of survivability techniques.(i) Path-based protection/restoration: in path-based protection, a link- or node-disjoint backup path is precomputed and takes over when the primary path fails. In path-based restoration, a new path is computed between the source and destination nodes of the failed path. If such a backup path cannot be found, the request is blocked.(ii) Link-based protection/restoration: in link-based protection, each link is preassigned a local route that is used when the link fails, and in link-based restoration, the objective is to compute a detour between the two ends of the failed link for all paths that are using the link. Since link-based protection/restoration requires signaling only between the two ends of the failed link, it has a smaller recovery time than path-based protection/restoration, which requires end-to-end signaling between the source and destination nodes.(iii) Segment-based protection/restoration: the segment-based scheme (e.g., see [19]) is a compromise between path-based and link-based schemes. Thus, in segment-based protection, backup routes are precomputed for segments of the primary path. In segment-based restoration, a detour of the segment containing the failed link is computed following a failure.

Depending on whether sharing of resources is allowed among backup paths, protection schemes can be of two types:(i) Dedicated protection: in this scheme, resources (e.g., links, wavelength channels, etc.) are not shared among backup paths and are exclusively reserved for a given path request.(ii) Shared protection: in this scheme, backup paths may share resources as long as their primary paths do not share links. In  :  protection, backup paths are used to protect primary paths. The shared scheme provides a better resource utilization; however, it is more complicated and requires more information, such as the shareability of each link.

In general, path protection requires less capacity than link protection, while shared protection requires less capacity than dedicated protection. However, path protection is more vulnerable to multiple link failures than link protection, and so is shared protection compared to dedicated protection.

1.4. Paper Outline and Objective

The remainder of this paper is structured as follows. In Section 2, we give an overview of several methods for determining the connectivity properties of a network. In case a network is found to be insufficiently connected from a survivability perspective, links may have to be added to increase the connectivity. In Section 3, we list key results in network connectivity augmentation. Once a network is designed to withstand some failures, proper path protection/restoration schemes should be in place that can quickly defer traffic to alternate routes in case of a failure. In Section 4, we survey work on finding disjoint paths in a network. We conclude in Section 5.

Throughout the paper, the objective is not to list and explain all the relevant algorithms. Rather, we aim to briefly explain some fundamental concepts and some polynomial-time algorithms that could easily be deployed by practitioners or which can be (and have been) used as building blocks for more advanced algorithms, and to provide pointers to further reading.

2. Determining Network Connectivity

In Section 1.1, we indicated that Menger's theorem implies that finding a minimum cut corresponds to finding the connectivity of a network. In this section, we will look further at finding cuts in a network.

Definition 3 (link (edge) cut). A link cut refers to a set of links whose removal separates the graph into two disjoint subgraphs, and where all links in the removed cut-set have an end-point in both subgraphs.

The two subgraphs need not be connected themselves.

Definition 4 (node (vertex) cut). A node cut refers to a set of nodes whose removal separates the graph into two disjoint subgraphs, and where all nodes in the removed cut-set have at least one adjacent link to both subgraphs.

Definition 5 (minimum link/node cut). A minimum cut is a cut whose cardinality is not larger than that of any other cut in the network.

Definitions for a cut also have a variant in which a source node and a terminating node need to be separated.

Definition 6 ( cut). An cut refers to a cut that separates two nodes and in the graph such that both belong to different subgraphs.

Often, when referring to a cut, a link cut is meant. In the remainder of this paper, we will use the same convention and only specify the type of cut for node cuts.

Definition 7 (maximum cut). A maximum cut is a cut whose cardinality is not exceeded by that of any other cut in the network.

Definition 8 (sparsest cut). The sparsest cut (sometimes also referred to as the (Cheeger) isoperimetric constant) is a cut for which the ratio of the number of links in the cut-set divided by the number of nodes in the smaller subgraph is not larger than that of any other cut in the network.

Finding a maximum or sparsest cut is a hard problem (the maximum-cut problem is APX-hard [20] and the sparsest-cut problem is NP-hard [21, 22]), but fortunately a minimum cut, and consequently the network's connectivity, can be computed in polynomial time as will be indicated below. The algebraic connectivity could be used to approximate the sparsest cut as [4, 21]. Dinh et al. [23] investigated the notion of pairwise connectivity (the number of connected pairs, which bears similarities to the sparsest-cut problem), and proved that finding the smallest set of nodes/links whose removal degrades the pairwise connectivity to certain degree is NP-complete.

2.1. Determining Link Connectivity

In the celebrated paper from Ford and Fulkerson [24] (and independently by Elias et al. [25]) a maximum flow from a source to a terminal in a network, where the links have a given capacity, is shown to be equal to the minimum-weight - link cut in that network, where the weight of the cut is the sum of the capacities of the links in the cut-set; the so-called max-flow min-cut theorem. By using a max-flow algorithm and setting the capacity of all links to 1, one can therefore compute the minimum - link cut, or the minimum link cut when repeated over all possible - pairs. It is not our goal to overview all maximum-flow algorithms (an excellent discourse of the subject is presented in the book by Ahuja et al. [26]), but we will present Dinitz's algorithm, which can be used to determine the minimum - link cut in time. We will subsequently present the algorithm of Matula for determining the minimum link cut in time.

2.1.1. Dinitz' Algorithm

Dinitz' algorithm, published in 1970 by Yefim Dinitz, was the first maximum-flow algorithm to run in polynomial time (contrary to the pseudopolynomial running time of the Ford-Fulkerson algorithm [24]). The algorithm is sometimes referred to as Dinic's or Dinits' algorithm, and also different variants are known. A historical perspective of the different variants is presented by Dinitz himself in [27]. In order to describe Dinitz' algorithm, as presented in Algorithm 1, some definitions are given.

   and /*Initialize to zero flow*/
(2) While /*loop until the algorithm terminates in line /
(3)  For all nodes , compute in the hopcount to /*by
   Breadth-First-Search [28] from /
(4)   Compute a blocking flow in thereby skipping links for which
(5)   If exists
(7)   Update
(8)   else
(9)  return

Definition 9. The residual capacity of a link is interpreted in two directions as follows: where the flow over a link cannot exceed the capacity of that link.

Definition 10. The residual graph of is the graph in which a directed link exists if .

Definition 11. A blocking flow is an - flow such that any other - flow would have to traverse a link already saturated by .

A blocking flow could be obtained by repeatedly finding (via Depth-First-Search [28]) an augmenting flow along an - path (or pruning the path from the graph in unit-capacity networks). In unit-capacity networks, the algorithm runs in , which therefore also is the time complexity to determine a minimum - link cut with Dinitz' algorithm (for unit node capacities, a complexity of can be obtained [29]).

For further reference, in Table 1, we present some key achievements in computing minimum - link cuts.

2.1.2. Matula's Algorithm

In this section, we describe the algorithm from Matula [43] for determining the link connectivity of an undirected network. Matula's algorithm is based on the following lemma.

Lemma 12. Let be a graph with a minimum cut of size that partitions the graph into two subgraphs and , then any dominating set of contains nodes of both and (a dominating set is a subset of the nodes in , such that every node in is either in or adjacent to a node in ).

Proof. For subgraph , , holds that the sum of the nodal degrees in is bounded by The upper bound occurs if all nodes in are connected to each other and some of the nodes have a link that is part of the cut-set. The lower bound stems from each node having a degree larger or equal than the minimum degree . From the bounds in (3), we can derive that Since is assumed, and consequently both terms on the left-hand side cannot be smaller than 1. Hence, , which means that, under the assumption that , there is at least one node in that does not have a neighbor in (and vice versa). In other words, any dominating set of should contain nodes of both and .

The algorithm of Matula (see Algorithm 2) starts with a node of minimum degree (e.g., node in ) and gradually builds a dominating set by adding nodes not yet part of or adjacent to the growing set. Since at one point a node, for example , from needs to be added, keeping track of the minimum cut between newly added dominating nodes, and will result in finding the overall minimum cut. The algorithm is presented below.

For a node of minimum degree set , , , and .
(2) While
(3)   Choose
(4)   The number of shortest augmenting paths from to
(5)   If then
(6)   Set , , followed by

In the algorithm of Matula, an augmenting path is a path in the residual network, where a residual network is the network that remains after pruning the links of a previous augmenting path. There are no 1-hop paths from to , because then . If has neighbors that belong to , then there exist 2-hop paths from to , for which either the first hop from to or the second hop from to is part of the minimum cut. These paths form the first augmenting paths, after which remains. These remaining augmenting paths can be found in time each and since there are at most such paths, the complexity of the algorithm is bounded by . Finally, if , then the initialization guarantees that that value would be found.

For directed multigraphs, Shiloach [44] provided a theorem that is stronger than Menger's theorem, namely.

Theorem 13. Let be a directed -link-connected multigraph, then for all , (not necessarily distinct) there exist link-disjoint paths from to for .

We refer to Mansour and Schieber [45] for an -time algorithm for determining the link connectivity in directed networks.

For further reference, in Table 2 we present some key achievements in computing minimum link cuts.

2.2. Determining Node Connectivity

Maximum-flow algorithms can also be used to determine the node connectivity, as demonstrated by Dantzig and Fulkerson [46] (and also discussed in [47]), by transforming the undirected graph to a directed graph as follows.

For every node place two nodes and in and connect them via a directed link , using the convention that the link starts at and ends at . For every undirected link place directed links and in . All links are assigned unit capacity.

The - node connectivity in can be computed by finding a maximum flow from to in . This can be seen as follows. Assume that there are node-disjoint paths between and , then there are also corresponding node-disjoint paths from to in . Since each link has unit capacity, there thus exists a flow of at least . Since each link entering a node has to traverse a single unit-capacity link at most one unit of flow can pass through a node, which corresponds to a node-disjoint path. Since there are only node-disjoint paths, the maximum flow in is equal to .

By using Dinitz' algorithm, one may compute the - node connectivity in time, and by using the algorithm of Mansour and Schieber [45], the node connectivity can be determined in time. We refer to Henzinger et al. [48] and Gabow [49] for more advanced algorithms to compute the node connectivity in directed and undirected graphs and to Yoshida and Ito [50] for a -node-connectivity property testing algorithm (in property testing the objective is to decide, with high probability, whether the input graph is close to having a certain property. These algorithms typically run in sub-linear time).

3. Network Connectivity Augmentation

In the previous section, we have provided an overview of several algorithms to determine the connectivity of a network. In this section, we will overview several network augmentation algorithms that can be deployed to increase the connectivity (or some other metric) of a network by adding links. Network augmentation problems seem closely related to network deletion problems (e.g., see [51]), where the objective is to remove links in order to reach a certain property. However, there may be significant differences in terms of complexity. For instance, finding a minimum-weight set of links to cut a -link-connected graph such that its connectivity is reduced to is solvable in polynomial time (as discussed in Section 2.1), while adding a minimum-weight set of links to increase a disconnected graph to -link-connectivity is NP-complete as shown in Section 3.1. When both link deletions and link additions are permitted, we speak of link modification problems, for example, see [52].

3.1. Link Connectivity Augmentation

In this section, we consider the following link augmentation problem.

Problem 1 (the link connectivity augmentation (LCA) problem). Given a graph consisting of nodes and links, link connectivity and an integer , the link connectivity augmentation problem is to add a minimum-weight set of links, such that the link connectivity of the graph is increased from to .

We can discriminate several variants based on the graph (directed, simple, planar, etc.) or if link weights are used or not (i.e., in the unweighted case all links have weight 1). Let us start with the weighted link connectivity augmentation problem.

Theorem 14. The weighted LCA problem is NP-hard.

We will use the proof due to Frederickson and JáJá [53] to show that the 3-dimensional matching (3DM) problem is reducible to the weighted LCA problem (an earlier proof has been provided by Eswaran and Tarjan [54], but since it aims to augment a network without any links to one that is 2 connected and has links (a cycle), it has the characteristics of a design rather than an augmentation problem).

Problem 2 (3-dimensional matching (3DM)). Given a set of triplets, where , , and are disjoint sets of elements each, is there a matching subset that contains all elements, such that , and thus no two elements of agree in any coordinate?

Proof. For a 3DM instance , with , , , and , we create the graph of the corresponding instance of the weighted LCA problem as follows: The graph as constructed above forms a tree and therefore is 1 connected. Links from the complement of can be used to augment the graph to 2-link connectivity. The weights of the links in are for , and for the remaining links in , the weight is 2.
contains a matching if and only if there is a set of weight such that is 2-link connected. Assuming exists, then adding links and for each triple will establish the (2-connected) cycle . Since , the weight of these added links is . The remaining nodes that are not yet on a cycle are the nodes and belonging to . These nodes will be directly connected, thereby creating the cycle . In total, additional links will be added, leading to a total weight of links that have been added of . Since the graph is a tree with leaves and the minimum link weight is 1, a network augmentation solution of weight is indeed the lowest possible. It remains to demonstrate that an augmentation of weight will lead to a valid matching . Since, in a solution of weight , each leaf will be connected by precisely one link from , a link will prevent adding a link , and therefore also link must be added. The corresponding triple was not augmented before and is, therefore, part of a valid matching. The remaining links do not contribute to the matching.

Frederickson and JáJá also used the construction of this proof to prove that the node-connectivity and strong-connectivity variants of the weighted LCA problem are NP-hard (in a directed graph strong connectivity is used, which means that there is a directed path from each node to every other node in the graph). We remark that the unweighted simple graph preserving LCA problem was claimed to be NP-hard by Jordán (reproduced in [55]) by using a reduction to one of the problems treated by Frederickson and JáJá. However, Jordán appears to be using an unweighted problem of which only (in the paper [53] referred to) the weighted version is proved to be NP-hard, and it is therefore not clear whether the unweighted problem is indeed NP-hard. For fixed , the unweighted simple graph preserving problem can be solved in polynomial time [55].

Eswaran and Tarjan [54] were the first to report on augmentation problems. They considered augmenting a network towards either 2-link connectivity, 2-node connectivity or strong connectivity, and provided for each unweighted problem variant an algorithm of complexity (Raghavan [56] pointed out an error in the strong connectivity algorithm and provided a fix for it). Since most protection schemes only focus on protecting against one single failure at a time (by finding two disjoint paths as discussed in Section 4), we will first present the 2-link-connectivity augmentation algorithm of Eswaran and Tarjan [54].

3.1.1. Eswaran and Tarjan Algorithm

The algorithm of Eswaran and Tarjan as presented in Algorithm 6 makes use of preorder (Algorithm 4) and postorder (Algorithm 3) numbering of nodes in a tree (the label of node denotes its number as a result of the ordering) and a procedure (Algorithm 5) to find 2-link-connected components.

For each do PostOrder

(3) For each do PreOrder

Find a directed spanning tree in rooted at a node
(2) PostOrder 
(3) For
(4)  For /*Only nodes “downstream”*/
(5)   if and
     /* is the number of descendants
    in the tree including /
(6)       then is 1-link-connected. /*Its removal cuts the graph*/

Find the 2-link-connected components of
(2) Condense into a tree for which each node represents one of the 2-link-
    connected components of
(3) Number the nodes in in preorder, starting from an arbitrary non-leaf node
     /*PreOrder  */
(4) For choose links , where are the
     leaves of ordered in increasing node number
(5) Map the ends of each chosen link to an arbitrary node in the corresponding
    2-link-connected component

We have assumed that the initial graph was connected. Eswaran and Tarjan's algorithm also allows to start with disconnected graphs, by augmenting the forest of condensed 2-link-connected components to a tree.

3.1.2. Cactus Representation of All Minimum Cuts

The algorithm of Eswaran and Tarjan uses a tree representation of all the 2-link-connected components in , which is subsequently used to find a proper augmentation. By using a so-called cactus representation of all minimum cuts in a network, a similar strategy could be deployed to augment a network to a connectivity >2. A graph is defined to be a cactus graph if any two distinct simple cycles in have at most one node in common (or equivalently, any link of belongs to at most one cycle). In this section, we will present the cactus representation.

We will use the notation to represent a set of links that connect nodes in to nodes in . The link-set , with , refers to a cut-set of links whose removal separates the graph into two subgraphs of nodes and nodes . Dinitz et al. [58] have proposed a cactus structure to represent all the minimum cuts of a graph (possibly with parallel links) and have shown that there can be at most such minimum cuts. The structure possesses the following properties.(1) is a cactus graph, that is, any two distinct simple cycles of have at most one node in common.(2) Each proper cut in is a minimum cut (a cut is called proper if the removal of the links in that cut partitions the graph in precisely two subgraphs. A minimum cut is always proper).(3) For any link that is part of a cycle in the weight , else .(4), where represents the minimum-weight link cut of .

A cactus graph without cycles is a tree, and if is odd, then is a tree. Cycles in the cactus graph reflect so-called crossing cuts in .

Definition 15. Two cuts and , with , are crossing cuts, if all four sets , , , and are non-empty.

Karzanov and Timofeev [59] have outlined an algorithm to compute that consists of two parts: (1) computing all minimum cuts and (2) constructing the corresponding cactus representation. However, Nagamochi and Kameda [60] reported that their cactus representation may not be unique. We assume that all minimum cuts are already known (e.g., by computing minimum - cuts between all possible source-destination pairs, by the Gomory-Hu tree algorithm [61], or with Matula's algorithm as explained in [62]) and focus on explaining—by following the description of Fleischer [63]—how to build a unique cactus graph for the graph .

Karzanov and Timofeev [59] observe that for a link , any two minimum cuts and that separate and are nested, which means that (or vice versa). If we assign the nodes of a preorder labelling , such that node is adjacent to a node in the set , and define to be the set of minimum cuts that contain but not , then it follows that all cuts in are noncrossing for each . For instance, consider a 4-node ring , where three minimum cuts separate nodes and , namely, , , and . Clearly , which allows us to represent them as a path graph . The three possibilities to cut this chain correspond to the three minimum cuts that separate and in the ring graph. For each there is a corresponding path graph . These path graphs are used to create a single cactus graph. We proceed to present the algorithm as described by Fleischer [63] (for an alternative description we refer to [64]), see Algorithm 7. We define to be the function that maps nodes of to nodes in , and we define to be the graph with nodes contracted to a single node (and any resulting self-loops removed). Let be the smallest graph that has a minimum cut of value , where corresponds to the largest index of such a graph. is a path graph. The algorithm builds from until is obtained.

Compute for
(2) For
(3)  Replace the node in that contains nodes with the Path .
    If , then remove and introduce new nodes with links
    for .
(4)  Connect path to . For any tree or cycle link in , let
    be the set of nodes in , or if is an empty node, the nodes in any nonempty
    node reachable from by some path of links disjoint from a cycle containing
    . Find the subset such that and connect to .
(5)  Label the nodes of . Let be the set of nodes mapped to in .
    Update by for all . All other mappings remain
(6)  Remove all empty nodes of degree and all empty 2-way cut nodes by contracting
    an adjacent tree link (a node is an -way cut node if its removal separates the graph into
    connected parts). Replace all empty 3-way cut nodes with 3 cycles.

Figure 1 gives an example of the execution of the algorithm on a 4-node ring.

3.1.3. Naor-Gusfield-Martel Algorithm

Naor et al. [65] have proposed a polynomial-time algorithm to augment the link connectivity of a graph from to , by adding the smallest number of (possibly parallel) links. The authors first demonstrate how to augment the link connectivity by one in time, after which it is explained how executing this algorithm times could optimally augment the graph towards link connectivity (Cheng and Jordán [66] further discuss link connectivity augmentation by adding one link at a time). In practice, as a result of the costs in network augmentation, a network's connectivity is likely not augmented with . We will therefore only present the algorithm to augment the link connectivity by one, see Algorithm 9, and refer to [65] for the extended algorithm. The algorithm uses the cactus structure that was presented in the previous section to represent all the minimum cuts of a graph . The algorithm is similar in approach to the Eswaran-Tarjan algorithm, since a cactus representation of a 1-connected network is the tree representation used by Eswaran and Tarjan, and the algorithm connects “leafs” as Eswaran and Tarjan have done. Naor et al., however, use a different definition of leafs for cactus graphs.

Definition 16 (cactus leaf). A node in a cactus representation is a cactus leaf if it has degree 1 or is a cycle node of degree 2.

Similarly to a tree, if the cactus has leafs, then links need to be added to increase the connectivity by 1.

The algorithm uses a Depth-First-Search-like procedure, see Algorithm 8, to label the nodes of the cactus graph.

Assign different colors to the different simple cycles /*, for example, by finding the articulation
    points [67]*/
(2) DFS traversal that starts at an arbitrary node and obeys the following rule: if a node is visited
    for the first time via a cycle with some color, then traverse all other differently colored links
    adjacent to before traversing the adjacent link of the same color. Enumerate the cactus leafs
     in the order in which they are first encountered in the DFS traversal.

(2) Cactus-DFS
(3) Form the pairs , where is the set of nodes from
   that map to the leaf of .
(4) For each pair , add a link between a node from in
   and a node from in . If is odd, then connect a node in to a node in
   a different leaf .

For further reference, in Table 3, we present some key achievements in augmenting link connectivity in unweighted graphs.

Splitting off a pair of links and refers to deleting those links and adding a new link . A pair of links is said to be splittable if the - min-cut values remain unaffected after splitting off the pair of links and is considered in the context of Mader's theorem.

Theorem 17 (Mader [68, 69]). Let be a connected undirected graph where for some node the degree , and the removal of one of the adjacent links of does not disconnect the graph, then has a pair of splittable links.

Mader's theorem has been used by for instance Cai and Sun [75] and Frank [77] in developing network augmentation algorithms. The algorithms (as already outlined in 1976 by Plesník [84]) attach a new node to the graph with parallel links between and all other nodes in the graph and subsequently proceed to split off splittable links.

As indicated by Theorem 14, the weighted LCA problem is NP-complete for both undirected graphs and directed graphs. Frederickson and JáJá [53] provided an algorithm to make a weighted graph 2 connected. The algorithm is a 2-approximation algorithm if the starting graph is connected, else it is a 3-approximation algorithm. Khuller and Thurimella [85] proposed a 2-approximation algorithm for increasing the connectivity of a weighted undirected graph to that has a complexity of . Taoka et al. [86] compare via simulations several approximation and heuristic algorithms, including their own maximum-weight-matching-based algorithm.

Under specific conditions, the weighted LCA problem may be polynomially solvable, as shown by Frank [77] for the case that link weights are derived from node weights.

3.2. Node Connectivity Augmentation

In this section, we consider the following node augmentation problem.

Problem 3. The Node Connectivity Augmentation (NCA) problem. Given a graph consisting of nodes and links, node connectivity and an integer , the node connectivity augmentation problem is to add a minimum-weight set of links, such that the node connectivity of the graph is increased from to .

Like for the LCA problem.

Theorem 18. The weighted NCA problem is NP-hard.

Proof. The proof of Theorem 14 also applies here.

The unweighted undirected NCA problem has received most attention. The specific case of making a graph 2-node connected was treated by Eswaran and Tarjan [54], Rosenthal and Goldner [87] (a correction to this algorithm has been made by Hsu and Ramachandran [88]). Watanabe and Nakamura [89] and Jordán [90] solved the case for achieving 3-node-connectivity, while Hsu [91] developed an algorithm to upgrade a 3-node connected graph to a 4-node-connected one. Increasing the connectivity of a -node-connected graph (where can be any integer) by 1 was studied by many researchers [90, 9297], since it was long unknown whether the problem was polynomially solvable. In 2010, Végh [98] provided a polynomial-time algorithm to increase the connectivity of any -node-connected unweighted undirected graph by .

Augmenting the node connectivity of directed graphs has been treated by Frank and Jordán [99]. They found a min-max formula that finds the minimum number of required new links to make an unweighted digraph -node connected. Frank and Végh [100] developed a polynomial-time algorithm to make a -node-connected directed graph -node connected.

As the weighted NCA problem is NP-complete, special cases have been considered [8891, 98, 101]. Most of these articles discuss specific connectivity targets ( and/or have specific values) or specific topologies, like trees. Also heuristic and approximation algorithms have been proposed [85, 102107].

4. Disjoint Paths

When a network is (made to be) robust, algorithms should be in place that can find link- or node-disjoint paths to protect against a link or node failure. There can be several objectives associated with finding link- or node-disjoint paths.

Problem 4. Given a graph , where and , a weight and a capacity associated with each link , a source node and a terminal node , and two bounds and find a pair of disjoint paths from to such as the following.

Min-Sum Disjoint Paths Problem
The total weight of the pair of disjoint paths is minimized.

Min-Max Disjoint Paths Problem
The maximum path weight of the two disjoint paths is minimized.

Min-Min Disjoint Paths Problem
The smallest path weight of the two disjoint paths is minimized.

Bounded Disjoint Paths Problem
The weight of the primary path should be less than or equal to , and the weight of the backup path should be less than or equal to .

Widest Disjoint Paths Problem
The smallest capacity over all links in the two paths is maximized.

The most common and simpler one is the min-sum disjoint paths problem. If the two paths are used simultaneously for load-balancing purposes (or protection), then the min-max objective is desirable. Unfortunately, the min-max disjoint paths problem is NP-hard [108]. If failures are expected to occur only sporadically (and in case of 1 : 1 protection), then it may be desirable to minimize the weight of the primary (shorter) path (min-min objective), which also leads to an NP-hard problem [109]. The min-max and min-min disjoint paths problems could be considered as extreme cases of the bounded disjoint paths problem, which was shown to be NP-hard [110] and later proven to be APX-hard by Bley [111] (the graph structure referred to as lobe that was used by Itai et al. [110] to prove NP-completeness has since often been used to prove that other disjoint paths problems are NP-complete, e.g., [112114]). Finding widest disjoint paths can easily be done by pruning “low-capacity” links from the graph and finding disjoint paths. When the capacity requirements for the primary and backup paths are different, disjoint paths problems usually become NP-complete [115].

Beshir and Kuipers [116] investigated the min-sum disjoint paths problem with min-max, min-min, bounded, and widest, as secondary objectives in case multiple min-sum paths exist between and . From these variants, only the widest min-sum link-disjoint paths problem is not NP-hard.

Li et al. [112] studied the min-sum disjoint paths problem, where the link-weight functions are different for the primary and backup paths and showed that this problem is hard to approximate. Bhatia [117] demonstrated that the problem remains hard to approximate in the case that the weights for the links of the backup path are a fraction of the normal link weights (for the primary path).

Sherali et al. [118] investigated the time-dependent min-sum disjoint paths problem, where the link weights are time-dependent. They proved that the problem is NP-hard, even if only one link is time-dependent and all other links are static.

4.1. Min-Sum Disjoint Paths

Finding min-sum disjoint paths is equivalent to finding a minimum-cost flow in unit-capacity networks [26]: a minimum-cost flow of will traverse disjoint paths. In fact, Suurballe's algorithm, which is most often cited as an algorithm to compute two disjoint paths, is an algorithm that uses augmenting paths, like in several max-flow algorithms. The original Suurballe algorithm as presented in [119] allows to compute node (or link) disjoint paths between a single source-destination pair, by using shortest path computations. Later, this approach was used by Suurballe and Tarjan [120] to find two link (or node) disjoint paths from a source to all other nodes in the network (i.e., source-destination pairs), by using only two shortest-paths computations, that is, in time. Both papers focus on directed networks, but can also be applied to undirected networks.

In directed networks, a link-disjoint paths algorithm can be used to compute node-disjoint paths, if we split each node into two nodes and , with a directed link , and the incoming links of connected to and the outgoing links of departing from .

In undirected networks, a link-disjoint paths algorithm can be used to compute node-disjoint paths by the transformation described in Section 2.2.

We will present the Suurballe-Tarjan algorithm, see Algorithm 10, for computing two link-disjoint paths between and every other node in the network.

Compute the shortest paths tree rooted at
(2) Modify the weights of each link to
   /* is the length of the shortest path in from to /
(3) For
(4)   , ,
(7) While
(9)  DELETE  /* becomes a forest of subtrees*/
(10)    For each non-tree link in such that or and are in different subtrees
(11)      If
(12)        , 

Instead of finding an augmenting path for each source-destination pair, Suurballe and Tarjan have found a way to combine these augmenting flow computations into two Dijkstra-like shortest-paths computations. First a shortest paths tree is computed in line 1, and based on the computed shortest path lengths, the link weights are modified in line 2. This link weight modification was also used by Suurballe and is to assure that for all links, with equality if is in . In Suurballe's original algorithm the direction of the links on the shortest path from to was reversed, after which a shortest (augmenting) path in the newly modified graph was computed. In Suurballe-Tarjan's algorithm the links maintain their direction, but an additional parameter is used instead. The algorithm proceeds in a Dijkstra-like fashion. Lines 3–6 correspond to the initialization of the smallest length from to found so far, its corresponding predecessor list and . The algorithm repeatedly extracts a node of minimum length (in line 8) and removes that node from the tree (in line 9). A slightly different relaxation procedure is used (lines 10–12). Upon termination of the algorithm, the disjoint paths between the source and a destination can be retrieved via the lists and with Algorithm 11.

(2) While
(3)  mark
(5) For
(7) While
(8)  If is marked
(9)       unmark
(10)   Else
(11)      /* is parent of in /

Taft-Plotkin et al. [121] extended the approach of Suurballe and Tarjan in two ways: (1) they return maximally disjoint paths, and (2) they also take bandwidth into account. Their algorithm, called MADSWIP, computes maximum-bandwidth maximally disjoint paths and minimizes the total weight as a secondary objective. Consequently, by assigning all links equal bandwidth, the MADSWIP algorithm returns the min-sum maximally disjoint paths.

For a distributed disjoint paths algorithm, we refer to the work of Ogier et al. [122] and Sidhu et al. [123].

Roskind and Tarjan [124] presented an algorithm for finding link-disjoint spanning trees of minimum total weight. Xue et al. [125, 126] have considered quality of service and quality of protection issues in computing two disjoint trees (quality of Protection (QoP) as used by Xue et al. refers to the amount of link failures that can be survived. QoP sometimes is used to refer to probabilistic survivability, as discussed in the following section or protection differentiation as overviewed by Cholda et al. [127]). Ramasubramanian et al. [128] proposed a distributed algorithm for computing two disjoint trees. Guo et al. [129] considered finding two link-disjoint paths subject to multiple Quality-of-Service constraints.

4.2. Probabilistic Survivability

When two disjoint primary and backup paths are reserved for a connection, any failure on the primary path can be survived by using the backup path. The backup path therefore provides 100% survivability guarantee against a single failure. When no backup paths are available, that is, unprotected paths are used, then the communication along a path will fail if there is a failure on that path. Banner and Orda [130] have introduced the term -survivable connection to denote a connection for which there is a probability that all its links are operational (the related notion of Quality of Protection (QoP), as defined by Gerstel and Sasaki [131], was argued to be difficult to apply to general networks). The previous two cases correspond to and , respectively. Banner and Orda proved that, under the single-link failure model, at most two paths are needed to establish a -survivable connection, if it exists. Based on this observation, they studied and proposed algorithms for several problem variants, namely establishing -survivable--bandwidth, most survivable, and widest -survivable connections for 1 : 1 and protection architectures (the MADSWIP algorithm [121] can also be used to find the most survivable connection). The -survivable--bandwidth problem asks for a connection with survivability and bandwidth and solving it provides a foundation for solving the other problems. We will therefore discuss the solution proposed by Banner and Orda for the -survivable--bandwidth problem.

The approach by Banner and Orda to solve the -survivable--bandwidth problem is twofold. First, the graph is transformed, after which a minimum-cost flow is found on the resulting graph. The graph transformation is depicted in Figure 2 and slightly differs for the 1 : 1 and cases. Clearly, if a link does not have sufficient spare capacity to accommodate the requested bandwidth , then it does not need to be considered further (Figure 2(a)). If, for 1 : 1 protection, , then there is sufficient bandwidth for both disjoint paths, since the backup path is only used after failure of the primary path. To allow for both paths to share that link, it is transformed to two links (Figure 2(b)). If the original link is only used by one path, then that link is protected, and hence the weight is assigned to the top link. If both paths have to use the original link, then the connection's survivability is affected by the failure probability of that link, which is why the weight is assigned to the lower link (the logarithm is used to transform a multiplicative metric to an additive metric). The same applies to the case, with the exception that the concurrent transmission of data over both paths requires twice the requested bandwidth. For the remaining range for protection (Figure 2(c)), it holds that only one of the paths can use that link, which is why there is no weight penalty.

In the transformed graph, a minimum-cost flow of units corresponds to two maximally disjoint paths of each bandwidth. The minimum-cost flow could, for instance, be found with the cycle-canceling algorithm of Goldberg and Tarjan [132], while the corresponding maximally disjoint paths could be returned via a flow-decomposition algorithm [26].

Luo et al. [133] studied the min-sum -survivable connection problem, where each link is characterized by a weight and a failure probability, and the problem is to find a connection of least weight and survivability ≥. Contrary to the min-sum maximally-disjoint paths problem, this problem is NP-hard, since it contains the NP-hard restricted shortest paths problem (e.g., see [134]). Luo et al. proposed an ILP and two approximation algorithms for this problem. Chakrabarti and Manimaran [135] studied the min-sum -survivable--bandwidth problem, for which they considered a segment-based protection strategy.

She et al. [114] have considered the problem of finding two link-disjoint paths, for which the probability that at least one of these paths will not fail is maximized. They refer to this problem as the maximum-reliability (max-rel) link-disjoint paths problem. The rationale behind this problem is to establish two disjoint paths that give 100% protection against a single-link failure, while reducing the failure probability of the connection as much as possible when multiple failures may occur. Assuming that the link-failure probabilities are independent, then the reliability of a connection (consisting of two link-disjoint paths and ) is defined as , with , for . The max-rel link-disjoint paths problem is proven to be NP-complete. She et al. [114] evaluated two simple heuristic algorithms that both transform the link probabilities to link weights , with , for , and . Based on these weights, one heuristic finds a shortest path, prunes its links from the graphs, and finds a shortest path in the pruned graph. This is often referred to as an active-path-first (APF) approach. The second heuristic uses Suurballe's algorithm to find two link-disjoint paths. Contrary to the first heuristic, the second always returns link-disjoint paths if they exist.

4.3. Multiple Failures

The single-link failure model has been most often considered in the literature, but multiple failures may occur as follows. (i) Due to lengthy repair times of network equipment, there is a fairly long time span in which new failures could occur.(ii) In case of terrorist attacks, several targeted parts of the network could be damaged. With Suurballe's algorithm, link/node-disjoint paths can be found to establish full protection against link/node failures.(iii) In layered networks, for instance IP-over-WDM, one failure on the lowest-layer network, may cause multiple failures on higher-layer networks. Similarly, the links of a (single-layered) network may share the same duct, in which case a damaging of the duct may damage all the links inside. These links are often said to belong to the same shared risk link group (SRLG) (the node variant SRNG also exists. When both nodes and links can belong to a shared risk group, the term Shared Risk Resource Group (SRRG) is used, e.g., see [136]). Finding two SRLG-disjoint paths—paths of which the links in one path may not share a risk link group with links from the other path—is an NP-complete problem [137]. In specific cases, the SRLG-disjoint paths problem is polynomially solvable, as discussed by Bhandari [138], Datta and Somani [139], and Luo and Wang [140]. In those cases, for instance, when the links in a SRLG share the same endpoint, a graph transformation can be made that reflects the shared risk groups, and on which a simple link-disjoint paths algorithm can be run. Lee et al. [141] have generalized the SRLG problem to include failure probabilities. In the deterministic SRLG scenario, when a SRLG fails (e.g., a cable in the physical network is cut) all higher-layer links that belong to that group fail. In the probabilistic SRLG (PSRLG) scenario, the links in that PSRLG fail with probability . If , for , then the problem of finding PSRLG-disjoint paths reduces to the NP-complete problem of finding SRLG-disjoint paths. For SRLG types of problems, often an integer programming formulation is provided (e.g., [137, 142144]) or an active-path-first (APF) approach is used as a heuristic. Hu [137] provided a basic ILP formulation to return a min-sum pair of SRLG-disjoint paths. Xu et al. [143] gave an ILP and an APF heuristic for the case of shared (backup paths) protection.(iv) Natural disasters may affect all nodes and links within a certain geographical area. Work on multilink geographic failures has mostly focused on determining the geographic max-flow and min-cut values of a network under geographic failures of circular shape (e.g., Sen et al. [145], Agarwal et al. [146], and Neumayer et al. [147]). Trajanovski et al. [148] proved that, the problem of finding two region-disjoint paths is NP-hard, and they proposed a heuristic for it.

5. Conclusion

We have provided an overview of algorithms for network survivability. We have considered how to verify that a network has certain connectivity properties, how to augment an existing network to reach a given connectivity, and, lastly, how to find alternative paths in case network failures occur. Our focus has been on algorithms for general networks, although much work has also been done for specific networks, such as optical networks, where additional constraints like wavelength continuity and signal impairments induce an increased complexity, for example, see our work [149151]. We have not discussed how to design a survivable network from scratch. Typically network design problems are hard to solve and involve many constraints, but since they only need to be solved sporadically, longer computation times are permitted. Predominantly, integer programming is used to design a network, as we have done in [152].


The author would like to thank Professor Piet Van Mieghem for his constructive comments on an earlier version of this paper.