Abstract

Valiant network design was proposed, at least in part, to counter the difficulties in measuring network traffic matrices. However, in this paper we show that in a Valiant network design, the traffic matrix is in fact easy to measure, leading to a subtle paradox in the design strategy.

1. Introduction

In recent years the difficulties in measurement and prediction of Internet traffic matrices have prompted a number of routing and network design strategies broadly termed “oblivious” [13]. They are oblivious in the sense that they guarantee performance under any possible traffic matrix. This appealing property has a cost: extra capacity is needed to ensure that performance is maintained under all possible inputs, though several papers have shown reasonable bounds to this additional cost.

In this paper we examine Valiant network design (sometimes called load balancing) a strategy extended from switch design to the design of a whole network [2, 3]. The basic principle is to build a completely connected network—a clique—and use load balancing to share all traffic across all two hop paths. The remarkable property of this network is that with only twice the capacity of an optimal network, it can carry any allowable traffic matrix, without congestion!

The irony of Valiant network design is that it is predicated on the assumption that traffic matrices are hard to measure, and yet in this paper we show that such a design creates a network in which it is actually possible to measure the traffic matrix. However, this fact is of little use, because if we redesign the network based on this improved information, we then lose the ability to make ongoing measurements, leading to a paradoxical situation.

It is a classic case where “you cannot have your cake, and eat it too!” Where we have the capability to make good measurements (courtesy of Valiant design) we cannot make use of them, and where we do not have such a design, the measurements are much harder to obtain. As a result, we suggest an alternative, which takes advantage of the properties of Valiant network design in addition to the ability to measure traffic matrices.

We should note that there are other reasons for using Valiant network design, for instance resilience to network failures, or errors in traffic predictions, and these may outweigh the issue of difficulties in traffic matrix estimation. However, the problem of measuring traffic matrices has been found interesting in a number of contexts, and so here we examine the measurement aspect of a Valiant network.

2. Background

A Traffic Matrix (TM) describes the amount of traffic (the number of packets or more commonly bytes) transmitted from one point in a network to another during some time interval, and they are thus naturally represented by 𝑇𝑡(𝑖,𝑗) which represents the traffic volume (in bytes or packets) from 𝑖 to 𝑗 during a time interval [𝑡,𝑡+Δ𝑡). The locations 𝑖 and 𝑗 may be physical geographic locations making 𝑖 and 𝑗 spatial variables, or logical variables such as a group of IP addresses, but in this paper we will associate locations with PoPs (Points of Presence). Often, for convenience, TMs are written as column vectors by stacking the columns of the matrix. This allows us to write a series of such matrices into a new matrix 𝑋, whose columns each represents a single snapshot of a TM. In this paper we need only single snapshots, and so our notation will refer to TMs as column vectors 𝐱.

TMs are the basic input into many network engineering problems. Of particular relevance here is the network design problem (the problem of determining where links will appear in the network, and what capacity they should have, along with the subsidiary problem of determining the routing of traffic in this network). However, TMs are not easy to measure directly due to problems with data collection, and the scale of data required [4].

On the other hand SNMP (the Simple Network Management Protocol) data is easy to collect and almost ubiquitous. However, SNMP data only provides link load measurements, not TM measurements [5]. The link measurements 𝐲 are related to the TM, which is written as a column vector 𝐱, by the linear relationship 𝐲=𝐴𝐱,(1) where 𝐴 is called the routing matrix [6]. If 𝐴 is invertible the solution to this system of equations is obvious, but in general, 𝐴 is not even square. A network with 𝑁 nodes has 𝑁(𝑁1) traffic demands, so the length of 𝐱 is 𝑂(𝑁2), but in a typical network design the number of links and hence the length of 𝐲 are 𝑂(𝑁). As 𝑁 becomes large, the system of equations above becomes underconstrained. In most real networks, the problem is highly underconstrained. The resulting problem of inferring the TM from link measurements is a classic underconstrained, linear-inverse problem. There are a number of good techniques for solving such problems (see, for instance, [5, 7]), but the ill-posed nature of the problem means that there are likely to be some errors in the estimates.

In response to these difficulties, an alternative set of ideas have developed: oblivious routing [1] and Valiant network design [2, 3], which seek to design a network and its routing such that it will work well for any arbitrary traffic matrix. That is they try to design the network in the absence of standard input information. The cost is a loss of efficiency. The network must be overengineered by at least a factor of two in most cases.

In this paper we consider Valiant Network Design (VND), sometimes also called Valiant load balancing after its central idea. We will consider the simplest example of such design, for clarity (though the concepts presented here extend to the more complicated case). We have 𝑁 PoPs which must be connected, but we do not know the TM. The only information we do possess is the total access capacity at each PoP. For simplicity, assume this capacity is 𝐶 for all PoPs. The access capacity determines the maximum amount of traffic that can come in or depart from a PoP. Hence it limits the traffic matrix, because the row and column sums of this matrix cannot exceed 𝐶, so in the absence of additional information, our job is to design the network which minimizes our cost subject to the constraints 𝑁𝑖=1𝑇(𝑖,𝑗)𝐶,𝑁𝑗=1𝑇(𝑖,𝑗)𝐶.(2) The basic principle of VND is that the network should be a clique (a completely connected network) and that traffic should be shared in even proportions across all two hop paths. Figure 1 illustrates the network design for a 6-node network and shows one of the 𝑁 paths from 𝑝 to 𝑞 through node 𝑖.

The key result of VND is that almost all traffic goes on two hop paths so in order to carry a maximal traffic matrix, the network requires approximately 2𝑁𝐶 capacity, which when shared amongst the links results in a required link capacity of 2𝐶/𝑁. (Note that traffic is evenly split across all 𝑁 possible intermediate nodes, including the end points, i.e., we include paths 𝑝𝑝𝑞 and 𝑝𝑞𝑞 in the set of load-balanced paths.) Capacity estimates exist for the more complicated case with unequal access capacities, as well as extensions of VND to networks requiring resilience to failures [2, 3], but these are not germain to the question under consideration here, that is, how much information can we obtain about the TM of a VND?

2.1. Valiant Network Design Routing Matrix

The important thing to notice in the above is that VND needs a completely connected network. This may be implemented as a VPN on top of some other physical network, but even in this case, we can obtain link traffic measurements with ease using SNMP. Note that in a completely connected network there are 𝑁(𝑁1) links and 𝑁(𝑁1) elements in the TM, so the routing matrix is square. We may hope that in this case the routing matrix is invertible, and if this were the case, then we could solve the TM measurement problem by the simple expedient of taking 𝐱=𝐴1𝐲.(3)

So we need to consider the routing matrix that results from VND. Formally, 𝐴={𝐴𝑖𝑟} is the matrix defined by 𝐴𝑖𝑟=𝐹𝑖𝑟,iftracfor𝑟traverseslink𝑖,0,otherwise,(4) where 𝐹𝑖𝑟 is the fraction of traffic from source/destination pair 𝑟=(𝑝,𝑞) that traverses link 𝑖. A network with 𝑁 nodes and 𝐿 links will have a 𝐿×𝑁(𝑁1) routing matrix (as the 𝑖𝑖 TM elements are inconsequential here). In VND 𝐹𝑖𝑟 can only take the values 0, 1/𝑁, or 2/𝑁. As the properties of 𝐴 are not determined by the constant denominator 𝑁, we will instead look at the matrix 𝑅=𝑁𝐴, which has the values 0, 1, and 2.

We give a simple example for a 3-node network below in which both the origin-destination pairs (𝑝,𝑞) and the links (𝑖,𝑗) are ordered in the following order: (1,2),(1,3),(2,1),(2,3),(3,1),(3,2).(5) To derive the matrix we separate it into two components: in terms of the traffic between origin/destination pair (𝑝,𝑞),

𝑅1 shows the routing of traffic on its first hop after entering the network at node 𝑝, and𝑅2 shows the routing of traffic on its second hop before it reaches its destination 𝑞.

It is simple to derive 𝑅1 as it specifies that traffic from node 𝑝 will be split evenly over all links 𝑝𝑚, so 𝑅1 has a simple block diagonal structure: 𝑅1=110000110000001100001100000011000011.(6) For instance, the second column of 𝑅2 says that 1/𝑁 of the traffic from 13 goes along each of the links 12 and 13. 𝑅2 is just the dual of 𝑅1, that is, traffic arriving at a node follows the same pattern as traffic departing a node, so the matrix would have the same block diagonal structure if the links and origin/destination pairs were ordered by destination. Permuted to give the same ordering as above we get 𝑅2=100001010100001010010100001010100001.(7) Then 𝑅=𝑅1+𝑅2.(8) Note that the “2” entries of 𝑅 lie along the diagonal and that 𝑅 is symmetric.

The question of interest is “is the matrix 𝑅 invertible?” In this simple example the answer is a resounding no. In fact, all of the examples we tried (up to 𝑁=30) resulted in singular matrices. For 𝑁 nodes, the routing matrices were of size 𝑁(𝑁1)×𝑁(𝑁1), but as shown in Figure 2(a), their rank was approximately 𝑁/2. The trend in rank suggests that the matrix will never be invertible for any 𝑁. So although 𝐴 is square, it is not invertible. The underconstrained nature of the problem remains.

However, there is a fix.

3. Routing Jitter

As noted above, the routing matrix for the VND is not invertible. However, with a very small change, we can make it so. The change we introduce is to vary the traffic spread by a small amount that we will call routing jitter. Rather than spreading the traffic perfectly evenly we introduce a random vector 𝐫 of length 𝑁2 with sum zero, spread uniformly over the range [𝜖/2,𝜖/2]. We keep the same amount of traffic on the direct (one hop) path between two nodes, but use 𝐫 to modify the proportions of traffic on each of the (𝑁2) 2-hop paths. The effect is to create a new matrix 𝑆=𝑅+𝐸, from which we can derive our new routing matrix 𝐴=𝑆/𝑁. The key result is that, for 𝑁>4, this new 𝐴 will be invertible with high probability, and the TM estimation problem now has a unique solution. Note that the possibility that 𝐴 is close to singular can be easily avoided by testing for this condition prior to its use, and applying a different jitter if the matrix is close to singular.

Note that the even load balancing in the simple VND is an artifact of the simple example we have considered with all nodes having equal capacity. In more realistic settings, VND load balancing is already uneven, so small additional changes to this routing, such as we perform above, are not a big problem, but they do have a cost. The total traffic on link (𝑖,𝑗) can be calculated by adding the traffic on this link arising from traffic with destination 𝑘, following path 𝑖𝑗𝑘 for some 𝑘𝑖 and traffic with destination 𝑗 following path 𝑚𝑖𝑗 for some 𝑚𝑗. The traffic on link (𝑖,𝑗) is therefore given by 𝑦𝑖,𝑗=1𝑁𝑘𝑖𝑇(𝑖,𝑘)1+𝜖𝑖,𝑗,𝑘+1𝑁𝑚𝑗𝑇(𝑚,𝑗)1+𝜖𝑚,𝑖,𝑗,(9) where 𝜖𝑖,𝑗,𝑘 is the extra traffic from 𝑖 to 𝑘 steered onto intermediate node 𝑗. Note that by construction we limit |𝜖𝑖,𝑗,𝑘|<𝜖/2, so that we can write 𝑦𝑖,𝑗1+𝜖/2𝑁𝑘𝑖𝑇(𝑖,𝑘)+𝑚𝑗𝑇(𝑚,𝑗)(2+𝜖)𝐶𝑁,(10) using (2). The standard VND (without consideration for link/node failures) requires capacity 2𝐶/𝑁, so the additional cost of our rerouting is (in the worst case) 𝜖𝐶/𝑁 capacity on each link. So clearly, we should aim to choose 𝜖 to be reasonably small.

The invertibility of 𝐴 for all but pathological cases of 𝑟 should be obvious, but it is not the only issue. Numerical matrix inversion can be highly inaccurate if the condition number of the matrix (the ratio of the largest and smallest singular values) is too high. Figure 2(b) shows simulated condition numbers for 𝐴 for several values of 𝑁 and a range of values of 𝜖. We can see that the condition number increases as 𝜖 decreases. The smaller epsilon is, the closer to ill-conditioned the matrix becomes. However, we found that for moderately sized problems (say 𝑁=30) that 𝜖<106 posed no problem (for Matlab’s standard matrix inversion function), resulting in errors in the inverse on the order of 107. As 𝑁 increases, condition numbers appear to increase, so larger problems may be more difficult, but the magnitude of this effect is inconsequential compared to the following.

Real traffic consists of packets, and load balancing mechanisms can only divide traffic at this granularity. Also, in order to avoid reordering of packets in a flow, one often performs load balancing on a source/destination basis. This introduces additional granularity into the traffic flows, preventing perfect load balancing. Errors in the load balancing shares are, in effect, errors in 𝐴 the routing matrix. We need our value of 𝜖 to be larger than the typical values of these errors in order to be able to obtain meaningful traffic estimates, so we suggest a value of the order of 0.01-0.05, requiring an additional 1%–5% capacity, which will in addition easily result in reasonably conditioned routing matrices.

4. Discussion

The above shows that minor modification of VND’s load balancing mechanism results in an identifiable TM estimation problem in the sense that the problem now has a unique solution, and in the absence of measurement errors, we can obtain the actual TM. This is ironic, considering that VND was at least in part predicated on the inability to measure this matrix.

However, we cannot just throw away the VND, because without it, we would no longer be able to make these measurements. So in the case that we have the measurements, we do not need them, and where we do need measurements, we cannot get them. This paradox is more annoying than intriguing.

In addition, VND also allows resilience to unexpected networks demands, either due to temporary surges or attacks, or due to long-term errors in traffic predictions. Surely there is some happy middle ground?

The obvious solution is to continue to use a Valiant-like network design, that is, one which uses load balancing over a clique. However, we can use the fact that we can measure the matrix to improve the design. Valiant design has a cost, roughly twice the capacity of an optimal network, which is needed in a VND. If we instead steered a percentage 𝑋 of the traffic along the direct path between two nodes, then we could trade off between flexibility with respect to unexpected changes in traffic, against a reduced cost of the network design. The choice of 𝑋 allows us to interpolate between the two extreme cases:

(i)𝑋=1: we get a direct routing, and given the input TM we can determine the minimum capacity network required.(ii)𝑋=2/𝑁: we get VND, with its resilience to unexpected traffic.

In either case, the TM is measurable.

The total capacity requirements for such a network consist of 𝑁𝐶 times the direct component plus 2𝑁𝐶 times the VND component, noting that in the simple version of VND 𝑋=2/𝑁. So, the total capacity requirement is 𝑃=𝑁𝐶+1𝑋12/𝑁𝑁𝐶,(11) for 𝑋[2/𝑁,1]. Of course, in reducing the capacity of the network, we lose some ability to deal with random variations in traffic matrices. The factor of 2 in capacity is the cost for being oblivious, so if we use the above methodology, we will no longer be able to carry any traffic matrix, but we will be able to carry the most likely traffic.

5. Conclusion

The conclusion of this paper is that there in an inherent paradox in the nature of Valiant network design. The choice to create a clique (as the underlying network structure) creates the possibility of making the traffic matrix problem identifiable. Hence, for a Valiant network design, we have (with a minor modification) enough information to measure the traffic matrix, and from this we could build some other design. Of course, if we actually change the network design (to a nonclique-based design), then we lose our measurement capability, but there is a possible alternative in choosing a design between the two possible extremes.

It should be noted that VND is also robust to prediction errors. Hence, VND can alleviate problems that may have occurred as the result of poor planning, not just because traffic matrices are hard to measure. VND can also be used to create networks that are highly resilient to node and link failures, and this is another reason we may wish to use this design methodology.