Table of Contents Author Guidelines Submit a Manuscript
Mathematical Problems in Engineering
Volume 2016 (2016), Article ID 1863929, 11 pages
http://dx.doi.org/10.1155/2016/1863929
Research Article

Adaptive Loss Inference Using Unicast End-to-End Measurements

1School of Information and Computer, Anhui Agriculture University, Hefei 230061, China
2State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

Received 8 July 2016; Revised 9 November 2016; Accepted 28 November 2016

Academic Editor: Mohammad D. Aliyu

Copyright © 2016 Yan Qiao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

We address the problem of inferring link loss rates from unicast end-to-end measurements on the basis of network tomography. Because measurement probes will incur additional traffic overheads, most tomography-based approaches perform the inference by collecting the measurements only on selected paths to reduce the overhead. However, all previous approaches select paths offline, which will inevitably miss many potential identifiable links, whose loss rates should be unbiasedly determined. Furthermore, if element failures exist, an appreciable number of the selected paths may become unavailable. In this paper, we creatively propose an adaptive loss inference approach in which the paths are selected sequentially depending on the previous measurement results. In each round, we compute the loss rates of links that can be unbiasedly determined based on the current measurement results and remove them from the system. Meanwhile, we locate the most possible failures based on the current measurement outcomes to avoid selecting unavailable paths in subsequent rounds. In this way, all identifiable and potential identifiable links can be determined unbiasedly using only 20% of all available end-to-end measurements. Compared with a previous classical approach through extensive simulations, the results strongly confirm the promising performance of our proposed approach.

1. Introduction

The robustness of communication networks is extremely important for both users and network service providers. However, as the network increases in size and diversity, it becomes extremely difficult to monitor the characteristics of the network interior, such as link loss rates and packet latency. The first reason is that general organizations only have administrative access to a small fraction of the network’s internal nodes, whereas commercial factors often prevent organizations from sharing internal performance data. The second reason is that the servers and routers in the network are generally operated by businesses, and those businesses may be unwilling or unable to cooperate in collecting the network traffic measurements for network management. Thus, monitoring the network interior has to rely on end-to-end measurements.

Network performance tomography (or network tomography) is proposed to acquire the characteristics of the network interior by efficiently probing only end-to-end paths [13], rather than by directly monitoring every network element. It formulates the problem of inferring link metrics from end-to-end measurement results as a large linear system. Link metrics can be calculated by solving the linear equations in the system. Because the end-to-end measurements inevitably impose additional traffic on the networks, it is important to appropriately select end-to-end paths such that the desired inference capability can be achieved with as few end-to-end measurements as possible.

Given all available paths, the state-of-the-art solutions in network tomography select a subset of available paths, determined by finding an arbitrary basis of the linear system [1, 4]. However, these methods typically assume a simple network model, in which all network elements are reliable. Tati et al. in [3] argue that failures of network elements are common events in modern networks and that the typical durations of link failures in IP networks are longer than the lengths of time windows for measurement collection. After one or more links break down, measurements on the selected paths that cover these link failures will be unavailable.

In addition, all existing tomography techniques select measurement paths offline without considering the real-time states of the current system. These approaches can be quite inefficient: they need to run repeatedly an unnecessarily large set of measurement probes capable of determining all links’ loss rates, many of which might in fact never lose packets. In practice, a small number of end-to-end measurements that cover these good links are definitely sufficient to determine their loss rates. In general, most links in communication networks are such good links. Thus, a more adaptive tomography technique is necessary.

To improve the performance of existing network tomography applications, in this paper, we, for the first time, propose an efficient loss inference method (named ALIA) that performs the path selection and link metrics inference in an adaptive manner. More specifically, it first selects and measures the most informative end-to-end path, depending on the previous measurement results. Then, it infers the particular metrics of links that can be unbiasedly determined currently. Finally, it removes all determined links from the system and returns to the first step. The path selection and metrics inference will be performed repeatedly until no informative measurements remain. In this manner, all identifiable links and potential identifiable links can be unbiasedly determined using only a small set of end-to-end measurements. Moreover, link failures can be detected in time to avoid more unavailable measurements. In this paper, we only specifically focus on inferring the metrics of link loss rates, although our approach is also applicable to other link metrics. In summary, this paper makes the following contributions to the field of network tomography applications.(1)We propose an adaptive loss inference engine motivated by our two observations. The selection of measurement paths is adaptive to the current system state, depending on the earlier measurement results. Early measurements can provide the following important information. On the one hand, good measurement outcomes can unbiasedly determine all links on these path (see Section 4.1). On the other hand, failed measurements can be used to locate the link failures. Furthermore, because the links that have been determined can be removed, the system scale will be reduced in each subsequent round.(2)We develop an efficient path selection method and an accurate fault localization method. In each selection round, we select the longest path that does not lie in the current system space, and we demonstrate that measuring this path can obtain the maximum information; once a new measurement fails, we add all links on the path into a suspected fault set. We also define a weight for each suspected link: if a link’s exceeds a certain threshold, then we consider it to be a real link failure. Hence, all paths that transverse the real failures will not be selected in the subsequent selection round.(3)We confirm the benefits of our proposed method in comparison with the previous solution in realistic network scenarios through simulations. The results show that, in most cases, our new approach uses only half of the other solution’s measurement cost and computational time, but it unbiasedly determines even more links than all available end-to-end measurements. Moreover, the accuracy of our fault localization is , with less than false positives. All of the results strongly demonstrate that the proposed approach significantly improves the performance of network tomography applications.

The remainder of this paper is organized as follows. We first survey the related works in Section 2. Then, we present the definitions and formally describe our problem in Section 3. In Section 4, we present two observations to explain the main problems in the existing tomography techniques. Then, we present our adaptive loss inference-based approach in Section 5. Finally, we evaluate our new approach on realistic topologies in Section 6 prior to concluding the paper in Section 7.

2. Related Work

Network tomography has been widely used in, but not limited to, the fields of inferring individual link characteristics [5], network topology inference [6], and estimating the complete set of end-to-end measurements from an incomplete set [7]. In this paper, we only specifically focus on the applications of inferring link loss rates. Existing works on link loss inference primarily focus on two problems. The first problem is how to select a set of minimum paths to reduce the traffic overhead while maximizing their performance. The second problem is how to unbiasedly determine the loss rates of most links using the measurement outcomes of the selected paths.

Chen et al. [1] first proposed selecting the independent paths by finding a basis of the linear system through QR decomposition. Ma et al. in [2] proposed STPC (spanning tree-based path construction) to construct linearly independent monitor-to-monitor paths that can uniquely determine most links under an environment in which all network routers support the source routing policy. Zheng and Cao [5] selected a minimum path set that can identify all identifiable links and cover all unidentifiable links. Tati et al. [3] considered the presence of link failures in current networks and proposed RoMe for tolerating link failures by selecting the path set with the maximum expected rank. Zhao et al. [4] proposed , which can infer the loss rates of all identifiable links and minimal identifiable link sequences with the least bias.

All of the above methods select paths offline and infer the link metrics after the path selection stages. The adaptive path selection approach has already been used in the field of fault diagnosis in our previous works [8, 9]. However, the methods proposed in [8, 9] can only select the measurement paths capable of distinguishing the two states of network elements: normal or fault. In this paper, we aim to select the end-to-end measurements, which can unbiasedly determine the loss rates of most links.

In our recent work [10], we proposed a path selection method named APSA. It divided the path selection into two steps: covering path selection to select the max-coverage paths and solving path selection to select the most important paths using the graph construction and decomposition method. However, APSA only focuses on overcoming the problem of path selection without considering the link loss inference problem. Furthermore, it cannot be applied to the networks that present link failures.

3. Definitions and Problem Formulation

We consider the Internet loss inference systems that consist of routers and communication links. Some routers in the network can be directly connected by end hosts, which can send and receive probing packets. For example, Figure 1 is a network system with 10 links and 9 routers, 5 of which are connected by end hosts. Because the link between an end system and a router is quite short and stable, we only count the performance of the links among routers.

Figure 1: An example of a network system.

Let denote the network with a set of nodes (routers) and links . The numbers of nodes and links are denoted by and , respectively. We define a path as a sequence of links starting from a source host and ending at a destination host. All paths in the network form the path set , which contains paths. Table 1 lists the available end-to-end path set in Figure 1.

Table 1: Paths and their transmission rates in Figure 1.

For a given network and a path set , we define the routing matrix with dimension , where and , as follows: each row of represents a path in the network, and the columns represent links; when path traverses link , and otherwise. For example, the routing matrix of paths in Table 1 is as follows:

Let be a random variable given the fraction of the number of probe packets that arrive correctly at the destination monitor in the current measurement. Let be the fraction of packets from all paths passing through link that have not been lost at that link. For any path , we define its transmission rate as . Similarly, the transmission rate of link can be defined as .

Given the routing matrix , the relationship between the transmission rates of paths in and the transmission rates of links in can be formulated as follows:

Taking the logarithms on both sides of (2), we can rewrite it as

Let and , which are grouped in vectors and , respectively. Equation (3) is equivalent to

The above formulations are similar to those in the other tomography-based works, including our previous study [10].

To identify the loss rates (loss rate transmission rate) of individual links, it is necessary to solve the linear equations (4). Normally, the number of rows in is considerably larger than the number of columns. Unfortunately, in most cases, is still column deficient. Consequently, we cannot obtain the unique solution of (4) without additional information of the system. However, some of the links in (4) can be unbiasedly determined, which we call identifiable links. For example, in (1), and are identifiable links because and , and the remainder of the links are all unidentifiable links.

Normally, the number of end-to-end paths is on the order of . For relatively large networks, probing on all paths will cost considerable probing time as well as large traffic overhead. Therefore, it is necessary to carefully select the probing paths that are the most informative for the inference. In this paper, our goal is to select the fewest probing paths to unbiasedly determine the most links.

4. Observations

In this section, we present two observations to highlight the critical problems that prevent the former loss inference approaches from achieving the desired performance.

4.1. Observation  1: Good Paths

If one path has a loss rate of zero (or near zero), we define this path as a good path. Good paths indicate that all links on them do not lose any packets. Therefore, links that are classified as unidentifiable can also be unbiasedly determined if they are lying on good paths. Former methods that select an arbitrary basis of paths without considering the good paths will inevitably leave out an appreciable number of such good links that should be unbiasedly determined. For example, the rank of (1) is , which means that the number of paths in the basis is . If we select , , , , , , and as the basic path set, for which the measurement results are listed in Table 1, only and can be uniquely determined because they are identifiable links. However, if we measure one additional path, such as , and observe that the loss rate of is , then there are 4 more links that can be unbiasedly determined: , , , and . In some cases removing these good links (, , , and ) from the system may generate additional identifiable links. In the rest part of this paper, we define all links such as , , , and and any other additional identifiable links (if possible) that should be unbiasedly determined as potential identifiable links.

In fact, there is a considerable number of paths that present (or near ) loss rates in most networks, which means that an appreciable number of potential identifiable links can be missed by the previous loss inference methods. We draw the cumulative distribution of loss rates on paths under different fractions of lossy links in Figure 2. In this figure, we use the realistic topology from the Rocketfuel Project [11]. The detailed settings are provided in Section 6. As shown in Figure 2, more than of paths are good paths when the network has links that lose packets. This proportion is up to when there are only of links that lose packets. In the remainder of this paper, we consider a path to be good if the path loss rate is under .

Figure 2: The cumulative distribution of loss rates on paths.
4.2. Observation  2: Failure Links

Existing works assume a simple network model, in which all network elements are reliable. However, failures of network elements are common events in modern networks due to maintenance procedures, hardware malfunctions, energy outages, or disasters [12]. The typical durations of link failures in IP networks are longer than the lengths of time windows for measurement collection in network tomography [13]. Hence, the link failures may prevent the collection of some measurements. For example, suppose that , , , , , , and are selected for inference and that the rank of the system is . If is now in a failure state, only paths , , , and can be successfully probed. Hence, the provided rank is reduced to , and none of the links can be unbiasedly determined.

We plot the average ranks provided by two arbitrary bases and by all paths as we increase the number of link failures in Figure 3. The results indicate that link failures significantly degrade the quality of the selected paths.

Figure 3: Rank of a basis under failures (adopted from [3]).

5. Adaptive Loss Inference Algorithm

In this section, we design an adaptive loss inference algorithm (named ALIA), which has the following main advantages. On the one hand, it can unbiasedly determine all identifiable and potential identifiable links using even fewer end-to-end measurements than the system basis. On the other hand, it can locate the link failures during the inference process to avoid probing on the unavailable paths.

5.1. Overview

The structure of our approach is outlined in Figure 4. We first select the path that is the most informative to determine the system links, and then we probe the path to obtain the path transmission rate . If , we know that all links on the paths are not in failure states, and these links can be used to filter the suspected link failures. Subsequently, the new measurement outcome is used to infer the link loss rates. All links that can currently be unbiasedly determined will be removed from the system to reduce the system scale; otherwise, if (we use to denote the failed path), there is at least one link on that is (are) in failure states. Therefore, we temporarily add all links on to the suspected failure set, and then we check whether each suspected failure meets a certain condition. If so, we mark it as a real failure. In the next round, we will select the most informative path in the reduced system that will not transverse the real failures. The above steps will be repeated until there is no informative path remaining in the system.

Figure 4: The structure of ALIA.
5.2. Path Selection

The information of one path is considered from two aspects. The first is the number of links that the path includes. The second is whether the path can increase the rank of the current system.

For the first aspect, we select the path (denoted by ) that includes the most links over all candidate paths. Such path can provide more information than others because once we observe that the transmission rate of path equals (i.e., ), we can determine all links on as good links. Although , we can also use all links on the path to filter the suspected failures. Moreover, although there are relatively more links that will be considered as suspected failures if the measurement on is failed (i.e., ), this problem can be postponed or even ignored because the probability of is typically very small.

For the second aspect, the rank of the system is proportional to the number of identifiable links [1]. Therefore, we select the path that is not lying in the current system space. Suppose that is an orthonormal basis for the current path space; path is not lying in the space if and only if [4]. Adding this path to the current system will increase the rank of the system by . Note that because the determined links will be removed in each round, the space of the system also changes round by round. Therefore, we need to compute the orthonormal basis of the current path space every round prior to the path selection.

In our previous work [10], we also proposed a path selection algorithm named APSA, which divides the process of path selection into two steps: selecting the covering paths which can cover all links and selecting solving paths which can determine the most links using the graph construction and decomposition method. However, link loss rates cannot be inferred during the path selection in APSA, and also, APSA cannot handle the link failures in the network. Thus, the graph construction and decomposition method has not been adopted in this paper.

5.3. Loss Inference

The loss rates of links can be unbiasedly determined in two ways. The first is from the good paths whose transmission rates are . The other is through solving the system equations in (4).

For the first way, once we observe that the new selected path has a loss rate of , all links on this path also have loss rates of . In such a case, we directly remove these links from the system. For the second way, we determine all identifiable links in current system through Theorem 1 proposed in our previous work [9], as follows.

Assume that is an by matrix and that the rank of is (here, let because if , all links are identifiable). Let denote the null space of , that is, for any vector , . represents an arbitrary basis for , where .

Theorem 1. Link , which is represented by the th column of , can be uniquely identified if and only if for all , , .

After determining all identifiable links, it is necessary to compute their loss rates. This is performed by finding an arbitrary solution through , where and is the pseudoinverse matrix of . The loss rate of identifiable link is equivalent to . We also adjust the value of in the system by when is removed from the corresponding path.

Note that some paths in the system may become good when the identifiable links have been removed. In such a case, we consider all links on these paths as good links and remove them from the system, and then we repeat the above steps until there is no link in the system that can be unbiasedly determined.

5.4. Fault Localization

In ALIA, link failures are located during the loss inference process. We define as a suspected failure set, which consists of link combinations. Suppose that is the new selected path. If its transmission rate , then we filter the links in from the path and set the states of all links on to be normal. Otherwise, if , we pick up links from whose states have not been set to be normal and add them to the suspected fault set . For each link in , we define a weight , as follows:

Here, is the set of currently failed paths. The physical meaning of (5) is the fraction of failed paths that currently cover link in all failed paths. We define two thresholds, and , as follows: the weights of suspected links will not be computed until the number of failed paths reaches the threshold . For each link in , if , we mark as a real failure and add it to the real failure set . Once the real failure is not empty, paths that include the real failures will not be selected in the further selection round. If the real failure set is still empty after we have finished the path selection, we simply put all links in the suspected set into the real failure set .

In our previous work [10], loss inference and fault localization were not taken into account.

5.5. The Algorithm Details

The details of the algorithm are shown in Algorithm 1. It inputs the candidate path matrix , the measurement module , and the thresholds and and outputs the links which can be unbiasedly determined, their transmission rates , and the real failures . It first initializes the real failure as null (line ()) and then does the while loop until there is no informative path left in the matrix (lines ()()). In the while loop, it first selects the informative path from matrix and then measures the transmission rate of this path (lines ()()). If the measurement is not failed it uses the links on the path to filter the failures in the suspected failure set (line ()) and further judges whether its transmission rate exceeds . If so, it assigns the transmission rates of links on the path to (lines ()()); otherwise, it adds the path and its transmission rate into the current matrix and the vector , respectively, and then finds the identifiable links in current system (line ()line ()) and computes their transmission rates. Finally, it adds the determined links and their transmission rates into and , respectively, and removes them from the system (lines ()()). If the measurement on path is failed, it adds the links on the paths to and adds the path to the failed path set (lines ()()). Once the number of paths in exceeds , it checks each link in using (5) to find out the real failures and then adds them into the set (lines ()()).

Algorithm 1: Adaptive loss inference algorithm.

The most time-consuming step in Algorithm 1 is to find out the identifiable links in current system (line ()) based on Theorem 1. The time complexity is the order of , where and are the dimensions of the current matrix . Hence, the total time complexity of Algorithm 1 is the order of , where is the number of finally selected paths. The dimension of changes in each round but is at most equal to , where is the number of network links. And the number of paths that ALIA finally selected is much smaller than the rank of candidate path matrix in most cases, according to our experimental results in Section 6.3. Therefore, the time complexity of Algorithm 1 is at most the order of .

5.6. A Working Example

In Table 1, suppose that is now in a failure state. We set and . ALIA finally selects 6 paths and unbiasedly determines links during the 6 rounds, as follows:(i)In the first round, is selected because it covers the most links. Then, we measure the transmission rate and obtain through probing on the path. Subsequently, we record the transmission rates of , , , and as and remove them from the system. The candidate path matrix of the current system is shown in Figure 5(a).(ii)We select path in the second round and obtain . Therefore, , , and are added to the suspected failure set .(iii)In the third round, is selected, and it also has . Currently, .(iv)In the fourth round, is selected, and . Now, . Because the number of failed paths reaches , we begin to compute the weights for each suspected failure as follows: , , , , and . Because , we consider as a real failure and add it to .(v)In the fifth round, is selected, and . Hence, the matrix of selected paths and its orthonormal basis of the path space is shown in Figure 6(a).(vi)In the sixth round, we select path and check that . is skipped because covers the real failure . Then, we obtain , and the selected paths’ matrix of the current system is shown as in Figure 6(b). Through the basis of the null space and Theorem 1, we know that and are identifiable links. We compute an arbitrary solution of the system through and obtain and .

Figure 5: The candidate path matrix in current system.
Figure 6: The path matrix of selected paths in current system.

After removing and from the system, the candidate path matrix is as shown in Figure 5(b), where there are no informative paths remaining. ALIA now returns the loss rates of all determined links , , , , , and , as well as the links failure .

6. Evaluation

6.1. Evaluation Setup

Topologies. We conduct our experiments on the realistic ISP topologies from the Rocketfuel Project [11]. We select the topologies of two autonomous systems with labels and , which are representatives for small and large topologies, respectively. The numbers of nodes and links in the topologies are presented in Table 2.

Table 2: The details of topologies.

Candidate Paths. We randomly select and nodes as the monitors that can both initiate and receive probes, respectively. The candidate paths are generated between each monitor pair, and all topologies in Table 2 adopt the shortest path routing policy. Links that cannot be covered by any paths are removed from the system. The numbers of candidate paths and the links covered by these paths are also listed in Table 2.

Link Loss. We allow each link to be congested with a probability . Because affects the selection result in our experiments, we vary to evaluate the performances of the two algorithms. We use two different loss rate models, LLRD1 and LLRD2 of [14] (which are also used in [13, 1517]), for assigning loss rates to links. In the LLRD1 model, congested links have loss rates uniformly distributed in , and good links have loss rates in . In the LLRD2 model, the loss rate ranges for congested and good links are and , respectively. Because there is little difference between the two models, we only present our results for the LLRD1 model. After assigning each link a loss rate, the actual losses on each link follow a Gilbert process. The link in the Gilbert model fluctuates between good and congested states. Links do not drop any packets when in the good state, and they drop all packets when in the congested state.

6.2. Baseline and Metrics

We compared our new algorithm (APSA) with the state-of-the-art path selection approach for network tomography called SelectPath [1]. Although several recent methods have been proposed to address the tomographic problems [3, 13, 15, 16, 18, 19], most of them are not comparable with our algorithm. For example, [13, 15, 16] do not select paths before they perform tomography, while selecting paths is a vital step in our algorithm; [18, 19] select monitoring paths to detect or locate the failures, but our algorithm aims to determine the loss rates of links. We choose SelectPath as the baseline not only because it works on problems similar to those of our algorithm but also because it is one of the most representative path selection algorithms that has been widely approved of in the research community.

SelectPath selects an arbitrary maximal set of linearly independent paths using QR decomposition with column pivoting [20]. Because there are also works that measure all available paths without path selection [13, 15, 16], we also present the performance of the entire candidate paths given their measurement results in our figures (marked as “All”).

The performances of the approaches are evaluated using the following three metrics: (i) probing cost, the number of selected paths; (ii) path quality, the number of links that can be unbiasedly determined from the selected paths given their loss rates; (iii) computing time, the period between the time they input the routing matrix and the time they return the selected paths and all determined links.

We also evaluate the performance of our algorithm on fault localization using two metrics: accuracy and false positive. Let be the link set of real failures and be the failures inferred by ALIA. is the set of links that are not in . Due to space limitations, we only present the results on the varied fraction of failures and ’s, and we fix the threshold to . All the figures present results averaged over 20 runs.

6.3. Results
6.3.1. The Number of Paths Selected

In Figure 7, we plot the number of paths selected by ALIA and SelectPath as we vary the fraction of lossy links. The probability of failure is fixed to , and the threshold is (the same as in Figures 8 and 9). In both topologies, ALIA selects considerably fewer paths than the SelectPath algorithm, particularly when there are fewer lossy links in the network. Moreover, the advantage of ALIA becomes even more obvious in the relatively large networks (). As the fraction of lossy links increases, the number of paths selected by ALIA increases because the number of good paths decreases and our algorithm requires more paths to determine the system. Because the SelectPath algorithm always selects an arbitrary basis of the system, the curve of SelectPath gently fluctuates.

Figure 7: The number of paths selected by the two algorithms.
Figure 8: The number of links determined by the two algorithms.
Figure 9: The computing times of the two algorithms.
6.3.2. The Number of Links Determined

Next, we evaluate the link identifiability. Figure 8 shows the number of links that can be unbiasedly determined by the two algorithms and all available paths. As shown in this figure, ALIA can determine the most links among the approaches. In other words, ALIA uses the fewest measurements to unbiasedly determine the most links. In the figure, all three scenarios take the good measurement results into account (i.e., links on the selected paths whose loss rates are will be considered as determined links). Thus, all of the curves decrease as the fraction of lossy links increases. The gap between ALIA and SelectPath indicates the potential identifiable links missed by SelectPath. ALIA performs even better than all available paths because it repeatedly removes the determined links from the system in each round, and additional identifiable links may emerge. For example, when we remove one lossy link, the lossy paths that include this link may become good, and all other links on such paths can also be determined.

6.3.3. Computational Times

Figure 9 compares the computing times of the two algorithms. In both figures, ALIA runs considerably faster than SelectPath, because the inference system is reduced round by round as ALIA removes the determined links in every round. However, as the fraction of lossy links increases, the curve of ALIA goes up. The first reason is that the links that can be determined from the good paths will be reduced as the number of lossy links increases. The other reason is that ALIA needs to select more paths to determine the system with more lossy links. Nevertheless, for most communication networks, the fraction of lossy links is generally less than . In other words, ALIA can reduce the computing time of SelectPath by more than in most cases.

6.3.4. Accuracy on Fault Localization

Figure 10 shows the performance of ALIA on fault localization averaging over all topologies. We first vary the fraction of links failures, and we plot the results in Figure 10(a). Here, the threshold and the fraction of lossy links are fixed to and , respectively. In the figure, the accuracy curve slightly decreases while the false positive curve slowly increases when the fraction of failures increases. Nevertheless, ALIA can locate more than of failures even when of the links are in failure states. In Figure 10(b), we vary the threshold and fix the probability of failure to . From the figure, we can observe a valley in each of the curves. This result occurs because as increases, fewer links in will be considered as real failures. Consequently, the accuracy and false positive curves both decrease. When increases to a certain value, there is no link that satisfies the real failure condition. In such a case, ALIA will place all links in the final into the set , leading to relatively high accuracy and false positive. This is also the reason why we assign to in most of our experiments.

Figure 10: Accuracy on fault localization.

7. Conclusion

In this paper, we present an adaptive loss inference approach (named ALIA) for network tomography applications. We first present two observations to argue that existing tomography-based approaches are far from perfect. On the one hand, selecting end-to-end measurements offline will inevitably miss an appreciable number of potential identifiable links. On the other hand, link failures will cause many measurements to be unavailable. Our proposed approach performs the path selection and loss inference round by round based on the earlier measurement results until there are no informative measurements remaining. In this way, both the overall probing cost and the computing time can be significantly reduced. In addition, ALIA can determine all potential identifiable links and locate the link failures with high accuracy. Through extensive simulations on realistic ISP topologies, the results strongly confirm the promising performance of ALIA.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (no. 61402013, no. 31671589) and the Open Foundation of State Key Laboratory of Networking and Switching Technology (SKLNST-2016-1-02).

References

  1. Y. Chen, D. Bindel, H. Song, and R. H. Katz, “An algebraic approach to practical and scalable overlay network monitoring,” in Proceedings of the ACM SIGCOMM 2004: Conference on Computer Communications, pp. 55–66, Portland, Ore, USA, September 2004. View at Publisher · View at Google Scholar · View at Scopus
  2. L. Ma, T. He, K. K. Leung, D. Towsley, and A. Swami, “Efficient identification of additive link metrics via network tomography,” in Proceedings of the IEEE 33rd International Conference on Distributed Computing Systems (ICDCS '13), pp. 581–590, July 2013. View at Publisher · View at Google Scholar · View at Scopus
  3. S. Tati, S. Silvestri, T. He, and T. La Porta, “Robust network tomography in the presence of failures,” in Proceedings of the IEEE 34th International Conference on Distributed Computing Systems (ICDCS '14), pp. 481–492, IEEE, Madrid, Spain, July 2014. View at Publisher · View at Google Scholar · View at Scopus
  4. Y. Zhao, Y. Chen, and D. Bindel, “Towards unbiased end-to-end network diagnosis,” IEEE/ACM Transactions on Networking, vol. 17, no. 6, pp. 1724–1737, 2009. View at Publisher · View at Google Scholar · View at Scopus
  5. Q. Zheng and G. Cao, “Minimizing probing cost and achieving identifiability in probe-based network link monitoring,” IEEE Transactions on Computers, vol. 62, no. 3, pp. 510–523, 2013. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  6. M. Coates, R. Castro, R. Nowak, M. Gadhiok, R. King, and Y. Tsang, “Maximum likelihood network topology identification from edge-based unicast measurements,” in Proceedings of the ACM (SIGMETRICS '02) International Conference on Measurement and Modeling of Computer Systems, pp. 11–20, New York, NY, USA, usa, June 2002. View at Scopus
  7. Y. Zhang, M. Roughan, W. Willinger, and L. Qiu, “Spatio-temporal compressive sensing and internet traffic matrices,” in Proceedings of the ACM SIGCOMM Conference on Data Communication (SIGCOMM '09), pp. 267–278, ACM, Barcelona, Spain, August 2009. View at Publisher · View at Google Scholar · View at Scopus
  8. L. Cheng, X. Qiu, L. Meng, Y. Qiao, and R. Boutaba, “Efficient active probing for fault diagnosis in large scale and noisy networks,” in Proceedings of the IEEE (INFOCOM '10), March 2010. View at Publisher · View at Google Scholar · View at Scopus
  9. Y. Qiao, X. Qiu, L. Meng, and R. Gu, “Efficient loss inference algorithm using unicast end-to-end measurements,” Journal of Network and Systems Management, vol. 21, no. 2, pp. 169–193, 2013. View at Publisher · View at Google Scholar · View at Scopus
  10. Y. Qiao, J. Jiao, Y. Rao, and H. Ma, “Adaptive path selection for link loss inference in network tomography applications,” PLoS ONE, vol. 11, no. 10, Article ID e0163706, 2016. View at Google Scholar
  11. “Rocketfuel project: internet topologies,” http://www.cs.washington.edu/research/networking/rocketfuel/.
  12. A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C.-N. Chuah, and C. Diot, “Characterization of failures in an IP backbone,” in Proceedings of the Conference on Computer Communications - 23rd Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE INFOCOM '04), pp. 2307–2317, Hong Kong, China, March 2004. View at Publisher · View at Google Scholar · View at Scopus
  13. H. X. Nguyen and P. Thiran, “Network loss inference with second order statistics of end-to-end flows,” in Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (IMC '07), pp. 227–240, San Diego, Calif, USA, October 2007. View at Publisher · View at Google Scholar
  14. V. N. Padmanabhan, L. Qiu, and H. J. Wang, “Server-based inference of internet performance,” in Proceedings of the IEEE International Conference on Computer Communications (INFOCOM '03), vol. 1, pp. 145–155, Orlando, Fla, USA, April 2003.
  15. H. X. Nguyen and P. Thiran, “The boolean solution to the congested IP link location problem: theory and practice,” in Proceedings of the 26th IEEE International Conference on Computer Communications (INFOCOM '07), pp. 2117–2125, IEEE, Anchorage, Alaska, USA, May 2007. View at Publisher · View at Google Scholar · View at Scopus
  16. D. Ghita, H. Nguyen, M. Kurant, K. Argyraki, and P. Thiran, “Netscope: practical network loss tomography,” in Proceedings of the IEEE International Conference on Computer Communications (INFOCOM '10), pp. 1–9, IEEE, San Diego, Calif, USA, March 2010. View at Publisher · View at Google Scholar · View at Scopus
  17. M. Malboubi, C. Vu, C.-N. Chuah, and P. Sharma, “Compressive sensing network inference with multiple-description fusion estimation,” in Proceedings of the IEEE Global Communications Conference (GLOBECOM '13), pp. 1557–1563, Atlanta, Ga, USA, December 2013. View at Publisher · View at Google Scholar · View at Scopus
  18. D. Jeswani, M. Natu, and R. K. Ghosh, “Adaptive monitoring: application of probing to adapt passive monitoring,” Journal of Network and Systems Management, vol. 23, no. 4, pp. 950–977, 2015. View at Publisher · View at Google Scholar · View at Scopus
  19. E. Cohen, A. Hassidim, H. Kaplan, Y. Mansour, D. Raz, and Y. Tzur, “Probe scheduling for efficient detection of silent failures,” Performance Evaluation, vol. 79, no. 3, pp. 73–89, 2014. View at Publisher · View at Google Scholar · View at Scopus
  20. G. H. Golub and V. C. F. Loan, “Matrix computations,” Mathematical Gazette, vol. 47, no. 5, pp. 392–396, 1983. View at Google Scholar