Approximation of Interactive Betweenness Centrality in Large Complex Networks

Wandelt, Sebastian; Shi, Xing; Sun, Xiaoqian

doi:https://doi.org/10.1155/2020/4046027

Complexity

On this page

Abstract Introduction Methods Results Conclusions Appendix Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2020 | Article ID 4046027 | https://doi.org/10.1155/2020/4046027

Approximation of Interactive Betweenness Centrality in Large Complex Networks

Sebastian Wandelt,^1,2Xing Shi,^1,2and Xiaoqian Sun^1,2

Academic Editor: Átila Bueno

Received30 Jun 2019

Accepted26 Nov 2019

Published27 Feb 2020

Abstract

The analysis of real-world systems through the lens of complex networks often requires a node importance function. While many such views on importance exist, a frequently used global node importance measure is betweenness centrality, quantifying the number of times a node occurs on all shortest paths in a network. This centrality of nodes often significantly depends on the presence of nodes in the network; once a node is missing, e.g., due to a failure, other nodes’ centrality values can change dramatically. This observation is, for instance, important when dismantling a network: instead of removing the nodes in decreasing order of their static betweenness, recomputing the betweenness after a removal creates tremendously stronger attacks, as has been shown in recent research. This process is referred to as interactive betweenness centrality. Nevertheless, very few studies compute the interactive betweenness centrality, given its high computational costs, a worst-case runtime complexity of O(N4) in the number of nodes in the network. In this study, we address the research questions, whether approximations of interactive betweenness centrality can be obtained with reduction of computational costs and how much quality/accuracy needs to be traded in order to obtain a significant reduction. At the heart of our interactive betweenness approximation framework, we use a set of established betweenness approximation techniques, which come with a wide range of parameter settings. Given that we are interested in the top-ranked node(s) for interactive dismantling, we tune these methods accordingly. Moreover, we explore the idea of batch removal, where groups of top-k ranked nodes are removed before recomputation of betweenness centrality values. Our experiments on real-world and random networks show that specific variants of the approximate interactive betweenness framework allow for a speedup of two orders of magnitude, compared to the exact computation, while obtaining near-optimal results. This work contributes to the analysis of complex network phenomena, with a particular focus on obtaining scalable techniques.

1. Introduction

Complex network theory provides powerful tools to understand the structures and dynamics of many complex systems. Essentially, these systems are being modelled as nodes representing entities and links representing dependencies between entities. Much research effort has been spent on understanding different types of critical infrastructure systems, e.g., energy [1, 2], communication [3, 4], air transportation [5–8], railway [9], and social networks [10]. The phenomena and processes analyzed on these networks vary by study, including resilience analysis [11, 12], delay/information spreading [13–15], growth pattern analysis, and many others. Nevertheless, at the heart of many analysis tasks is the problem of identifying node importance, i.e., a quantification of the relative value of a node in a network. Indeed, it is significant to identify the extremely important nodes which maintain the structure and function of the network.

These node importance values vary for two reasons. First, the importance can be measured regarding different perspectives of importance, preferring local vs global or topological vs flow-like views. Depending on the chosen view, many different node centrality measures have been proposed, including degree centrality, closeness centrality [16], eigenvector centrality [17], Katz centrality [18], and betweenness centrality [19]. Second, the importance of a node often depends significantly on the presence of other nodes in the network. For a pair of nodes with redundant function, e.g., regarding propagation, one node can become significantly more important in the absence of the other node. This effect is visualized in Figure 1. Initially, node 9 is not important in the network. However, once node 14 fails, the majority of flow in the network is routed via node 9, since all flows have to go through the remaining path on the right-hand side. Accordingly, a very small change in the network, here referring to the failure of a node, can change the node importance significantly.

Existing methods usually do not take into account this dependency of node importance values, mainly because of limited computational resources. For instance, computing exact betweenness centrality values of each node in a network has a worst-case time complexity cubic in the number of nodes, since essentially, all pairs of shortest path between all nodes have to be computed [20]. Computing the interactive betweenness centrality requires to recompute the betweenness centrality after each node removal, increasing the worst-case time complexity to being quartic in the number of nodes in the network, i.e., . Such a high computational complexity inhibits computations on even medium-sized networks, given that increasing the size of a network by a factor of 10 will increase the required computational resources by a factor of 10,000. While static betweenness centrality computations can be speed up significantly by parallelization [21], interactive betweenness centrality cannot be further accelerated, given the dependency of choices between each attack step: the subnetwork at step is only determined once the to-be-removed node at step i is fixed.

In this study, we aim to explore possibilities for computing an approximation of the interactive betweenness centrality for larger networks. To achieve this goal, we devise an estimation framework. We exploit betweenness approximation techniques for selecting outstandingly important nodes in a network. There are several widely used static betweenness approximation methods, which come with whole range of parameters. Moreover, in order to avoid recomputing the approximate betweenness at each iteration, we select a number of outstanding nodes (not only one) on the fly. Experiments on random and real-world networks show that this strategy computes rankings very similar to those obtained by exact interactive betweenness computation. Moreover, experiments on network dismantling show that the results obtained by approximation of interactive betweenness are close to those of interactive betweenness, but at much lower runtime requirements. Our work contributes to the analysis of complex network phenomena, with a particular focus on obtaining scalable techniques.

2. Methods

2.1. The Overall Framework

We devised an interactive betweenness approximation framework consisting of a set of static betweenness approximation algorithms and different selections of k-batch removal (remove k nodes with high betweenness before recomputation). The exact interactive computation is recomputing betweenness values after removal of the Top-1 node. However, the time complexity of static betweenness computation is , where N is the number of nodes and E is the number of edges in the network, which is prohibitive for large networks and makes the interactive computation more expensive. To reduce the computational costs, we exploit approximation methods since such methods can make trade-offs between speed and identification of high betweenness nodes. Note that the identification of the highest betweenness node is the core part of interactive computation. Moreover, we also considered selection of k (i.e., choice of how many nodes to remove in each iteration): instead of removing a single node with the highest betweenness, batch removal reduces the number of iterations of interactive computation. Based on the above ideas, our framework has two core parts:(1)Static betweenness estimation: compute estimated betweenness values of all nodes in the current GCC (giant connected component) of the network.(2)Selection of batch removal: obtain Top-k ranked nodes and remove them from the network and then go back to (1).

In part (1), the approximation algorithms estimate betweenness values of each node in the current GCC. The accuracy of approximation affects the quality of such interactive computation: if the approximation method cannot identify the Top-k nodes correctly, it will lead to continuous errors in subsequent iterations, which propagate and often become worse with an increasing number of nodes. Therefore, we need to select approximation methods with nice trade-offs between quality and runtime. In part (2), the selection of the parameter k is also worth considering. On the one hand, if we choose small k, we will get better quality. One of the most extreme choices is setting k = 1, that is, recomputing betweenness after each node’s removal, which will be extremely time consuming but can get exact results. On the other hand, if we set k very large, we can reduce runtime for the price of deteriorated quality. In the best case, the k value is chosen adaptively in each iteration, as there may be only one or many high betweenness nodes in the current GCC. To sum up, our interactive betweenness approximation framework focuses on the selections of approximation methods together with the number and size of batch removal.

2.2. Static Betweenness Approximation

The existing algorithms for betweenness approximation compute an estimation of the static betweenness of all nodes in the network. Since all the approximation methods are based on Brandes’ algorithm, we revisit this algorithm first. For a node pair , Brandes [20] defines the pair-dependency on node , denoted by , and dependency of node s on node , denoted by , as

In addition, Brandes proved that obeyswhere represents all parents of t on the breadth-first-search (BFS) from s. Based on these, the betweenness value of can be computed by . That is, given a network with N nodes and E edges, a single BFS from one source node s can compute the dependency of each node which takes time. To obtain betweenness of all nodes, each node of the network should be set as a source node and it requires N iterations of BFS. In total, the computation of exact static betweenness of all nodes needs time, which is quite expensive for large networks. Besides, for dense networks with , as the worst case, the time complexity is .

To reduce the computational cost, approximation methods compute a subset of node dependencies or pair dependencies instead of the set of all dependencies required by the exact computation. Different strategies for selecting the subset constitute several approximating methods. In general, there are three classifications:(1)Pivots sampling: such methods conduct BFS from a subset of source nodes, called pivots, and compute node dependencies on each node from selected pivots.(2)Node pairs sampling: instead of considering node dependencies, such methods sample pairs of nodes and compute the pair dependencies on each node from selected node pairs.(3)Bounded BFS: such methods change the stop condition of BFS and only consider a subset of shortest paths.

Besides these three classifications, there are also some recent methods for betweenness approximation, including the sparse-modeling based method [22], MPI-based adaptive sampling method [23], and GNN-based method [24]. More details for different static approximating methods and parameter settings are in Appendix A.

2.3. Choosing the Size of Batch Removal

In this section, we describe more details on how to determine k based on current GCC in each iteration. If k is small, few nodes are removed from current GCC, which requires more iterations and higher computational costs. On the contrary, if k is large, the computational costs will be reduced but the quality decreases since many removed nodes have lost their importance. Therefore, we need to make a trade-off between quality and speed. Note that it is more reasonable to choose an adaptive k value based on the number of particularly central nodes in each iteration. Firstly, we need to roughly estimate the range of k for different networks. We selected k = 1 and conducted experiments of the interactive exact betweenness computation on 48 real-world networks with diverse sizes. We visualize the distribution of (the number of nodes that need to be removed to get 50% GCC reduction) in Figure 2. As shown in Figure 2(a), some larger networks can be cut into 50% with removal of few nodes (e.g., removing no more than 10 nodes can get 50% reduction on a network with size 10,000). Besides, the distribution of in Figure 2(b) indicates that removal of no more than 50 nodes can cause 50% GCC reduction on many networks. When being interested in fixed-size batch removal, we set k [1, 2, 4, 8, 16].

(a)

(b)

Figure 3 shows the distribution of betweenness values under different attack strategies. We can see that in I4 (interactive 4th attack) and IR (interactive remainder), there are 2 nodes with high betweenness (e.g., node 2 and 3 with betweenness value = 0.5 in IR) and these two nodes are of the same importance. We can remove both of them in one iteration to break up the GCC. In I3 (interactive 3rd attack), there is only one node (i.e., node 5) with high betweenness value (0.5), that is, there is only one particularly central node. For such an outstandingly important node, it is reasonable to set k = 1 and only remove the single node from GCC. Inspired by the example network, we, in addition, consider setting k to be the number of nodes with betweenness 0.5 and make it adaptive in the range . Besides, we also consider the case of setting k to be the number of nodes with betweenness (average + standard deviation of betweenness values).

(a)

(b)

(c)

(d)

Besides, we can remove certain percentage of nodes in each iteration. For the remaining experiments, we selected 1%, 5%, 10%, and 20%. To sum up, we determine k value in each iteration based on the distribution of betweenness values of nodes in current GCC. Table 1 shows an overview of k settings for batch removal.

2.4. Measures for Comparison

2.4.1. Accuracy

Given an approximation algorithm and certain k setting, the output of our framework is a ranking of nodes from higher interactive betweenness to lower interactive betweenness. To analyze the approximated ranking, we considered four aspects:(1)Identification of important nodes: in many cases, people are more concerned about the top nodes with high betweenness. We used three measures: Top-1%-Hits, Top-5%-Hits, and Top-10%-Hits.(2)Ranking sortedness: compared with the exact ranking, the sortedness of approximated ranking can be described by inversion number.(3)Weighted coefficient: considering the importance of top-ranked nodes, we used Weightedtau to add the weight of exchanges between top-ranked nodes.(4)Destructive to the network: during interactive computation, the size of GCC keeps decreasing as we keep removing nodes. A good method can identify nodes with high betweenness, which could have great impact on network connectivity, resulting in a quick dismantling process and a fast GCC reduction. We considered the number of nodes needed to be removed to cut GCC into 10%.

In total, we devise six measures to evaluate the accuracy compared to the standard ranking (i.e., the ranking of nodes of exact computation with k = 1) as follows:(1)Top-1%-Hits: the fraction of nodes correctly identified by approximate methods in Top-1% nodes.(2)Top-5%-Hits: the fraction of nodes correctly identified in Top-5% nodes.(3)Top-10%-Hits: the fraction of nodes correctly identified in Top-10% nodes.(4)Inversion: normalized inversion number of estimated ranking with exact ranking as the standard. After computing the inversion number, we normalized it and mapped it to [0, 1]: inversion = , where N is the number of nodes and is the exact inversion number.(5)Weightedtau: a node with rank a is mapped to weight 1/(a + 1) and an exchange between two nodes with rank a and b has weight 1/(a + 1) + 1/(b + 1). That is, top-ranked nodes have higher weights, which increase the impact of the exchanges between important nodes.(6)10% GCC reduction (): it represents how many nodes the method requires to remove to dismantle the network until GCC 10% N. The normalized value is mapped to [0, 1] and 1 means the method which needs the minimum number of nodes to get 10% GCC reduction.

2.4.2. Runtime

We conduct experiments on the same computer with four i7-6500U cores (2.50 GHz) and 16 GB RAM. We run each approximate method independently and recorded the exact runtime.

2.4.3. Trade-Offs

Considering six measures of accuracy, we normalized the runtime and plotted it with normalized measures to see which method can offer a nice trade-off. In order to analyze the results on different networks, we computed the average normalized runtime and measure values. To sum up, we use six measures to evaluate accuracy, and we also analyzed runtime and trade-offs. Besides, we set the naming schedule as approximation algorithm_parameter_k (e.g., RAND2_64_2 represents using RAND2 algorithm with number of pivots = 64 and k = 2).

3. Results

3.1. Networks in this Study

First, we generated 9 ER (Erdos–Renyi) graphs, 9 BA (Barabási-Albert) graphs, and 27 WS (Watts–Strogatz small-world) graphs with different sizes and parameters. Table 2 provides an overview of our random networks and generator parameters. Figure 4 visualizes four selected random networks. On these random graphs, we performed sensitivity analysis of Top-1-node-identification for three selected methods in order to select reasonable parameters. In addition, we selected 48 real-world networks of different sizes and structures, covering a variety of domains, as obtained from http://networkrepository.com/networks.php:(i)Social (4 networks): networks showing the social friendships between people. Nodes are persons and edges represent their connections.(ii)Biological (5 networks): networks showing the interactions between elements in biological systems.(iii)Brain (7 networks): networks representing functional connectivity in brains. We chose different brain networks of mouse, macaque, and fly.(iv)Ecology (2 networks): networks showing the interactions between species.(v)Economic (2 networks): networks representing interactions between interconnected economic agents.(vi)Infrastructure (3 networks): networks consisting of interlinks between fundamental facilities.(vii)Power (5 networks): networks showing the transmission of electric power.(viii)Road (2 networks): networks representing road connectivity between intersections.(ix)Technological (2 networks): networks consisting of the interlinks between technology systems.(x)Web (5 networks): networks representing the hyperlinks between pages of the World Wide Web.(xi)Email (2 networks): networks showing mail contacts between two addresses.(xii)Retweet (7 networks): networks describing retweeting relationships on Twitter.(xiii)Cheminformatics (2 networks): networks reflecting the chemical interactions of materials.

(a)

(b)

(c)

(d)

Table 3 shows an overview of our 48 real-world datasets, including network properties.

3.2. Sensitivity Analysis/Parameter Selection

In order to select reasonable parameters of approximation methods, we evaluated the quality of identification of the Top-1 node with each selected method by computing static betweenness with each method on our generated random networks. Figure 5 reports the fraction of networks on which each competitor can correctly identify the Top-1 node. RAND2: Figure 5 shows the results of identifying the Top-1 node on random networks regarding different number of sampled pivots. It can be seen that the quality (measured as ratio of correctly identified nodes) increases with the number of pivots and sampling with 512 pivots is the best. RAND2_64 can be chosen as a trade-off one and it correctly identifies over 70% WS networks and saves much time. RK: as Figure 5 indicates, it can get the best quality when we set . However, it only identified 60% of all random graphs. As RK with and 0.3 cannot identify Top-1 node in all ER networks, we chose and 0.1. KPATH: the results of KPATH are shown in Figure 5. The quality is the worst compared to RAND2 and RK. KPATH_0.2_4 and KPATH_0.2_8 are reasonable among all selected settings.

(a)

(b)

(c)

With these results, we selected RAND2_512, RAND2_64, RK_0.07_0.1, RK_0.10_0.1, KPATH_0.2_4, and KPATH_0.2_8 for further analysis in the remaining part of the study.

3.3. Accuracy

Since the computation on real-world networks is expensive, we analyzed results on generated random graphs first, in order to select competitors and conduct further experiments on real-world networks. Figure 6 presents the average measure values of 66 competitors. We can see that RAND2_512_1 offers the highest accuracy in general.

Figure 7 presents the distribution of 10% GCC reduction of 66 competitors. We can see that on 10% GCC reduction measure, which is closely related to dismantling problem, the quality of RAND2_64_1 is also good. Moreover, the accuracy of RAND2_64 is close to RAND2_512 with different k. Table 4 shows the measure values and runtime on a specific ER network. RAND2_64, RK_0.10_0.1, and KPATH_0.2_4 can save runtime compared to RAND2_512, RK_0.07_0.1, and KPATH_0.2_8. Considering the prohibitive computational costs on larger networks, we selected RAND2_64, RK_0.10_0.1, and KPATH_0.2_4 for further analysis on 48 real-world networks.

3.3.1. Real-World Networks

We ran experiments on 48 real-world networks and computed six measure values of accuracy. Figure 8 presents the distribution of measure values. We can see that RAND2_64 with k = 1 is outstanding on all measures. Besides, when setting constant k values, the quality becomes worse as we increase the k value. On measure (GCC 10% reduction), it is clear that the quality becomes worse from k = 1% to k = 20%. Besides, RAND2_64 and RK_0.10_0.1 with removing certain nodes with 0.5 can also offer good accuracy. Compared to RAND2_64 and RK_0.10_0.1, the quality of KPATH_0.2_4 is not good. We computed the average measure values on 48 real-world networks, and the results are shown in Figure 9: RAND2_64_1, RK_0.10_0.1_1, RAND2_64_0.5, and RK_0.10_0.1_0.5 are good.

3.4. Runtime

The runtime of computing interactive betweenness depends on the size of the network, choices of k, and selected approximation algorithms. We analyzed the runtime regarding different k values with the same approximation method. Besides, we evaluated the runtime of different approximation methods with the same k setting.

3.4.1. Runtime regrading Different k Settings

Figure 10 plots the runtime (in seconds) of RAND2_64, RK_0.10_0.1 and KPATH_0.2_4 with the same k setting (i.e., k = certain nodes with ) on different real-world networks with y-axis = runtime in seconds and x-axis = NlogN where N is the number of nodes in the network. We can see that Figure 10 shows that the time complexity is around for these sparse real-world networks. Note that for dense networks, the runtime will be nearly theoretically. Moreover, the runtime of RAND2_64 is the highest, while KPATH_0.2_4 is the fastest among three approximation methods, but it does not offer a nice quality.

3.4.2. Runtime regrading Different Approximation Methods

Figure 11 shows the runtime of RK_0.10_0.1 with k from 1 to 16. We can see that the runtime increases as k decreases. If we choose smaller k, then fewer nodes are removed in each iteration, resulting in larger number of iterations and computational costs. Besides, doubling the k value will save 50% runtime when k 2. When k = 1, the runtime reaches its upper bound and is not doubled compared to k = 2.

To sum up, on sparse real-world networks, the actual runtime is from our results and has a linear relationship with k 2.

3.5. Speedup

In this section, we present the speedup of interactive betweenness approximation compared to the standard BETWI (exact interactive betweenness computation). Based on our experimental results of BETWI and other approximation methods conducted on the same computer, we computed the speedups on three ER networks with different sizes (N = 300, 400, 500, 600, 700, 800, 900, and 1000) but the same generator parameter . Similar to our analysis of runtime, our evaluation on the speedups of interactive betweenness approximation sheds light on two aspects, the speedups of different betweenness approximation algorithms and the speedups with increasing k. Figure 12(a) shows the speedups of RAND2_512, RAND2_64, RK_0.07_0.1, RK_0.10_0.1, KPATH_0.2_4, and KPATH_0.2_8 with the same k setting. We can see that the speedup increases as the network becomes larger. As a fast algorithm, KPATH offers great speedups compared to RK and RAND2. Figure 12(b) presents the speedups with different k settings. Removing one node from GCC in each iteration induces low speedups, while doubling the k value approximately doubles the speedup.

(a)

(b)

3.6. Trade-Offs

From the results on quality and runtime, some competitors (e.g., RAND2_64_0.5) get high quality but need hours on the largest network. Several methods (e.g., KPATH_0.2_4 with k = 16) are quite fast but the quality is bad. In this section, we focus on the trade-offs of these selected competitors.

3.6.1. Trade-Offs on Specific Networks

Figure 13 presents trade-offs between quality (i.e., the values of six measures of accuracy) and speed (exact runtime). We used 3 colors to distinguish 3 approximation methods and 3 markers to label 3 typical k settings, including a faster one (k = 16), a slowest one (k = 1), and k = 4 as a trade-off. We can see that the RK_0.10_0.1 with k = 4 gets nice trade-offs on Top-1%-Hits, taking no more than 25% runtime to get high accuracy compared to the maximum runtime, while when we consider inversion measure, KPATH_0.2_4 with k = 1 is good. In addition, RAND2_64 with k = 4 also provides pleasurable trade-offs on both Top-1%-Hits and Weightedtau.

(a)

(b)

(c)

(d)

(e)

(f)

3.6.2. Average Trade-Offs

As there are orders of magnitude deviation between runtime on different networks, to analyze the trade-offs on 48 real-world networks, we normalized the runtime on each network with the slowest = 1 and the fastest = 0 and then we computed the average normalized runtime on 48 networks. Besides, we further normalized the measure values to map these values into [0, 1] on each network and computed the average normalized measure values. Figure 14 shows the results. We used the same labels as Figure 13 and added legends of these competitors with average normalized runtime 0.5 and average normalized measure values 0.6. We can see that setting k in [2, 4, 1%, 0.5] can make good trade-offs with specific approximation methods.

(a)

(b)

(c)

(d)

(e)

(f)

4. Conclusions

Betweenness centrality is a widely used measure of node importance, which counts the number of shortest paths a node appears in a network. However, if one node in the network is being attacked or loses its functionality, the betweenness values of other nodes will change. That is, all betweenness values need to be recomputed, in order to update the actual node importance. Recent research suggests that, on network dismantling problem, interactively removing one node with the highest betweenness outperforms removing nodes based on ranking obtained by one time betweenness computation. However, the interactive betweenness computation requires static betweenness recomputation on current GCC after each node removal and it is significantly more computationally expensive (an order of magnitude) compared to static approaches.

In this paper, we systematically investigated approximation of interactive betweenness centrality. We proposed a framework for interactive betweenness estimation with k-batch removal. Our framework consists of a set of static betweenness approximation algorithms with various parameter settings for identifying top nodes with high betweenness and selections of how many nodes to be removed in each iteration. In other words, we not only analyzed the performance of removing one top node but also evaluated the removal of a batch of nodes. As the computation of interactive betweenness is more expensive than the computation of static betweenness, we focus on choosing approximation methods with parameter settings and k values (the number of nodes to be removed in each iteration) which can offer high quality and also a nice trade-off between accuracy and speed. To ensure that our datasets cover different network structures, we generated 45 random networks, including ER, WS, and BA networks, and selected 48 real-world networks with distinct sizes from different fields. We devised six measures to evaluate accuracy considering the identification of important node, the similarity of rankings, and the effects on GCC reduction.

To make preliminary selections of suitable parameter settings, we conducted sensitivity analysis of static betweenness approximation algorithms and evaluated the quality of identification of Top-1 node on random networks. We selected six approximation methods, consisting of RAND2_64, RAND2_512, RK_0.07_0.1, RK_0.10_0.1, KPATH_0.2_4, and KPATH_0.2_8. As for k settings, based on the results of 50% GCC reduction, we found that many networks can be dismantled with a small fraction of nodes and we chose 11 different k settings (k in [1, 2, 4, 8, 16, 1%, 5%, 10%, 20%, AS, 0.5]). We ran tests with 66 competitors (six approximation algorithms with 11 k settings) on random networks to further select competitors. Based on the results on random networks, we chose RAND2_64, RK_0.10_0.1 and KPATH_0.2_4 with 11 k settings and conducted experiments on larger real-world networks. We found that RAND2_64_1, RAND2_64_0.5, RK_0.10_0.1_1, and RK_0.10_0.1_0.5 offer high accuracy. Besides, we analyzed the runtime regarding different approximation algorithms and k settings. Our analysis on different approximation methods with the same k reveals that RAND2_64 is the slowest and KPATH_0.2_4 is the fastest competitor. Moreover, we also found that doubling k values will get 50% runtime reduction for k 2 and the runtime reaches its upper bound with k = 1 (shown by Figure 11). Our analysis indicates that RAND2_64 and RK_0.10_0.1 with k = 2, 4, 1% and 0.5 provide a nice trade-off between accuracy and speed..

In synthesis, we have proposed a novel framework for interactive betweenness approximation. We systematically evaluated the selections of approximation algorithms with various parameter settings and the choices of different batch removals from three aspects: accuracy, runtime, and trade-offs between them. Our work contributes to the analysis of complex network phenomena, with a particular focus on obtaining scalable techniques. Future work could investigate the interactive approximate computation of other network centrality measures.

Appendix

A. Static Betweenness Estimation Techniques

A.1. Pivots Sampling

Brandes and Pich [25] introduced RAND1 for betweenness approximation. RAND1 samples a subset of source nodes uniformly at random and computes the estimated betweenness of all nodes with scaling it up by , where is the number of sampled source nodes. Bader et al. [26] proposed GSIZE algorithm which determines the number of sample pivots by graph size. GSIZE utilizes an adaptive sampling technique which was introduced by Lipton and Naughton [27]. Given a node , GSIZE keeps sampling pivot s until . Geisberger et al. [28] proposed RAND2 based on random sampling to approximate static betweenness values of all nodes. RAND2 modifies RAND1 by scaling it up with a linear function. RAND2 decreases the contribution of nodes close to the source nodes and can solve the overestimation problem of RAND1.

A.2. Node Pairs Sampling

Bergamini and Meyerhenke [29] proposed a fully dynamic algorithm (DA) for computing estimated betweenness. DA keeps track of the old shortest paths and substitutes them only when they are necessary. Riondato and Kornaropoulos [30] proposed RK which samples pairs of nodes instead of conducting BFS from sampled source nodes. RK is an given the allowed additive error ϵ with probability RK guarantees that the error of each node is less than ϵ with probability at least . RK determines the sample size by VC dimension (Vapnik–Chervonenkis dimension) introduced in [31] instead of the network size. Riondato and Upfal [32] presented an method ABRA. ABRA uses progressive sampling and sets the stop condition by utilizing Rademacher averages proposed by Shalev-Shwartz and Ben-David [33] and pseudodimension introduced by Pollard [34] in statistical learning field.

A.3. Bounded BFS

Everett and Borgatti [35] found that the betweenness of node in its EGO network is related to the exact betweenness of in the network. The EGO network of is composed of node , all the neighbors of , and the edges that connect those nodes. Everett and Borgatti [35] used neighbors with distance = 2 in their EGO approximation algorithm. In other words, EGO bounds the BFS with 2 hops from source nodes. Pfeffer and Carley [36] presented the KPATH method which computes betweenness centrality values based on k-centrality measures. Pfeffer and Carley [36] assumed that these nodes distant from each other do not contribute to the betweenness values. Compared to [35], the BFS of KPATH is bounded by k hops from source nodes and these nodes with distance from the source node are not considered. Borassi and Natale [37] introduced an adaptive algorithm KADABRA, which can approximate betweenness of all nodes or just compute the Top-k nodes. KADABRA uses a balanced bidirectional BFS to sample shortest paths. Instead of conducting a full BFS from s to t, KADABRA performs a BFS from s and a BFS from t at the same time until such two BFSs touch each other.

As mentioned above, we divided approximation algorithms into three classifications: pivots sampling, node pairs sampling, and bounded BFS. We selected one method with a nice trade-off between runtime and quality from each classification. For pivots sampling methods, we chose RAND2 as it offers an outstanding accuracy with a good trade-off. From an experimental perspective, the results of [38, 39] both show that RAND2 outperforms other methods in tested networks. From a theoretical perspective, with linear scaling, RAND2 can handle the overestimation problem of RAND1. Thus, RAND2 can be selected as a representative of the methods based on pivots sampling. However, the performance of RAND2 is determined by sample size. As RAND2 needs (the number of sampled pivots) iterations of BFS, the time complexity of static RAND2 is . On the one hand, if we sample few pivots, we cannot identify the Top-1 node (the node with the highest betweenness) well. On the other hand, if we sample too many pivots, we will do redundant calculations. As Geisberger et al. [28] suggested, we just selected constant sample sizes . We selected RK proposed by Riondato and Kornaropoulos [30] among node pairs sampling methods. The results on Top-1%-Hits in the benchmark provided by Alghamdi et al. [38] indicate that RK is a better choice for identifying vital nodes. As an method, ϵ can greatly affect the speed and quality by determining the sample size [40]:where is the estimated vertex-diameter of the network as the computation of exact vertex-diameter is quite expensive. Since we focus on identifying the Top-1 node for interactive approximation, we can set ϵ higher than 0.01 (default). We evaluated the performance of RK with . As for δ, we set it to be 0.1 (default). In addition, we chose KPATH introduced by Pfeffer and Carley [36] as a typical one among the methods based on bounded BFS. KPATH approximates static betweenness centrality values using k-centrality measures. KPATH assumes that nodes distant from offer zero dependencies. KPATH stops the BFS until reaching k hops. Therefore, only pair dependencies of two nodes with distance k can contribute to the betweenness values. KPATH determines sample size by parameter α: the number of samples is proportional to , where N is the number of nodes in the network. To distinguish k in KPATH from our k-batch removal, we name k in KPATH . We set . Besides, considering the α values, we set it to be 0.0, 0.2, and 0.4 to make comprehensive comparison. For three selected methods with its parameter settings, we choose the naming scheme as method_parameter (e.g., KPATH_0.2_4 is KPATH method with α = 0.2 and = 4). Table 5 presents an overview of our selected methods.

Data Availability

All networks used in this study are available from the public repository http://networkrepository.com/networks.php.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This study was supported by the Research Fund from the National Natural Science Foundation of China (grant nos. 61861136005, 61851110763, and 71731001).

References

R. Albert, I. Albert, and G. L. Nakarado, “Structural vulnerability of the North American power grid,” Physical Review E, vol. 69, no. 2, Article ID 025103, 2004.
View at: Publisher Site | Google Scholar
L. Cuadra, S. Salcedo-Sanz, J. Del Ser, S. Jiménez-Fernández, and Z. W. Geem, “A critical review of robustness in power grids using complex networks concepts,” Energies, vol. 8, pp. 9211–9265, 2015.
View at: Publisher Site | Google Scholar
S. H. Yook, H. Jeong, and A. L. Barabási, “Modeling the Internet’s large-scale topology,” Proceedings of the National Academy of Sciences, vol. 99, no. 21, pp. 13382–13386, 2002.
View at: Publisher Site | Google Scholar
R. Albert and A. L. Barabási, “Statistical mechanics of complex networks,” Reviews of Modern Physics, vol. 74, no. 1, pp. 47–97, 2002.
View at: Publisher Site | Google Scholar
M. Zanin and F. Lillo, “Modelling the air transport with complex networks: a short review,” The European Physical Journal Special Topics, vol. 215, no. 1, pp. 5–21, 2013.
View at: Publisher Site | Google Scholar
X. Sun, S. Wandelt, and F. Linke, “Temporal evolution analysis of the European air transportation system: air navigation route network and airport network,” Transportmetrica B: Transport Dynamics, vol. 3, no. 2, pp. 153–168, 2015.
View at: Publisher Site | Google Scholar
X. Sun and S. Wandelt, “Network similarity analysis of air navigation route systems,” Transportation Research Part E: Logistics and Transportation Review, vol. 70, pp. 416–434, 2014.
View at: Publisher Site | Google Scholar
T. Verma, N. A. Araújo, and H. J. Herrmann, “Revealing the structure of the world airline network,” Scientific Reports, vol. 4, no. 1, p. 5638, 2014.
View at: Publisher Site | Google Scholar
S. Wandelt, Z. Wang, and X. Sun, “Worldwide Railway Skeleton Network: extraction methodology and preliminary analysis,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 8, pp. 2206–2216, 2017.
View at: Publisher Site | Google Scholar
P. A. Duijn, V. Kashirin, and P. M. Sloot, “The relative ineffectiveness of criminal network disruption,” Scientific Reports, vol. 4, no. 1, p. 4238, 2014.
View at: Publisher Site | Google Scholar
A. Cardillo, M. Zanin, J. Gómez-Gardenes, M. Romance, A. J. G. del Amo, and S. Boccaletti, “Modeling the multi-layer nature of the European Air Transport Network: resilience and passengers re-scheduling under random failures,” The European Physical Journal Special Topics, vol. 215, no. 1, pp. 23–33, 2013.
View at: Publisher Site | Google Scholar
S. Wandelt, X. Sun, D. Feng, M. Zanin, and S. Havlin, “A comparative analysis of approaches to network-dismantling,” Scientific Reports, vol. 8, no. 1, 2018.
View at: Publisher Site | Google Scholar
R. Pastor-Satorras and A. Vespignani, “Epidemic dynamics and endemic states in complex networks,” Physical Review E, vol. 63, no. 6, Article ID 066117, 2001.
View at: Publisher Site | Google Scholar
A. V. Goltsev, S. N. Dorogovtsev, J. G. Oliveira, and J. F. Mendes, “Localization and spreading of diseases in complex networks,” Physical Review Letters, vol. 109, no. 12, Article ID 128702, 2012.
View at: Publisher Site | Google Scholar
M. Salehi, R. Sharma, M. Marzolla, M. Magnani, P. Siyari, and D. Montesi, “Spreading processes in multilayer networks,” IEEE Transactions on Network Science and Engineering, vol. 2, no. 2, pp. 65–83, 2015.
View at: Publisher Site | Google Scholar
G. Sabidussi, “The centrality index of a graph,” Psychometrika, vol. 31, no. 4, pp. 581–603, 1966.
View at: Publisher Site | Google Scholar
P. Bonacich, “Factoring and weighting approaches to status scores and clique identification,” The Journal of Mathematical Sociology, vol. 2, no. 1, pp. 113–120, 1972.
View at: Publisher Site | Google Scholar
L. Katz, “A new status index derived from sociometric analysis,” Psychometrika, vol. 18, no. 1, pp. 39–43, 1953.
View at: Publisher Site | Google Scholar
L. C. Freeman, “A set of measures of centrality based on betweenness,” Sociometry, vol. 40, no. 1, pp. 35–41, 1977.
View at: Publisher Site | Google Scholar
U. Brandes, “A faster algorithm for betweenness centrality,” The Journal of Mathematical Sociology, vol. 25, no. 2, pp. 163–177, 2001.
View at: Publisher Site | Google Scholar
R. Fan, K. Xu, and J. Zhao, “A GPU-based solution for fast calculation of the betweenness centrality in large weighted networks,” PeerJ Computer Science, vol. 3, p. e140, 2017.
View at: Publisher Site | Google Scholar
R. Matsuo, R. Nakamura, and H. Ohsaki, “A study on sparse-modeling based approach for betweenness centrality estimation,” in Proceedings of the 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), Tokyo, Japan, July 2018.
View at: Google Scholar
A. van der Grinten and H. Meyerhenke, “Scaling betweenness approximation to billions of edges by MPI-based adaptive sampling,” 2019, https://arxiv.org/abs/1910.11039.
View at: Google Scholar
S. K. Maurya, X. Liu, and T. Murata, “Approximations of betweenness centrality with graph neural networks,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, November 2019.
View at: Google Scholar
U. Brandes and C. Pich, “Centrality Estimation in large networks,” International Journal of Bifurcation and Chaos, vol. 17, no. 7, pp. 2303–2318, 2007.
View at: Publisher Site | Google Scholar
D. A. Bader, S. Kintali, K. Madduri, and M. Mihail, in Approximating Betweenness Centrality Algorithms and Models for the Web-Graph, Springer Berlin Heidelberg, Berlin, Heidelberg, Germany, 2007.
R. J. Lipton and J. F. Naughton, “Estimating the size of generalized transitive closures,” in Proceedings of the 15th International Conference on Very Large Data Bases VLDB ’89, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1989, http://dl.acm.org/citation.cfm?id=88830.88847.
View at: Google Scholar
R. Geisberger, P. Sanders, and D. Schultes, Better Approximation of Betweenness Centrality Proceedings of the Meeting on Algorithm Engineering & Expermiments, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2008, http://dl.acm.org/citation.cfm?id=2791204.2791213.
E. Bergamini and H. Meyerhenke, Fully-dynamic Approximation of Betweenness Centrality Algorithms-ESA 2015 Ed Bansal N and Finocchi I, Springer Berlin Heidelberg, Berlin, Heidelberg, Germany, 2015.
M. Riondato and E. M. Kornaropoulos, “Fast approximation of betweenness centrality through sampling,” Data Mining and Knowledge Discovery, vol. 30, no. 2, pp. 438–475, 2016.
View at: Publisher Site | Google Scholar
V. N. Vapnik and A. Y. Chervonenkis, On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities, Springer International Publishing, Cham, Switzerland, 2015.
M. Riondato and E. Upfal, “ABRA,” ACM Transactions on Knowledge Discovery from Data, vol. 12, no. 5, pp. 1–38, 2018.
View at: Publisher Site | Google Scholar
S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, New York, NY, USA, 2014.
D. Pollard, “Convergence of stochastic processes,” Economica, vol. 52, no. 208, p. 529, 1985.
View at: Publisher Site | Google Scholar
M. Everett and S. P. Borgatti, “Ego network betweenness,” Social Networks, vol. 27, no. 1, pp. 31–38, 2005.
View at: Publisher Site | Google Scholar
J. Pfeffer and K. M. Carley, “k-centralities: local approximations of global measures based on shortest paths,” in Proceedings of the 21st International Conference on World Wide Web, WWW ’12 Companion, ACM, New York, NY, USA, 2012.
View at: Publisher Site | Google Scholar
M. Borassi and E. Natale, in KADABRA Is an ADaptive Algorithm for Betweenness via Random Approximation 24th Annual European Symposium on Algorithms (ESA 2016) (Leibniz International Proceedings in Informatics (LIPIcs), Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 2016, http://drops.dagstuhl.de/opus/volltexte/2016/6371.
Z. Alghamdi, F. Jamour, S. Skiadopoulos, and P. Kalnis, “A benchmark for betweenness centrality approximation algorithms on large graphs,” in Proceedings of the 29th International Conference on Scientific and Statistical Database Management (ACM), pp. 1–12, Chicago IL USA, June 2017.
View at: Google Scholar
J. Matta, G. Ercal, and K. Sinha, “Comparing the speed and accuracy of approaches to betweenness centrality approximation,” Computational Social Networks, vol. 6, no. 1, 2019.
View at: Publisher Site | Google Scholar
S. Har-Peled and M. Sharir, “Relative (p, ε)-approximations in geometry,” Discrete & Computational Geometry, vol. 45, no. 3, pp. 462–496, 2011.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2020 Sebastian Wandelt et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1337

Downloads

1127

Citations