Abstract

The backbone is the natural abstraction of a complex network, which can help people understand a networked system in a more simplified form. Traditional backbone extraction methods tend to include many outliers into the backbone. What is more, they often suffer from the computational inefficiency—the exhaustive search of all nodes or edges is often prohibitively expensive. In this paper, we propose a backbone extraction heuristic with incomplete information (BEHwII) to find the backbone in a complex weighted network. First, a strict filtering rule is carefully designed to determine edges to be preserved or discarded. Second, we present a local search model to examine part of edges in an iterative way, which only relies on the local/incomplete knowledge rather than the global view of the network. Experimental results on four real-life networks demonstrate the advantage of BEHwII over the classic disparity filter method by either effectiveness or efficiency validity.

1. Introduction

Complex networks have become an important approach for understanding systems involving interacting objects [1]. Thus, networked systems have permeated a wide spectrum of domains, ranging from the biology and the automatic control to the computer science [2, 3]. With networked systems being increasingly large, to understand and reveal the underlying phenomena taking place in such systems are facing considerable challenges. The presence of the backbone is a signature or an abstraction of the nature of complex systems and can provide huge help for understanding them in more simplified forms [4]. For example, detecting the backbones in criminal networks can better target suspects [5]. Also, urban planners attempt to examine the topologies of public transport systems by analyzing their backbones [6].

Recent years have witnessed an increasing interest in extracting backbones in large-scale weighted networks of various kinds [4, 79]. As many networks are evolving into large scale and the weight distributions are spanning several orders of magnitude, extracting backbones from them has become a critical task for research and applications of various purposes. In general, the backbone should be thought of as a set of nodes and edges that interconnect various pieces of network, providing a path for the exchange of information between different subnetworks [10]. Thus, a promising way for backbone extraction is to map the original network into a smaller network, in which the numbers of nodes and edges should be small enough to be amenable to analysis and visualization.

In the literature, the existing methods can be roughly divided into two categories, one based on the coarse graining and the other is filter-based. The methods based on the coarse graining [4, 7, 1114] clump nodes sharing common attributes together in the same group/community and then consider the whole group as one single unit in the new networks. However, there is often no clear statement on whether properties of the initial network should be preserved in the network of clusters [15].

The filter-based methods [8, 9, 1618] typically employ a bottom-up strategy to extract the backbone. They often start by defining a statistical property of a node or an edge, and then this property is used as a criterion to determine nodes/edges to be preserved or discarded. In this case, the observation scale is fixed and the representation that the network symbolizes is not changed. Instead, those elements, nodes, and edges, which carry relevant information about the network structure, are kept while the rest are discarded. However, the filter-based methods may include a multitude of outliers, which should not be included into the backbone naturally. What is more, they often suffer from the computational inefficiency: the exhaustive search of all nodes or edges is often prohibitively expensive.

In this work, we attempt to design a novel filter-based method for extracting backbones from large-scale weighted networks. Unlike the exhaustive search adopted by the existing methods, the proposed approach only needs incomplete information and then invokes the iteratively local search scheme for improving the efficiency. So, this novel method is called backbone extraction heuristic with incomplete information (BEHwII). In particular, although proposed in [8] is employed as the filtering criterion, BEHwII imposes instead of to enhance the filtering rule, so that the case of extracting too many outliers into the backbone can be avoided. Our method is naturally a heuristic, since it does not examine all edges in the network. Alternatively, BEHwII greedily selects an optimal edge in one iteration and adds this edge into the backbone if the predefined filtering rule is satisfied. Extensive experiments on various real-world networks demonstrate the superiority of BEHwII over the global filtering method in terms of effectiveness and efficiency.

The remainder of this paper is organized as follows. In Section 2, we introduce preliminaries and motivation of this work. In Section 3, we discuss the local search mechanism and then present the algorithmic details of BEHwII. Experimental results will be given in Section 4. We present the related work in Section 5 and finally conclude this paper in Section 6.

2. Preliminaries and Motivation

Since the proposed method for backbone extraction is a filter-based model in essence, we begin by providing the preliminary knowledge about the filter-based model. Thus, we analyze some drawbacks of existing filter-based methods, which leads to a better understanding of the motivation of this paper.

The filter-based models typically employ a bottom-up strategy to extract the backbone. They often start by defining a statistical property of a node or an edge, and then this property is used as a criterion to determine nodes/edges to be preserved or discarded. As a result, preserved nodes and their links, or preserved edges and their endpoints, composed the backbone of the network. Therefore, the key step in filter-based methods is how to define a reasonable filtering property for nodes/edges. For instance, -core is a well-known filtering property that is used to construct a hierarchical topological filter in [16]. However, many simple filtering properties (e.g., -core) are not suitable for weighted networks. Meanwhile, the real-world weighted networks are usually with strong disorder heavy-tailed distributions of weights [19]. That is, the probability distribution that any given link carries a weight is broadly distributed, spanning several orders of magnitude. This feature exerts nontrivial challenges to define the filtering property for weighted networks, due in large part to the lack of a characteristic scale. Serrano et al. [8] addressed this challenge by introducing the disparity filter based on the null hypothesis; that is, the normalized weights that correspond to the connections of a certain node of degree are produced by a random assignment from a uniform distribution. Given a node and its associated link with weight , the normalized weight is defined as Under the null hypothesis, a null model is then presented, in which points are distributed with uniform probability in the interval . As a result, subintervals are generated, of which lengths represent the expected values for the normalized weights according to the null hypothesis. The probability density function for one of these variables taking a particular value is Based on 2, given an edge, the probability indicating its normalized weight is compatible with the null model and can be defined as where is the degree of node . Thus, is adopted as the filtering criterion in [8] for weighted networks. Given a significance level , the edges that carry weights which can be considered not compatible with a random distribution can be filtered out with a certain statistical significance. That is, edges with should be kept, since they reject the null hypothesis.

The criterion gave birth to an effective filter-based method for backbone extraction [8]. However, two drawbacks have attracted our attention. One of the biggest limitations is that it may include a multitude of outliers, which should not be included into the backbone naturally. In what follows, we try to explore its cause and give a modified scheme.

For node with degree , the level of local heterogeneity in the weights can be calculated as Thus, under perfect homogeneity, when all the links share the same amount of the strength of the node, equals 1 independently of , while in the case of perfect heterogeneity, when just one of the links carries the whole strength of the node, is equal to . With predefined null model, the join probability distribution for two intervals can be defined as where is the Heaviside step function, which can be used to calculate the statistics of for the null model. The average and the standard deviation are estimated to be In real networks, the observed level of local heterogeneity, denoted by , can be compared against the null model expectations. Namely, the observed values are compatible with the null hypotheses when they lie between the perfect homogeneity and . And the local heterogeneity will be recognized only if obeys

The parameter is a constant determining the confidence interval for the evaluation of the null hypothesis. The larger it is, the more restrictive the null model becomes and the more disordered weights should be for local heterogeneity to be detected. A typical value of in analogy to Gaussian statistics could be set as 2. In Figure 1, we show two regions (local heterogeneity and local compatibility) associated with different . Obviously, small nodes in terms of degree (e.g., ) are more likely to fall into the local compatible region, which implies that those nodes with small degree should not be preserved in the backbone.

In [8], the multiscale backbone is obtained by preserving all the links which beat the significant level for at least one of the two nodes at the ends of the link while discounting the rest. Notice that is not symmetrical; that is, , if . In the case of a node with degree connected to a node with degree , we might have . Then this link will be preserved as it holds . However, as discussed above, node is likely to fall into the local compatible region, which should be kept away from the backbone. Considering that an intermediate power law degree distribution is usually observed in real systems, the disparity filter in [8] may include a multitude of outliers. To avoid including many outliers into the backbone, one can impose instead of to enhance the filtering rule, so that a connection is preserved whenever its intensity is significant for both nodes involved.

Secondly, most of the existing filter-based methods [8, 9, 16, 17] suffer from the computational inefficiency, the exhaustive search of all nodes or edges in a network. For example, the filtering method based on is heavily dependent on the number of links. As many social networking sites are evolving into superlarge scales, for example, containing millions even billions of nodes and edges, the computation will be terrible!

According to the above analysis, this paper proposes a local method for extracting backbones from weighted networks. In particular, we try to answer the following two questions:(i)Q1: how to carefully design a filtering criterion to avoid including many outliers into the backbone?(ii)Q2: how to reduce the computational complexity of the backbone extraction algorithm?

3. Backbone Extraction Heuristic with Incomplete Information (BEHwII)

Let be a given weighted graph, where is the set of nodes (), is the set of edges () that connect the nodes in , and is the weight of every edge in . Backbone extraction is formulated as finding a subset of graph , that is, the backbone, where and , . This implies that the backbone should also significantly reduce the number of edges, while preserving most essential connections.

In this section, we propose a backbone extraction heuristic with incomplete information (BEHwII for short). First, we introduce the basic idea of BEHwII, covering the local search mechanism. Second, we present algorithmic details including the complexity analysis for BEHwII.

3.1. Local Search Model

In this paper, we employ the filtering criterion proposed in [8]. However, one major drawback lies in that it is probable to include too many outliers into the backbone as stated in Section 2. To explore its cause, we argue that this drawback originates from the looseness of the filtering rule, that is, . Therefore, BEHwII attempts to impose instead of to enhance the filtering rule, so that a connection is preserved whenever its intensity is significant for both nodes involved. In BEHwII, an edge is preserved in the backbone, if where is the probability derived by comparing the normalized weight with the null model, as shown in 3. With the filtering rule, BEHwII aims to extract a certain percentage (denoted by ) of edges satisfying 8 as the backbone.

A straightforward way for backbone extraction is to apply the exhaustive search, that is, to examine all of the edges one by one, and add the edge to the backbone as 8 satisfied. Obviously, this exhaustive search suffers from the computational inefficiency, especially when the network becomes much larger. Here, we introduce a local search model to solve this problem. We divide the explored graph into three regions: the known local area , the boundary area , and a larger unknown area , as illustrated in Figure 2. Initially, we randomly select a node as the start node and add to . Then, all neighbors of nodes in (e.g., ) are added to . The local search model selects an optimal edge with minimum from and adds it into the backbone if it holds 8. Areas and are expanded accordingly. Another edge will be selected and checked, until a certain number of edges are included into the backbone.

Remark 1. The local search model is a streaming and iterative scheme in essence [20]. An iterative process is invoked to examine each node along with its neighbors and performs a computation, of which the result is associated with the processed node. Such scheme is a very promising technique of scaling the existing method. Moreover, the local search model is independent of the “global knowledge"; that is, it only needs to fetch part of the node adjacency lists into main-memory. Due to the small-world effect, our model is validated to be slightly dependent on the initial node selection, of which the experimental results will be given in Section 4.1.

3.2. Algorithmic Details

In this section, we introduce how to use BEHwII to extract the backbone starting from any randomly selected node. BEHwII initially places the randomly selected source node into the known local area () and adds its neighbors into . Two data structures used in BEHwII are described as follows:(i)Min-heap , which stores the edge information, including and , in , so that every update process will take time;(ii)List , which stores the edges of the backbone, and every insert process will take time.

We describe the BEHwII Algorithm step by step roughly as follows.

Step 1. Find the edge with the minimal value of in and add it into if it satisfies 8.

Step 2. If any endpoints on the considered edge are not included in (), remove from to ; otherwise, delete edge and turn to Step 1.

Step 3. Delete edge and remove additional nodes () from to .

The above process continues until it has agglomerated a certain percentage of edges, or it has discovered the entire enclosing component, whichever happens first. Note that if with the minimal value of in Step 1 does not satisfy 8, we still check its endpoints and add corresponding edges into . Here, the nodes between can be seen as the excessive nodes to continue the search process. See Algorithm 1 for more exact pseudocode.

(1)    procedure BEHwII(
(2)  ;
(3)   ;
(4)   , where ;
(5)   while    do
(6)    Get the minimal from ;
(7)   if    then
(8)     ;
(9)   end if
(10) ;
(11)   if    then
(12)  ;
(13)  ;
(14)  ;
(15)  end if
(16)  if    then
(17)   break;
(18) end if
(19) end while
(20)    return;
(21) end procedure

Computational Complexity. The main computational cost of the above algorithm originates from the number of examined edges . For each examined edge , BEHwII needs to calculate the value of on it and update the min-heap . Because depends on the degrees of nodes and and on the normalized weights and , thus, it takes time to calculate on each examined edge. The updating (inserting or deleting) cost of for each examined edge is . In general, the running time for the algorithm is , where is the average degree of the graph.

4. Experimental Results

Four real-world undirected and weighted networks, Lesmis, USAir97, OClinks, and RTNN, are used for experiments. Some characteristics of these networks are shown in Table 1, where and indicate the numbers of nodes and edges, respectively, in the network, indicates the average degree, and indicates the average weight. Lesmis [21] is the network of coappearances of characters in Victor Hugo’s novel, where nodes represent characters and edges connect any pair of characters that appear in the same chapter of the book. USAir97 [22] gathers 2126 flight information between 332 US airports, where the weight represents the normalized distance among two airports. OClinks [23] is a network created from an online community, where nodes represent students at the University of California and edges are established between two students if one or more messages have been sent from one to the other. RTNN [24] is also a coappearance network including all words/terms in online stories about the September 11 attack, where each node represents a word and each tie means that the two words appear in the same story.

4.1. Comparison Results

In this subsection, we compare BEHwII with the disparity filter (DF for short) proposed by Serrano et al. [8] in performance and scalability. BEHwII is a local search based algorithm, which can start from any randomly selected source node. To investigate the impact of the parameter , we fix and take , respectively, where is a high-connected node and is a low-connected one. Both and are randomly selected from the original network. For convenience, we denote BEHwII starting from by ; then represents BEHwII starting from . For a given extraction goal (the percent edges kept in the backbone), the effectiveness of , , and DF can be validated by measuring the average weight and node betweenness of the extracted backbones, while the efficiencies can be measured by the number of examined edges and the overall running time.

Effectiveness. Figure 3 shows the average weight of the extracted backbones when the original graphs are extracted by , , and DF, respectively. Note that as the only parameter for DF is , for a given network, the fraction of extracted edges is a monotonically increasing function of . For convenient comparison, both DF and BEHwII use the same parameter , which is gradually increased so that the number of extracted edges grows accordingly. Two observations are noteworthy from Figure 3. First, compared with DF, shows slight improvements in terms of the average weight, no matter what is input. does not perform well when is set to be too small. For instance, obtains the backbone with the average weight lower than 10 on the  Lesmis network, but, after using and DF to extract backbones, the average weight increases significantly. Another important observation is that and will trend consistently as grows to a certain level. As can be seen from Figures 3(a) and 3(b), when the fraction of edges grows to around 0.25, the backbones extracted by and will have the same value of average weight. As adds local optimum edge into the backbone, even if it starts from a low-connected source node, it can sniff several high-connected nodes within limited steps. Therefore, will evolve to a after a certain percentage of edges have been discovered.

We then extensively explore the average node betweenness in the backbones extracted from Lesmis, USAir97, OClinks, and RTNN. Node betweenness centrality is the fraction of all shortest paths in the network that contain a given node, which reflects the connectedness of the node. Figure 4 shows the average betweenness of extracted nodes for different fractions of edges in the backbones. We can clearly find out that both and outperform DF in all of the test graphs. This implies that the edges extracted by BEHwII always lie between two high-connected nodes. As for DF, the filtering rule is so loose that some outliers (nodes with degree equal 1) will be included in the backbones, which will drop the connectedness of extracted backbone.

We then take a direct look at the extracted backbones. The  Lesmis and  USAir97 networks are used here as two examples. We set and for . In the case of Lesmis, the extracted backbone obtained by is shown in Figure 5(a). The source node is colored with green, the nodes and edges colored with blue are those kept in the backbones, the size of the node expresses its strength (), and the thickness of the edge represents the weight on it. Interestingly, the backbone obtained by preserves almost all high-connectivity nodes and essential connections. We then employ DF directly on this network and obtain a backbone as shown in Figure 5(b). The clique-like pattern on the top is missed, and, what is more, two outliers (highlighted by dashed circles) are kept.

As for the USAir97 network, nodes are placed in the plane according to their actual coordinates on the earth. The backbone extracted by , as shown in Figure 5(c), almost covers all the geographic regions of USA. In addition, the hierarchy of the transportation system is fully highlighted, including not just the most high flux connections but also small weight edges that are statistically significant because they represent relevant signal at the small scales. However, the backbone extracted by DF includes many small airports in Alaska and the west coast of USA (highlighted in dashed ellipses).

The Efficiency. Figure 6 compares the efficiencies of BEHwII and DF, given the extraction goal . The numbers of examined edges by , , and DF for the four test networks are shown in Figure 6(a). Apparently, and examine fewer edges than DF does. The latter will examine all nodes and edges in the network. Figure 6(b) verifies our analysis in Section 3.2; that is, the running time of BEHwII originates from the number of examined edges. It is interesting to find that the running time of and remains nearly constant in relative large dense graphs (e.g., OClinks and RTNN), that is, because those two networks have the “small world” effect [23, 24], in which most nodes can be reached from each other by a small number of hops or steps. In this context, both and can rapidly sniff those high-connected nodes; therefore their overall running times are almost consistent.

4.2. Inside BEHwII

Here, we take a further step to explore several factors that affect the performance of BEHwII. We select BEHwII starting from a high-connected source node, that is, , for experiments. Two inside factors have been investigated: the significant level and the inside filtering rule.

The Significant Level . It is particularly interesting to analyze the behavior of the topological properties of the backbones extracted by at increasing levels of the significant level . Figures 7(a) and 7(b) show the evolution of the cumulative degree distribution, , with different values of for  USAir97 and  OClinks, respectively. The backbones extracted by have the cumulative degree distributions similar to the original networks. Smaller values of have flat startups, indicating that the extracted backbones contain fewer low-degree nodes. The evolution of the weight distribution () with different values of is shown in Figures 7(c) and 7(d), from which we observe that the original USAir97 and OClinks networks are both heavy tailed. Interestingly, almost all scales are kept during the search process until becomes too restrictive, in which case applies a very small value of . A restrictive cuts off below , which may discard the region of small weights. Finally, we analyze the cumulative node betweenness centrality distributions of extracted backbones. It is worth mentioning that the node betweenness centrality in the backbone is given as that in the original network. Figures 7(e) and 7(f) give the evolution of the cumulative betweenness centrality distribution with different . For both test graphs, starts from a very low value if applies a very small value of , which implies that those low-connected nodes will not be included in the backbones.

Therefore, we can conclude that values of in the range are optimal, in the sense that backbones extracted by in this region have a large proportion of high-connective nodes and essential connections, and the stable stationary degree/weight distributions, compared with the original network. It is important to stress that also includes the connections with the largest weight present in the network. This is because the heavy tail of the distribution is mainly determined by relevant large-scale weight. This is clearly illustrated in Figures 7(c) and 7(d).

The Inside Filtering Rule. We further explore the critical factor that contributes to the success of . As discussed in Section 3.1, uses a strict filtering rule to absorb edges. Here, we relax the previous inside filtering rule, by imposing instead of , so that a connection is preserved whenever its intensity is significant for one of the nodes involved. In this loose (denoted by ), an edge is preserved in the backbone, if . We visualize the backbones of Lesmis and USAir97 extracted by in Figure 8. For each test network, we set and . In the case of the  Lesmis network, six outliers (highlighted by dashed circles) are extracted by , and it also fails to discover many essential connections. Obviously, its performance is worse than by comparing Figures 8(a) and 5(a). has made progress in the case of  USAir97, as most regions of USA have been covered in the extracted backbone as shown in Figure 8(b). However, it still includes many small airports in Alaska and the west coast of USA (highlighted in dashed ellipses) as DF does.

In the literature, the existing backbone extraction methods can fall into two categories: the coarse graining based methods and the filter-based methods. The methods based on the coarse graining clump nodes sharing common attributes together in the same group/community and then consider the whole group as one single unit in the new networks. Some methods along this line include the box-covering technique [4], fractal skeleton [7], and traditional community detection techniques such as the Kernighan-Lin algorithm [11], latent space models [12], stochastic block models [13], and modularity optimization [14]. The differences between these methods ultimately come down to the precise definition of a community. However, there is often no clear statement on whether properties of the initial network are preserved in the network of groups.

The filter-based methods typically employ a bottom-up strategy to extract the backbone. They often start by defining a statistical property of a node or an edge, and then this property is used as a criterion to determine nodes/edges to be preserved or discarded. In this case, the observation scale is fixed and the representation that the network symbolizes is not changed. Instead, those elements, nodes, and edges, which carry relevant information about the network structure, are kept while the rest are discarded. An example of a well-known hierarchical topological filter is the -core decomposition [16], with a filtering rule that acts on the connectivity of the nodes. In the case of weighted networks, two basic reduction techniques refer to the extraction of the minimum spanning tree [17] and the application of a global threshold [18] on the edge-weights, so that just those that beat the threshold are preserved, as real-world weighted networks that are usually with strong disorder heavy-tailed distributions of weight, which exerts nontrivial challenges to define the filtering property. Serrano et al. [8] addressed this challenge by introducing the disparity filter based on the null hypothesis.

In summary, although backbone extraction based on the coarse graining and filter models are extensively studied, they all need the knowledge of the entire network. Further study is still needed on finding a nice balance between the good performance and high efficiency. Our work attempts to fill this void by conducting backbone extraction based on an efficient BEHwII method.

6. Conclusion

In this work, we propose a backbone extraction heuristic with incomplete information (BEHwII) to find the backbone in a complex weighted network. First, a strict filtering rule is carefully designed to determine edges to be preserved or discarded. Second, we present a local search model to examine part of edges in an iterative way, which only relies on the local/incomplete knowledge rather than the global view of the network. Experimental results on four real-life networks demonstrate the advantage of BEHwII over the classic disparity filter method by either effectiveness or efficiency validity.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was partially supported by the National Natural Science Foundation of China (NSFC) under Grants 61103229 and 71372188, the National Center for International Joint Research on E-Business Information Processing under Grant 2013B01035, the National Key Technologies R&D Program of China under Grant 2013BAH16F01, the National Soft Science Research Program under Grant 2013GXS4B081, the Industry Projects in Jiangsu S&T Pillar Program under Grant BE2012185, and the Key/Surface Projects of Natural Science Research in Jiangsu Provincial Colleges and Universities under Grants 12KJA520001, 14KJA520001, and 14KJB520015.