Abstract

Nowadays, network sampling has become an indispensable premise and foundation for large-scale network analysis, and its effectiveness determines to a large extent the reliability and practicability of the subsequent network analysis results. In this paper, we propose a network sampling algorithm inspired by an epidemic spreading model named the contact process. The contact process is similar to the random walk process but different from it in two key points. First, at each time step, a randomly selected sampled node rather than the latest sampled node is responsible for recruiting a new node from its neighborhood. Second, the responsible node recruits one of its neighbor nodes with a probability inversely proportional to the degree of this neighbor node, instead of equal probability. Experiments on nine indiscriminately selected real-world networks show that our proposed sampling algorithm has a significant advantage in preserving two basic network properties, the degree distributions and clustering coefficient distributions of original networks, compared with seven classical sampling methods.

1. Introduction

In recent decades, the rapid development of storage technology has allowed online social network (OSN) providers to deposit almost all user-generated information every day. The analysis of OSNs is receiving remarkable research attention from both the academic and industrial communities. However, in some scenarios, some access restrictions are imposed on the network such that it is hard or infeasible for people to study the whole network. In other scenarios, the network is, however, available but too large to be stored and analyzed in a reasonable amount of memory and time. In the face of these problems, network sampling techniques have emerged to help us effectively and efficiently study and analyze real-world networks. The concept of network sampling can be simply described as follows. Given a network and the sampling ratio , where , the primary goal of network sampling is to construct a representative subnetwork which preserves the most important properties of the original network, where , , and .

Nowadays, network sampling has become an indispensable premise and foundation for large-scale network analysis, and its effectiveness determines to a large extent the reliability and practicability of the subsequent network analysis results. Besides, network sampling also has a wide spectrum of applications, e.g., surveying hidden population in sociology, visualizing social graph, scaling down Internet AS graph, and graph sparsification [1].

A large number of sampling techniques have been proposed in the past few decades, designed for various purposes and for preserving different network properties [2]. These sampling techniques can be categorized into two groups: random selection and network exploration techniques. In the first group, nodes or links are recruited in the sample uniformly at random or proportional to some particular characteristic like degree or PageRank values [3]. In the second group, the sample network starts from a randomly selected seed node and is expanded following the local connections of previously sampled nodes.

Leskovec and Faloutsos [4] show that, among the typical network sampling methods, random walk (RW) and forest fire (FF) sampling methods have the best overall performance. Recently, Blagus et al. [3] empirically compared 11 representative network sampling methods on 12 real-world networks and concluded that breadth-first search (BFS) and random walk with subgraph induction (RWI) sampling methods show the best overall performance in preserving the degree and clustering coefficient distribution of original networks. Next, we will briefly review these sampling methods, which will be used as comparing counterparts of our proposed algorithm.

Random walk (RW), forest fire (FF), and breadth-first search (BFS) are somewhat similar to each other. Initially, the sampled node set and edge set are both empty. At the first step, a randomly selected node is added into . At each following step, for every node added into at the previous step, randomly selected neighbor nodes of , say , are added into , and the corresponding edges are added into . For RW, ; for FF, follows a geometric distribution with mean value , where we set to be 0.7 as suggested in [4]; for BFS, , where is the degree of node . This step repeats until the sampling size is reached; that is, . Another key different point is that the random walk is memoryless and a visited node has a probability of being visited again in the future, while the forest fire and breadth-first search never include the repeated nodes.

Besides the abovementioned methods, Metropolis–Hastings random walk (MHRW) is demonstrated to be a well-performed sampling method in the literature [5, 6]. It achieves a uniform distribution of sampled nodes by the following transition probability:

Blagus et al. [3] proposed analyzing the sampling methods with subgraph induction, where the final sample network is constructed from a generated sample and all the existing edges between any two nodes of this sample. They empirically show that RW, FF, and MHRW with subgraph induction, named RWI, FFI, and MHRWI, improve the performance of the corresponding methods without subgraph induction. Therefore, RWI, FFI, and MHRWI are also used as baseline algorithms in this paper.

In this paper, we propose a network sampling algorithm inspired by an epidemic spreading model, the contact process, and thus, it is called contact process sampling (CPS). It is similar to RW but has two key different points. First, at each time step, a randomly selected sampled node rather than the latest sampled node is responsible for recruiting a new node from its neighborhood. Second, a sampled node chooses one of its neighbor nodes with a probability inversely proportional to the degree of this neighbor node, instead of an equal probability.

The rest of this paper is organized as follows. Section 2 describes the contact process and introduces the CPS algorithm. Section 3 compares the sampling quality of CPS with the aforementioned well-performed sampling methods. Section 4 concludes the whole work and makes some remarks.

2. Proposed Model

The contact process, which was first proposed as a susceptible-infected-susceptible (SIS) model for epidemic spreading, has found wide applications in science and engineering [7]. A general contact process on a network is described as follows. Initially, a set of nodes are infected by a virus (or carry a piece of information), and other nodes on the network are susceptible (not infected). At each time step, an infected node is chosen at random, say node . With probability , the virus on node dies, and node becomes susceptible again; with probability , the virus on node selects one neighbor of to contact, say . If is already infected, nothing happens; if is susceptible, it gets infected.

In such a contact process, the fraction of infected nodes on a given network in an ultimately steady state is dependent on two critical factors: the aforementioned death rate and the contact probability , which is the probability that an infected node chooses a neighbor node of degree to contact. Yang et al. [7] proved that, when is smaller than the threshold value, if the contact probability takes the form of , the fraction of infected nodes in the ultimately steady state is maximized when .

In this paper, we propose a network sampling algorithm named contact process sampling (CPS), which employs a process analogous to the contact process across the network. In order to get a sample network of nodes as soon as possible, we eliminate the effect of the death rate from our CPS model by setting it to be 0. To ensure the connectedness of a sample network, the CPS algorithm starts from only one node.

The CPS algorithm is presented by Algorithm 1. Initially, a randomly chosen node is recruited into the sample set. At each time step, a sampled node is chosen at random, say node . Following the conclusion of Yang et al., node chooses a neighbor node to recruit into the sample set with probability , where is the degree of node and represents the set of ’s neighbor nodes. This recruitment step repeats until the sample set contains distinct nodes. Then, we construct the final sample network with these sampled nodes and the links which connect any two of these sampled nodes in the original network.

Input: an undirect and unweighted graph ; the sample ratio , where ;
Output: a sample graph , where , and ;
(1)Randomly select one node , and let and ;
(2)whiledo
(3)Randomly select one node from ;
(4)Select one node from the neighborhood of , with probability inversely proportional to the degree of ;
(5);
(6)end while
(7)fordo
(8)if and then
(9);
(10)end if
(11)end for
(12)return;

3. Performance Evaluation

3.1. Datasets

Nine indiscriminately selected real-world networks from KONECT [8] are employed to test the performance of sampling models. They are all undirected and unweighted networks, and their basic statistics are presented in Table 1. The fill of a network is the proportion of edges to the total number of possible edges. The global clustering coefficient is defined as the probability that two incident edges are completed by a third edge to form a triangle. Assortativity is defined as the Pearson correlation coefficient between the degrees of connected nodes. For other characteristics of these networks, the reader is referred to KONECT [8].

In this paper, we consider the sample ratio ranging from to of original networks (by step of in and in ) as suggested in [3]. For each network, we perform 30 realizations of each sampling technique and each sample ratio. For each run of the exploration techniques, the sample starts from a randomly selected new seed node.

3.2. Evaluation Measures

For the evaluation of sampling algorithms, two well-known and widely used network statistics are used to measure the representativeness of the sampled network. They are degree distribution (DD) as a global statistical property and clustering coefficient distribution (CCD) as a local statistical property. The DD of a network refers to the probability distribution of degrees of all nodes in the network [9] and is represented by the fraction of nodes of degree , . The clustering coefficient of a node in a network is the proportion of that node’s neighbors that are connected, and the CCD of a network refers to the probability distribution of the clustering coefficient of all nodes in the network [10].

We compare the DD and CCD of the sample network and the original network by the Kolmogorov–Smirnov D-statistic (KSD). KSD is used to measure the agreement of two cumulative distribution functions [11]: original distribution and estimated distribution . It is defined as , where is over the range of the random variable. Clearly, it is a value between 0 and 1. The closer it is to zero, the higher is the similarity between the two distributions. Note that KSD does not address the issue of the scaling but rather compares the shape of the (normalized) distribution [4].

3.3. Algorithm Comparison

The comparison of sampling techniques based on degree distribution is shown in Figure 1. We can see that, in most datasets, the techniques without subgraph induction (RW, FF, and MHRW) perform significantly different from other methods. This group of techniques approximates the degree distribution of the original networks with a larger deviation than others (except for PowerGrid and Douban). Therefore, this observation reinforces the conclusion of Blagus et al. [3] that the techniques with induction improve the performance of the corresponding techniques without it.

As for the performance of techniques with subgraph induction, the nine datasets can be categorized into two groups. In the first group of datasets (PowerGrid, Amazon, WordNet, AstroPh, and Livemocha), the techniques with subgraph induction perform similarly to each other, and our proposed CPS algorithm is the best in three datasets (WordNet, AstroPh, and Livemocha) and has a negligible difference from the best ones in the other two datasets. In the second group of datasets (Gowalla, Brightkite, Douban, and Flickr), the techniques with subgraph induction perform greatly different from each other. Our proposed CPS algorithm shows a significant advantage in three datasets (Gowalla, Brightkite, and Douban) and is the second-best in Flickr. In general, our proposed CPS algorithm is the best performing technique in preserving the degree distribution of the original networks.

The comparison of sampling techniques based on clustering coefficient distribution is shown in Figure 2. Similarly to that of degree distribution, the techniques with subgraph induction perform better than the corresponding techniques without subgraph induction in most datasets (except for PowerGrid and Douban). The three techniques without subgraph induction, RW, FF, and MHRW, unless otherwise specified, are excluded from our following discussion.

The five techniques with subgraph induction have a similar declining shape of KSD plot in 3 datasets, PowerGrid, Amazon, and Flickr, where the CPS algorithm is comparable to other techniques. In contrast, in the other 6 networks, the KSD plots of RWI, FFI, and BFS algorithms begin to increase with the growth of sampling ratio, except for BFS in WordNet and Gowalla. Fortunately, the KSD plots of the CPS algorithm are always declining in these 6 networks, and the KSD values are very small compared with those of the RWI, FFI, and BFS algorithms. The performance of MHRWI is intermediate between that of CPS and those of RWI, FFI, and BFS, where the KSD value is closer to the former, and the shape of the KSD plot is similar to the latter. In general, the CPS algorithm has the best overall performance in preserving the clustering coefficient distribution of the original networks.

To quantitatively demonstrate the superiority of CPS to other methods, Tables 2 and 3 present the KSD values of DD and CCD produced by 8 sampling techniques on 9 datasets when the sampling ratio is as suggested in [3], where the best and second best values for every dataset are highlighted in bold type. For DD, the CPS algorithm is the best in 6 out of 9 datasets, and for CCD, the CPS algorithm is the best in 5 out of 9 datasets. Other than CPS, no algorithm is ranked in the first position in more than 2 datasets, whether DD or CCD or both. Recall that the nine real-world network datasets are indiscriminately selected from KONECT [8]. We conclude that the CPS algorithm has a significant advantage in preserving degree distributions and clustering coefficient distributions of original networks.

4. Concluding Remarks

In this paper, we proposed a network sampling strategy inspired by the contact process and empirically validated its superior performance in preserving two important structural properties of original networks. Although it is a little similar to random walk sampling, two key different operations from RW make it produce better sample network than RW and several typical RW-variant sampling methods.

There is much work that remains to be done in the future. First of all, test of the CPS algorithm in preserving other properties of the original network would be useful to show possible limits of its applicability. Second, one should also investigate the characteristics of some datasets, typically Douban, where the sampling methods without subgraph induction perform better than the ones with subgraph induction. Finally, but not the least, the function approximation of a sample network is worthy of exploration. For example, the comparison of the epidemic spreading process on sample and original networks may be the topic of our next work [12].

Data Availability

The readers can access all the 9 datasets supporting the conclusions of the study from http://konect.cc/, and the details are listed as follows: PowerGrid, http://konect.cc/networks/opsahl-powergrid/; Amazon, http://konect.cc/networks/com-amazon/; WordNet, http://konect.cc/networks/wordnet-words/; AstroPh, http://konect.cc/networks/ca-AstroPh/; Livemocha, http://konect.cc/networks/livemocha/; Gowalla, http://konect.cc/networks/loc-gowalla_edges/; Brightkite, http://konect.cc/networks/loc-brightkite_edges/; Douban, http://konect.cc/networks/douban/; and Flickr, http://konect.cc/networks/flickrEdges/

Conflicts of Interest

The authors declare that they have no conflicts of interest.