Abstract

Link prediction uses observed data to predict future or potential relations in complex networks. An underlying hypothesis is that two nodes have a high likelihood of connecting together if they share many common characteristics. The key issue is to develop different similarity-evaluating approaches. However, in this paper, by characterizing the differences of the similarity scores of existing and nonexisting links, we find an interesting phenomenon that two nodes with some particular low similarity scores also have a high probability to connect together. Thus, we put forward a new framework that utilizes an optimal one-variable function to adjust the similarity scores of two nodes. Theoretical analysis suggests that more links of low similarity scores (long-range links) could be predicted correctly by our method without losing accuracy. Experiments in real networks reveal that our framework not only enhances the precision significantly but also predicts more long-range links than state-of-the-art methods, which deepens our understanding of the structure of complex networks.

1. Introduction

Modern science and engineering techniques increase our availability to various kinds of data including online social networks, scientific collaboration networks, and power grid networks [15]. Many interesting phenomena could be uncovered from these networks. For example, analyzing the data of Facebook and Twitter helps find lost friends by only counting their common friends [6, 7] and recommendation systems in online stores [8, 9]. Restricted by instrument accuracy and other obstacles, we only obtain a small fraction or a snapshot of the complete networks [10, 11], promoting us to filter the information in complex networks [1214]. Link prediction is a straightforward approach to retrieve networks by predicting missing links and distinguishing spurious links [1517]. Thus great efforts have been devoted to link prediction in recent years [16, 18]. Link prediction is used in different kinds of networks, including unipartite networks and bipartite networks, where unipartite networks consist of nodes with the same type (e.g., social networks and neural networks) and bipartite networks consist of nodes with two types (e.g., user-object purchasing networks and user-movie networks) [19, 20].

In classical link prediction approaches, similarity scores are computed first for two disconnected nodes, and then nonexisting links in the top of the score list are predicted as potential ones [16]. Consequently, the key issue is to search effective score-assigning methods that are mainly divided into three categories [16, 21]: similarity based algorithms, Bayesian algorithms, and maximum likelihood algorithms. First, similarity based algorithms [2224] suppose that similar nodes have a high probability to link together. Similarities are evaluated by common neighbors, random walk resource allocation, and some other local and global indices. Second, Bayesian algorithms [2527] abstract the joint probability distribution from the observed networks and then utilize conditional probability to estimate the likelihood of a nonexisting link. Third, maximum likelihood algorithms [28, 29] presuppose that some underlying principles rule the structure of a network, with the detailed rules and specific parameters obtained by maximum likelihood estimation. Scores of nonexisting links are acquired through the details of these principles. Most of these methods favor predicting links with high similarity scores and perform badly in the detection of long-range links with low similarities.

In the aforementioned methods, the basic hypothesis that two nodes with a high similarity score have a high likelihood of connecting together lacks an in-depth illustration. Recent works have demonstrated that long-range links exist extensively in complex networks and play an important role in routing, epidemical diffusion, and other dynamics [30, 31]. However, in practice, the endpoints of a long-range link usually have weak interaction and low similarity [30], which prevents the detection of long-range links by traditional methods [32, 33]. Hence, the structural patterns underlying the networks are of great importance to study.

Our study takes a different but complementary approach to link prediction problem. By analyzing the score distributions of existing and nonexisting links, respectively, we find an interesting phenomenon that the existing and nonexisting links follow different connecting patterns in respective of their similarity scores. Then, inspired by the precision-recall curves [3436], we propose a metric, named precision-to-noise ratio (PNR), to characterize the ability to distinguish potential links for different scores. PNR describes the local precision of a given set of links with the same score. Based on PNR, a novel framework, which projects one-variable function to adjust the scores of a given method, is put forward. We argue that the framework finds the optimal transforming function that exploits the full capacities of traditional link prediction methods and improves their performance both on precision and on the detection of long-range links. Experiments in six real-world networks demonstrate the effectiveness of our method.

The rest of the paper is organized as follows. In Section 2, we first brief the link prediction problem and then introduce our proposed method. In Section 3, we compare the performances of our method and the classical methods. Finally, the conclusion is given.

2. Materials and Methods

We give the link prediction formulism in Section 2.1 and the baseline method in Section 2.2. Our proposed framework is introduced in Section 2.3.

2.1. Network Formation and Metrics

Given a network , with if node connects to ; otherwise, . When evaluating the prediction performance, we usually divide the links randomly into training set and probe set (), with and . The goal is to accurately predict the links in probe set only by using the information in training set.

We first assign a score to each nonexisting link and then choose links with the highest top-L scores as potential ones. State-of-the-art similarity evaluation methods could be utilized to carry out link prediction, including common neighbors (CN), Jaccard index (JB), resource allocation index (RA), local path index (LP), and structural perturbation method (SPM) (see the part of Baseline and [38]).

There are two popular metrics to characterize the accuracy: area under the receiver operating characteristic curve (AUC) [39] and the precision [40, 41]. AUC can be interpreted as the probability that a randomly chosen missing link (i.e., a link in ) has a higher score than a randomly chosen nonexisting link. Then, AUC requires times of independent comparisons. We randomly choose a real link and a nonexisting link to compare their scores. After different comparisons, we record times where real links have higher scores, and times where the two kinds of links have the same score. The final AUC is calculated as If all the scores are given by an independent and identical distribution, then AUC should be around 0.5. A higher AUC is corresponding to a more accurate prediction.

Another metric is precision that characterizes the ratio of correctly predicted links for a given prediction list. That is to say, if the length of prediction list is , among which links are the right potential links, then the precision isClearly, higher precision means higher prediction accuracy. Intuitively, higher accuracy means higher AUC and higher precision. In the experiments, we will see that precision has little correlation with AUC and that improving the precision may not result in the improvement of AUC.

2.2. Baseline Prediction Methods

There exists a large number of score-assigning approaches in link prediction problem. All these methods could be introduced into our framework. Though we only investigate some state-of-the-art score-assigning approaches, the results and conclusions are also applicable for other score-assigning methods. The five score-assigning approaches [6, 16] are as follows.

(i) Common Neighbor (CN). The metric supposes that if two nodes and have more common neighbors, they are more likely to connect together. The neighborhood overlap of the two nodes is as follows: where is the neighbor set of node and indicates the size of a set. The drawback of CN is that it favors large-degree nodes. Though the similarity of two large-degree nodes is low, they still have many common neighbors.

(ii) Jaccard Coefficient (JB). Jaccard is a conventional similarity metric that aims to suppress the influence of large-degree nodes, which is Since the similarity is normalized by the size of the union set of the two nodes’ neighbors, low similarity still exists between two large-degree nodes even though they may have many common neighbors.

(iii) Resource Allocation (RA). This index is inspired by the resource allocation dynamics in complex networks. Given a pair of unconnected nodes and , suppose that the node needs to allocate some resource to , using common neighbors as transmitters. Each transmitter (common neighbor) starts with a single unit of resource and then distributes it equally among all its neighbors. The similarity between and can be calculated as the amount of resource received from their common neighbors: Comparing with Jaccard method, RA could also suppress the influence of large-degree nodes, but more specifically. Different neighbors contribute to the similarity differently. If two nodes prefer to connect low-degree nodes, it means that they have a higher probability to share common interests or characteristics. However, many pair-nodes have common high-degree neighborhoods, resulting in that high-degree nodes play a weak role when evaluating similarity. Based on the idea, Adamic-Adar (AA) index is obtained by using instead of in (5).

(iv) Local Path (LP). CN considers the intersection of neighborhoods, which actually utilizes the one-path neighbors to characterize similarity. LP takes a general consideration of paths by considering two-path neighbors: where is the adjacent matrix of a network and is a small positive number. LP supposes that one-path neighbors contribute more to the similarity than two-path neighbors. LP is the low order parts of Katz method (), but with much lower computing complexity.

(v) Structural Perturbation Method (SPM). Lü et al. [6] suppose that network structure follows consistency after some random perturbation. In SPM, training set is divided into a small fraction of perturbation set and the remaining set (). has similar eigenvectors with , but different eigenvalues. For the th largest eigenvalues of and ,where is the eigenvector of , corresponding to . The similarity matrix is

SPM first divides a network into training set and probe set and further divides the training set into perturbation set and the remaining set. For a given division of training and probe set, we calculate the average of 10 times independent simulations of (8) as the similarity matrix.

Apart from the five similarity metrics introduced above, for more similarity-evaluating methods, please refer to [42, 43].

2.3. The Proposed Method

We start our framework by reinvestigating the definition of precision. Supposing that is the similarity score of nodes and obtained by a prediction method only based on training set , is the similarity distribution that a randomly chosen existing link in training set has score , and is the similarity distribution that a randomly chosen nonexisting link in the training set has score . Due to random division of training set and probe set, links in the probe set should have the same similarity distribution with that of the training set at high confidence according to the law of large numbers [44, 45]. Thus we would not differentiate similarity distribution of existing links in the training and probe sets in the following paper. The assumption is reasonable according to the statistical theory if the size of samples goes to infinity [44, 45]. Since classical methods only predict links with high scores, the estimated precision of the method is written aswhere is the size of , is a constant, and is the whole set of all possible links (). is the maximum score. In real scenarios, the length of the prediction list is usually the size of the probe set [16], which requires subjecting to . If at , the precision . Otherwise, gives rise to a high precision. Since only links with top-L highest scores are predicted as potential links, precision could be calculated by (2) [6, 16]. Equation (2) is a much easier formula to describe precision than (9).

Most previous link prediction methods only predict links with high similarity scores. We generalize (9) by considering links of different similarities. Supposing that links with scores are predicted as potential links, the precision is as follows:where . To confine the length of the prediction list, a precondition requires . Note that, in most previous works, , and equation (10) reduces to (9). Our generalized precision equation (10) considers links with both high and low scores.

The main concern is to select appropriate set in (10) to maximize the precision. We propose precision-to-noise ratio (PNR) to determine ,where PNR measures the ability to distinguish real links with the same score. Note that a nonexisting link in training set may be an existing link in probe set. Given a nonexisting link in training set with the similarity , the probability that it is an existing link in probe set (i.e., the precision) is , where is a constant.

The central issue of our framework is to use PNR to determine the optimal score set . We first calculate the similarity scores of all links only based on training set by a traditional method. Second, , , and PNR are computed. Third, we reassign the scores of each link , where is the original similarity score by the first step. Finally, we sort links in the descending order of and links with top-L scores are predicted as potential links [16, 18]. The optimal score set corresponds to the original similarity scores whose reassigned scores rank in the top-L score list.

Different kinds of similarity evaluations could be introduced into the framework. Taking CN similarity method as an example, our framework is as follows:(1)Divide the links of a network into training set and probe set randomly.(2)Calculate the similarity scores of all existing and nonexisting links by CN method only according to training set.(3)Calculate PNR. Divide the scores into uniform bins and count how many existing () and nonexisting () links locate in each bin (i.e., calculate discrete ). Then we obtain . Note that if , we define .(4)Obtain the readjusting scores of the nonexisting links in training set by .(5)Determine the prediction list by choosing links with highest scores , and calculate the precision.

Figure 1 depicts the proposed framework based on CN method. After obtaining the similarity scores of links (Figure 1(a)→1(b)), traditional CN method directly predicts potential links according to the scores (Figure 1(b)→1(d)), while the proposed framework calculates (Figures 1(b)→1(c)) and later predicts potential links according to the modified scores (Figures 1(c)→1(d)).

An important property of our framework is that if is determined according to , that is, , the precision could exploit full capacity of a given similarity-evaluating method. is the optimal transforming function . It means that no matter how we transform the similarity by other one-variable function, , the precision performance of cannot outperform the proposed method by . For the proof of the optimal , please see part I in the supplementary materials.

3. Experimental Results

We first describe the six real networks in Section 3.1. The precision comparison between our method and the baseline methods is given in Section 3.2. Finally, the characteristics of the predicted links by different methods are investigated in Section 3.3.

3.1. Datasets

To verify the effectiveness of the proposed method, we measure the performance of our framework in six empirical networks from diverse disciplines and backgrounds: (1) email [46]: Enron email communication network covers all the email communication within a dataset of around half million emails; nodes of the network are email addresses and if an address sent at least one email to address , the graph contains an undirected link from to ; (2) PDZBase [47]: an undirected network of protein-protein interactions from PDZBase; (3) Euroad [48]: international E-road network that locates mostly in Europe; the network is undirected, with nodes representing cities and links denoting e-road between two cities; (4) neural [49]: a directed and weighted neural network of C. elegans; (5) USair [6]: an directed network of flights between US airports in 2010; each link represents a connection from one airport to another in 2010; (6) roundworm [49]: a metabolic network of C. elegans.

Different real networks contain directed or undirected, weighted or unweighted links. To simplify the problem, we treat all links undirected and unweighted. Besides, only the giant connected components of these networks are taken into account. This is because for a pair of nodes located in two disconnected components, their similarity score will be zero according to most prediction methods. Table 1 shows the basic statistics of those networks.

3.2. Precision Evaluation

In the experiments, we set that means the networks are randomly divided into 90% training set and 10% probe set. All the experiments are the average of 50 independent simulations.

Figure 2 shows AUC and precision of five different methods in USair network. In Figure 2, CN method achieves low AUC, yet high precision, whereas RA method achieves similar AUC with methods of CN, JB, and SPM, but much lower precision. Apart from USair network, the deviation between AUC and precision also exists in other real-world networks (see FIG. S1 in the supplementary materials). The main reason is that AUC characterizes the score difference between existing and nonexisting links in the whole networks, whereas precision only counts the links with top-L high scores. Specifically, from the perspective of score distributions, . Comparing with (10), the definitions of the two metrics are completely different, resulting in little correlation between them.

Figure 3 shows PNR and the score distributions of existing and nonexisting links for USair network by CN method. In Figure 3(a), the scores of existing and nonexisting links follow power law distribution largely. High scores sometimes correspond to low PNR, especially at (see Figure 3(b)). Nevertheless, some low scores achieve high PNR, indicating that for a nonexisting link in training set with this particular score, the link is likely to be an existing link in probe set. For a nonexisting link in training set with high score, yet with low PNR, it has a high probability not to be an existing link in probe set. The similar phenomenon also exists in other networks (see FIG. S2 in the supplementary materials). In consequence, the foundation of traditional methods, which suppose that similar nodes have a high likelihood to form links, is confronted with great challenges in precisely predicting links of low similarities.

Figure 6 shows the precision difference between the proposed PNR methods and the baseline methods. Our proposed method enhances precision remarkably compared with the original methods in most cases. Some fluctuation exists in these methods, due to the limited size of networks. Table 2 gives the maximal precision increasement in the six networks. In Table 2, precision is obtained by the maximum of traditional methods and methods, respectively, that is, and . Our method outperforms state-of-the-art methods in the six networks. Besides, Figure 4 shows the influence of the probe set size on the precision performance. We find that our method outperforms classical methods when , except for JB method when . Other networks have similar results (see FIG. S3 in the supplementary materials). However, according to the theoretical analysis (see the first part in the supplementary materials), our method should perform better than, or at least equally to, the classical methods. The reason is that we suppose the network structure is not influenced by the random division of training and probe set. Thus, the training subnetwork should have similar structure with the original networks. The assumption is rational when is small. If the size of the probe set is large, the training sets have many differences with the entire networks, which violates the assumption of our method. Therefore, our method performs well when the fraction of the probe set is small.

3.3. Characteristics of the Predicted Links

Long-range links play an important role in the dynamics of networks and it is of much significance to predict long-range links [32, 50]. Figure 5 gives a comparison of the predicted links between JB and the corresponding PNR methods in the Usair network. In Figure 5, our method predicts more links between faraway nodes in different communities, while the original JB method only predicts links between close nodes. Community detection method in [37] is utilized in Figure 5. However, it is difficult to evaluate long-range links solely based on community divisions. Since long-range links usually have long distances and low similarities, we would investigate the average distance and average similarity of the predicted links by our proposed framework.

The distance of a link is the shortest distance between nodes and only based on training set. Since the endpoints of the predicted links do not connect directly, . The average distance of the predicted links isAnalogously, the average similarity of the predicted links iswhere is the similarity of nodes and in training set.

Figure 7 shows the difference of the average distances obtained by PNR method and the corresponding original methods. Generally, PNR method achieves a higher average distance than the corresponding original methods in the six networks, especially for SPM in Email network and LP in USair network, whereas for many cases, PNR and the original methods have the same average distance . It is because that the distance of most unconnected nodes are 2, revealing that most commonly used methods incline to predict triangle edges. Therefore, our method has little influence on the average distance. However for some sparser networks, such as neural and USair networks, the average distance is improved by our framework, especially for LP in USair network. Previous works show that the two endpoints of a long-range link usually have a high distance or low similarity. Since PNR framework could increase the average distance of the predicted links, it can be conjectured that more long-range links are predicted. Besides, integrating Figures 6 and 7, we can find that our framework predict more long-range links correctly.

Furthermore, Figure 8 shows the difference of average similarity obtained by PNR method and the corresponding original methods. In Figure 8, PNR method achieves a lower average similarity than the corresponding original methods in the six networks, except RA method in roundworm network. The reason is that PNR has much fluctuations due to the limited size of networks, bringing about the unusual phenomenon of RA in roundworm network. Similar to the analysis of average distance, we show that PNR methods are beneficial to the prediction of long-range links, which agrees with the conclusion from Figure 7.

4. Conclusion

In summary, we systematically study the drawbacks of similarity-based link prediction methods and show that some link prediction methods achieve high AUC, yet low precision. Based on the differences of the similarity distributions of existing and nonexisting links, we propose a metric (PNR) to explain the problem of high AUC and low precision. Two nodes with some particular low scores also have a high likelihood of forming links between them. Furthermore, we prove that PNR is the optimal one-variable function to adjust the likelihood scores of links. Experiments in real networks demonstrate the effectiveness of PNR, and the precision is greatly enhanced. Additionally, the proposed framework could also reduce the average similarity and increase the average distance of the predicted links, which indicates that more missing long-range links can be detected correctly.

Though the proposed approach investigates link prediction in unipartite networks, it could also be generalized to bipartite and other kinds of networks. What is more, our method provides a novel way to explore the connecting patterns of real networks that may inspire other better score-assigning methods in the future.

Conflicts of Interest

The authors declare no competing financial interests.

Acknowledgments

The authors thank Dr. Alexandre Vidmer for his fruitful discussion and comments. This work is jointly supported by the National Natural Science Foundation of China (61703281, 11547040), the Ph.D. Start-Up Fund of Natural Science Foundation of Guangdong Province, China (2017A030310374 and 2016A030313036), the Science and Technology Innovation Commission of Shenzhen (JCYJ20160520162743717, JCYJ20150625101524056, JCYJ20140418095735561, JCYJ20150731160834611, JCYJ20150324140036842, and SGLH20131010163759789), Shenzhen Science and Technology Foundation (JCYJ20150529164656096, JCYJ20170302153955969), the Young Teachers Start-Up Fund of Natural Science Foundation of Shenzhen University, and Tencent Open Research Fund.

Supplementary Materials

In the supplementary materials, we prove that PNR is the optimal transferring function in Section  1. The deviation of AUC and precision in different networks is shown in Section  2. The PNR performances of different methods in different networks are shown in Section  3. In Section  3, we first plot the PNR by different methods in FIG. S2 and then show the influence of the probe set size on the precision in Fig. S3. FIG. S1 (color online), AUC and precision of six real-world networks (see Table  2) by five different popular approaches. Results are average of 50 independent simulations. In the experiments, pH = 0.1 means that we utilize 90% existing edges as training set to predict the other 10% edges (probe set). FIG. S2 (color online), PNR for six networks by five different methods. (a) Email network. (b) PDZBase network. (c) Euroad network. (d) Neural network. (e) Roundworm network. (f) USair network. Results are the average of 50 independent simulations and are obtained only according to training set. For different methods and different networks, scores are normalized to 0~1 with snew = (s  smin)/(smax  − smin). FIG. S3 (Color online), the precision difference Δp as a function of probe set size pH = L/|E| in the four networks, where Δp is the difference between the five classical and the corresponding PNR methods, Δp = pPNR  − poriginal. Δp > 0 means that our method outperforms the original methods. In the panels, when pH > 0.85, Δp > 0. (Supplementary Materials)