Complexity

Volume 2017 (2017), Article ID 8581365, 12 pages

https://doi.org/10.1155/2017/8581365

## Connecting Patterns Inspire Link Prediction in Complex Networks

Guangdong Province Key Laboratory of Popular High Performance Computers, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China

Correspondence should be addressed to Hao Liao

Received 9 August 2017; Revised 27 November 2017; Accepted 6 December 2017; Published 27 December 2017

Academic Editor: Diego Garlaschelli

Copyright © 2017 Ming-Yang Zhou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Link prediction uses observed data to predict future or potential relations in complex networks. An underlying hypothesis is that two nodes have a high likelihood of connecting together if they share many common characteristics. The key issue is to develop different similarity-evaluating approaches. However, in this paper, by characterizing the differences of the similarity scores of existing and nonexisting links, we find an interesting phenomenon that two nodes with some particular low similarity scores also have a high probability to connect together. Thus, we put forward a new framework that utilizes an optimal one-variable function to adjust the similarity scores of two nodes. Theoretical analysis suggests that more links of low similarity scores (long-range links) could be predicted correctly by our method without losing accuracy. Experiments in real networks reveal that our framework not only enhances the precision significantly but also predicts more long-range links than state-of-the-art methods, which deepens our understanding of the structure of complex networks.

#### 1. Introduction

Modern science and engineering techniques increase our availability to various kinds of data including online social networks, scientific collaboration networks, and power grid networks [1–5]. Many interesting phenomena could be uncovered from these networks. For example, analyzing the data of Facebook and Twitter helps find lost friends by only counting their common friends [6, 7] and recommendation systems in online stores [8, 9]. Restricted by instrument accuracy and other obstacles, we only obtain a small fraction or a snapshot of the complete networks [10, 11], promoting us to filter the information in complex networks [12–14]. Link prediction is a straightforward approach to retrieve networks by predicting missing links and distinguishing spurious links [15–17]. Thus great efforts have been devoted to link prediction in recent years [16, 18]. Link prediction is used in different kinds of networks, including unipartite networks and bipartite networks, where unipartite networks consist of nodes with the same type (e.g., social networks and neural networks) and bipartite networks consist of nodes with two types (e.g., user-object purchasing networks and user-movie networks) [19, 20].

In classical link prediction approaches, similarity scores are computed first for two disconnected nodes, and then nonexisting links in the top of the score list are predicted as potential ones [16]. Consequently, the key issue is to search effective score-assigning methods that are mainly divided into three categories [16, 21]: similarity based algorithms, Bayesian algorithms, and maximum likelihood algorithms. First, similarity based algorithms [22–24] suppose that similar nodes have a high probability to link together. Similarities are evaluated by common neighbors, random walk resource allocation, and some other local and global indices. Second, Bayesian algorithms [25–27] abstract the joint probability distribution from the observed networks and then utilize conditional probability to estimate the likelihood of a nonexisting link. Third, maximum likelihood algorithms [28, 29] presuppose that some underlying principles rule the structure of a network, with the detailed rules and specific parameters obtained by maximum likelihood estimation. Scores of nonexisting links are acquired through the details of these principles. Most of these methods favor predicting links with high similarity scores and perform badly in the detection of long-range links with low similarities.

In the aforementioned methods, the basic hypothesis that two nodes with a high similarity score have a high likelihood of connecting together lacks an in-depth illustration. Recent works have demonstrated that long-range links exist extensively in complex networks and play an important role in routing, epidemical diffusion, and other dynamics [30, 31]. However, in practice, the endpoints of a long-range link usually have weak interaction and low similarity [30], which prevents the detection of long-range links by traditional methods [32, 33]. Hence, the structural patterns underlying the networks are of great importance to study.

Our study takes a different but complementary approach to link prediction problem. By analyzing the score distributions of existing and nonexisting links, respectively, we find an interesting phenomenon that the existing and nonexisting links follow different connecting patterns in respective of their similarity scores. Then, inspired by the precision-recall curves [34–36], we propose a metric, named precision-to-noise ratio (PNR), to characterize the ability to distinguish potential links for different scores. PNR describes the local precision of a given set of links with the same score. Based on PNR, a novel framework, which projects one-variable function to adjust the scores of a given method, is put forward. We argue that the framework finds the optimal transforming function that exploits the full capacities of traditional link prediction methods and improves their performance both on precision and on the detection of long-range links. Experiments in six real-world networks demonstrate the effectiveness of our method.

The rest of the paper is organized as follows. In Section 2, we first brief the link prediction problem and then introduce our proposed method. In Section 3, we compare the performances of our method and the classical methods. Finally, the conclusion is given.

#### 2. Materials and Methods

We give the link prediction formulism in Section 2.1 and the baseline method in Section 2.2. Our proposed framework is introduced in Section 2.3.

##### 2.1. Network Formation and Metrics

Given a network , with if node connects to ; otherwise, . When evaluating the prediction performance, we usually divide the links randomly into training set and probe set (), with and . The goal is to accurately predict the links in probe set only by using the information in training set.

We first assign a score to each nonexisting link and then choose links with the highest top-*L* scores as potential ones. State-of-the-art similarity evaluation methods could be utilized to carry out link prediction, including common neighbors (CN), Jaccard index (JB), resource allocation index (RA), local path index (LP), and structural perturbation method (SPM) (see the part of* Baseline* and [38]).

There are two popular metrics to characterize the accuracy: area under the receiver operating characteristic curve (AUC) [39] and the precision [40, 41]. AUC can be interpreted as the probability that a randomly chosen missing link (i.e., a link in ) has a higher score than a randomly chosen nonexisting link. Then, AUC requires times of independent comparisons. We randomly choose a real link and a nonexisting link to compare their scores. After different comparisons, we record times where real links have higher scores, and times where the two kinds of links have the same score. The final AUC is calculated as If all the scores are given by an independent and identical distribution, then AUC should be around 0.5. A higher AUC is corresponding to a more accurate prediction.

Another metric is precision that characterizes the ratio of correctly predicted links for a given prediction list. That is to say, if the length of prediction list is , among which links are the right potential links, then the precision isClearly, higher precision means higher prediction accuracy. Intuitively, higher accuracy means higher AUC and higher precision. In the experiments, we will see that precision has little correlation with AUC and that improving the precision may not result in the improvement of AUC.

##### 2.2. Baseline Prediction Methods

There exists a large number of score-assigning approaches in link prediction problem. All these methods could be introduced into our framework. Though we only investigate some state-of-the-art score-assigning approaches, the results and conclusions are also applicable for other score-assigning methods. The five score-assigning approaches [6, 16] are as follows.

*(i) Common Neighbor (CN).* The metric supposes that if two nodes and have more common neighbors, they are more likely to connect together. The neighborhood overlap of the two nodes is as follows: where is the neighbor set of node and indicates the size of a set. The drawback of CN is that it favors large-degree nodes. Though the similarity of two large-degree nodes is low, they still have many common neighbors.

*(ii) Jaccard Coefficient (JB).* Jaccard is a conventional similarity metric that aims to suppress the influence of large-degree nodes, which is Since the similarity is normalized by the size of the union set of the two nodes’ neighbors, low similarity still exists between two large-degree nodes even though they may have many common neighbors.

*(iii) Resource Allocation (RA).* This index is inspired by the resource allocation dynamics in complex networks. Given a pair of unconnected nodes and , suppose that the node needs to allocate some resource to , using common neighbors as transmitters. Each transmitter (common neighbor) starts with a single unit of resource and then distributes it equally among all its neighbors. The similarity between and can be calculated as the amount of resource received from their common neighbors: Comparing with Jaccard method, RA could also suppress the influence of large-degree nodes, but more specifically. Different neighbors contribute to the similarity differently. If two nodes prefer to connect low-degree nodes, it means that they have a higher probability to share common interests or characteristics. However, many pair-nodes have common high-degree neighborhoods, resulting in that high-degree nodes play a weak role when evaluating similarity. Based on the idea, Adamic-Adar (AA) index is obtained by using instead of in (5).

*(iv) Local Path (LP).* CN considers the intersection of neighborhoods, which actually utilizes the one-path neighbors to characterize similarity. LP takes a general consideration of paths by considering two-path neighbors: where is the adjacent matrix of a network and is a small positive number. LP supposes that one-path neighbors contribute more to the similarity than two-path neighbors. LP is the low order parts of Katz method (), but with much lower computing complexity.

*(v) Structural Perturbation Method (SPM).* Lü et al. [6] suppose that network structure follows consistency after some random perturbation. In SPM, training set is divided into a small fraction of perturbation set and the remaining set (). has similar eigenvectors with , but different eigenvalues. For the th largest eigenvalues of and ,where is the eigenvector of , corresponding to . The similarity matrix is

SPM first divides a network into training set and probe set and further divides the training set into perturbation set and the remaining set. For a given division of training and probe set, we calculate the average of 10 times independent simulations of (8) as the similarity matrix.

Apart from the five similarity metrics introduced above, for more similarity-evaluating methods, please refer to [42, 43].

##### 2.3. The Proposed Method

We start our framework by reinvestigating the definition of precision. Supposing that is the similarity score of nodes and obtained by a prediction method only based on training set , is the similarity distribution that a randomly chosen existing link in training set has score , and is the similarity distribution that a randomly chosen nonexisting link in the training set has score . Due to random division of training set and probe set, links in the probe set should have the same similarity distribution with that of the training set at high confidence according to the law of large numbers [44, 45]. Thus we would not differentiate similarity distribution of existing links in the training and probe sets in the following paper. The assumption is reasonable according to the statistical theory if the size of samples goes to infinity [44, 45]. Since classical methods only predict links with high scores, the estimated precision of the method is written aswhere is the size of , is a constant, and is the whole set of all possible links (). is the maximum score. In real scenarios, the length of the prediction list is usually the size of the probe set [16], which requires subjecting to . If at , the precision . Otherwise, gives rise to a high precision. Since only links with top-*L* highest scores are predicted as potential links, precision could be calculated by (2) [6, 16]. Equation (2) is a much easier formula to describe precision than (9).

Most previous link prediction methods only predict links with high similarity scores. We generalize (9) by considering links of different similarities. Supposing that links with scores are predicted as potential links, the precision is as follows:where . To confine the length of the prediction list, a precondition requires . Note that, in most previous works, , and equation (10) reduces to (9). Our generalized precision equation (10) considers links with both high and low scores.

The main concern is to select appropriate set in (10) to maximize the precision. We propose precision-to-noise ratio (PNR) to determine ,where PNR measures the ability to distinguish real links with the same score. Note that a nonexisting link in training set may be an existing link in probe set. Given a nonexisting link in training set with the similarity , the probability that it is an existing link in probe set (i.e., the precision) is , where is a constant.

The central issue of our framework is to use PNR to determine the optimal score set . We first calculate the similarity scores of all links only based on training set by a traditional method. Second, , , and PNR are computed. Third, we reassign the scores of each link , where is the original similarity score by the first step. Finally, we sort links in the descending order of and links with top-*L* scores are predicted as potential links [16, 18]. The optimal score set corresponds to the original similarity scores whose reassigned scores rank in the top-*L* score list.

Different kinds of similarity evaluations could be introduced into the framework. Taking CN similarity method as an example, our framework is as follows:(1)Divide the links of a network into training set and probe set randomly.(2)Calculate the similarity scores of all existing and nonexisting links by CN method only according to training set.(3)Calculate PNR. Divide the scores into uniform bins and count how many existing () and nonexisting () links locate in each bin (i.e., calculate discrete ). Then we obtain . Note that if , we define .(4)Obtain the readjusting scores of the nonexisting links in training set by .(5)Determine the prediction list by choosing links with highest scores , and calculate the precision.

Figure 1 depicts the proposed framework based on CN method. After obtaining the similarity scores of links (Figure 1(a)→1(b)), traditional CN method directly predicts potential links according to the scores (Figure 1(b)→1(d)), while the proposed framework calculates (Figures 1(b)→1(c)) and later predicts potential links according to the modified scores (Figures 1(c)→1(d)).