Abstract

Link prediction is a fundamental problem of data science, which usually calls for unfolding the mechanisms that govern the micro-dynamics of networks. In this regard, using features obtained from network embedding for predicting links has drawn widespread attention. Although methods based on edge features or node similarity have been proposed to solve the link prediction problem, many technical challenges still exist due to the unique structural properties of networks, especially when the networks are sparse. From the graph mining perspective, we first give empirical evidence of the inconsistency between heuristic and learned edge features. Then, we propose a novel link prediction framework, AdaSim, by introducing an Adaptive Similarity function using features obtained from network embedding based on random walks. The node feature representations are obtained by optimizing a graph-based objective function. Instead of generating edge features using binary operators, we perform link prediction solely leveraging the node features of the network. We define a flexible similarity function with one tunable parameter, which serves as a penalty of the original similarity measure. The optimal value is learned through supervised learning and thus is adaptive to data distribution. To evaluate the performance of our proposed algorithm, we conduct extensive experiments on eleven disparate networks of the real world. Experimental results show that AdaSim achieves better performance than state-of-the-art algorithms and is robust to different sparsities of the networks.

1. Introduction

Networks have recently emerged as an important tool for representing and analyzing many kinds of interacting systems ranging from biological to social science [1]. As technological innovation and data explosion gather pace, we humans are now moving into the era of big data, hence the reach of and participation in these networks is rapidly expanding. Studying these complex, interlocking networks can help us understand the operation mechanism of real-world systems. Therefore, in the past years, lots of work has been dedicated to studying evolution [2, 3], topologies [4, 5], and characteristics [6] of networks, attracting researchers from physics, sociology, and computer science.

Under many circumstances however, the current observations of various network data are substantially incomplete [7]. For example, in protein-protein interaction and metabolic networks, whether two nodes have a link must be determined experimentally, which is very costly. As a result, the known links may represent fewer than 1% of the actual links [8]. Besides, in social networks like Facebook, only part of the friendships among users are shown by the observed network, and there still exist user pairs who already know each other but are not connected through Facebook. Due to this, it is always a challenging yet meaningful task to identify which pairs of nodes not connected in the current network are likely to be connected in the actual network, i.e., predicting missing links. Acquiring such knowledge is useful, for example, in biological domain, it gives invaluable guidance to carry out targeted experiments, and in social network domain, it can be used to recommend promising friendships, thus enhancing users’ loyalties to web services.

The way to solve the link prediction problem [914] can be roughly divided into two categories, i.e., unsupervised methods and supervised methods. In current research work on unsupervised link prediction, they mainly focus on defining a similarity metric for unconnected node pairs using information extracted from the network topology. The defined metrics represent different kinds of proximity between a pair of nodes and have different performance among various networks and no one can dominate others. Most of the metrics are easy to compute and interpret, but they are so invariant that they are fundamentally unable to cope with dynamics, interdependencies, and other properties in networks [15]. The link prediction problem can also be posed as a supervised binary classification task from a machine learning perspective [16]. Since then the research of supervised methods for link prediction has become prominent [15, 1719], and the results of these researches provide confirmatory evidence that a supervised approach can enhance the link prediction performance.

Choosing an appropriate feature set is crucial for any supervised machine learning task [2022]. For link prediction, each sample in the dataset corresponds to a pair of nodes. A typical solution is using multiple topological similarities as features and this is the most intuitive way. But all these features are handcrafted and cost much human labor. Besides, they often rely on domain knowledge, thus restricting the generalization across different fields.

An alternative method is learning the features automatically for the network. By treating networks as special documents consisting of a series of node sequences, the node features can be learned by solving an optimization problem [23]. After obtaining the features of nodes, the link prediction task is traditionally conducted using two approaches. The first one is similarity-based ranking method [24], for example, cosine similarity is used to measure the similarity of pairs of nodes. For two unconnected nodes, the larger the similarity value, the higher the connection probability they have. The other one is edge feature based classification method [25, 26]. In this method, the edge features are generated by heuristic binary operators such as Hadamard operator and Average operator. Then a classifier is trained using these features and will be used to distinguish whether a link will form between two unconnected nodes.

As the features learned through network embedding preserve the network’s local structure, the cosine similarity works well for strongly assortative networks but fails to capture the disassortativity of the network, i.e., nodes prefer to build connections on large scales than on small scales [7]. Thus, using cosine similarity for link prediction suffers from statistical performance drawbacks. Besides, the edge features obtained through binary operators will potentially lose node’s information, since the features of nodes are learned by solving an optimization problem but the edge features are not (see Figure 1 for a clear explanation and the details will be discussed in Section 3.3). Furthermore, the edge and node features have the same dimensionality, which is usually on the scale of several hundreds. This means that even for linear models such as logistic regression, it still needs to learn hundreds of parameters, which presents us with the question of feasibility especially when the data size is large. How to design a simple, general yet efficient link prediction method using the node features directly learned from network embedding still remains an open problem.

To solve the abovementioned issues, we propose a novel link prediction method, AdaSim (Adaptive Similarity function), for large-scale networks. The node feature representations are obtained by optimizing a graph-based objective function using stochastic gradient descent techniques. Instead of generating edge features using heuristic binary operators, we perform link prediction solely leveraging the node features of the network. Our essential contribution lies in defining a flexible node similarity function with only one tunable parameter, which serves as a penalty of the original similarity. The optimal value can be obtained through supervised learning and thus is adaptive to the data distribution, which gives AdaSim the ability to capture the various link formation mechanisms of different networks. Compared with the original cosine similarity, the proposed method generalizes well across various network datasets.

In summary, our main contributions are listed as follows:(i)We propose, AdaSim, a novel link prediction method by introducing an adaptive similarity function using features learned from network embedding.(ii)We show that AdaSim is flexible enough with only one tunable parameter. It is adjustable with respect to the network property. This flexibility endows AdaSim with the power of capturing the link formation mechanisms of different networks.(iii)We demonstrate the effectiveness of AdaSim by conducting experiments on various disparate networks of the real-world. The results show that the proposed method can boost the performance of link prediction in different degrees. Besides, we find that AdaSim works particularly well for highly sparse networks.

The rest of the paper is structured as follows. Section 2 reviews some research works related to link prediction. The problem definition of link prediction and feature learning are described in Section 3, and some empirical findings on the datasets are also given in this section. Section 4 illustrates the proposed link prediction method AdaSim with detailed explanations of each component. The experimental results and analysis are represented in Section 5. Finally, Section 6 concludes the paper.

Early works on link prediction mainly focus on exploring topological information derived from graphs. Liben-Nowell and Kleinberg [27] studied several topological features such as common neighbors, Adamic-Adar, PageRank and Katz and found that topological information is beneficial in predicting links compared with a random predictor. Subsequently, some topology-based predictors were proposed for link prediction, e.g., resource allocation [28], community-enhanced predictors [8] and clustering coefficient-based link prediction [29].

Al Hasan et al. were the first to model link prediction as a binary classification problem from a machine learning perspective [16]. Various similarity metrics between node pairs are extracted from the network and treated as features in a supervised learning setup; then, a classifier is built with these features as inputs to distinguish positive samples (links that form) and negative samples (links that do not form). Thereafter, the supervised classification approach has been prevalent in the link prediction domain. Lichtenwalter et al. proposed a high-performance link prediction framework called HPLP. Some new perspectives for link prediction, e.g., the generality of an algorithm, topological causes, and sampling approaches, were included in Ref. [15]. Later, a supervised random walk-based algorithm was proposed by Backstrom and Leskovec [17] to effectively incorporate the information from the network structure with rich node and edge attribute data.

In addition, the link prediction problem is also extended to heterogeneous information networks [3033]. Among these works, a core concept based on network schema was proposed, namely, meta-path. Multiple information sources can be effectively fused into a single path and different meta paths have different physical meanings. Some similarity measures can be calculated using meta paths; then, they are treated as features of a classifier to discriminate positive and negative links.

All the works mentioned above on supervised link prediction use handcraft features, which require expensive human labor and often rely on domain knowledge. To alleviate this, one can use the latent features learned automatically through representation learning [34]. For networks, the unsupervised feature learning methods typically use the spectral properties of various matrix representations of graphs, such as adjacency and Laplacian matrices. In the perspective of linear algebra, this kind of method can actually be regarded as a dimensional reduction technique. Several works [35, 36] have been done aiming to acquire the node features of graphs, but the computation of eigendecomposition of a matrix is costly, thus making these methods impractical to scale up to large networks.

Perozzi et al. [23] extended the skip-gram model to graphs and proposed a framework, DeepWalk, by representing a network as a special “document” consisting of a series of node sequences, which are generated by random walks. DeepWalk can learn features for nodes in the network, and the representation learning process is irrelevant to downstream tasks like node classification and link prediction. Later, Node2vec was proposed by Grover and Leskovec [26]. Compared with DeepWalk, Node2vec uses a biased random walk to control the sampling space of node sequences. The network properties such as homophily and structure equivalence can be captured by Node2vec. The link prediction was performed using edge features obtained through heuristic binary operators on node features. In [37], the authors proposed a deep model called SDNE to capture the highly nonlinear property of networks. The first-order and second-order proximity were jointly exploited to capture the local and global network structure, respectively. More recently, Wang et al. [38] proposed a novel Modularized Nonnegative Matrix Factorization (M-NMF) model to incorporate not only the local and global network structure but also the community information into network embedding. In order to model the diverse interacting roles of nodes when interacting with other nodes, Tu et al. [39] presented a Context-Aware Network Embedding (CANE) method by introducing a mutual attention mechanism. CANE can model the semantic relationships between nodes more precisely. In order to save computation time, in [24], link prediction was directly carried out using cosine similarity of node features instead of edge features.

The above works mainly focus on the network embedding techniques and ignore the typical characteristics of link formation. The main difference between existing work and our efforts lies in that we consider an adaptive similarity function yet with a learning-based idea, making our model flexible enough to capture the various link formation patterns of different networks. For example, a negative value of can weaken the role of “structural equivalence” and enhance the score of dissimilar node pairs, thus capturing the disassortativity on link formation.

3. Problem Statement and Feature Learning Framework

In this section, we first give the formal definition of the link prediction problem. Then the feature learning framework for networks is presented. Finally, we introduce the empirical findings on several network datasets when using node features for link prediction.

3.1. Problem Formulation

Given a network , where is the set of nodes, is the set of links, no multiple links or self-links are allowed for any two nodes in the network. It is assumed that some of the links in the network are unobserved or missing at the present stage. The link prediction task aims to predict the likelihood of a link between two unconnected nodes using information intrinsic to the network.

Since here we are considering a supervised approach for link prediction, we first need to construct a labeled dataset , where is the feature vector of the sample and the corresponding label. Specifically, in which and denote the features of node and , respectively. The node features are learned from network representation learning. is a mapping function from node features to node pair features. Any node pair in , indicates that this node pair belongs to positive samples and otherwise the negative samples. Positive samples are the edges, , chosen randomly from the network . We delete from G and keep the obtained sub-network ()fully connected. To generate negative samples, we sample an equal number of node pairs from which have no edge connecting them. The dataset is spitted into two parts: training dataset and test dataset . A classification model can be learned with dataset , then this model will be used for predicting whether a pair of nodes in dataset should have a link connecting them. Our algorithms are typical methods from the field of graph mining. Hence, in contrast to part of our previous papers [9, 10], we follow conventions in the field of artificial intelligence in which (positive samples) has 50% of the observed links, and scores are based on the combination of and the same number of nonobserved links (negative samples). For the highly sparse sexual contact network, which has only a small number of nodes, instead comprises all observed links.

3.2. Feature Learning of Network Embedding

For a given network , a mapping function from nodes to feature vectors can be learned for link prediction. Here is a user-specified parameter that denotes the number of dimensions of the feature vectors and is a matrix of size parameters. The mapping function is learned through a series of document-like node sequences, using optimization techniques originated in language modeling.

The purpose of language modeling is to evaluate the likelihood of a sentence appearing in a document. The model is built using a corpus . More formally, it aims to maximizeover all training corpus, where is a word of the vocabulary, is the context of that includes the words that appear to both the left side of and the right side. Recent research on representation learning has put a lot of attention on leveraging probabilistic neural networks to build a general representation of words, extending the scope of language modeling beyond its original goals. Each word is represented by a continuous and low-dimensional feature vector. The problem then is to maximizewhere denotes the latent representation of a word.

The social representation of networks can be learned analogously through a series of node sequences generated by a specific sampling strategy . Similar to the context of word in language modeling, is defined to be the neighborhood of node using sampling strategy . The node representation of networks can be obtained by optimizing the following expression

The learned representations can capture the shared similarities in local graph structure among nodes in the networks. Nodes that have similar neighborhoods will acquire similar representations.

3.3. Empirical Findings on Several Network Datasets

After learning the representations for the nodes in the network, there are two approaches to the link prediction task, i.e., node-similarity-based method and edge-feature-based method. The former is simple and scalable, and the latter is complex yet powerful. But both methods have their limitations in effectively characterizing the link formation patterns of node pairs. Since the node-similarity-based method was not involved in learning, it cannot be aware of the effects of global network property in link prediction. The edge feature-based method could not describe the node pair relationship very well at the feature level using a heuristic binary operator, as the information loss exists in the mapping procedure from node features to edge features. We show the empirical evidence for the limitations of these two kinds of methods in the following subsections.

3.3.1. Limitation of Heuristic Binary Operators

Figure 1(a) shows a toy network (krackhardt kite graph) with 10 nodes and 18 edges. Each node and edge is marked with a unique label. Given a specific sampling strategy , we can obtain the node sequences and the corresponding edge sequences simultaneously after performing on the network. Hence, both node representations and edge representations of the network can be learned using optimization techniques. For a specific pair of nodes, the learned edge representation is called the “true” features and the generated edge representation using binary operator is called the “heuristic” features. It is known that if a binary operator is good enough, it should be able to accurately characterize the relationship of pairs of nodes, i.e., the correlation between heuristic features and true features should be as strong as possible. For the 18 edges in the toy network, five different kinds of heuristic binary operators [26] (see Table 1) are chosen to generate edge features (for the Division operator, we omit the kind of since it has very similar results compared with ), and their correlation with the true edge features are displayed in Figure 1(b).

On the basis of the evidence from Figure 1, we can tell that different operators have different results in representing features of pairs of nodes and no one can dominate the others. Some of the edges, e.g., edges 10 and 16, can be well characterized by the Hadamard operator, while others, for example, edges 12, 14, and 17, can be characterized by the Average operator. Furthermore, most values are less than 0.5, which means a weak correlation between the heuristic edge features and true edge features. This verifies our claim that edge features obtained through heuristic binary operators may cause the loss of information of node features.

3.3.2. Limitation of Similarity-Based Method

Given an unconnected node pair , several metrics can be used to measure their similarity, for example, the common neighbors between and , and the number of reachable paths from and . But here we only consider the metric of cosine similarity since we have the node pair’s feature vectors and , respectively. The cosine similarity is used to characterize the link formation probability and it is defined aswhere denotes the transpose and means the -norm of a vector. The cosine similarity measures the cosine of the angle between two -dimensional vectors obtained from network representation learning. In fact the idea of cosine similarity has been used for link prediction in several works [24, 40, 41]. But there are a few issues when directly using cosine similarity for link prediction. The first one is that it did not consider the label information of node pairs. Thus, it belongs to the category of unsupervised learning. However, lots of works have demonstrated that supervised learning approaches to link prediction can enhance the performance [15, 18, 19]. The other one is that cosine similarity is too rigid to capture different link formation mechanisms of different networks.

In the phase of representation learning for networks, it is assumed that two nodes have similar representations if they have similar contexts in the node sequences sampled by strategy . For networks, this indicates that if two nodes are structurally close (for the three nodes in the graph, suppose the geodesic distance of is 2 and is 5, we say is closer to than ) to each other, then they have a high probability to simultaneously occur in the same sequence which results in a high value in terms of cosine similarity. But in real-world networks, whether two nodes will form a link is not simply influenced by this kind of structural closeness. Two nodes far from each other in the network will also have a high chance to build relationships if they are structurally equivalent [26]. For two nodes, “the closer the graph distance, the easier for them to build link” holds not necessarily true, especially when the network is sparse and disassortative.

As shown in Figure 2, we can see that different networks have different patterns in building new connections (the datasets are described in Section 5.1). These patterns are closely related to the network properties, such as clustering coefficient, graph density and assortativity. To some networks with high assortativity, two unconnected nodes tend to be connected if they are structurally close, while others are not. More specifically, the link formation probability for two unconnected nodes is vastly decreasing with the increase of geodesic distance in the C.elegans dataset, and 97.8% new links span the geodesic distance less than 3. But for the Gnutella dataset, with an increase of geodesic distance, the link formation probability first increases then decreases, and most of the new links (62.4%) are generated by node pairs with distance equal to 5 or 6. For the Router dataset, the new links span a wide range of geodesic distances from 2 to 34 and almost half of the new links (48.67%) span a distance larger than 5. The distribution of link formation probabilities is more complex than the other two datasets.

The cosine similarity function assigns higher scores to pairs of nodes if they are close to each other and vice versa. It can capture link formation patterns in the case of Figure 2(a), i.e., the shorter the distance between two unconnected nodes, the higher the probability to be connected. But cosine similarity fails to capture the patterns in the cases of Figures 2(b) and 2(c), especially Figure 2(b), in which the link formation pattern follows a Gaussian distribution. For the pattern of Figure 2(b), nodes prefer to build connections with those that are relatively farther from them. When performing a link prediction task in cases like this, pairs of nodes with a relatively longer distance should be more similar than those with a shorter one.

Thus, we need to design a flexible similarity function for link prediction and capture the various patterns of link formation. Besides the similarity function should be devised on the basis of concision and scalability. This can be achieved by adjusting the similarity of node pairs and balancing link formation probabilities among different distances. Inspired by this, we propose a modified similarity function which is defined aswhere is a balance factor to control the similarity of node pairs with different geodesic distances. As we have the labels of node pairs, the optimal value of can be learned in a supervised way.

4. The Proposed Framework

In this work, we propose a novel link prediction framework, AdaSim, based on an adaptive similarity function using the features learned from network representation. The whole framework is illustrated in Figure 3. It can be divided into three parts: subgraph generation, feature representation, and similarity function learning. First, the positive and negative node pair indexes are obtained through random sampling. The corresponding subgraph is generated via edge removal. Then we learn the representation of nodes in the network using an unsupervised way. Finally, a similarity function is defined and the optimal parameter is determined through supervised learning. The obtained similarity function with optimal penalty can be directly used to solve the link prediction problem.

4.1. Subgraph Generation

Unlike other tasks such as link clustering or node classification, in which the complete structural information is available, a certain fraction of the links needs to be removed before performing network representation learning for link prediction [4244]. In order to achieve this, one can iteratively select one link and determine whether it is removable or not. But this operation is less effective and very time consuming, especially when the network is very sparse since it needs to traverse almost all the nodes in the graph.

Instead, we propose a fast positive sampling method based on minimum spanning tree (MST) in this paper. An MST is a subset of the edges in the original graph that connects all the nodes together. That means all the edges are removable except those that belong to the MST and their deletion will not break the property of of connectivity. Lines 1–4 in Algorithm 1 show the core of our approach. We first generate a MST of denoted as using Kruskal’s algorithm. The positive samples are randomly selected from . To generate negative samples , we sample an equal number of node pairs from , with no edge connecting them (lines 5–7). Then we delete all the edges in from and obtain the subgraph (line 8).

Input:, positive edge ratio
Output: samples of node pairs and subgraph
(1)
(2)
(3)Shuffle ()
(4)
(5)for and do
(6) append to
(7)end for
(8)
(9)return,
4.2. Feature Representation

Now we proceed to perform the feature learning task on subgraph . This task consists of two core components, i.e., a node sequence sampling strategy and a language model.

4.2.1. Node Sequence Sampling

In terms of node sequence sampling, the most classical strategies are Breadth First Search (BFS) and Depth First Search (DFS) [26]. BFS starts at a specific node and explores the neighbors first before moving to the next level. On the contrary, DFS traversing the network starts at one node and explores as far as possible along each branch before backtracking. BFS and DFS represent two extreme sampling strategies with respect to the search space they explore, bringing about valuable implications on the learned representations. In fact, the neighborhood sampled by BFS can reflect the structural equivalence about the networks and the sampled nodes in DFS can reflect a macro-view of the neighborhoods, which is essential in inferring communities based on homophily [26]. Although they are of paramount significance for producing interesting representations, neither can simultaneously reveal the complex properties of networks. We need a sampling strategy that can smoothly interpolate between DFS and BFS, whose requirement can be fulfilled by random walks on graphs.

A random walk of length on rooted at node is a stochastic process with random variables such that and is a node chosen uniformly at random from the neighbors of . Random walks arise in a variety of models for large scale networks, such as computing node similarities [19, 45], learning to rank nodes [46, 47], and estimating network properties [48]. Besides, they are the foundation of a class of output-sensitive algorithms that employ them to calculate community structure’s local information.

This connection is the reason that motivates us to use random walks as the node sequence sampling strategy for extracting network information.

Lines 1–6 in Algorithm 2 show the procedure of node sequence sampling. As shown in Figure 3, we can obtain a series of node sequences using random walks. For example, if we want a random walk of length rooted at on the toy network, we may get the result of . The other sequences are obtained similarly.

Input:, window size , feature size , walks per node , walk length
Output: node representation
(1)for to do
(2)
(3)fordo
(4)  walks = RandomWalk(, , )
(5)end for
(6)end for
(7)for walk walks do
(8)fordo
(9)  
(10)  
(11)end for
(12)end for
(13)return
4.2.2. Language Model

In order to get the representations of networks, the objective is to solvewhere , is the context size, and is the mapping function from node to feature representations. For , we can use softmax, which is a log-linear classification model, to get the posterior distribution of nodes. However softmax involves the summation over all the node pairs and doing such computation for every training instance is very expensive, making it impractical to scale up to large networks.

To solve this problem, an intuition is to limit the number of output vectors updated per training instance. Thus, hierarchical softmax [49] is proposed to improve the learning efficiency, which we adopt in this work. In the end, we use stochastic gradient descent (SGD) techniques to optimize the objective function (lines 7–12 in Algorithm 2) to get the social representations of each node, i.e., , in the graph. As illustrated in Figure 3, for each node in the toy network, we can get a -dimensional representations associated with it.

4.3. Similarity Function Learning

For node pair , we use and as their features obtained from network representation learning. Considering the distribution bias of real links among different geodesic distances, we propose a novel similarity function which is defined as

We denote and for simplicity. Then, (7) can be rewritten as

A logistic function is applied for mapping the node pair similarity to a value in , which is a probability indicating that it belongs to the positive class or negative class. We use to denote this probability, which is represented as

In order to measure the closeness between the predicted value and the true label, we select cross-entropy loss as our objective, which is defined as

The stochastic gradient descent technique is used to get the optimal value of by minimizing , its updating rule can be written aswhere

Algorithm 3 shows the core part of the parameter learning process. The training dataset and test data set are first obtained through line 1 in Algorithm 3. Then the optimal value of is learned using SGD on the training dataset (line 2). The is used to measure the similarity of node pairs in and we can get their probability of being connected through lines 3 to 6 in Algorithm 3. Finally, the evaluation results are obtained through line 7.

Input: node representation , pairs of nodes , train test split ratio
Output: Evaluation results
(1),
(2)
(3)fordo
(4)
(5)
(6)end for
(7)val = GetEvaluation()
(8)return val

5. Experiments

In this section, we first give a brief description of the datasets used in the experiment. Next, we introduce the baseline models and evaluation metrics for link prediction. Then, the experimental results are presented with a detailed analysis. As the AdaSim framework involves several parameters, lastly, we show how the different choices of these parameters affect the performance of link prediction.

5.1. Datasets

To comprehensively evaluate the performance of our proposed link prediction algorithm, we use ten real-world datasets to conduct our experiments, and these datasets are commonly used in the link prediction domain. These datasets come from various fields and their details are described as follows:(i)C. elegans [50] is the neural network of the Caenorhabditis elegans worm. The nodes represent the neurons and the edges denote synapse or gap junction.(ii)PB [51] is a network of hyperlinks between weblogs on United States politics.(iii)Wiki-vote [52] is a social network that contains all the Wikipedia voting data from the inception of Wikipedia till January 2008. Nodes in the network represent Wikipedia users and a directed edge from node to node represents that user voted on user .(iv)Email-enron [52] is a communication network that covers all the e-mail communication around half a million emails. Nodes of the network are e-mail addresses and if there is at least one e-mail from address to address , then they have a link between them.(v)Epinions [52] is who-trust-whom online social network. Members of the site of Epinions can decide whether to trust each other. If user trusts user , then there is a link between them.(vi)Slashdot [52] is technology-related news website. The network contains friend/foe links between the users of Slashdot.(vii)Sexual [53] is a well-known sexual contact network. This network is very sparse and has almost no closed triangles.(viii)Roadnet [52] denotes a road network of California, which is a typical sparse and treelike network.(ix)Power [50] is a traditional sparse network, which denotes the power grid of the western United States.(x)Router is an Internet network of router-level collected by the Rocketfuel Project [54].(xi)p2p-Gnutella [52] is a peer-to-peer file sharing network of Gnutella. Nodes in the network represent hosts and edges represent connections among those hosts of Gnutella.

The basic topological information of these networks is listed in Table 2, including the number of nodes and edges, average degree, average clustering coefficient, diameter, and density of the network. We roughly divide the networks as dense and sparse based on the average degree and average clustering coefficient. To sum up, we conduct experiments on networks with various properties, i.e., sparse and dense, small and large. Thus, the datasets can comprehensively reflect the characteristics of the proposed method (different from other networks, has of the observed links for the highly sparse sexual contact network due to the small number of nodes).

5.2. Baseline Methods and Evaluation Metrics

In order to validate the performance of our proposed algorithm, we compare AdaSim against the following link prediction models.(i)Common neighbors (CN): for node , let denote the set of neighbors of . Two nodes, and , have a high probability of being connected if they have many common neighbors [55, 56]. The simplest way to measure this neighborhood overlap is by directly counting the number of common neighbors, i.e.,(ii)Resource allocation (RA) [28]: For an unconnected node pair and , it is assumed that can send some resources to by the medium of neighbors. The similarity between and can be defined as the amount of resources received by from , which described aswhere is the degree of .(iii)Preferential attachment (PA) [57]: preferential attachment mechanism is used to generate random scale-free networks, in which the new links connecting to is proportional to . Similarly, the probability that a new link connecting and is proportional to . The PA similarity index is defined as(iv)Salton Index (SI) [40]: the other name of SI is cosine similarity and is defined as(v)Clustering coefficient for link prediction (CCLP) [29]: it is a similarity index with more local structural information considered. In this method, the local link information is conveyed by clustering coefficient of common neighbors.where is the number of triangles passing through node .(vi)Heterogeneity Index [10]: this method is based on the network heterogeneity and the state-of-the-art for sparse and treelike networks.where is a free heterogeneity exponent.(vii)Node2vec [26]: this is a supervised way of link prediction using logistic regression. The features used in this method are generated through heuristic binary operators of node pair features which are learned from network embedding. There are two parameters, and , to control the node sequences sampling. Note that when , node2vec equals to DeepWalk [23].

Beside Node2vec, there are other approaches for unsupervised feature learning for graphs, such as spectral clustering [58] and LINE [59]. We exclude them in this work since they have already been shown to be inferior to Node2vec [26]. We also exclude other supervised methods, such as ensemble learning [15] and support vector machines [16]. These methods can get relatively better performance but at the cost of high complexity, which is not our original intention.

We adopt the area under the receiver operating characteristic (AUC) to quantitatively evaluate the performance of link prediction algorithms. The AUC value quantifies the probability that a randomly chosen missing link is given a higher score than a randomly chosen node pair without a link. A higher score means better performance.

5.3. Experimental Results

In order to obtain the following results, we set the parameters in line with the typical values in [26]. That is, , and the optimization is run for a single epoch. Fifty percent of the edges are removed and treated as positive examples. The negative node pairs which have no edge connecting them are randomly sampled from the network. For the two parameters, and , in Node2vec, they are selected through a grid search over . After the dataset is prepared, we use tenfold cross validation to evaluate the performance. For the sake of objectivity, the experiment is repeated ten times on each dataset and the average results are reported in Table 3.

A general observation we can draw from these results is that the proposed link prediction algorithm, AdaSim, can obtain better performance than all the baseline methods on all datasets. More specifically, the unsupervised similarity-based link prediction methods achieve relatively lower value than those supervised ones, since the label information is not leveraged to boost model performance. But the PA predictor achieves competitive results compared with AdaSim and even better than Node2vec on five out of eleven datasets. This is because preferential attachment is one of the key features in generating power law scale-free networks. It reflects the mechanism of network evolution that involves the addition of new nodes and edges. Thus, it can obtain better performance on link prediction problems. But similarity-based link prediction methods perform extremely worse when the network is sparse since limited or no closed triangular structure exists in these networks.

Among all the supervised link prediction methods, AdaSim outperforms both DeepWalk and Node2vec in all the eleven networks with gain ratios of different scales. The gain ratio varies from 0.75% to 43.04% in the AUC values compared with Node2vec.

To intuitively show the influence of penalty on link prediction performance, is set to specific values from to with fixed increment (here for demonstration) and displays the results of AUC on three datasets, i.e., C. elegans, Router, and Wiki-vote, in Figure 4. Notice that corresponds to the original cosine similarity measurement. It can be clearly seen, from Figure 4, that the results of AUC are considerably affected by the value of . Compared with the rigid cosine similarity, our proposed AdaSim can substantially improve the link prediction performance. This also verifies our empirical findings in Section 3.3 that different networks have different link formation patterns, thus a flexible and adaptive similarity function for link prediction is needed to capture these various patterns.

We select three representative sparse networks, that is, Sex (small), Power (medium), and p2p-Gnutella (large), and report the wall-clock time of CN, CCLP, HEI, Node2vec, and AdaSim in Table 4. We can observe from this table that with the increase of network size, the prediction time needed for all algorithms also increases. Besides, the learning-based algorithms usually take more time than similarity-based ones since vector-vector multiplication takes more time than simply calculating the neighborhood information of two nodes. Although our algorithm requires more time to predict, it has also achieved considerable performance gains, as explained above.

5.4. Performance on Networks with Different Sparsities

Networks in the real world are often sparse, we only know very limited information about the interactions among the nodes. For example, 80% of the molecular interactions in cells of yeast and 99.7% of human are still unknown [40]. A good link prediction method should have robust performance on networks with different sparsities.

We change the sparsity of the networks by randomly removing a certain percent of links in the original network, and then follow the aforementioned experiment setup to report the results of different methods. The results on the Wiki-Vote dataset are displayed in Figure 5. Only four baseline methods are listed in the figure since CN, CCLP, and DeepWalk perform similarly with RA and Node2vec, respectively.

It can be seen from the results that the AUC values decrease with the increase of removed edge ratio since it is becoming more and more challenging to characterize node similarity using information on network topology. The similarity-based methods perform well when the removed edge ratio is relatively small. Moreover, AdaSim performs consistently well and is robust to different sparsity conditions of networks. Even when eighty percent of the edges are removed, the AdaSim can still hold the performance around 0.95 in terms of AUC. Overall, AdaSim is not only robust to different network conditions but also achieves better performance than baselines.

5.5. Parameter Sensitivity

There are several parameters involved in the AdaSim algorithm and in Figure 6, we examine how the different choices of parameters influence the performance of AdaSim on the Wiki-Vote dataset. Except for the parameter being tested, all other parameters assume default values.

We measure the AUC as a function of the representation dimension , walk length and the number of walks per node . We observe that the dimension of learning representations for nodes has limited effects on link prediction performance. With the increase of dimensionality, the AUC values increase slightly and turn to saturate when reaches 128. It can also be observed that a larger and will improve the performance; this is because more neighborhood information of the seed node is included in the representation learning process, and the node similarities can be captured more precisely.

6. Conclusion

In this work, we focus on the link prediction problem with features obtained from network embedding. As the edge features generated through heuristic binary operators are an information-loss projection of the original node features, we have quantitatively given the evidence of inconsistency between heuristic edge features and learned ones. Moreover, we have developed a novel link prediction framework AdaSim by introducing an adaptive similarity function to deal with the inflexibility of cosine similarity, especially for sparse or treelike networks. AdaSim first learns node representations of networks by solving a graph-based objective function, then adds a penalty parameter, , on the original similarity function. At last, the optimal value of is learned through supervised learning. The proposed AdaSim is flexible and thus is adaptive to data distribution and can capture the various link formation mechanisms of different networks. We conducted experiments using publicly available real-world network datasets, and extensively compared AdaSim with seven well-established representative baseline methods. The results show that AdaSim achieves better performance than state-of-the-art algorithms on all datasets. It is also robust to the sparsity of the networks and obtains competitive performance with even though a large fraction of edges are missing.

Data Availability

All datasets can be obtained from the corresponding author upon request.

Conflicts of Interest

The authors declare no competing interests.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61901247, 61803047), Natural Science Foundation of Shandong Province ZR2019BF032, Major Project of the National Social Science Foundation of China (19ZDA149 and 19ZDA324), and Fundamental Research Funds for the Central Universities (14370119 and 14390110).