Abstract

With the rapid growth of various complex networks, link prediction has become increasingly important because it can discover the missing information and predict future interactions between nodes in a network. Recently, the CAR and CCLP indexes have been presented for link prediction by means of different triangle structure information. However, both indexes may lose the contributions of some shared neighbors. We propose in this work a new index to make up the weakness and then improve the accuracy of link prediction. The proposed index focuses on a new triangle structure, i.e., the triangle formed by one seed node, one common neighbor, and one other node. It emphasizes the importance of these triangles but does not ignore the contribution of any common neighbor. In addition, the proposed index adopts the theory of resource allocation by penalizing large-degree neighbors. The results of comparison with CN, AA, RA, ADP, CAR, CAA, CRA, and CCLP on 12 real-world networks show that the proposed index outperforms the compared methods in terms of AUC and ranking score.

1. Introduction

As a fundamental research hotspot in complex network analysis, link prediction has a wide range of applications in both theory and reality, such as analysis of network evolution [1, 2], recommendation system [3], and checking potential interactions between proteins in biological networks [4, 5]. The basic task of link prediction is to estimate the missing or latent existent links between unconnected nodes in a network [6, 7]. To date, a host of algorithms and models have been proposed for link prediction [6, 8, 9]. Reference [8] groups them into two ways: similarity-based approaches and learning-based approaches. A similarity-based approach computes similarity scores between unconnected nodes based on the known information. Then, a ranked list of node pairs in descending order according to their similarity scores is obtained and the node pairs at the top are thought most likely to have links. A learning-based approach formalizes the link prediction problem into a binary classification task [10] and uses machine learning methods to solve the problem. The key job in a learning-based approach is to construct the feature vectors of node pairs. In general, learning-based approaches are more complicated than similarity-based ones.

The hypothesis behind similarity-based approaches is the more similar that two nodes are, the more likely that a link exists between them [8]. This idea is simple and intuitive. Thus, the study of this kind of approaches has become the mainstream [6, 9]. The Common Neighbors (CN) index [11], as its name suggests, simply counts the number of common neighbors between two nodes. The Adamic-Adar (AA) [12] and Resource Allocation (RA) [13] indexes are two variants of the CN index; they penalize the contributions of large-degree common neighbors. These indexes are called local methods because they only use local structure information. Besides, some global and quasilocal methods have also been proposed by researchers, such as Katz [14], SimRank [15], Random Walks with Restart [16], Local Path [17], FriendLink [18], and Local Random Walk [19].

With the increasing growth of sizes of complex networks, local methods are still good candidates because they are more efficient in terms of running time than global and quasilocal methods. Therefore, we focus in this study on local methods. Recently, Cannistraci et al. proposed the CAR index [20], which suggests that links between the common neighbors, i.e., local-community-links (LCLs), are more valuable than common neighbors in link prediction. In CAR index, a local community is a triangle passing through two common neighbors and one seed node. In the example network shown in Figure 1(a), there is one LCL between the common neighbors of seed nodes and (see Figure 1(b)). Thus, CAR index assigns a similarity score of four to nodes and . However, if we remove the link between and , CAR will assign a zero similarity score to and , even though they have four common neighbors. In addition, the idea of LCL is also plugged into AA, RA, and Jaccard indexes [20]. Later, Wu et al. proposed the CCLP index based on the clustering coefficients of common neighbors. This index considers all triangles passing through a common neighbor. For the example network in Figure 1(a), there are triangles passing through nodes , , and , respectively (see Figure 1(c)). Thus, CCLP index accumulates the clustering coefficients of nodes , , and when calculating the similarity between and , but utterly neglects the contribution of node . In real-world networks, it is possible that there are no triangles passing through some or even all shared neighbors of one node pair. Thus, CAR and CCLP indexes may assign a very low or even zero similarity score to the node pair, even if it has many common neighbors.

In this paper, we defines a new type of triangle structure, called TRA-triangle, which is formed by one seed node, one common neighbor, and one other node (see Figure 1(d)). Based on the TRA-triangle, a new similarity index, namely, TRA index, is proposed for link prediction. This index suggests that the common neighbors that can form TRA-triangles with a seed node are more important than others. In addition, the proposed index also penalizes the large-degree neighbors, as done in RA index [13]. Although all the TRA, CAR-based, and CCLP indexes are based on triangle structures, the intuitions behind them are different. The CAR-based indexes believe that LCLs are more valuable than common neighbors. The CCLP index is inspired by CAR index but employs all triangles passing through common neighbors, while the TRA index, which only uses the TRA-triangles, strikes a balance between CAR and CCLP. Furthermore, as aforementioned, CAR-based and CCLP indexes lose the contribution of those common neighbors with no triangles passing through them, whereas TRA index counts the contribution of all kinds of common neighbors. Therefore, TRA index can achieve better prediction accuracy than CAR-based indexes and CCLP index. The accuracy of TRA index is evaluated on 12 real-world networks from various fields. The experimental results show that our index is far superior to CAR-based indexes and CCLP index. Take the network of HEP as an example, which is a very sparse network, the improvements made by TRA on CAR and CCLP, under the metric of AUC, are up to 26.9% and 4.2%, respectively.

The rest of the paper is structured as follows. In Section 2, we give the description of the link prediction problem and the evaluation metrics, list the compared methods and networks, and depict the Wilcoxon signed-ranks test. Section 3 introduces the proposed method. In Section 4, the experimental results and performance analysis of the proposed method are presented. Finally, Section 5 concludes this work.

2. Preliminaries

2.1. Problem Description and Metric

Given an undirected and unweighted network , in which and are the node set and link set, respectively, in this study, multilinks and self-loops are not allowed. Let be the number of nodes in the network, and let be the universal possible link set, which contains possible links. Then, the set of nonobserved links or nonexisting links is . Suppose there are some missing links in , the task of link prediction is to find those links. A similarity-based approach assigns a similarity score to each node pair in and assumes that the higher score a node pair has, the more likely there is a link between them.

To test the performance of a similarity index, we randomly divide the link set into two parts: training set and testing set , such that and . is supposed to be the observed information, and is used for testing. Two parameter-free metrics are employed to quantify the accuracy of link prediction algorithms: AUC [6] and ranking score [21, 22]. In this situation, the AUC score can be interpreted as the probability that a randomly selected missing link (i.e., a link in ) is given a higher score than a randomly selected nonexistent link (i.e., a link in ). When implementing, if we perform independent comparisons, there are times that the missing link has higher score and times that they have the same score. The AUC value is then computed as

Ranking score (RS) takes the ranks of links in testing set after sorting in descend order according to their similarity scores into consideration. Let be the set of nonobserved links. Let be a missing link in and be its rank. The ranking score of is defined as , and the ranking score of the link prediction result is as follows:

Note that the AUC value is the higher the better, whereas the ranking score is the smaller the better.

2.2. Local Similarity Indexes

As yet, many similarity indexes have been proposed for link prediction [6, 8, 9]. Here, we list some local similarity indexes that will be used in our experiments for the purpose of comparison.

Common Neighbor (CN) index [11] defines the similarity between and as the number of their common neighbors, which iswhere denotes the set of neighbors of node .

Adamic-Adar (AA) index [12] is a variant of CN index, which believes that small-degree neighbors have more contributions than large-degree neighbors when computing similarity. Its definition is as follows:where is the degree of node .

Resource Allocation (RA) index [13] defines the similarity between and as the amount of resource that received from through their common neighbors, which is

Adaptive Degree Penalization (ADP) index [23] penalizes a common neighbor according to its degree and the average clustering coefficient of the network. Therefore, it can automatically adapt to the network. The definition of ADP index is as follows:where is a constant and is the average clustering coefficient of the network. We set , as suggested by the authors.

CAR index [20] suggests that two seed nodes are more likely to link together if there are links between their common neighbors, which is defined aswhere is the number of links between and other common neighbors of and .

CAA and CRA indexes [20] are generated by plugging the idea of CAR index into the AA and RA indexes, respectively, which are defined as

CCLP index [24] computes the similarity between and by employing clustering coefficient of common neighbors, which iswhere denotes the clustering coefficient of node , which isin which is the number of triangles passing through node .

2.3. Networks

In this study, we use 12 real-world networks drawn from various fields to evaluate the effectiveness of link prediction methods.(1)Advogato (ADV): a social network whose users are mainly free and open source software developers [25].(2)C.elegans (CE): the neural network of a Caenorhabditis elegans worm [26].(3)Dolphin: a social network of 62 dolphins in a community living off Doubtful Sound, New Zealand [27].(4)Email: a network of email interchanges between members of a university [28].(5)Foodweb (FW): a food web in Florida Bay during the rainy season [29].(6)Hamster: a friendship network between users on hamsterster.com [30].(7)HEP: the coauthorships network of scientists who posted preprints on the high-energy theory archive from 1995 to 1999 [31].(8)Karate: the social network of a karate club at a US university [32].(9)Political blogs (PB): a network of blogs about US politics [33].(10)USAir: a network of the US air transportation system [6].(11)Word: an adjacency network of common adjectives and noun in the novel “David Copperfield” by Charles Dickens [34].(12)Yeast: the protein-protein interaction network of budding yeast [35].

In this work, all the aforementioned networks are treated as undirected and unweighted networks, and only the giant component of each network is used. Table 1 lists the basic statistics of the giant components of these networks.

Given network , suppose , be two seed nodes. is called a seed node pair with common neighbors if they have at least one common neighbor. denotes the set of seed node pairs with common neighbors, formallyLet , be two seed nodes, and is one of their common neighbors. If , we call is a zero-triangle-neighbor; otherwise, is a triangle-neighbor. If , is called a CAR-triangle-neighbor and if (see (18)), is called a TRA-triangle-neighbor. Let be the set of triangle-neighbors, and , denote the sets of CAR- and TRA-triangle-neighbors, respectively. Clearly, . Let and be two subsets of . For any pair in , at least one of their shared neighbors is not a triangle-neighbor, and for any pair in , all of their shared neighbors are not triangle-neighbors. More explicitly,Similarly, we define , , , and , which areCorrespondingly, the ratios of those subsets to are, respectively, defined asTable 2 lists these ratios over the 12 networks.

2.4. Wilcoxon Signed-Ranks Test

The Wilcoxon signed-ranks test is a nonparametric statistical hypothesis test used to check whether two methods perform equally well over multiple networks [38, 39]. Let be the difference in performance scores of two link prediction methods on the th network. The differences are ranked in accordance with their absolute values; in case of ties, average ranks are assigned. Let be the sum of ranks for the networks on which the second method outperformed the first, and the sum of ranks for the opposite. For a larger number of networks, the statisticsis distributed approximately normally [39]. In (16), and is the number of networks.

With , if is small than -1.96, we reject the null-hypothesis, which states that both methods perform equally well.

3. The Proposed Index

The link prediction problem has a familiar relationship with the network evolving mechanism [2, 40]. A recently proposed triangle growth mechanism demonstrates that various key features observed in most real-world networks can be generated in simulated networks [41]. Therefore, triangle structure information has an important effect in link formation.

In this work, we focus on a new triangle structure, namely TRA-triangle. A TRA-triangle passes through one seed node, one common neighbor, and one other node. In our opinion, the common neighbors that can form TRA-triangles are more important than others. Given two nodes and , we denote the number of triangles passing through them as , which is

For the example network in Figure 1(a), the triangles used for seed nodes , are shown in Figure 1(d). Clearly, and . Thus, node is in more close contact with than . Given seed nodes and , is one of their common neighbors. Function sums up the number of TRA-triangles formed by , , and , , which is

In this paper, we propose a new similarity index, by combining the aforementioned triangle structure and the idea of RA index [13]. For the convenience of statement, we name our new method TRA index. Its definition is

In (19), the numerator is . Therefore, the TRA index does not miss the effect of any common neighbor. If all common neighbors are zero-triangle-neighbors, TRA degenerates to RA. For the example network in Figure 1(a), .

4. Experimental Results

Table 3 lists the predicted results of different methods in terms of AUC on the 12 networks. The results are obtained by averaging over 50 independent realizations for each network with testing set containing 10% links. The highest AUC value for each network is highlighted in boldface. Clearly, TRA index gets nine best results over the 12 networks. Meanwhile, TRA index outperforms the CAR, CAA, CRA, and CCLP indexes on all networks. We can see from Table 2 that, on most of the networks, there exist varying degrees of such seed node pairs with common neighbors that belong to and/or . As stated in Introduction, CCLP index will give lower or zero similarity scores to those pairs. Furthermore, both values of and are very high on most of the networks. Particularly, on Dolphin, Email, Hamster, HEP, and Yeast, the corresponding values of are greater than 0.8. This phenomenon indicates that only a very small fraction of seed node pairs with common neighbors on those networks can be assigned similarity scores by CAR-based indexes. Although there are some seed node pairs belonging to and/or , TRA index still can assign reasonable similarity scores to them. Therefore, the results of TRA index in Table 3 are better than them of CAR, CAA, CRA, and CCLP indexes. For CN, AA, RA, and ADP indexes, ADP index performs the best, since it can penalize common neighbors by automatically adapting to the network. On Dolphin, HEP, and USAir, ADP index obtains the best accuracy; the performance of our index approximates to the best. In addition, TRA index achieves much better AUC scores than others on FW and Karate. This result suggests that TRA-triangles play an important role on these two networks. From Table 1, both networks are dense ones. Roughly speaking, the probability that there exist TRA-triangle-neighbors between seed nodes on dense networks is more than on sparse ones.

To check whether the proposed index is significantly different with compared methods, we applied Wilcoxon signed-ranks test [39] based on the results in Table 3. The pairwise test results are presented in Figure 2. From the statistical point of view, our index is significantly better than others except ADP index, because ADP index has the capability of adapting to the structure of a network automatically. Although there is no statistical difference between our index and ADP index according to Wilcoxon signed-ranks test, our index performs better than ADP index in terms of AUC.

Figure 3 exhibits the changes of AUC on 12 networks when the proportion of in increases from 10% to 20%. It is quite evident from Figure 3 that the AUC values of all indexes show downward trends when the proportion increases from 10% to 20% except on FW. The reason is that the increase of will decrease the size of training set and then will result in the number of common neighbors between seed nodes becoming small. Consequently, the difficulty of link prediction will enhance. The FW network, which possesses high average degree, small average shortest distance, and small-degree heterogeneity, is a very dense network. Therefore, the decrease of training set gives slight influence of accuracy on FW. In addition, we can observe from Figure 3 that the performance presented by all indexes on ADV, CE, Dolphin, Email, Hamster, HEP, Karate, Word, and Yeast is very similar. On these nine networks, the AUC values of CAR-based indexes are obvious lower than those of others. On the network of FW, the results of CAR-based indexes are better than those of CN, AA, RA, and ADP indexes, because FW is a very dense network in which the ratio of CAR-triangle-neighbor is very high (see Table 2). On PB and USAir, the performance of CAR-based indexes is not as bad as on other nine networks. The reason is both networks have high average degrees, small average shortest distances, and high ratio of CAR-triangle-neighbors.

Furthermore, we list the AUC values of different methods on the 12 networks when in Table 4. The results of our index outperform others on eight among the 12 networks, while CCLP index achieves the highest value on CE.

Table 5 gives the results in terms of ranking score. These results are similar to those in Table 3. The ranking score of TRA index outperforms others except on Dolphin, HEP, and USAir. The pairwise Wilcoxon signed-ranks test results are shown in Figure 4. Similar to the test in Figure 2, TRA index is significantly better than compared methods except ADP index. As depicted above, ADP has the adaptive capability and hence performs better than other compared methods.

Figure 5 describes the changes of ranking score on 12 networks when increases from 10% to 20%. Clearly, all indexes yield higher ranking scores with the increase of . Do not forget that higher ranking score means lower accuracy. As analyzed above, FW is very dense. Thus, the changes of AUC on FW are very slight (see Figure 3). However, the changes of ranking score on FW are more evident, especially for CAA and CRA indexes. The reason is that the calculation of ranking score considers all missing links. In addition, as seen in Figure 5, CAA and CRA indexes perform worse than CAR index according to ranking score. From the definitions of these three indexes, we find that both CAA and CRA indexes can get more negative impact than CAR index from zero-triangle-neighbors.

Finally, the ranking scores of all methods on the 12 networks with are listed in Table 6. Our index outperforms all other indexes except on HEP and USAir in terms of ranking score. These results are consistent with them of AUC. In contrast with that on FW, the influence of TRA-triangles on HEP and USAir is small.

From the above results, we can conclude that TRA index is superior to CAR-based indexes and CCLP index and performs better than common-neighbor-based methods on most of networks.

5. Conclusion and Discussion

Link prediction is an important research topic of complex network analysis and has a wide range of applications in various fields. Inspired by the triangle growth mechanism in network evolving [41], this paper proposed the TRA index for link prediction. When computing the similarity between two seed nodes, the proposed index not only counts the contributions of all common neighbors but also emphasizes the importance of the neighbors that can form TRA-triangles. To some extent, TRA-triangles reflect the close relationships between neighbors and seed nodes. In addition, the proposed index also adopts the theory of resource allocation [13] due to its effectiveness.

The accuracy of the TRA index is experimentally evaluated over 12 real-world networks from various fields in terms of AUC and ranking score. The experimental results show that the proposed index performs far better than CAR-based indexes. Meanwhile, our index outperforms the CCLP index because of the superior strategy in our index. For common-neighbor-based methods, the proposed index yields some improvements of accuracy on most of networks. These results indicate that combining the information of TRA-triangles and the theory of resource allocation in similarity index is a helpful idea for link prediction.

There are some improved studies for our index in future. One of them is to analyze the degree of influence of TRA-triangles on different networks and further to be adaptive to set the weight of TRA-triangles on different networks. The second is to study the application of TRA index on other topics, such as community detection and anomaly detection. In addition, for learning-based link prediction approaches, TRA index can be used as a feature for a node pair.

Data Availability

The networks used in this study are available from http://deim.urv.cat/~alexandre.arenas/data/welcome.htm, http://www-personal.umich.edu/~mejn/netdata/, http://vlado.fmf.uni-lj.si/pub/networks/data/, http://noesis.ikor.org/datasets/link-prediction, and http://konect.uni-koblenz.de/networks/.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (no. 61602225) and the Fundamental Research Funds for the Central Universities (no. lzujbky-2017-192).