A Method for Improving the Accuracy of Link Prediction Algorithms
Link prediction is a key tool for studying the structure and evolution mechanism of complex networks. Recommending new friend relationships through accurate link prediction is one of the important factors in the evolution, development, and popularization of social networks. At present, scholars have proposed many link prediction algorithms based on the similarity of local information and random walks. These algorithms help identify actual missing and false links in various networks. However, the prediction results significantly differ in networks with various structures, and the prediction accuracy is low. This study proposes a method for improving the accuracy of link prediction. Before link prediction, k-shell decomposition method is used to layer the network, and the nodes that are in 1-shell and the nodes that are not linked to the high-shell in the 2-shell are deleted. The experiments on four real network datasets verify the effectiveness of the proposed method.
Link prediction refers to the prediction of the possibility of a link between two nodes in the network that have not yet formed a link or inference of the missing link in the network through the known network topology, network node attributes, and other information . Link prediction can be used in different fields because this process predicts not only the lost links in the network  but also the possible future links in the evolved network. In social networks, recommending links that do not yet exist as potential friendship relationships can help users find new friends . In the battlefield communication networks, link prediction supplements the network topology reconnaissance by predicting the hidden links and the links that are not yet detected by the enemy’s communication network to provide conditions for network confrontation . In biological networks, such as protein-protein interaction and metabolic networks , a large number of experiments is required to prove that the two nodes in these networks are connected. However, completing these experiments is costly and time-consuming. Therefore, predicting the unknown links in the network based on the known ones is more reasonable and effective, checking all potential links without solid basis. Link prediction is also often applied to personalized recommendations. For example, in e-commerce systems, predictions are made based on user clusters and purchase relationships first before providing personalized product recommendations to different users to improve the platform’s performance and order quantity . In summary, link prediction research has important theoretical and practical application values. In addition, the accuracy of link prediction algorithms can be further improved when applied to real networks, especially in social networks. In this regard, this study proposes a method for improving the accuracy of link prediction algorithms.
Numerous link prediction algorithms exist. However, scholars in various fields are particularly interested in the similarity method based on network structure, which has low complexity and good effect. Common neighbor (CN) , which is the simplest similarity index, describes similarity by calculating the number of common neighbors between any two nodes. On the basis of CN, researchers proposed other local similarity indicators, such as Jaccard Index , Adamic and Adar , and resource allocation (RA) indices , which demonstrate better prediction effect than the former. Barabási and Albert  proposed the preferential attachment process based on degree. Lü et al.  considered the third-order path on the basis of CN and proposed a local path, which produces excellent results at the expense of a certain degree of complexity. Liu et al.  proposed the extended resource allocation (ERA) algorithm on the basis of RA. Li et al.  introduced the potential information capacity (PIC) index. Some of the abovementioned indicators, such as RA, ERA, and PIC, portray and apply different perspectives to the resource transmission process in complex networks. In consideration of the global information of the network, global indicators (e.g., full path indicator algorithm KATZ , average commute time , and cosine similarity indicator Cos+ ), which obtain better effects than local methods, are also proposed. However, the complexity of such indicators is high, and applying them to large and complex networks is difficult. RA achieves excellent results in the case of low-complexity networks, which are close or even higher than the global index in some networks. Kovács et al.  proposed the L3 algorithm, which produces satisfactory results in protein interaction networks. With the advancement of the study on network evolution mode and link generation mechanism, the accuracy of link prediction algorithms can be further improved.
The main contributions of this study are as follows: We infer that in the k-shell decomposition algorithm, if the nodes in the outer shell share links with those in the high-shell, the possibility of linkage between the former and other nodes will increase. We propose a method for improving the accuracy of the link prediction algorithm by deleting the outer nodes of the k-shell. This approach provides insights into the enhancement of the link prediction accuracy. We evaluate the proposed method on four different network datasets, compare state-of-art algorithms [8, 16] on different percentages of test sets, and verify the ability of the proposed method to improve the link prediction accuracy.
The existence of a link between any two nodes in the network can be speculated via link prediction. However, some links may not exist in the actual network; such a phenomenon will reduce the accuracy of link prediction. Deleting nodes with low potential link probability before link prediction might improve the accuracy of link prediction. On this basis, we design the following method:
First, the 1-shell is obtained in accordance with the k-shell method . All nodes with a degree of 1 and their corresponding edges are removed. If news nodes with a degree of 1 appear in the network, these nodes and their connected edges are removed. This operation is repeated until no nodes with a degree of 1 are present in the network. The removed nodes and the connected edges between them are the 1-shell of the network. The 2-shell, 3-shell… are obtained using the same procedure.
The node located in the 1-shell denotes the person who is not good at socializing in the social network. The possibility of this person making new friends is extremely low. Therefore, we delete the nodes that are in the 1-shell. If the node in the 2-shell is not connected to a node that is in a high-shell, it is also deleted. As shown in Figure 1, nodes 9 and 11 are in 1-shell, so they are deleted. Nodes 12, 13, 14, and 15 are in the 2-shell. Since nodes 13 and 14 are not connected to x-shell (x 4) nodes, they are deleted. Figure 2 is what we get at this point. We then use the link prediction algorithm. If the k-shell contains many shells or k is large, the nodes in the 1-shell and 2-shell can be deleted and nodes in 3-shell that are not connected to x-shell (x > 4) or more outer shells can be deleted.
3. Experiments and Results
We conducted experiments on four real network datasets in different fields. Given that this study considers undirected and unweighted networks, the weight and direction of the edges in the network are disregarded. These details of the four networks are discussed in the following paragraphs:(i)Lesmis: it is a network of the relationship among the characters of the film “Les Misérables.” A node represents a character and an edge between two nodes shows that these two characters appeared in the same chapter of the book .(ii)Polbooks: it is a network of the purchases of the United States political books .(iii)Metabolic: this is the metabolic network of the roundworm Caenorhabditis elegans. Nodes are metabolites (e.g., proteins), and edges are interactions between them. Since a metabolite can iterate with itself, the network contains loops. The interactions are undirected. There may be multiple interactions between any two metabolites .(iv)Net-science: this is a network of coauthorships in the area of network science .
Table 1 summarizes the basic topology features of these networks, where N, M, , and C represent the number of network nodes, number of network edges, average degree, and clustering coefficient, respectively.
Area under the curve (AUC) indicators  are commonly used to evaluate the accuracy of link prediction. The network is divided into a training set and a test set. After calculating the link score between every two nodes in the network through the training set, each time of an edge is randomly selected from the test and nonexistent edge sets for comparison. If the similarity score value of the edge in the test set is greater than that in the nonexistent edge set, one point is added ( times in total). If the two similarity score values are equal, 0.5 point is added ( times in total). Considering independent comparison for times, the AUC is expressed as
We use the classic link prediction algorithm RA (RA has the best overall performance over the other 10 algorithms ) and the new algorithm L3 (L3 performs best on the protein network ) in the experiment. For any two nodes x, y in the RA method, the possibility of link existence is calculated as follows:where is the set of neighbor nodes of node and is the degree of node .
For any two nodes x, y in the L3 method, the possibility of link existence is calculated as follows:
Among them, if there is a connection between node x and u, axu = 1; otherwise, it is 0. is the degree of node .
We divide the test set ratio in each network into 10%, 20%, 30%, 40%, and 50%. We take the largest connected subgraph of the test set and perform 20 independent experiments on each divided dataset. The obtained AUC are averaged. The AUC results of the two algorithms under this improved method are shown in Figure 3. RA∗ and L3∗ represent the results of using RA and L3 algorithms after applying the proposed method to delete some nodes.
Obviously, the AUC of L3∗ and RA∗ is higher than that of L3 and RA. Therefore, we deleted the nodes that are in the 1-shell and the nodes that are in 2-shell that are not linked to the x-shell (x > 3) in the above network, and subsequently using RA and L3 algorithms can improve the AUC values. This finding indicates that the proposed method can improve the accuracy of link prediction.
The present study still requires further improvements, including the following: (1) determining whether all nodes that are in 1-shell should be deleted, (2) finding new methods for deleting nodes without affecting network connectivity to improve the accuracy of link prediction, and (3) determining whether the method for deleting nodes is suitable for other link prediction algorithms.
Our proposed method of deleting nodes provides an idea for dealing with large networks. However, we also found that some networks use k-shell layering with only one or a few layers. Deleting the outermost layer could possibly cause a large number of nodes to be deleted, which will affect the experimental results. Some other networks do not have 1-shell and 2-shell layers. These situations are important for us to solve.
This study proposes a method for improving the accuracy of link prediction. Before using the link prediction algorithm, the network is layered using k-shell, and the nodes in the 1-shell (outermost shell) and the 2-shell (secondary outer shell) that are not connected to those in the high-shell are deleted. The RA and L3 algorithms are used to conduct experiments on four real network datasets in different fields. The results show that the accuracy of the algorithms increases after deleting the abovementioned nodes, thereby verifying the effectiveness of the proposed method.
Data are available in the following links: http://www-personal.umich.edu/∼mejn/netdata/, https://deim.urv.cat/∼alexandre.arenas/data/welcome.htm, and http://vlado.fmf.uni-lj.si/pub/networks/data/default.htm.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work has been supported by the National Key Research and Development Program under No. 2018YFB2100100, National Natural Science Foundation of China under Grant No. 62066048, and Postdoctoral Science Foundation of China under No. 2020M673312.
P. Jaccard, “Bulletin de la société vaudoise des sciences naturelles,” Etude Comparative de la Distribution florale dans une Portion des Alpes et des Jura, vol. 37, pp. 547–579, 1901.View at: Google Scholar
D. E. Knuth, The Stanford GraphBase: A Platform for Combinatorial Computing, ACM, New York, NY, USA, 1993.