Abstract

Link prediction in complex networks predicts the possibility of link generation between two nodes that have not been linked yet in the network, based on known network structure and attributes. It can be applied in various fields, such as friend recommendation in social networks and prediction of protein-protein interaction in biology. However, in the social network, link prediction may raise concerns about privacy and security, because, through link prediction algorithms, criminals can predict the friends of an account user and may even further discover private information such as the address and bank accounts. Therefore, it is urgent to develop a strategy to prevent being identified by link prediction algorithms and protect privacy, utilizing perturbation on network structure at a low cost, including changing and adding edges. This article mainly focuses on the influence of network structural preference perturbation through deletion on link prediction. According to a large number of experiments on the various real networks, edges between large-small degree nodes and medium-medium degree nodes have the most significant impact on the quality of link prediction.

1. Introduction

Complex networks play an important role in modeling and analyzing complex systems such as the social system, biological system, and information system [1]. Generally, individuals, such as human beings, biological elements, and computers, are represented by nodes, while links represent the relations or interaction between nodes [2]. The theory of complex networks offers new insight into the connection in the real world. Related research studies focus not only on single-layer networks, such as clustering [3], link prediction [4], and community discovery [5, 6] but also multilayer networks [7] as well, such as cascade [8, 9], communication [1013], synchronization [14], and game [1517].

Link prediction in a complex network includes prediction on links unknown or to be built in a given network. Based on the structure and attributes of a public network, link prediction is intended to predict the possibility of link generation between two nodes, which have not been connected yet [18]. For a given undirected network, as shown in Figure 1, the solid lines signify known edges in the network, while the dotted lines signify the unknown edges or those to be built in the future. What we need to do here is to predict the unknown edges denoted by the dotted lines through known nodes and edges accurately. As one of the research directions of data mining, link prediction can be applied to various fields. For instance, in social networks, link prediction can recommend new friends to users by predicting whether two strangers online are acquaintances offline. In biology, predicting protein-protein interaction will significantly reduce the cost of experiments. Besides, researchers have applied the idea and method of link prediction on node classification, such as determining the type of an article in an academic network [19].

Researchers have made considerable efforts to enhance the accuracy of link prediction. Many link prediction algorithms are based on the similarity between nodes, assuming that the more similar two nodes are, the more likely that there is a link between them. The index describing the similarity between nodes can be roughly classified into the local index, global index, and quasi-local index. The local index is the most commonly used among the methods based on node similarity, due to its simplicity and adaptability for considerably large networks. There are other link prediction algorithms which could be based on maximum likelihood estimation, and they have better performance when dealing with networks with a distinct hierarchical structure, such as a grassland food chain [20]. Some algorithms may also employ probabilistic models for link prediction, during which information processed covers not only the network structure but also the attributes of nodes. These algorithms are characterized by higher prediction accuracy but, at the same time, greater calculation complexity [21].

Link prediction is usually proven useful in biological networks. In social networks, however, they may raise concerns about privacy and security, in that our data are valuable not only for enterprises and public entities but also for an increasing number of cybercriminals conducting network analysis for malicious purposes. Through the link prediction algorithm, cybercriminals can accurately predict the friends of a social account user and even the owner of that account according to his or her relationships. If they dig further, criminals may find the name, age, address, bank account, and other private information of a social account’s corresponding entity.

Considering what has been mentioned above, it is urgent to improve privacy protection. However, currently, there lacks intensive research on how to prevent identification of link prediction algorithms utilizing concealing, changing, or adding edges through network structural disturbance at a small cost. Based on the perturbation of the adjacency matrix, Lu et al. [22] studied the influence of structural consistency on link predictability. Waniek et al. [23] studied how to conceal sensitive relations in the network through reconnection strategy, proposing a heuristic method in achieving it. Wu et al. [24] proposed an active learning algorithm and applied perturbation on the most symbolic links in the network, to adjust the structure predictability of the graph.

Based on the previous research, this article focuses on the influence of network structural preference disturbance through deletion on link prediction. According to a large number of experiments on the various real networks, edges between large-small degree nodes and medium-medium degree nodes have the most significant impact on the performance of link prediction. In the real-world network, the connection choice between nodes is not uniform, but there is an obvious preference, which leads to a certain correlation between nodes in the network. Based on this connection correlation between nodes, people put forward the concept of homogeneity and heterogeneity to distinguish the connection preference between nodes. Therefore, the heterogeneity of complex network nodes is a measure of the uniform distribution of nodes. If the nodes tend to connect similar nodes, they will form homogeneous network; if the high nodes and low nodes have certain probability to connect, they will form heterogeneous network.

2. Model Demonstration

In this section, some basic terminologies used in this article will be first introduced, based on which official definitions will be made. Then, we will present the method of network structural preference perturbation. Finally, the pseudocode of this method will be given.

2.1. Definition

A complex network: a given biological or social network can be modeled as a graph, , in which denotes the set of nodes in the network, denotes the set of edges, and denotes the adjacent matrix of the network. When there is a link between node and , the elements from matrix satisfy ; otherwise, they satisfy . We can use source and target to describe the relationship between nodes, thus dividing networks into directed and undirected networks. In directed networks, the link, by the name of arc, is built from a source node to a target node. In undirected networks, there is no distinction between source nodes and target nodes in a link, by the name of edge. This article will mainly focus on undirected networks, which can be further applied to directed networks conveniently.

Link prediction: in a given network , denotes the number of nodes and denotes the number of edges. Therefore, the total number of node pairs is . If the universal set denotes the set of all pairs of nodes in a network, link prediction works as follows: any pair of nodes, denoted by , which does not belong to the network, will be assigned a particular score through a certain kind of algorithm. Then, based on scores assigned to the pairs of nodes, the largest pairs will be chosen as edges for prediction.

Network structural preference perturbation: in a given network , network structural preference perturbation aims at executing one or more operations among adding, changing, and deleting towards the edges of a network at a lower cost based on the set of edges . The aim of these operations is that when a new network is formed eventually, the edges lost or to be built in are barely predictable through the operations demonstrated above.

2.2. Method of Perturbation

The method of network structural preference perturbation mainly consists of one or more operations among adding, changing, and deleting towards the edges of a network. This article will focus on deletion, trying to identify the particular quality of edges that are significant in influencing the effect of link prediction. For a given network , the set of edges can be divided into a train set and test set . The network structural preference perturbation mainly happens in train sets. After the perturbation, edge prediction can be made in the training network through an algorithm of link prediction. The performance of link prediction can thus be calculated by comparing the edges predicted and those in the test set. The given network in the following passages refers to a training network.

When a training network has been divided from the original one, we start to apply perturbation on it. For any edge denoted by , its deletion value can be calculated through the following formula:

In this formula, and denote, respectively, the degree of node and . is used to control the preference of edge selection. When is specified as a specific value, the weight of each edge selected will be calculated. For example, when is slightly greater than 0, the connection between large and small nodes and medium and medium nodes has a larger weight, which means it is easier to be selected. is an adjustable parameter. Formula (1) makes sure that the sum of deletion value of all the edges equals 1.

After acquiring the deletion value of every edge, we set up a parameter of proportion , in order to randomly pick out edges by the number of and delete them. It is noteworthy that the degree of some nodes in the edge may change after the edge has been deleted, and hence, the deletion value of every node, , will be changed. In an ideal situation, the deletion value of the remaining edges should be recalculated every time an edge is deleted. However, the time complexity of the perturbation algorithm would be enormous in this way. To reduce time complexity, a parameter of the proportional interval is set by the name of , meaning that the calculation of deletion value is only redone after every edges are deleted.

The pseudocode of the proposed method is as follows. (see Algorithm 1).

(1)Input: adjacent matrix of the original network , parameter , parameter , and parameter
(2)fordo
(3) DELETE-EDGE ()
(4)end for
(5)function DELETE-EDGE ()
(6);
(7) Initialize the value matrix: ;
(8)fordo
(9)  fordo
(10)   Calculate the value of each element in using formula (1);
(11)  end for
(12)end for
(13)fordo
(14)  Randomly choose a position based on the value matrix ;
(15)  
(16)end for
(17);
(18)end function
(19)Output: the adjacent matrix after the perturbation has been applied

3. Experiment

3.1. Experimental Setup

We have experimented on four real networks, whose statistics are shown in Table 1. During the experiment, four algorithms of link prediction are used, including RA [29], AA [30], CN’ [31], and PA [32]. For a random given node , with denoting the set of its adjacent nodes, the calculation formula of the four algorithms can be expressed as follows:

Resource allocation (RA) is

Adamic-Adar (AA) index is

Common neighbor (CN) is

Preferential attachment (PA) is

We here choose precision as the index of the performance of link prediction. For a given group of edges that has not been observed, precision is defined as the ratio of successfully predicted edges to the top L predicted edges. Suppose that is the number of successfully predicted edges among the top predicted edges; then, the calculation formula for precision can be expressed as follows:

For a given network, the train set and test set will be divided by the ratio of . In order to test the performance of prediction, perturbation will be applied on the train set under different and . Then, predictions will be made through the four algorithms of link prediction, and the result will be compared with the test set. Every experiment will be repeated 100 times.

3.2. Experiment Result

First, the experiment supposes that equals 0.01, 0.1, 0.2, 0.3, and 0.4, respectively, and . The precision calculated through four algorithms on four data sets will be tested, under the condition that ranges from −20 to 20, with an interval of 1. As shown in Figure 2, the result indicates that when is slightly larger than 0, the minimum value of precision is achieved, signifying the best effect of network perturbation. The result shown in the cases (a), (e), and (i) from Figure 3 suggests that edges between large-small degree nodes and medium-medium degree nodes have the largest influence on the performance of link prediction.

Then, is set to be −20, −10, 0, 10, and 20, respectively, and . The precision calculated through four algorithms on four data sets will be tested under the condition that ranges from 0.01 to 0.4, with an interval of 0.01. As shown in Figure 4, in most cases, the precision decreases as f increases for all four methods of the algorithm. Because the larger the is, the larger the ratio of deleted edges will be, and the fewer edges are remained in the network, thus reducing the quality of prediction of the different algorithms of link prediction. However, there are cases in metabolic and neural networks where precision increases as increases, which could result from the fact that metabolic and neural networks have higher heterogeneity of node degree.

In order to better demonstrate the influence network structural preference perturbation has on link prediction, we have also tested the precision calculated through four algorithms on four data sets, under the condition that ranges from −20 to 20, with an interval of 1, ranges from 0.01 to 0.4, with an interval of 0.01, and . The experimental results are shown in Figure 5, where curves in the horizontal planes are isolated.

4. Conclusion

In this article, the influence of network structural preference perturbation by a deletion on link prediction is analyzed. By using an interactive criterion to determine node degree, we first assign a perturbation value through the calculation to every edge in a given network. Then, we apply perturbation through deletion on edges selected according to perturbation value. This procedure will be repeated until a certain proportion of edges have experienced perturbation. After that, we make link prediction on networks before and after perturbation, using four methods including RA, AA, CN, and PA, compared to the different influence types of connection and the ratio of deletion has on the performance of link prediction.

Massive experiments on various real networks indicate that the edges between large-small degree nodes and those between medium-medium degree nodes have the most significant influence on the performance of link prediction. By deleting the specific link in the network, we can resist the impact of link prediction on privacy protection. The above strategies can not only protect privacy in the field of social networks but also be worth promoting and applying in other fields. For example, in the design of computer communication topology, to minimize the connection between large and small nodes, medium and medium nodes can resist topology estimation, so as to better protect our own network; in the field of counter-terrorism, we should pay more attention to the connection between the leader node and leaf node, which often means the vulnerability of the terrorist team in communication connection.

Data Availability

The data can be obtained upon request to the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (Grant no. 61903266), China Postdoctoral Science Foundation (Grant no. 2018M631073), China Postdoctoral Science Special Foundation (Grant no. 2019T120829), Fundamental Research Funds for the Central Universities, and Sichuan Science and Technology Program (No. 20YYJC4001).