Abstract

By mining the data published on social network, we can discover the hidden value of information including the privacy of individuals and organizations. Protecting privacy of individuals and organizations on social network has become the focus of more and more researchers. Based on the actual privacy protection need of edge sensitive attribute and vertexes sensitive attribute, we propose a new personalized -anonymity technology of privacy preserving to reduce distortion extent of the data in the privacy processing of data of social network. Experimental results of personalized -anonymity algorithm show that -neighborhood attack of graph, background knowledge attack, and homogeneity attack can be prevented effectively by using anonymous vertexes and edges, as well as the influence matrix based on background knowledge. The diversity of vertex sensitive attribute can be achieved. Personalized protecting privacy requirements can be met by using such parameter as .

1. Introduction

Data publishing of social network is very important for scientific research, commercial purpose, countries, and so on, but social network data includes privacy information and sensitive relations, which can be leaked by publishing directly. How to protect individual privacy, and make the publishing data or graph useful at the same time, has become very important problem of social network data publishing. One of the most important principle is that individual can decide his own privacy whether be published or not, that is to say individual has different privacy protection needs.

Anonymity techniques for data publishing have been used in the relational data for a long time, and make great progress in relational database area, including -anonymity, -diversity, generalization, and so forth [1, 2]. Can we apply the same anonymous techniques that apply to relational data to social networks? Social network data contains more information than relational data because network data contains vertexes (nodes), edges, relationships between nodes, and various metric features of the graph. Some researchers want to use these technologies on data publishing of social network [3]. So the structure and evaluation of social network method were proposed in paper [4], and the categories of attack in social network can be found in paper [5]. Graph modification [6], graph partitioning [7], graph isomorphism [8], clustering [9], attribute generalization [10], and so on were applied to data publishing of social network, and then more and more anonymous technologies of social networks appear in many academic papers.

Actually, a graph structure is necessary to represent the network vertexes relations rather than a two dimensional representation in relational database [11], degree denotes the relationship between two vertexes, high degrees mean the relationships are more closer among the vertexes, and there are only a small part of vertexes which degrees are high, degrees of most of vertexes are low in big social network. So a limited fraction of vertexes with high degrees bring a lot of data loss and computation cost when using unified anonymity methods and the same privacy protection level [12].

Personalized privacy protection based on data table was proposed firstly by Xiao and Tao in 2006 [13]. They used individual guarding node to set level of self-sensitive attribute and did not set the same anonymity level for all individuals, but rather anonymity according to setting guarding node.

Ever since then, more and more researcher paid more attention to personalized anonymity of data publishing and made modest progress. During the research process of social network privacy protection, because data of social network is more complex than traditional data table, most of social network research used unified anonymity methods and the same privacy protection level. For example, user can create their basic information, Web albums, Web logs, the lists of friends, and so on. But Facebook, Twitter, Wechat, and voov meeting, they were able to decide those information whether can be accessed and viewed by others according to their own privacy level, consequently achieved purpose of preserving privacy to some extent.

The data in social network is more complex than two-dimensional data table in relational database. Privacy protection in social network can be summarized as vertex protection, edge protection, and sensitive attribute protection. Vertex protection is to prevent an attacker from identifying a vertex in an anonymous publishing graph with a high probability. Edge protection is to prevent an attacker from identifying an edge in an anonymous publishing graph with a high probability. Attribute protection is to prevent the attacker from getting vertexes or sensitive attributes of edges with a high probability. We cannot use anonymity methods and technologies, which used into traditional two dimensional data table, into social network directly, and users have personalized protecting privacy requirements (vertex protection, edge protection, and sensitive attribute protection) in the real social network such as the users of Facebook, Twitter, and Wechat, so it has the very high research value that personalized privacy protection methods are used into social network data publishing [14].

2. Problem Definition

2.1. Related Concepts

Definition 1. -Anonymity. is a table and is quasi-identifier in . is said to satisfy -anonymity if and only if each sequence of values in emerge occurrences at least in [15].

Table 1 is said to satisfy -anonymity, includes nation, birthday, gender, ZIP, the sensitive attribute is disease, . As can be seen from Table 1, , , .

Definition 2. k-Degree anonymity. A social network graph is said to satisfy -degree anonymity, if each vertex (node) has other vertexes at least, and these vertex’s degree are same in the social network graph. The variable represents vertex amounts, and represents edge amounts between vertexes [16, 17].

-degree anonymity can prevent the inference attack by the adversary with background knowledge about vertex degree. In Figure 1, degree collection is in primal social network graph (a), so anonymity social network graph (b) satisfies 2-degree anonymity in Figure 1.

Definition 3. Graph isomorphism. For graphs: and where , if there is a bijection between and satisfies , if and only if , , and are graph isomorphism, represented as . represents vertex (node) numbers, and represents edge numbers between vertexes.

For example, when we delete the node information of (a) and (b) in Figure 1, (a) and (b) are isomorphic [18].

Definition 4. -Isomorphism. For a graph , whose sub-graphs are if satisfies: (1) ; (2) (3) are isomorphism, then, the graph is -isomorphism.

Definition 5. k-Isomorphism vertex group. Given a -isomorphism publishing graph , then, there exist vertexes are isomorphic to , the vertex set consists the vertex and the is -isomorphism vertexes group, which is denoted as , . Each includes vertexes and there are in the -isomorphism graph .

Definition 6. k-Isomorphism edge group. Given a -isomorphism publishing graph , , then there exist edges is isomorphic to , the vertex set consists the vertex , and the is -isomorphism edges group, which is denoted as , . Each includes edges and there are in the -isomorphism graph .

Definition 7. Social network graph. Given a social network graph: , wherein vertex set denotes the social individuals, and the edge set denotes the relationships among the social individuals. Each vertex and edge has its identify and attribute which includes (a)Identifier attribute (ID) of vertex as (b)Quasi-identifier attribute (QI) of vertex as (c)Sensitive attribute (SA) of vertex as (d)Quasi-identifier attribute (QI) of edge as (e)Sensitive attribute (SA) of edge as (f)Other attributes (OA)

Attribute (QI) of edge denotes by vector pair (), the total number of vertexes , denotes QI of numeric attribute, denotes QI of character attribute, , , , and denote the amount of QI, respectively. For example, Figure 1 is an example of friendship social network, each vertex is a customer, and each edge denotes relationship between two vertexes. Table 2 is primal data of each vertex in Figure 1(a). Table 3 is edge table, Eid denotes the sequence number of edge, Vid1 and Vid2 denote the sequence number of vertex of Figure 1(b), and weighted relationship denotes the relationship between Vid1 and Vid2 of Figure 1(c). Table 4 is another relational data table of a vertex of Figure 1(a).

2.2. Sensitive Degree of Friend Relationship (SA) of Vertex and Edge
2.2.1. Sensitive Degree of Friend Relationship (SA) of Vertex

We use the influence matrix to represent the level of influence of vertex-sensitive attributes [19, 20]. We can use the influence matrix to meet the requirements of personalized privacy protection of users.

: the influence degree of NO. sensitive attribute generated by NO. vertex.

: the weightiness of sensitive attribute value of NO. vertex.

Influence matrix is with rows, columns, represents vertex amount, represents QI attribute amount, so it can be described as

The , values come from experts or experience value. For example, the weightiness of QI in Table 4 can be divided into 5 grades, 1, 0.8, 0.4, 0.1, and 0, and the weightiness of in Table 4 can be divided into 5 grades too, 0.10, 0.60, 0.70, 0.80, and 0.90. The cold is general disease, and disease weightiness value can use 0.1. Common cold (influenza) may have the character of a regional outbreak, and we define the weight value of the ZIP as 0.8. Common cold may also have a little bit to do with gender, and we define the weight value of gender as 0.1. Then, we define the disease weight values of obesity, short breath, hypertension, diabetes, pneumonia, cancer, and AIDS as 0.12, 0.31, 0.5, 0.6, 0.7, 0.91, and 0.92, respectively. The influence matrix is as follows according to Table 4.

2.2.2. Sensitive Degree of Friend Relationship (SA) of Edge

We described relationships of simple friend, good friend, and sweetheart friend (boyfriend or girlfriend) among the vertexes in Figure 1 of friend relationship graph. Graph () of Figure 1 is an example of friend relationship graph, “1” represents simple friend relationship between two vertexes, “2” represents good friend relationship between two vertexes, “3” represents sweetheart relationship between two vertexes, and “0” represents no relationship between two vertexes. Usually, if sweetheart friend includes gay or lesbian relationship, most of people do not want others to know that he is gay or she is lesbian, so different people have different sensitive degree about friend relationships, so we must meet the needs of personalized privacy protection according to the practical application.

2.3. -Anonymity Graph

In order to make it impossible for an attacker to infer the real relationship between targeted individuals and corresponding vertexes with a probability, -anonymity concept in data tables and the new concept of -anonymity are introduced.

Definition 8. -Anonymity of the vertex. Undirected graph the graph is as its anonymous publishing graph, if a vertex , there are at least vertexes in , which makes and , wherein, , thus, the vertex is -anonymity, and the vertex is -anonymity according to , is the weight of relationships (edge weight) of - neighborhood of vertex .
For example, in Figures 1, of vertex (sage), and of vertex (Maci), so vertex and vertex satisfy -anonymity.

Definition 9. -Anonymity of the graph. Undirected graph , the graph is as its anonymous publishing graph. If any vertex is -anonymity, thus, the graph is -anonymity, if any vertex is -anonymity, thus, the graph is -anonymity.

Definition 10. Individual information leakage. Suppose graph is the anonymity publishing graph of social network graph , when the relative sensitive coefficient and satisfy one of the following four conditions, then, there exists individual information leakage. Otherwise, if the graph can ensure that any of the following circumstances are not going to happen, the anonymity publishing graph is regarded as secure. If the graph can ensure the following circumstance (1) and (2) will not happen, then, the anonymity publishing graph is -secure [21]. (1)Vertex Leakage. The probability of ascertaining the corresponding relationship between the vertex in the graph and the target individual A in the primal graph is greater than (2)Edge Leakage. The probability of ascertaining the corresponding relationship between the edge in the graph and the edge in the primal graph is greater than (3)Leakage of Vertex Sensitive Information. The probability of ascertaining the sensitive information of target individual A in the primal graph is greater than (4)Leakage of Edge Sensitive Information. The probability of ascertaining the sensitive information of the edge in the primal graph is greater than

2.4. Personalized -Anonymity
2.4.1. Personalized -Anonymity Model

Personalized -anonymity satisfies the following conditions: (1)Personalized -anonymity satisfies -anonymity(2) in matrix , all vertexes in -isomorphism vertexes group be supposed to be published directly. Otherwise should be satisfied condition (3) and condition (4)(3) are column vectors of influence matrix , , is the numbers of different sensitive attribute value(4)If in influence matrix , when is generated, under the precondition of anonymity, promote generalization hierarchies, or suppress directly [19]. denotes sensitive degrees between and in influence matrix, if , it means that will influence ’s sensibility

Here, threshold is important degree parameter of sensitive attribute in condition (2). If sensitive attribute values of an equivalent class () are less then , that is to say sensitive attribute of these vertexes in -isomorphism vertex group cannot affect their privacy, all vertexes can be published directly. Otherwise, must satisfy condition (3) and condition (4). If , number of different sensitive attribute value is greater than or equal to 2, makes sensitive attribute diversity.

2.4.2. Personalized -Anonymity Example

There is an example which is shown to explain the definition and the process of personalized -anonymity according to Figure 2.

Figure 1(a) is the subgraph of social relationships network, and the isomorphism subgraphs of are found. The 3-isomorphism subgraphs are shown in Figure 2.

In Figure 2, (a) is the initial subgraph in Figure 1, and (b) and (c) are the isomorphism graphs corresponding to (a). From graph , the amount of vertexes is 27, and the amount of edges is 39. Therefore, 9 3-isomorphism vertex groups and 13 3-isomorphosm edge groups are created and listed in Tables 5 and 6.

Now, the 9 3-isomorphism vertex groups are generalized by their identifier attributes according to parameter . The isomorphism vertex groups VCS are changed into equivalence class vertexes groups QI. The item age, gender, and ZIP are identifier attributes, and disease item is the sensitive attribute. The inheritance hierarchy tree of ZIP is shown in Figure 3. The inheritance hierarchy tree of disease is shown in Figure 4 [21].

The , , and attributes in the isomorphism groups VCS1 and VCS2 are listed in Table 7. After generalization, the identifier attributes value gen (VCS) are created and shown in Table 8 [15].

3. Personalized -Anonymity Algorithm

The basic algorithm principle is that -isomorphism graph is caught; -isomorphism graph vertex group VCS is generalized about identifier attributes and sensitive attributes; edge group ECS is generalized about identifier attributes and sensitive attributes. In the process, the generalization is not executed definitely, especially when the type differences do not affect -diversity [22]. The input parameter indicates the generalizing type: when the value is 0, it should be static generalization, and when the value is greater than 0, it means the generalizing would be on the base of graph isomorphism. The input parameter indicates the sensitive degree between nodes. When , we achieve -anonymity graph, -neighborhood attack of graph and structure attack of graph can be prevented [23, 24], when , the input parameter is the generalization threshold [19, 22], background knowledge attack and homogeneity attack can be prevented by using anonymous data of vertexes in social network effectively, and diversity of sensitive attribute can be solved. The following is personalized -anonymity algorithm (), and personalized -anonymity algorithm () has been given in another paper published by the author [23].

Inputs:
Initial anonymous graph G = (V, E),
Sensitivity parameters: k’(k’ ≥2); l(2 ≤ l ≤ k’); m(l ≤ m ≤ |V|);
Node attributes table: AS = {viS,viN(1),…,viN(s),viC(1),…,viC(t)};
Edge attributes table: AS = {vjS,vjN(1),…,vjN(s),vjC(1),…,vjC(t)};
All the classified attribute inheritance tree HC
Input parameters α =0 and β
Outputs:
Anonymous graph Gp = {g1, g2, …, ge};
The whole VCS, ECS and their attribute information;
Steps:
1 anonymous graph Gp, groups VCS and ECS are caught;
2 read α to judge the generalizing type;
3 got the group number:NVCS = |NVP|/k, NECS = |NEP|/k;
4 fori =1 to NVCSdo//QI attributes generalization
5 forj =1 to sdo//numeric type attributes generalization
6  gen(VCSi)[Nj] = [min{v1N(j), …,vkN(j)},max{v1N(j), …,vkN(j)}]
7  end for
8  forj =1 to tdo//t type QI attributes generalization
9   gen(VCSi)[Cj] = {v1C(j),…,vkC(j)}
10  end for
11  whiledo
12   if(sensitive attributes are classified) then
13    forj =0 to kdo
14     vjC is replaced by its parents node in classified inheritance tree of sensitive attribute
15     ifthen jump while loop
16    end for
17   Else
18    forj =1 to kdo
19     the interval of vjN is changed to its neighborhood;
20     ifthen jump while loop
21    end for
22   end if
23  end while
24 end for
25 fori =1 to NECSdo
26  forj =1 to pdo
27  gen(ECSi)[Nj] = [min{v1N(j),…,vkN(j)},max{v1N(j), …,vkN(j)}]
28  end for
29  forj =1 to qdo
30   gen(ECSi)[Cj] = {e1C(j),…,ekC(j)}
31  end for
32  whiledo
33   if(sensitive attributes are classified) then
34    forj =1 to kdo
35     vjC is replaced by its parents node in classified inheritance tree of sensitive attribute
36     ifthen jump while loop
37    end for
38   Else
39    forj =1 to kdo
40     the interval of njN is changed to its neighborhood;
41     ifthen jump while loop
42    end for
43   end if
44  end while
45 end for
46 anonymous graph Gp is published; all the VCS nodes, ECS edges and their attribute information are published

4. Experiments and Results

The experiments were completed in the PC with Intel(R) Core(TM) i5-4590 CPU @ 3.30 GHz, 8 GB memory, and the OS is Microsoft Windows 7. The programs were coded and compiled in VS 2019 IDE.

The vertex (nodes) data set in these experiments are from adults census data set of the UC Irvine Machine Learning Repository [25, 26]. There are two experiments examples, and the vertex numbers of each are 300 and 1000. In these vertexes, 6 attributes are considered in the experiments, which are age, occupation, race, gender, zip, and disease. In these attributes, age is numeric, and the others are category. The attribute disease is sensitive attribute. The edge set in these experiments is created by Pajek software randomly, and the numbers of nodes are, respectively, 5000, 10000, 15000, 20000, and 25000.

Information loss was compared between the algorithm in this paper that we proposed and paper [15]. We use the information loss method from paper [15]. The algorithm in this paper was named as ACIM (anonymous composite improved model) algorithm, and the algorithm in paper [15] was named as ACM (anonymous composite model) algorithm. In personalized -anonymity algorithm (), we make the data usability and original according to parameter . When is less than the given threshold, all vertex (data) will be published directly, which reduce the degree of data distortion [19].

When , a number of nodes and edges should be added in initial graphs. When the structure is more different, the number of adding is higher. Meanwhile, the information loss is larger.

Figure 5 shows that some nodes are added to construct the isomorphic graphs, the percentage of adding nodes in all the nodes of the graph is shown in Figure 5, and the situation of edges is shown in Figure 6. With the , , and , the increasing speed of nodes and edges slows down. These are additional redundant data.

In Figures 79, the loss of information is shown with , , and . The information loss degrees are increasing with the increasing of nodes, and . The reason is that the candidate set will be larger with the increasing of data scale, and finding similar neighborhood will be easier [19].

When , the information loss results of attributes generalization are compared when value is 5, 10, 15, 20, and 25. Figures 10 and 11 show the comparison results.

From Figure 10, when value increase, the demand of privacy protection becomes higher, which lead to obviously increasing of information loss. Besides, the loss of ACIM is lower after comparison. The reason is that not all the situations in same types are generalized but adding threshold judgments in ACIM. In Figure 11, the number of nodes is larger, and the generalization information loss is lower and make the node information availability for users.

Figures 12 and 13 show the comparison of generalization information loss with different value. With higher value, the vertex’s attributes should be changed less, so the information loss rate should be lower. That is to say, parameter can make the vertex information availability and meet personalized needs.

5. Conclusion

The authors study -anonymity technologies and introduce -anonymity application in relational database and social network. We proposed personalized -anonymity model of social network. A lot of personalized -anonymity algorithm experiments were done by the authors. Experimental results show that -neighborhood attack of graph, background knowledge attack, and homogeneity attack can be prevented effectively by using anonymous vertexes and edges, as well as the influence matrix based on background knowledge. The diversity of vertex-sensitive attribute can be achieved. Personalized protecting privacy requirements can be met by using such parameter as .

Data Availability

Previously reported data were used to support this study and are available at Adult Data Set of the UCI Machine Learning Repository, http://archive.ics.uci.edu/ml/datasets/Adult.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors want to thank the helpful comments and suggestions from the anonymous reviewers. This work was supported in part by the Natural Science Foundation of Taizhou University (Grant no. TZXY2019QDJJ008) and the Natural Science Foundation of Heilongjiang Province of China (Grant no. JJ2019LH0048).