Research Article | Open Access
Privacy Protection Method for Multiple Sensitive Attributes Based on Strong Rule
At present, most studies on data publishing only considered single sensitive attribute, and the works on multiple sensitive attributes are still few. And almost all the existing studies on multiple sensitive attributes had not taken the inherent relationship between sensitive attributes into account, so that adversary can use the background knowledge about this relationship to attack the privacy of users. This paper presents an attack model with the association rules between the sensitive attributes and, accordingly, presents a data publication for multiple sensitive attributes. Through proof and analysis, the new model can prevent adversary from using the background knowledge about association rules to attack privacy, and it is able to get high-quality released information. At last, this paper verifies the above conclusion with experiments.
Data publishing is widely used in the field of information sharing and scientific research, and how to ensure the availability of data and the security of user’s privacy is the core content of studies. The data tables usually contain three types of attribute: identifier, which can identify the individual uniquely, for example, the social security number (SSN); quasi-identifier (QI), which cannot identify the individual uniquely but can provide individual information, for example, the country and age attributes; sensitive attribute (), which is usually related to the privacy of users, for example, the disease attribute. The sensitive attribute needs to be protected in the published table. A series of data publishing methods [1–11] are presented, in order to prevent adversary from linking quasi-identifying attributes with public available dataset to reveal personal identity, and paper [1, 2] presents -anonymity method, which partitions the table into equivalence groups (EG). Each equivalence group consists of at least different records, and -anonymity generalizes the quasi-identified attributes of records in the same equivalence groups. But the -anonymity is faced with the risk of sensitive attribute disclosure due to lack of diversity. In order to solve this problem,  proposals -diversity, which not only can satisfy the -anonymity but also requires that there are at least different sensitive attribute values in each equivalence group. In addition, privacy protection methods in [5–11] also improve the -anonymity from different angles, respectively. But most of them only consider the situation of single sensitive attribute, so some privacy protection methods for multiple sensitive attributes are presented. Papers [12, 13] attempt to directly use -diversity for multiple sensitive attributes, which result in a lot of information loss. Paper  protects users’ privacy through disturbing the order of sensitive attributes values in the same equivalence groups. But this method needs to add fake sensitive attribute values to EG, and it breaks the relationship between sensitive attributes, so useful relationships cannot be provided. The publication method in paper  can prevent adversary using nonmembership knowledge to attack data table, but its strict grouping condition will result in excessive information loss. According to the theory of paper , we can know that the publication methods of [12, 13, 15] cannot ensure good diversity and are vulnerable to background-join attack, so paper  divides the raw data table into several projected tables, puts the sensitive attributes which have strong dependency into the same projected table, and makes each projected table satisfy -closeness at last. But this method ignores the association rules with high confidence, and the adversary can use the knowledge of these rules to get the privacy of users. In order to avoid the suppression of records, paper  presents a new publication method, which chooses to generalize each sensitive attribute, respectively. But like other privacy protection methods for multiple sensitive attributes, paper  ignores the inherent relationship between sensitive attributes, so the adversary can use the related background knowledge to attack privacy, and it is difficult for users to find valuable relationships from its released tables. In order to resolve this problem, this paper introduces the association rule into the design of privacy protection method and presents an improved data publishing model for multiple sensitive attributes based on the work of .
2. The Main Work of This Paper
Most existing researches on privacy preserving technology for multiple sensitive attributes have not taken the inherent relationship between sensitive attributes into account, so adversary sometimes can use the related background knowledge to attack the privacy of users, and some valuable relationships cannot be provided by released tables. Faced with this situation, we introduce the association rules into the research on data publishing. The main works of this paper are as follows.(1)It analyses the data publishing model-Rating in paper , points out its weakness, and presents an attack method with strong rule (Section 3).(2)It takes relationship between different sensitive attributes into account, presents a mixed data publishing model based on Rating, then improves the algorithm of Rating, makes it more effective, and at last analyses and proves the correctness of the algorithm and the security of the mixed data publication model (Section 4).(3)It proves that the new model has better quality of released information than Rating in theory (Section 5).(4)Through the experiments, it verifies that the new data publishing model can provide better privacy, and it is able to preserve valuable relationships between sensitive attributes in released tables (Section 6).
3. The Analysis of Rating
This section will introduce Rating  model briefly and present an attack method which can use strong rules to get users’ privacy from Rating.
3.1. Description of Symbol
Table contains records, record = , are quasi-identifiers of , and are sensitive attributes of . represents the value of in quasi-attribute . Similarly, represents the value of in sensitive attribute .
Definition 1 (generalization). Assume is a sensitive attribute of table , , is a subset of in table , and generalization means using abstract value to replace specific value . For example, in Table 1, , we can use to replace in Table 3.
Definition 2 (-diversity). is the parameter input by users. After generalization, for each sensitive , if satisfies , record satisfies -diversity. Here represents the number of in , and represents the number of values in set . If all records of model satisfy -diversity, the model satisfies -diversity.
3.2. Review of Rating
Rating generalizes each sensitive attribute, respectively, and can improve the quality of released information. Assume Table 1 is the original data table, are sensitive attributes, and age and country are quasi-identifiers. , Rating generates ID table (IDT) first, then uses the values of ID table to generalize the original data table, gets attribute table (AT), and at last releases IDT and AT. Tables 2 and 3 are IDT and AT, respectively, and both of them are released tables of Table 1.
is a subset of sensitive attribute , it satisfies that , refers to the set of all the values in original table. After getting the IDT, we use the values of ID table to generalize original data table. If and , use to replace in original data table. For example, in Table 1, because , and use to replace in original table, so in Table 3. AT displays the name of in , and users can use the name to get the corresponding value in IDT. For convenient description, AT displays the value directly in this paper.
3.3. The Weakness of Rating
Rating takes the generalization strategy which is suitable for multiple sensitive attributes and can improve the quality of released information. But Rating ignores the relationship between different sensitive attributes, and sometimes the users’ privacy disclosure may happen.
Compared with data publishing for single sensitive attribute, multiple sensitive attributes not only mean the increase of the sensitive attributes’ number but also need more effective method to decrease the information loss, and adversary may use the relationship between sensitive attributes to attack privacy of user. So when designing the data publishing model for multiple sensitive attributes, the association rules should be taken into account. An attack method will be presented as follows.
Assuming the original table can basically reflect the real world, if Bob knows his neighbor Alice is in Table 3 and Alice is 23 years old and Chinese, Bob is sure that is Alice according to the quasi-identifier. Through Bob’s common sense of life or previous investigations, he knows that if event happens to someone, the probability of occurrence of is usually not less than 75%; namely, . Then through the ID table Bob knows in this table that there are four people whose attribute values are , so in these four people, there are at least three whose attribute values must be . That is to say, there are at least three people whose values are while their values are .
So Bob begins to analyze attribute table, and he finds out that there are 8 records whose value may be , and these 8 records are , but in these records, only 3 records’ values may be , and these three records are , respectively. Then Bob can be sure that , . And is Alice, so the privacy of Alice is disclosed.
The above is an example of attack with association rules. Because Rating has not taken the correlation between sensitive attributes into account, if adversary masters corresponding background knowledge, the privacy of user may be disclosed. So in the next section, an improved model will be presented.
4. The Data Publishing for Multiple Sensitive Attributes Based on Strong Rules
In this section, we introduce the association rules and present a new data publishing model which can avoid attacking with strong rules between sensitive attributes. The records are divided into two categories; each category will be processed by different data publishing models, respectively. So this new data publishing model for multiple sensitive attributes is actually a mixed model.
4.1. Data Publishing Method Based on Sensitive Attributes Clustering (SAC)
The original data table will be divided into two tables: table SAC and table IR, process the two tables, respectively. This part introduces the division of table and the processing of table SAC. We will introduce some definitions and parameters first.
Definition 3 (association rules). Assume that when event happens, the probability of occurrence of event is ; namely, ⇒ . is an association rule.
Definition 4 (support degree). It represents the number of occurrences of an event. For example, in Table 1, the number of is 4, so the support degree of is 4. Namely, . especially represents the simultaneity number of and . In Table 1, there are 3 records satisfying , , so .
Definition 5 (confidence). Assume when happens, the probability of occurrence of event is , so the confidence of is . For example, in Table 1, there are 4 records whose values are , and three of them have value . So when appears, the probability of occurrence of is 3/4. Namely, . It is not difficult to prove that .
Usually we set minimum support degree threshold (min_support) and the minimum confidence threshold (min_confidence); if an association rule satisfies both these two thresholds, the rule is meaningful. In this paper, as long as one sensitive attribute value appears once, adversary may use it to attack the privacy of users, so the min_support is set to 1. And the min_confidence is set by users.
Definition 6 (strong rule). Assuming association rule [confidence = ], ≥ min_confidence, so we call strong rule.
Usually the strong rule’s confidence is relatively higher, and adversary may use the strong rule to attack users’ privacy if adversary has the related background knowledge. On the other hand, we hope to preserve information of strong rule in the released data, because it is valuable. So we need to put the records containing strong rules into table SAC and process table SAC with consideration for strong rule. Record containing strong rules means that, for record , if , is a strong rule, so record contains strong rule. And for a value , if there is a strong rule that satisfies or , we call strong value.
4.1.1. Partition Table
Assuming the association rules in original data table are close to the situation of the real world, first use classic association rule mining algorithm-Apriori  to find out all the strong rules in table, according to min_confidence set by user (line 1). Then find a sensitive attribute that has the largest number of strong values, put all other sensitive attributes into , and if a record contains strong value in , add it to table SAC (line 2). At last delete the records contained by SAC in original table, and get table IR (line 3), so the original table is divided into two tables: SAC and IR. Obviously, the records which all contain strong association rules are in SAC, and all the strong values in IR belong to the same sensitive attribute, so there are no probability tilts caused by strong association rules in IR (please see Section 4.2.2).
Algorithm 7 (partition table). Input: original data table , min_confidence
Output: Table SAC, Table IR
(1) Accoding to min_confidence find out all strong rules with Apriori.
(2) Find the records which all contain strong values in , and put them into table SAC.
(3) Table IR = -Table SAC.
4.1.2. Partition Sensitive Attributes
After partition table, begin to process the table SAC. In order to preserve the information of strong rules, cluster the sensitive attributes. First, we need to define the distance between sensitive attributes.
Definition 8 (distance between sensitive attributes). Given two sensitive attributes , , the distance between the two sensitive attributes can be defined asHere, if , , or is strong rule, we say there are strong rules between and ; else, there are no strong rules between and .
Definition 9 (distance between sensitive attribute and cluster). Assuming is a cluster, is a sensitive attribute, and the distance between and can be defined as
In this method, put the similar sensitive attributes into a same cluster, as long as the set of sensitive attributes is not empty, and generate new cluster constantly (line 1). For each new empty cluster , pick a sensitive attribute from sensitive attribute set orderly, put into , and is the first attribute of cluster (line 2). Find out all such sensitive attributes , and satisfies that . Add to , and delete in (line 3 to line 4). Similarly, generate other clusters.
After Algorithm 10, we get the set of clusters: Cluster_Set = , , and is a subset of sensitive attributes set = .
Assuming , , record , and the value of is .
Algorithm 10 (partition sensitive attributes). Input: Table SAC
Output: set of sensitive attributes’ cluster (Cluster_Set)
is the set of sensitive attributes, = .
(1) While the is not empty, repeat (2) to (5).
(2) Generate cluster , add the first sensitive attribute to , - .
(3) For each , do (4).
(4) If distance() = 0, add to , -.
4.1.3. Partition Records
This part will divide the Table SAC into several groups and anonymize the records in the same group.
Algorithm 11 (partition records). Input: Table SAC, Table IR, and Cluster_Set,
Output: released Table SAC~
(1) While Table SAC is not empty, repeat (2) to (7).
(2) Generate a new group .
(3) Choose a record from Table SAC orderly, , SAC-.
(4) While , repeat (5) to (6).
(5) If , this satisfies , and SAC-.
(6) Else choose record from table IR orderly, satisfies , , and IR-.
(8) Permutate cluster values in each group randomly.
In the algorithm of partition records, while table SAC is not empty, generate group constantly (lines 1-2). For each empty group , choose a record from table SAC as ’s first record (line 3). Choosing , or , does not have the same sensitive attributes values with , and add to , until the number of records in is not less than (line 4–6). Within each group, sensitive attributes values are permutated randomly in each cluster to break the linking between different clusters (line 8). That is to say, adjust the position of cluster values randomly. Finally, release .
Now take Table 1 as an example, there are three steps on this process. Here, assume that , min_confidence = 0.75.(1)Partition table: both and have four strong values, and has no values, so . We first find out that there are 4 records containing strong values in , and they are , and , respectively. These 4 records make up table SAC (Table 4) and meanwhile, delete these 4 records in Table 1.(2)Partition sensitive attributes: generate a new cluster , add to , and now there remain two sensitive attributes in sensitive attributes set. Because there is strong rule between and and , add to . But both and are 1, , cannot be added to . And the only one attribute in sensitive attributes set makes up a cluster alone. So clustering is over, and we get two clusters , .(3)Partition records: according to the grouping condition, it cannot have the same sensitive attribute values in a group, and make up a group, similarly, and and , and , and make up groups, respectively (Table 5). After grouping, randomly permutate the cluster values in the same groups and release the table SAC~ (Table 6). For example, in group 1, permutate value , randomly. Here, each group has two records, according to the random principle; after disturbing order, may swap position with (), or both () and () remain in the original positions. Similarly, for , permutate value, , , randomly. Although through anonymity, the relationship between and is still preserved. On the other hand, linking between and has been broken. So this method preserves the links between sensitive attributes in the same clusters and breaks the links between sensitive attributes from different clusters.
Lemma 12. Assuming adversary has the background knowledge about strong rules, for , , , adversary is sure that contains with the probability:Here, represents the set of strong rules in table .
Proof. If , there are no records containing strong rules in table IR, so . If , for each group in , has at least records; if there is record containing in , according to the nature of random permutation, each record in contains with equality probability , so we can know that even though adversary can be clear of how many records contain through related background knowledge, is not more than .
In the example of Section 3.3, for some records, adversary can make sure that they contain strong rules with probability 100%. table can prevent adversary from using strong rules to attack users’ privacy.
4.2. Improved Rating (IR)
The table IR will be processed by improved Rating. For each sensitive attribute , Rating  hashes in by their values (each bucket corresponds to each value), and if has different values (), it can get sequence = , and assuming there are in , so the corresponding bucketi contains in it. Every time Rating chooses the buckets that have the largest size, gets a value from every one of these buckets to make up a SID, uses SID to replace corresponding sensitive attribute values in original table, and gets attribute table, and the SID makes up ID table. Every time Rating generates an SID, it needs to reselect buckets, because after the last generation of SID, the sizes of buckets have been updated. Paper  has not presented the algorithm of choosing largest buckets, so this paper will present a heuristic algorithm of choosing buckets.
4.2.1. Heuristic Algorithm for Choosing Buckets
This heuristic algorithm (Algorithm 13) is actually a stable sorting algorithm, sort sequence in descending order, and choose the first buckets from sequence. This algorithm will be called after updating the size of buckets. Let us introduce some parameters, and sequence refers to the bucket in the sequence: sequence.value: the attribute value in sequence, sequence.size: the size of sequence, sequence.size~: after updating size, the size of sequence, if , sequence.size~ = sequence.size-1; if , sequence.size~ = sequence.size, sequence.position: the position of sequence, sequence.position = , sequence.position~: after sorting, the new position of sequence.
Algorithm 13 (heuristic choosing largest buckets). Input: sequence, ,
Output: sequence satisfies stable descending order,
(1) If sequence.size~ < sequence.size~, do (2) to (5).
(2) For ; ; , repeat (3) to (5)
(3) If sequence.size~ < sequence.size~, do (4) to (5).
(4) Find sequence and sequence which satisfy sequence.size~ > sequence.size~ ≥ sequence. size~
(5) remove = to the position between sequence and sequence; meanwhile keep the relative position among , end loop.
Lemma 14. After updating size of buckets, if sequence.size~ ≥ sequence.size~, the position of all buckets in sequence does not need to be adjusted.
Proof. Using proof by contradiction, only the sizes of have been changed, so these buckets may need to adjust their position in the sequence. If sequence () needs to be removed to the back of sequence (), then sequence.size~ > sequence. and sequence.size~ ≥ sequence. > sequence.size~. Because sequence.size ≥ sequence.size, and after generation, in both the sizes of sequence and sequence minus 1, there is sequence.size~ ≥ sequence.size~, we can get the contradictory conclusion:
Lemma 15. If sequence.size~ > sequence.size~ ≥ sequence, in order to preserve stable descending order, there are sequence.position~ < sequence. < sequence. and sequence. position~ = sequence.
Proof. Obviously, according to the nature of stable descending sorting, the new position of sequence is between sequence and sequence. Here we will discuss the new position of sequence. In addition to the situation in Lemma 15, the new position of sequence has two possibilities, using the contradiction to proof, respectively.(1)After adjusting, the position of sequence will be sequence.position~ and sequence.position~ < sequence.position~. According to the nature of descending order, before generating SID, sequence.size ≥ sequence.size and sequence.position < sequence.position, after generating SID, in both the sizes of sequence and sequence minus 1, one has sequence.size~ ≥ sequence.size~. So sequence.position~< sequence.position~ contradicts the nature of stable descending order.(2)After adjusting, the position of sequence will be sequence.position~, sequence.position~ > sequence.position~, and existing set , and each sequence satisfies that sequence.position~ < sequence.position~ < sequence.position~. If , according to the nature of stable descending order, sequence.size~ ≥ sequence.size~ ≥ sequence.size~, sequence.size ≥ sequence.size ≥ sequence.size, sequence.position < sequence.position < sequence.position, and we can get the contradictory conclusions: . In another situation, if , one must have sequence.size~ ≥ sequence.size~ > sequence.size~ and sequence.size = sequence.size ≥ sequence.size; because the sizes of sequence and sequence have not changed, there is sequence.size~ ≥ sequence.size~, so we can get the contradictory conclusions: sequence.size~ ≥ sequence.size~.
Corollary 16. If sequence.size~ < sequence.size~, the new position of sequence will be sequence.position~; for each sequence, its new position is as follows: sequence.position~ = sequence.position~ + .
Proof. According to Lemma 15, sequence.position~ = sequence.position~ + 1; for each sequence ), there exists sequence.position~ = sequence.position~ + 1, so for each sequence (), it satisfies
Corollary 17. After updating size of bucket, if sequence.size~ > sequence.size~ ≥ sequence.size~, here or , sequence.size~ ≥ sequence.size~; to make sequence satisfy the stable descending order, one only needs to remove = sequence, to the position between sequence and sequence.
Proof. After updating size, only sizes of have changed, so only the positions of may need to be adjusted. According to the nature of stable descending order sequence].position~ = sequence.position~ + 1; according to Corollary 16, sequence.position~ = sequence.position~ + (), (), and sequence.position~ = sequence.position~ + 1. So removing sequence, to the position between sequence and sequence can make sequence satisfy stable descending order.
Here one analyzes the efficiency of this sorting algorithm. In the best situation, sequence.size~ ≥ sequence.size~, it only needs to compare sequence.size~ with sequence.size~ in this algorithm, and the time complexity is . The worst situation is that sequence.size~ < sequence.size~, the algorithm needs to compare times, and the time complexity is . The efficiency of this sorting algorithm is much better than most of existing sorting algorithms.
4.2.2. SID Creation
This part will introduce the algorithm of creating SID. The definition of dangerous bucket will be introduced as follows.
Definition 18 (dangerous bucket). Assuming sequence is in sensitive attribute , for each bucket , if , is a dangerous bucket in . Here, is size of the domain ().
The SID creation can be seen in Algorithm 19. For each sensitive attribute , one generates its sequence and removes the dangerous buckets from sequence to SDB. Every time one generates a new , for each bucket in SDB, removes one value from it to (line 6), for each bucket of sequence, sequence, …, sequence[-] which have largest size, removes a value from it to (line 7). When a is completed, call for Algorithm 13 to sort the sequence. In the step of processing residual values, for each value in a nonempty bucket, remove to a which does not contain value (line 9).
Algorithm 19 (SID creation). Input: , IR table
(1) For each sensitive attribute , do (2) to (9).
(2) Generate sequence.
(3) Find the dangerous buckets, and put them in SDB.
(4) Remove the dangerous buckets in sequence;
(5) When there are at least - buckets in sequence which are not empty, generate a new repeat (6) to (8).
(6) For each bucket in SDB, remove a value to .
(7) For each bucket from sequence, sequence, …, sequence[-, remove a value to ;
(8) Call for Algorithm 13 and use sequence and - as input; //processes the residual attribute values
(9) For each value in nonempty buckets, find a which contains no value , and remove to . If one cannot find this , value cannot be grouped.
Lemma 20. If bucket is a dangerous bucket in sensitive attribute , after completing a new , is still a dangerous bucket.
Proof. After completing a new , the frequency of .value is , and is the size of . Before generating the new , the frequency of .value was . If there is , Lemma 20 can be proved.
The left side of the equation is equal to , and the right side of the equation is equal to ; one only needs to prove thatBecause before generating the new , was dangerous bucket, and satisfied , has , so we can get
Corollary 21. If bucket is a dangerous bucket, after completing a new , is still one of the buckets which have the largest size.
Proof. Using proof by contradiction, if there is Set = bucket1, bucket2, …, , for each , it has larger size than . According to Lemma 20, after generating a new , the , so bucket satisfies , so there hasGet the contradictory conclusions: From the above certification, we can find out that the dangerous bucket is always one of the largest buckets, so each new should take one value of dangerous bucket, and it does not need to consider dangerous bucket for sorting sequence. This method further improves the efficiency of the algorithm.
Besides, in this improved Rating, the algorithm for AT&IDT Creation is the same as in Rating, uses to generalize corresponding value in IR table, after processing IR, gets AT, and then uses the set of to make up IDT. At last release AT and IDT with the previous SAC~ table.
Here we assume adversary masters strong rules and summarize the security of mixed model. For released table SAC~, because each group contains at least records without the same values and refers to Lemma 12, it is easy to know that SAC~ satisfy -diversity. For released AT, we will discuss a problem of probability tilts first. For record , after generalization if , satisfy which is a strong rule, there will be probability tilt between and obviously. And according to the method of partition table, all the strong values in AT belong to the same sensitive attribute, so the probability tilts will not happen in AT. Besides, each consists of at least different values, so AT also satisfies -diversity. And through the above analysis, we know that mixed model is able to satisfy -diversity.
5. Analysis and Proof of Information Availability
This section analyzes the information loss of the new model from availability of association rules and the quality of published data table.
5.1. The Availability of Association Rules
In Rating model, all the relationships between different sensitive attributes are broken, the new model presented by this paper makes improvement, and all the strong rules can be preserved in released table.
Lemma 22. If the association rule is as follows: confidence = , ≥ in the original data table, in the released data tables, the confidence of is still .
Proof. The released data tables contain SAC~ table, ID table, and the attribute table, and user can get the support degree of from SAC~ table and ID table, namely, support(), because in attribute table there is , one only needs to get support() from the SAC~ table. So the confidence of is support()/support().
Here, we can see that the mixed model preserves all the strong association rules, and user can get the confidence of strong association rules from the released tables. And the Rating model breaks all the relationships between sensitive attributes and generates unnecessary information loss.
5.2. The Quality of Published Data Table
This part uses the reconstruction error (RCE) [9, 17] to measure the quality of the published tables. Assume original table = ) gets a dimensional space ; for record in table , the probability density function (pdf) of is
Here, the is a dimensional variable in .
For record in the released table of Rating, the pdf of is
Assume the Cluster_Set = in the mixed model, the defines a dimensional space . In the released tables of the mixed model, if , the pdf of is
Here, is a dimensional variable in , represents the set of the possible values of , and represents the number of the possible values. For example, in Table 6, a user wants to reconstruct the pdf of ; in his view, the can be () or () with equality probability 1/2, and can be or with equality probability 1/2, so the pdf of is