Table of Contents Author Guidelines Submit a Manuscript
Mathematical Problems in Engineering
Volume 2015 (2015), Article ID 464731, 14 pages
http://dx.doi.org/10.1155/2015/464731
Research Article

Privacy Protection Method for Multiple Sensitive Attributes Based on Strong Rule

College of Computer Science, Communication University of China, Beijing 100024, China

Received 14 April 2015; Revised 10 July 2015; Accepted 15 July 2015

Academic Editor: Nazrul Islam

Copyright © 2015 Tong Yi and Minyong Shi. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

At present, most studies on data publishing only considered single sensitive attribute, and the works on multiple sensitive attributes are still few. And almost all the existing studies on multiple sensitive attributes had not taken the inherent relationship between sensitive attributes into account, so that adversary can use the background knowledge about this relationship to attack the privacy of users. This paper presents an attack model with the association rules between the sensitive attributes and, accordingly, presents a data publication for multiple sensitive attributes. Through proof and analysis, the new model can prevent adversary from using the background knowledge about association rules to attack privacy, and it is able to get high-quality released information. At last, this paper verifies the above conclusion with experiments.

1. Introduction

Data publishing is widely used in the field of information sharing and scientific research, and how to ensure the availability of data and the security of user’s privacy is the core content of studies. The data tables usually contain three types of attribute: identifier, which can identify the individual uniquely, for example, the social security number (SSN); quasi-identifier (QI), which cannot identify the individual uniquely but can provide individual information, for example, the country and age attributes; sensitive attribute (), which is usually related to the privacy of users, for example, the disease attribute. The sensitive attribute needs to be protected in the published table. A series of data publishing methods [111] are presented, in order to prevent adversary from linking quasi-identifying attributes with public available dataset to reveal personal identity, and paper [1, 2] presents -anonymity method, which partitions the table into equivalence groups (EG). Each equivalence group consists of at least different records, and -anonymity generalizes the quasi-identified attributes of records in the same equivalence groups. But the -anonymity is faced with the risk of sensitive attribute disclosure due to lack of diversity. In order to solve this problem, [4] proposals -diversity, which not only can satisfy the -anonymity but also requires that there are at least different sensitive attribute values in each equivalence group. In addition, privacy protection methods in [511] also improve the -anonymity from different angles, respectively. But most of them only consider the situation of single sensitive attribute, so some privacy protection methods for multiple sensitive attributes are presented. Papers [12, 13] attempt to directly use -diversity for multiple sensitive attributes, which result in a lot of information loss. Paper [14] protects users’ privacy through disturbing the order of sensitive attributes values in the same equivalence groups. But this method needs to add fake sensitive attribute values to EG, and it breaks the relationship between sensitive attributes, so useful relationships cannot be provided. The publication method in paper [15] can prevent adversary using nonmembership knowledge to attack data table, but its strict grouping condition will result in excessive information loss. According to the theory of paper [16], we can know that the publication methods of [12, 13, 15] cannot ensure good diversity and are vulnerable to background-join attack, so paper [16] divides the raw data table into several projected tables, puts the sensitive attributes which have strong dependency into the same projected table, and makes each projected table satisfy -closeness at last. But this method ignores the association rules with high confidence, and the adversary can use the knowledge of these rules to get the privacy of users. In order to avoid the suppression of records, paper [17] presents a new publication method, which chooses to generalize each sensitive attribute, respectively. But like other privacy protection methods for multiple sensitive attributes, paper [17] ignores the inherent relationship between sensitive attributes, so the adversary can use the related background knowledge to attack privacy, and it is difficult for users to find valuable relationships from its released tables. In order to resolve this problem, this paper introduces the association rule into the design of privacy protection method and presents an improved data publishing model for multiple sensitive attributes based on the work of [17].

2. The Main Work of This Paper

Most existing researches on privacy preserving technology for multiple sensitive attributes have not taken the inherent relationship between sensitive attributes into account, so adversary sometimes can use the related background knowledge to attack the privacy of users, and some valuable relationships cannot be provided by released tables. Faced with this situation, we introduce the association rules into the research on data publishing. The main works of this paper are as follows.(1)It analyses the data publishing model-Rating in paper [17], points out its weakness, and presents an attack method with strong rule (Section 3).(2)It takes relationship between different sensitive attributes into account, presents a mixed data publishing model based on Rating, then improves the algorithm of Rating, makes it more effective, and at last analyses and proves the correctness of the algorithm and the security of the mixed data publication model (Section 4).(3)It proves that the new model has better quality of released information than Rating in theory (Section 5).(4)Through the experiments, it verifies that the new data publishing model can provide better privacy, and it is able to preserve valuable relationships between sensitive attributes in released tables (Section 6).

3. The Analysis of Rating

This section will introduce Rating [17] model briefly and present an attack method which can use strong rules to get users’ privacy from Rating.

3.1. Description of Symbol

Table contains records, record = , are quasi-identifiers of , and are sensitive attributes of . represents the value of in quasi-attribute . Similarly, represents the value of in sensitive attribute .

Definition 1 (generalization). Assume is a sensitive attribute of table , , is a subset of in table , and generalization means using abstract value to replace specific value . For example, in Table 1, , we can use to replace in Table 3.

Table 1: Original table.

Definition 2 (-diversity). is the parameter input by users. After generalization, for each sensitive , if satisfies , record satisfies -diversity. Here represents the number of in , and represents the number of values in set . If all records of model satisfy -diversity, the model satisfies -diversity.

3.2. Review of Rating

Rating generalizes each sensitive attribute, respectively, and can improve the quality of released information. Assume Table 1 is the original data table, are sensitive attributes, and age and country are quasi-identifiers. , Rating generates ID table (IDT) first, then uses the values of ID table to generalize the original data table, gets attribute table (AT), and at last releases IDT and AT. Tables 2 and 3 are IDT and AT, respectively, and both of them are released tables of Table 1.

Table 2: Rating published IDT for Table 1.
Table 3: Rating published AT for Table 1.

is a subset of sensitive attribute , it satisfies that , refers to the set of all the values in original table. After getting the IDT, we use the values of ID table to generalize original data table. If and , use to replace in original data table. For example, in Table 1, because , and use to replace in original table, so in Table 3. AT displays the name of in [17], and users can use the name to get the corresponding value in IDT. For convenient description, AT displays the value directly in this paper.

3.3. The Weakness of Rating

Rating takes the generalization strategy which is suitable for multiple sensitive attributes and can improve the quality of released information. But Rating ignores the relationship between different sensitive attributes, and sometimes the users’ privacy disclosure may happen.

Compared with data publishing for single sensitive attribute, multiple sensitive attributes not only mean the increase of the sensitive attributes’ number but also need more effective method to decrease the information loss, and adversary may use the relationship between sensitive attributes to attack privacy of user. So when designing the data publishing model for multiple sensitive attributes, the association rules should be taken into account. An attack method will be presented as follows.

Assuming the original table can basically reflect the real world, if Bob knows his neighbor Alice is in Table 3 and Alice is 23 years old and Chinese, Bob is sure that is Alice according to the quasi-identifier. Through Bob’s common sense of life or previous investigations, he knows that if event happens to someone, the probability of occurrence of is usually not less than 75%; namely, . Then through the ID table Bob knows in this table that there are four people whose attribute values are , so in these four people, there are at least three whose attribute values must be . That is to say, there are at least three people whose values are while their values are .

So Bob begins to analyze attribute table, and he finds out that there are 8 records whose value may be , and these 8 records are , but in these records, only 3 records’ values may be , and these three records are , respectively. Then Bob can be sure that , . And is Alice, so the privacy of Alice is disclosed.

The above is an example of attack with association rules. Because Rating has not taken the correlation between sensitive attributes into account, if adversary masters corresponding background knowledge, the privacy of user may be disclosed. So in the next section, an improved model will be presented.

4. The Data Publishing for Multiple Sensitive Attributes Based on Strong Rules

In this section, we introduce the association rules and present a new data publishing model which can avoid attacking with strong rules between sensitive attributes. The records are divided into two categories; each category will be processed by different data publishing models, respectively. So this new data publishing model for multiple sensitive attributes is actually a mixed model.

4.1. Data Publishing Method Based on Sensitive Attributes Clustering (SAC)

The original data table will be divided into two tables: table SAC and table IR, process the two tables, respectively. This part introduces the division of table and the processing of table SAC. We will introduce some definitions and parameters first.

Definition 3 (association rules). Assume that when event happens, the probability of occurrence of event is ; namely, . is an association rule.

Definition 4 (support degree). It represents the number of occurrences of an event. For example, in Table 1, the number of is 4, so the support degree of is 4. Namely, . especially represents the simultaneity number of and . In Table 1, there are 3 records satisfying , , so .

Definition 5 (confidence). Assume when happens, the probability of occurrence of event is , so the confidence of is . For example, in Table 1, there are 4 records whose values are , and three of them have value . So when appears, the probability of occurrence of is 3/4. Namely, . It is not difficult to prove that .

Usually we set minimum support degree threshold (min_support) and the minimum confidence threshold (min_confidence); if an association rule satisfies both these two thresholds, the rule is meaningful. In this paper, as long as one sensitive attribute value appears once, adversary may use it to attack the privacy of users, so the min_support is set to 1. And the min_confidence is set by users.

Definition 6 (strong rule). Assuming association rule [confidence = ], ≥ min_confidence, so we call strong rule.

Usually the strong rule’s confidence is relatively higher, and adversary may use the strong rule to attack users’ privacy if adversary has the related background knowledge. On the other hand, we hope to preserve information of strong rule in the released data, because it is valuable. So we need to put the records containing strong rules into table SAC and process table SAC with consideration for strong rule. Record containing strong rules means that, for record , if , is a strong rule, so record contains strong rule. And for a value , if there is a strong rule that satisfies or , we call strong value.

4.1.1. Partition Table

Assuming the association rules in original data table are close to the situation of the real world, first use classic association rule mining algorithm-Apriori [18] to find out all the strong rules in table, according to min_confidence set by user (line 1). Then find a sensitive attribute that has the largest number of strong values, put all other sensitive attributes into , and if a record contains strong value in , add it to table SAC (line 2). At last delete the records contained by SAC in original table, and get table IR (line 3), so the original table is divided into two tables: SAC and IR. Obviously, the records which all contain strong association rules are in SAC, and all the strong values in IR belong to the same sensitive attribute, so there are no probability tilts caused by strong association rules in IR (please see Section 4.2.2).

Algorithm 7 (partition table). Input: original data table , min_confidence
Output: Table SAC, Table IR
(1) Accoding to min_confidence find out all strong rules with Apriori.
(2) Find the records which all contain strong values in , and put them into table SAC.
(3) Table IR = -Table SAC.

4.1.2. Partition Sensitive Attributes

After partition table, begin to process the table SAC. In order to preserve the information of strong rules, cluster the sensitive attributes. First, we need to define the distance between sensitive attributes.

Definition 8 (distance between sensitive attributes). Given two sensitive attributes , , the distance between the two sensitive attributes can be defined asHere, if , , or is strong rule, we say there are strong rules between and ; else, there are no strong rules between and .

Definition 9 (distance between sensitive attribute and cluster). Assuming is a cluster, is a sensitive attribute, and the distance between and can be defined as

In this method, put the similar sensitive attributes into a same cluster, as long as the set of sensitive attributes is not empty, and generate new cluster constantly (line 1). For each new empty cluster , pick a sensitive attribute from sensitive attribute set orderly, put into , and is the first attribute of cluster (line 2). Find out all such sensitive attributes , and satisfies that . Add to , and delete in (line 3 to line 4). Similarly, generate other clusters.

After Algorithm 10, we get the set of clusters: Cluster_Set = , , and is a subset of sensitive attributes set = .

Assuming , , record , and the value of is .

Algorithm 10 (partition sensitive attributes). Input: Table SAC
Output: set of sensitive attributes’ cluster (Cluster_Set)
is the set of sensitive attributes, = .
(1) While the is not empty, repeat (2) to (5).
(2) Generate cluster , add the first sensitive attribute to , - .
(3) For each , do (4).
(4) If distance() = 0, add to , -.
(5) .

4.1.3. Partition Records

This part will divide the Table SAC into several groups and anonymize the records in the same group.

Algorithm 11 (partition records). Input: Table SAC, Table IR, and Cluster_Set,
Output: released Table SAC~
(1) While Table SAC is not empty, repeat (2) to (7).
(2) Generate a new group .
(3) Choose a record from Table SAC orderly, , SAC-.
(4) While , repeat (5) to (6).
(5) If , this satisfies , and SAC-.
(6) Else choose record from table IR orderly, satisfies ,  , and IR-.
(7)
(8) Permutate cluster values in each group randomly.

In the algorithm of partition records, while table SAC is not empty, generate group constantly (lines 1-2). For each empty group , choose a record from table SAC as ’s first record (line 3). Choosing , or , does not have the same sensitive attributes values with , and add to , until the number of records in is not less than (line 4–6). Within each group, sensitive attributes values are permutated randomly in each cluster to break the linking between different clusters (line 8). That is to say, adjust the position of cluster values randomly. Finally, release .

Now take Table 1 as an example, there are three steps on this process. Here, assume that , min_confidence = 0.75.(1)Partition table: both and have four strong values, and has no values, so . We first find out that there are 4 records containing strong values in , and they are , and , respectively. These 4 records make up table SAC (Table 4) and meanwhile, delete these 4 records in Table 1.(2)Partition sensitive attributes: generate a new cluster , add to , and now there remain two sensitive attributes in sensitive attributes set. Because there is strong rule between and and , add to . But both and are 1, , cannot be added to . And the only one attribute in sensitive attributes set makes up a cluster alone. So clustering is over, and we get two clusters , .(3)Partition records: according to the grouping condition, it cannot have the same sensitive attribute values in a group, and make up a group, similarly, and and , and , and make up groups, respectively (Table 5). After grouping, randomly permutate the cluster values in the same groups and release the table SAC~ (Table 6). For example, in group 1, permutate value , randomly. Here, each group has two records, according to the random principle; after disturbing order, may swap position with (), or both () and () remain in the original positions. Similarly, for , permutate value, , , randomly. Although through anonymity, the relationship between and is still preserved. On the other hand, linking between and has been broken. So this method preserves the links between sensitive attributes in the same clusters and breaks the links between sensitive attributes from different clusters.

Table 4: The SAC table.
Table 5: Partition records for Table 4.
Table 6: Mixed model published SAC~ for Table 1.

Lemma 12. Assuming adversary has the background knowledge about strong rules, for , , , adversary is sure that contains with the probability:Here, represents the set of strong rules in table .

Proof. If , there are no records containing strong rules in table IR, so . If , for each group in , has at least records; if there is record containing in , according to the nature of random permutation, each record in contains with equality probability , so we can know that even though adversary can be clear of how many records contain through related background knowledge, is not more than .

In the example of Section 3.3, for some records, adversary can make sure that they contain strong rules with probability 100%. table can prevent adversary from using strong rules to attack users’ privacy.

4.2. Improved Rating (IR)

The table IR will be processed by improved Rating. For each sensitive attribute , Rating [17] hashes in by their values (each bucket corresponds to each value), and if has different values (), it can get sequence = , and assuming there are in , so the corresponding bucketi contains in it. Every time Rating chooses the buckets that have the largest size, gets a value from every one of these buckets to make up a SID, uses SID to replace corresponding sensitive attribute values in original table, and gets attribute table, and the SID makes up ID table. Every time Rating generates an SID, it needs to reselect buckets, because after the last generation of SID, the sizes of buckets have been updated. Paper [17] has not presented the algorithm of choosing largest buckets, so this paper will present a heuristic algorithm of choosing buckets.

4.2.1. Heuristic Algorithm for Choosing Buckets

This heuristic algorithm (Algorithm 13) is actually a stable sorting algorithm, sort sequence in descending order, and choose the first buckets from sequence. This algorithm will be called after updating the size of buckets. Let us introduce some parameters, and sequence[] refers to the bucket in the sequence:sequence[].value: the attribute value in sequence[],sequence[].size: the size of sequence[],sequence[].size~: after updating size, the size of sequence[], if , sequence[].size~ = sequence[].size-1; if , sequence[].size~ = sequence[].size,sequence[].position: the position of sequence[], sequence[].position = ,sequence[].position~: after sorting, the new position of sequence[].

Algorithm 13 (heuristic choosing largest buckets). Input: sequence, ,
Output: sequence satisfies stable descending order,
(1) If sequence[].size~ < sequence[].size~, do (2) to (5).
(2) For ; ; , repeat (3) to (5)
(3) If sequence[].size~ < sequence[].size~, do (4) to (5).
(4) Find sequence[] and sequence[] which satisfy sequence[].size~ > sequence[].size~ ≥ sequence[]. size~
(5) remove = to the position between sequence[] and sequence[]; meanwhile keep the relative position among , end loop.

If the sequence always satisfies the stable descending order, there are Lemmas 14 and 15 and Corollaries 16 and 17, and we will prove the correctness of algorithm.

Lemma 14. After updating size of buckets, if sequence.size~ ≥ sequence.size~, the position of all buckets in sequence does not need to be adjusted.

Proof. Using proof by contradiction, only the sizes of have been changed, so these buckets may need to adjust their position in the sequence. If sequence[] () needs to be removed to the back of sequence[] (), then sequence[].size~ > sequence[]. and sequence[].size~ ≥ sequence[]. > sequence[].size~. Because sequence[].size ≥ sequence[].size, and after generation, in both the sizes of sequence[] and sequence[] minus 1, there is sequence[].size~ ≥ sequence[].size~, we can get the contradictory conclusion:

Lemma 15. If sequence.size~ > sequence.size~ ≥ sequence, in order to preserve stable descending order, there are sequence.position~ < sequence. < sequence. and sequence. position~ = sequence.

Proof. Obviously, according to the nature of stable descending sorting, the new position of sequence[] is between sequence[] and sequence[]. Here we will discuss the new position of sequence[]. In addition to the situation in Lemma 15, the new position of sequence[] has two possibilities, using the contradiction to proof, respectively.(1)After adjusting, the position of sequence[] will be sequence[].position~ and sequence[].position~ < sequence[].position~. According to the nature of descending order, before generating SID, sequence[].size ≥ sequence[].size and sequence[].position < sequence[].position, after generating SID, in both the sizes of sequence[] and sequence[] minus 1, one has sequence[].size~ ≥ sequence[].size~. So sequence[].position~< sequence[].position~ contradicts the nature of stable descending order.(2)After adjusting, the position of sequence[] will be sequence[].position~, sequence[].position~ > sequence[].position~, and existing set , and each sequence[] satisfies that sequence[].position~ < sequence[].position~ < sequence[].position~. If , according to the nature of stable descending order, sequence[].size~ ≥ sequence[].size~ ≥ sequence[].size~, sequence[].size ≥ sequence[].size ≥ sequence[].size, sequence[].position < sequence[].position < sequence[].position, and we can get the contradictory conclusions: . In another situation, if , one must have sequence[].size~ ≥ sequence[].size~ > sequence[].size~ and sequence[].size = sequence[].size ≥ sequence[].size; because the sizes of sequence[] and sequence[] have not changed, there is sequence[].size~ ≥ sequence[].size~, so we can get the contradictory conclusions: sequence[].size~ ≥ sequence[].size~.

Corollary 16. If sequence.size~ < sequence.size~, the new position of sequence will be sequence.position~; for each sequence, its new position is as follows: sequence.position~ = sequence.position~ + .

Proof. According to Lemma 15, sequence[].position~ = sequence[].position~ + 1; for each sequence[] ), there exists sequence[].position~ = sequence[].position~ + 1, so for each sequence[] (), it satisfies

Corollary 17. After updating size of bucket, if sequence.size~ > sequence.size~ ≥ sequence.size~, here or , sequence.size~ ≥ sequence.size~; to make sequence satisfy the stable descending order, one only needs to remove = sequence, to the position between sequence and sequence.

Proof. After updating size, only sizes of have changed, so only the positions of may need to be adjusted. According to the nature of stable descending order sequence].position~ = sequence[].position~ + 1; according to Corollary 16, sequence[].position~ = sequence[].position~ + (), (), and sequence[].position~ = sequence[].position~ + 1. So removing sequence[], to the position between sequence[] and sequence[] can make sequence satisfy stable descending order.
Here one analyzes the efficiency of this sorting algorithm. In the best situation, sequence[].size~ ≥ sequence[].size~, it only needs to compare sequence[].size~ with sequence[].size~ in this algorithm, and the time complexity is . The worst situation is that sequence[].size~ < sequence[].size~, the algorithm needs to compare times, and the time complexity is . The efficiency of this sorting algorithm is much better than most of existing sorting algorithms.

4.2.2. SID Creation

This part will introduce the algorithm of creating SID. The definition of dangerous bucket will be introduced as follows.

Definition 18 (dangerous bucket). Assuming sequence is in sensitive attribute , for each bucket , if , is a dangerous bucket in . Here, is size of the domain ().

The SID creation can be seen in Algorithm 19. For each sensitive attribute , one generates its sequence and removes the dangerous buckets from sequence to SDB. Every time one generates a new , for each bucket in SDB, removes one value from it to (line 6), for each bucket of sequence[], sequence[], …, sequence[-] which have largest size, removes a value from it to (line 7). When a is completed, call for Algorithm 13 to sort the sequence. In the step of processing residual values, for each value in a nonempty bucket, remove to a which does not contain value (line 9).

Algorithm 19 (SID creation). Input: , IR table
Output:
(1) For each sensitive attribute , do (2) to (9).
(2) Generate sequence.
(3) Find the dangerous buckets, and put them in SDB.
(4) Remove the dangerous buckets in sequence;
(5) When there are at least - buckets in sequence which are not empty, generate a new repeat (6) to (8).
(6) For each bucket in SDB, remove a value to .
(7) For each bucket from sequence[], sequence[], …, sequence[-, remove a value to ;
(8) Call for Algorithm 13 and use sequence and - as input; //processes the residual attribute values
(9) For each value in nonempty buckets, find a which contains no value , and remove to . If one cannot find this , value cannot be grouped.

Lemma 20. If bucket is a dangerous bucket in sensitive attribute , after completing a new , is still a dangerous bucket.

Proof. After completing a new , the frequency of .value is , and is the size of . Before generating the new , the frequency of .value was . If there is , Lemma 20 can be proved.
The left side of the equation is equal to , and the right side of the equation is equal to ; one only needs to prove thatBecause before generating the new , was dangerous bucket, and satisfied , has , so we can get

Corollary 21. If bucket is a dangerous bucket, after completing a new , is still one of the buckets which have the largest size.

Proof. Using proof by contradiction, if there is Set = bucket1, bucket2, …, , for each , it has larger size than . According to Lemma 20, after generating a new , the , so bucket satisfies , so there hasGet the contradictory conclusions: From the above certification, we can find out that the dangerous bucket is always one of the largest buckets, so each new should take one value of dangerous bucket, and it does not need to consider dangerous bucket for sorting sequence. This method further improves the efficiency of the algorithm.
Besides, in this improved Rating, the algorithm for AT&IDT Creation is the same as in Rating, uses to generalize corresponding value in IR table, after processing IR, gets AT, and then uses the set of to make up IDT. At last release AT and IDT with the previous SAC~ table.
Here we assume adversary masters strong rules and summarize the security of mixed model. For released table SAC~, because each group contains at least records without the same values and refers to Lemma 12, it is easy to know that SAC~ satisfy -diversity. For released AT, we will discuss a problem of probability tilts first. For record , after generalization if , satisfy which is a strong rule, there will be probability tilt between and obviously. And according to the method of partition table, all the strong values in AT belong to the same sensitive attribute, so the probability tilts will not happen in AT. Besides, each consists of at least different values, so AT also satisfies -diversity. And through the above analysis, we know that mixed model is able to satisfy -diversity.

5. Analysis and Proof of Information Availability

This section analyzes the information loss of the new model from availability of association rules and the quality of published data table.

5.1. The Availability of Association Rules

In Rating model, all the relationships between different sensitive attributes are broken, the new model presented by this paper makes improvement, and all the strong rules can be preserved in released table.

Lemma 22. If the association rule is as follows: confidence = , in the original data table, in the released data tables, the confidence of is still .

Proof. The released data tables contain SAC~ table, ID table, and the attribute table, and user can get the support degree of from SAC~ table and ID table, namely, support(), because in attribute table there is , one only needs to get support() from the SAC~ table. So the confidence of is support()/support().
Here, we can see that the mixed model preserves all the strong association rules, and user can get the confidence of strong association rules from the released tables. And the Rating model breaks all the relationships between sensitive attributes and generates unnecessary information loss.

5.2. The Quality of Published Data Table

This part uses the reconstruction error (RCE) [9, 17] to measure the quality of the published tables. Assume original table = ) gets a dimensional space ; for record in table , the probability density function (pdf) of is

Here, the is a dimensional variable in .

For record in the released table of Rating, the pdf of is

Assume the Cluster_Set = in the mixed model, the defines a dimensional space . In the released tables of the mixed model, if , the pdf of is

Here, is a dimensional variable in , represents the set of the possible values of , and represents the number of the possible values. For example, in Table 6, a user wants to reconstruct the pdf of ; in his view, the can be () or () with equality probability 1/2, and can be or with equality probability 1/2, so the pdf of is

If , pdf of is . So in the released tables of mixed model, the pdf of is

We can measure the distance between released tables of mixed model and original table as follows:

Here, assume is a record in original table and is the form of in released tables of mixed model. Similarly, is the form of in released table of Rating, and the distance between released table of Rating model and original table is as follows:

The released table would have higher quality with the smaller distance. Take all the records into account, and the reconstruction error (RCE) of mixed model and Rating, respectively, are

Lemma 23. If the original table contains strong rules, the RCE of mixed model is smaller than Rating model:

Proof. To simplify the proof, assume both two models have no remaining attribute value to process, which means and satisfy , . In order to prove the conclusion, we only need to proveFor each in AT table of mixed model, because the AT table satisfies the requirement of the Rating model, there exists . For each in SAC~ table, one hasSimilarly, for in released table of Rating, one hasBecause the original table has strong rules, , , The mixed model has lower RCE than Rating, which means the released tables of mixed model are closer to the original table. The linking between sensitive attributes in the same cluster is preserved, so the mixed model has higher quality than the Rating.

6. Experiment

The experiment uses the real dataset Adult (http://archive.ics.uci.edu/ml/datasets/Adult), we get 30718 records after deleting the incomplete ones, and the experiment consists of four parts: (1) execution time, (2) additional information loss, (3) accuracy rate of mining strong association rules, and (4) probability of privacy disclosure. We choose education, occupation, age, relationship as sensitive attributes, education, occupation, education, occupation, age, and education, occupation, age, relationship. If there are no special statements, the experiments use the default parameters: the number of records is 30718, and the min_confidence is 80%.

6.1. Execution Time

This paper presents an improved algorithm of Rating and mainly improves the algorithm of the SID creation, and the AT and IDT creation is the same as Rating. So this part will compare the improved algorithm of SID creation with the old one. Here, the old SID creation uses the stable bubble sort algorithm when choosing the largest buckets. We set parameters and choose a certain number of records randomly, and Figure 1 shows that the execution time of improved SID creation is much better than the old one, because of the heuristic search. And the improved SID creation is more suitable for the large dataset.

Figure 1: The comparison of execution time, .

Figure 2 shows the effect of sensitive attributes number on execution time. Because age has much more different values than other 3 sensitive attributes, Bubble sort needs more time to compare. After adding age attribute, the running time of SID creation increases rapidly. So the running time of SID creation is influenced seriously by the number of different values of sensitive attribute.

Figure 2: The effect of sensitive attributes number on execution time, .
6.2. Additional Information Loss

This part compares the additional information loss (AIL) [12] of mixed mode with the Rating. In order to make AIL more suitable for mixed model and Rating, we slightly change its definition. Assume sensitive has SID in IDT, the additional information loss of is

Here, represents the number of values in .

And the additional information loss of table is:

Here, has sensitive attributes .

Figure 3 shows the AIL of the mixed model and Rating; when increases, both the additional information losses of the two models increase basically. And the additional information loss of the mixed model is slightly more than the one of Rating but is always less than 0.03%, and the security of mixed model is enhanced. Here, Rating uses the stable bubble sort algorithm in SID creation. This part of experiment also finds an interesting phenomenon: if the sort algorithm in SID creation is unstable, the additional information loss will be much more than the stable sorting algorithm in SID. This phenomenon needs to be further studied.

Figure 3: The comparison of additional information loss, .

Figure 4 shows the effect of the minimum confidence on additional information loss. When the minimum confidence decreases, the additional information loss increases. Because more records are put in SAC table, the available sensitive attribute values in the IR table will be less, and additional information loss increases.

Figure 4: The effect of minimum confidence on additional information loss, .

Figure 5 shows the effect of number of sensitive attributes on additional information loss. Because all the sensitive attributes in Rating are processed independently, the AIL of Rating is almost not influenced by the number of sensitive attributes. For mixed model, when the number of sensitive attributes increases, more records are removed to SAC table, and values for grouping are less in IR table, so the AIL of mixed model grows with the sensitive attribute number but is still in the realm of acceptable.

Figure 5: The effect of sensitive attribute number on additional information loss, .
6.3. The Accuracy of Mining Strong Association Rules

Strong rules tend to be valuable in practice, so the ability to provide strong rules will be analyzed for publication models by this experiment. We use the method of Lemma 22 to excavate strong association rules from the released tables of mixed model, and in the released tables of Rating and the raw data table, we use Apriori to calculate confidence of strong association rules.

Figures 6 and 7 show the average confidence of strong rules in the three tables. We can see that if all the records in SAC and all values in IR can be grouped, user can accurately calculate confidence of strong rules from the released tables of mixed model, and the results also verify the conclusion of Lemma 22. When in Figure 7, because the sensitive attribute relationship only has 6 different values and is very close to 6, some records cannot be grouped in SAC, and they have to be deleted, and then the average confidence in mixed model deviates from the one in raw data table, but the difference is small. And the average confidence of strong rules in Rating greatly deviates from the one in raw data table; because Rating breaks all the relationships between sensitive attributes, it is difficult for users to calculate the confidence of strong rules from the released tables of Rating. Figures 8 and 9 show the similar results. Because mixed model has considered association rules between sensitive attributes, it can provide more valuable relationships than Rating in released tables.

Figure 6: The average confidence of strong rules with varying , .
Figure 7: The average confidence of strong rules with varying , .
Figure 8: The average confidence of strong rules with varying the number of records .
Figure 9: The average confidence of strong rules with varying the number of records, .
6.4. The Probability of Privacy Disclosure

We refer to the method of paper [15] and analyze the probability of disclosure in this experiment. Assume adversary has background knowledge about strong rules. Because the records containing no strong rules satisfy -diversity according to [17] and previous analysis, we study the safety of records that contain strong rules in mixed model and Rating and analyze the probability that these records contain known strong rules from released tables. In Figure 10, -dimension represents the probability of disclosure, and -dimension represents the number of records. We can see that the probability records contain strong rules is 1/3 in the released tables of mixed model, mixed model can ensure a maximum of disclosure probability for records, and the conclusion of Lemma 12 is verified. On the other hand, because Rating has not considered the relationship between sensitive attributes, the disclosures probabilities for records in the released tables of Rating are more than 80%.

Figure 10: The probability of disclosure, .

Figure 11 shows the similar result, the disclosures probabilities for many records are more than in Rating, while mixed model ensures a maximum of 1/2 probability for records. Here we will discuss an extreme situation. Assume is a strong rule, and in released tables of Rating, the number of records that actually contain is , and the number of records that may contain is ; obviously, we have . But if or has very low frequency in raw data table, the may be very small or even equals . When , adversary can be sure which records contain in released tables of Rating. So we can see that the disclosures probabilities for several records of Rating are 100% in Figure 11. From these experiments, we can know that the mixed model can prevent adversary from attacking data table with related background knowledge, and it is able to provide better protection for privacy.

Figure 11: The probability of disclosure, .
6.5. Summary of Experiment

Section 6.1 verifies the efficiency of the improved SID creation. And we know the additional information loss of mixed model is acceptable from the results of Section 6.2. Through the analysis of Sections 6.3 and 6.4, Rating cannot preserve strong rules in released tables, and as long as adversary masters background knowledge about these strong rules, Rating is unsafe. On the other hand, mixed model can provide strong rules for users forwardly and is able to ensure the security of privacy at the same time.

7. Summary

In view of the situation that most of existing privacy protection methods for multiple sensitive attributes have not taken the inherent relationship between different sensitive attributes into account, this paper presents that an attack method uses the association rules to get the users’ privacy and accordingly presents a protection model. Through theoretical and experimental analysis, we prove that the new protection model can provide better protection for privacy, and it is able to preserve useful relationships in released tables. Besides, in order to improve the efficiency of algorithm, we present an improved SID creation method, and prove it is more effective with experiment.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This paper is supported by “Guangzhou Research Institute of Communication University of China Common Construction Project, Sunflower – the Aging Intelligent Community.”

References

  1. L. Sweeney, “k-anonymity: a model for protecting privacy,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 5, pp. 557–570, 2002. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  2. L. Sweeney, “Achieving k-anonymity privacy protection using generalization and suppression,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 5, pp. 571–588, 2002. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  3. K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, “Incognito: efficient full-domain K-anonymity,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '05), pp. 49–60, June 2005. View at Publisher · View at Google Scholar · View at Scopus
  4. A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, “L-diversity: privacy beyond k-anonymity,” in Proceedings of the 22nd International Conference on Data Engineering (ICDE '06), pp. 24–26, April 2006. View at Publisher · View at Google Scholar · View at Scopus
  5. X. Xiao and Y. Tao, “Personalized privacy preservation,” in Prceedings of the 25th ACM SIGMOD International Conference on ACM Management of Data (SIGMOD '06), pp. 229–240, ACM, Chicago, Ill, USA, 2006.
  6. N. Koudas, D. Srivastava, T. Yu, and Q. Zhang, “Distribution based microdata anonymization,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 958–969, 2009. View at Publisher · View at Google Scholar
  7. L. Ninghui, L. Tiancheng, and S. Venkatasubramanian, “t-Closeness: privacy beyond k-anonymity and l-diversity,” in Proceedings of the 23rd International Conference on Data Engineering (ICDE '07), pp. 106–115, IEEE, Istanbul, Turkey, April 2007. View at Publisher · View at Google Scholar · View at Scopus
  8. X.-C. Yang, X.-Y. Liu, B. Wang, and G. Yu, “K-anonymization approaches for supporting multiple constraints,” Journal of Software, vol. 17, no. 5, pp. 1222–1231, 2006. View at Google Scholar · View at Scopus
  9. X. Xiao and Y. Tao, “Anatomy: simple and effective privacy preservation,” in Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB '06), pp. 139–150, Seoul, Republic of Korea, September 2006. View at Scopus
  10. N. Li, W. Qardaji, and D. Su, “Provably private data anonymization: or, k-anonymity meets differential privacy,” CERIAS TR 2010-24, Center for Education and Research Information Assurance and Security, Purdue University, West Lafayette, India, 2010. View at Google Scholar
  11. G. Aggarwal, T. Feder, K. Kenthapadi et al., “Achieving anonymity via clustering,” in Proc. of the 25th ACM SIGMOD-SIGACT-SIGART Symp. on Principles of Database Systems, S. Vansummeren, Ed., pp. 153–162, ACM, New York, NY, USA, 2006. View at Google Scholar
  12. X.-C. Yang, Y.-Z. Wang, B. Wang, and G. Yu, “Privacy preserving approaches for multiple sensitive attributes in data publishing,” Chinese Journal of Computers, vol. 31, no. 4, pp. 574–587, 2008. View at Google Scholar · View at Scopus
  13. Y. Jing and W. Bo, “Personalized l-diversity algorithm for multiple sensitive attributes based on minimum selected degree first,” Journal of Computer Research and Development (China), vol. 49, no. 12, pp. 2603–2610, 2012. View at Google Scholar
  14. Y. Ye, Y. Liu, D. Lv, and J. Feng, “Decomposition: privacy preservation for multiple sensitive attributes,” in Database Systems for Advanced Applications, vol. 5463 of Lecture Notes in Computer Science, pp. 486–490, Springer, Berlin, Germany, 2009. View at Publisher · View at Google Scholar
  15. A. Abdalaal, M. E. Nergiz, and Y. Saygin, “Privacy-preserving publishing of opinion polls,” Computers & Security, vol. 37, pp. 143–154, 2013. View at Publisher · View at Google Scholar · View at Scopus
  16. Y. Fang, M. Zaman Ashrafi, and S. Kiong Ng, “Privacy beyond single sensitive attribute,” in Database and Expert Systems Applications, pp. 187–201, Springer, 2011. View at Google Scholar
  17. J. Liu, J. Luo, and J. Z. Huang, “Rating: privacy preservation for multiple attributes with different sensitivity requirements,” in Proceedings of the 11th IEEE International Conference on Data Mining Workshops (ICDMW '11), pp. 666–673, Vancouver, Canad, December 2011. View at Publisher · View at Google Scholar · View at Scopus
  18. J. W. Han and M. Kamber, Data Mining Concepts and Techniques, China Machine Press, Beijing, China, 2011.