Wireless Communications and Mobile Computing

Wireless Communications and Mobile Computing / 2020 / Article
Special Issue

Recent Advances in Security and Privacy Issues for Internet of Things Applications

View this Special Issue

Research Article | Open Access

Volume 2020 |Article ID 8416823 | https://doi.org/10.1155/2020/8416823

Razaullah Khan, Xiaofeng Tao, Adeel Anjum, Haider Sajjad, Saif ur Rehman Malik, Abid Khan, Fatemeh Amiri, "Privacy Preserving for Multiple Sensitive Attributes against Fingerprint Correlation Attack Satisfying c-Diversity", Wireless Communications and Mobile Computing, vol. 2020, Article ID 8416823, 18 pages, 2020. https://doi.org/10.1155/2020/8416823

Privacy Preserving for Multiple Sensitive Attributes against Fingerprint Correlation Attack Satisfying c-Diversity

Academic Editor: Ghufran Ahmed
Received06 Nov 2019
Accepted19 Dec 2019
Published28 Jan 2020

Abstract

Privacy preserving data publishing (PPDP) refers to the releasing of anonymized data for the purpose of research and analysis. A considerable amount of research work exists for the publication of data, having a single sensitive attribute. The practical scenarios in PPDP with multiple sensitive attributes (MSAs) have not yet attracted much attention of researchers. Although a recently proposed technique (p, k)-Angelization provided a novel solution, in this regard, where one-to-one correspondence between the buckets in the generalized table (GT) and the sensitive table (ST) has been used. However, we have investigated a possibility of privacy leakage through MSA correlation among linkable sensitive buckets and named it as “fingerprint correlation attack.” Mitigating that in this paper, we propose an improved solution “-anonymization” algorithm. The proposed solution thwarts the attack using some privacy measures and improves the one-to-one correspondence to one-to-many correspondence between the buckets in GT and ST which further reduces the privacy risk with increased utility in GT. We have formally modelled and analysed the attack and the proposed solution. Experiments on the real-world datasets prove the outperformance of the proposed solution as compared to its counterpart.

1. Introduction

Data generation and sharing have shown a drastic increase in the ongoing decade. The reason behind is obviously the growing sources of data due to huge research and smart revolution (smart grids, cities, devices, etc.). The utility of the shared/published data is utilized in research and analysis by the data researchers. The research and analysis may involve data mining, statistical data analysis, and other policy makings. In the context of health records, the data owners are the individuals to whom the data belong. The hospital that collects, manipulates, and shares that data is known as the data publisher. The data researchers may be a wide range of stakeholders (e.g., pharmaceuticals, government agencies, and survey organizations). The collected data contain private information (e.g., name, contact number, and social security number), partial identifiers (e.g., age, gender, zipcode, and country), and confidential or sensitive information (e.g., disease) about the data owners. Sharing such sensitive information is a privacy breach and legislatively wrong, if disclosed to unauthorized parties.

To ensure privacy of such information, most of the existing algorithms [16] in the literature deal exist with a single sensitive attribute only. However, a dataset may practically have multiple sensitive attributes (MSAs) [714]. For example, a hospital may publish data with more than one sensitive attribute, such as disease, symptom, and physician as shown in Table 1. The sensitive nature of healthcare records urges researchers to handle such scenarios and assure that the privacy of an individual may not be breached.


Personally identified attributeQuasi-identifier attributes QIDsMultiple sensitive attributes (MSAs)
NameGenderAgeZipcodeCancer typeCancer treatmentPhysicianDiagnosis dateSymptomDiagnostic method

p1 (Michael)M3434548RectalSurgeryJack7/8/19Back painChest X-ray
p2 (Lisa)F2134607BreastInhibitor therapyAlan17/8/19SwellingUltrasound
p3 (Richard)M2637506ColonSurgeryDaisy22/11/19Back painBlood test
p4 (Dave)M3134549ProstrateBiologic therapyTom29/08/19Abdominal painChest X-ray
p5 (Kate)F3533753ProstrateRadiationFrank9/09/19Testis swellingBlood test
p6 (William)M3843674LiverAblationTom5/08/19Weight lossCT scan
p7 (Robert)M2735064RectalMedicationTom12/08/19Abdominal painCT scan
p8 (Olivia)F3244662ProstrateBiologic therapyJack21/09/19Back painBlood test
p9 (Emily)F2234548BreastBiologic therapyAlan13/09/19Skin imitationMRI test

In data publications, along with privacy, data utility is also a major concern so that researchers may perform research and analysis. Therefore, data should be anonymized in such a way that the research analysts may extract useful information. Balancing privacy and utility in privacy preserving data publishing (PPDP) is a NP-hard problem [1520]. Therefore, the scenario in this paper is more challenging, as we consider the dimensionality in quasi-identifiers (QIs) as well as more than one sensitive attributes, i.e., MSAs.

An adversary or an attacker is a person who tries to breach the data privacy using different types of background knowledge (bk) about the MSA dataset. The bk includes the fact that certain pattern of values in published data is more likely to be observed than other values. For example, this knowledge can be fingerprint correlation (fcorr) knowledge, QI knowledge (qik) [10], or nonmembership knowledge (nmk) [21, 22]. MSA values in a table that belongs to a specific individual form a fingerprint. The fcorr between two k-anonymous [1] groups can increase an adversary knowledge. The qik is the personally identifiable information (PII) [21] for an adversary to uniquely identify an individual, and according to nmk, an individual cannot be linked to a specific sensitive value (SV). The -Angelization [22] is a strong privacy algorithm for MSAs, where p represents the different sensitivity level of categorical SAs and k implies the k-anonymous QIs. The -Angelization algorithm shown in Tables 2 and 3 are obtained from the original microdata in Table 1. The authors in [22] overcame the problem of the nmk attack but still privacy could be breached with named as the attack. The attack is comparatively considered as a strong privacy attack. If the adversary is intended to disclose privacy of almost every individual, the attack iteratively can breach the privacy of the whole dataset. The privacy breach scenario is explained in Section 1.1 in detail.


GenderAgeZipcodeBatch ID

Person
Person
22–2634548–375061

Person
Person
Person
31–3834549–446622

Person
Person
21–3434548–346073

Person
Person
27–3533753–350644


PhysicianCancer typeCancer treatmentDiagnosis methodSymptomDiagnosis dateBatch ID

Daisy, AlanColon, BreastSurgery, Biologic therapyBlood test, MRI testBack pain, Skin irritation22/11/19, 13/09/191
Tom, JackProstrate, LiverAblation, Biologic therapyCT scan, Chest X-ray, Blood testWeight loss, Abdominal pain, Back pain5/08/19, 21/09/19, 29/08/192
Alan, JackRectal, BreastInhibitor therapy, SurgeryChest X-ray, Ultrasound,Back pain, Swelling7/8/19, 17/8/193
Tom, FrankProstrate, RectalRadiation, MedicationBlood test, CT scanTestis swelling, Abdominal pain12/08/19, 9/09/194

1.1. Motivation

The -Angelization [22] algorithm directly adopts the single SA approach named as angelization [23] to implement privacy for MSAs. This approach invalidates the -Angelization for the attack. The privacy breach scenario I explains the invalidation for [22] in detail. The complexity, lack of utility, and privacy breaches in SLOMS [24] and SLAMSA [25] techniques have already been invalidated by the -Angelization. Although [22] is an efficient solution for utility improvement, the intruder can easily breach the privacy for a record using the bk and his intelligence. Our work has been motivated by the following limitations in the -Angelization algorithm:

(i) Privacy breach scenario I. For example, an adversary (i.e., David) intends to identify p2 (Lisa) information in Table 1. Since they both live in a neighbourhood, age, gender, and Zipcode are known (21, F, and 34607). Using QIs, David identifies her presence in group 3 of the generalized table (GT), i.e., Table 2, and through the batch ID, the sensitive batch table (SBT), i.e., Table 3, in group 3 can be accessed. For the (p, k)-Angelization, physician is a maximum weighted attribute (see Section 5.2). The maximum weighted attribute implies high dependency that has high privacy risk. An attack on it can easily breach privacy. So the intruder starts the attack from the physician attribute. It is an iterative process that leads to the record identification of the target individual and can identify the complete records in data table T. Since the (p, k)-Angelization blindly follows the angelization [23] mechanism, correlating the MSAs in different buckets may result in single SA values against each SA. This is a column-wise vertical correlation between two SA fingerprint buckets (SAFBs) in SBT that has common physicians and other SA values. The intruder takes intersection of SAFB 3 with groups having common physicians and proceeds iteratively until p2 is identified. So, he takes intersection between SAFB 3 and SAFB 2 because of the common physician Jack, between SAFB 2 and SAFB 4 because of Tom, and then between SAFB 3 and SAFB 1 because of Alan. Table 4 depicts the identified SVs and hence the disclosed individuals. Although the intruder was interested to identify only p2, the privacy of p1 and p4 was also breached during the process, which implies that this process iteratively can breach the individuals in the complete dataset.


PhysicianDiagnosis methodSymptomsCancer treatmentCancer typePrivacy breached

JackChest X-ray (intersection between bucket 2 and 3)Back pain (intersection between bucket 2 and 3)Rectal (lifted after Alan got the breast)p1

AlanUltrasound (after Jack got the chest X-ray between bucket 2 and 3)
MRI test (from general knowledge of rays type treatment since in bucket 3 only ray-type methods exist)
Swelling (lifted symptom after Jack got the back pain from intersection between bucket 2 and 3)
Back pain (temporary)
Surgery (wrong, but not very important)Breast (intersection between bucket 1 and 3)p2

TomCT scan, blood test (intersection between bucket 2 and 4)Abdominal pain (intersection between bucket 2 and 4)Prostrate (intersection between bucket 2 and 4)p4

FrankTestis swelling

DaisyColon (after Alan got the breast)

The intruder uses Table 3 (SBT) and on each step stores the values in Table 4 and finally identifies all the sensitive information related to p2. In Table 4, the values against each physician attribute are the values obtained by taking intersection between two SAFBs in Table 3 linked through common physician’s names. In Table 3, Jack is common between SAFB 3 and SAFB 2, so whatever value David gets from intersection, he adds against Jack in Table 1. First, chest X-ray is common in the diagnostics method. The leftover value ultrasound for sure belongs to Alan. While in group 3, both the remaining diagnosis values cannot be assigned to Tom, as Tom may have only one value, so the intruder is not sure at this stage. In the symptoms attribute, back pain is common and is stored against Jack. Here, another symptom value “swelling” is definitely for Alan because there is neither physician nor symptom. Since any further intersection for cancer treatment and cancer type does not produce any value, the process is forwarded to SAFB 2 and SAFB 4 because of Tom. Similarly, for the diagnostic method, Tom had CT scan and Blood test and no value for Frank. Although there is one value for Frank, the intruder can refine this while taking Frank intersection with other SAFBs that are not in the current sample dataset. In the symptom column, abdominal pain for Tom and the lifted value in SAFB 4 is testis swelling for Frank. The weight loss and back pain symptoms in SAFB 2 cannot be assigned to either Tom or Jack because the intruder has no enough information about this yet. For cancer treatment, there is no common value while for the cancer type prostrate is assigned to Tom. The last intersection process is between SAFB 1 and SAFB 3. Although there is no common value for the diagnostic method and the values in SAFB 3 are only related to rays, the intruder is intelligent enough that he can easily assign MRI test to Alan. This may not be the exact value, but can help to guess or identify the record. For symptom although we already have swelling for Alan, taking back pain for Alan which is already assigned to Jack and the only two values in this cell do not suit to the intruder knowledge. For cancer treatment, the common value is surgery and for the cancer type breast is the only attribute value. In SAFB 3, the leftover values are Rectal for Jack and in SAFB 1 and colon for Daisy. At the end, as the intruder also knows that p2 is a female, her attribute values {breast cancer, swelling, and ultrasound/MRI} can easily identify p2. The weighted sensitive attributes values disclosed against the linkable SA identifies the patient p2 record. Table 4 shows that during the process, the details about patient p1 and p4 are also identified. Some of the information regarding Frank and Daisy is incomplete or incorrect due to the fact that no further intersection with any other group is possible since the current data are sample data. This process iteratively executes and can also identify the remaining patients MSAs values.

(ii) Need for bucketization. Deeply analysing the (p, k)-Angelization, it is observed that all the features of angelization [16] were not well utilized. Tables 2 (GT) and 3 (SBT) by the (p, k)-angelization have one-to-one correspondence/linking between the two tables using the bucket id (BID). Due to the one-to-one correspondence, both tables are not considered as independent while the purpose of angelization was to publish both tables independently. Applying SA diversity may affect the utility in GT. Similarly, increasing the dimensionality in QIs in GT also decreases the utility. The adversary after finding a presence of an individual in a bucket in GT can easily move from GT to the exact group in SBT, where the fingerprint buckets may help in isolating the sensitive values. In fact, splitting the table into GT and SBT in the (p, k)-angelization is useless.

In our proposed (c, k)-anonymization algorithm, the bucketization approach is adopted, which separates the QIs and SAs into two separate tables: generalized table (GT) and sensitive table (ST), independently. Both tables are respectively linked through BID.

GT consists of k-anonymous QIs generalized buckets (GBs), and an adversary cannot get additional information about an individual’s privacy. The ST is a bucket table with MSAs in the bucketized form named as sensitive attributes fingerprint buckets (SAFBs). Anatomy [26] and Angel [23] are examples of bucketization for preserving privacy; however, they are applicable to single sensitive attributes. In this work, we use the bucketization for MSAs that can prevent different types of adversary’s attack, e.g., attack. Tables 5 and 6 are the GT and ST produced by the proposed (c, k)-anonymization algorithm. In the proposed approach, better privacy has been achieved with minimum utility loss. It is also not necessary that the publisher should always publish the data with all their QIs attributes known as marginal publication. Marginal publication is to publish the GT with few QI attributes instead of all QI attributes, along with ST. The idea of marginal publication was introduced in [23]. The bucketization has the minimum information loss because of the independent publishing of the GT and ST. In both these tables, the connection is not between the buckets, instead it is between the records in generalized buckets and sensitive buckets.


NameGenderAgeZipcodeBucket ID

p2 (Lisa)F[21–27] (21)[34548–37506] (34607)1
p9 (Emily)F[21–27] (22)[34548–37506] (34548)1
p3 (Richard)M[21–27] (26)[34548–37506] (37506)2
p7 (Robert)M[21–27] (27)[34548–37506] (35064)3
p8 (Olivia)[32–38] (32)[43674–44662] (44662)1
p6 (William)[32–38] (38)[43674–44662] (43674)2
p4 (Dave)[31–35] (31)[33753–34549] (34549)3
p1 (Michael)[31–35] (34)[33753–34549] (34548)1
p5 (Kate)[31–35] (35)[33753–34549] (33753)3


PhysicianCancer typeCancer treatmentDiagnosis methodSymptomDiagnosis dateBucket ID

Jack, AlanRectal, breast, prostrateSurgery, inhibitor therapy, biologic therapyChest X-ray, ultrasound, blood test, MRI testBack pain, swelling, skin imitation7/8/19, 17/8/19, 21/09/19, 13/09/191
Daisy, TomColon, liverSurgery, ablationBlood test, CT scanBack pain, weight loss22/11/19, 5/08/192
Tom, FrankProstrate, rectalBiologic therapy, radiation, medicationChest X-ray, blood test, CT scanAbdominal pain, testis swelling29/08/19, 9/09/19, 12/08/193

1.2. Contributions

We propose an efficient solution -anonymization for privacy preservation in MSAs. In -angelization [22], privacy can be breached under the attack (explained in Section 1.1). The tables published by the proposed (c, k)-anonymization are depicted in Tables 5 and 6. The “Name” attribute in Table 5 is not published while publishing the data. The proposed approach also prevents against the adversary nmk and qik. The main contributions are as follows:(i)We propose an improvement of -angelization, named as the -anonymizaiton algorithm, for MSAs privacy. The proposed solution prevents against attack. For reducing the privacy risk, the real (i.e., one to one) linking between GT and ST is transformed to one-to-many (i.e., real and likely) linking.(ii)We formally model and investigate the invalidation of -angelization for the attack and correctness of the proposed -anonymization algorithm.(iii)Based on the above points, the experimental results prove that our proposed approach provides better privacy and utility as compared to its counterpart.

In this section, we broadly categorize the data privacy models in order to define boundaries of the proposed work in the available literature.

2.1. Data Privacy Models and Methods

Privacy models can be categorized as (i) syntactic (i.e., partition), or (ii) semantic (i.e., randomized). The syntactic approach achieves privacy in two levels: clustering data and privacy framework. The k-anonymity [1] and then its extension l-diversity [3] and then the t-closeness [4] are the examples of syntactic data privacy models, in which the final set of groups are called equivalence classes (ECs). In the semantic approach, the original values are noised in a random way. -differential privacy [27] is an example of the semantic data model. The researchers have proposed both the syntactic and semantic privacy models for different types of data, e.g., single sensitive attribute [3, 4, 6], or MSAs [79], or 1 : m (i.e., one individual having many records) [28] microdata. For preserving the privacy, the algorithms in privacy models practice different approaches. These approaches can be categorized to (i) generalization [15, 1315] (i.e., greedily convert the more specialized values to less specialized values), (ii) anatomy [25, 26] (i.e., partition the QI and S attributes), and (iii) microaggregation [29, 30] (i.e., dataset is partitioned into clusters where QI values of records are replaced with the mean of value). The proposed work in this paper considers the syntactic data privacy, using generalization and anatomy for MSAs.

2.2. Syntactic Anonymization Literature for Multiple Sensitive Attributes

A plethora of research contributions related to MSAs privacy [714, 1618, 22, 24, 25, 3138] exists. A recent work, anatomization with slicing [25] is an effective technique for MSAs. Although it does not generalize the QIs attributes, it enhances utility but publishes many tables, which makes the solution more complex. To prevent against proximity breach, the authors in [7] have adopted multi-sensitive bucketization (MSB) technique using clustering. However, it is applicable to numerical data only. The (, l) model [8] for a single sensitive attribute satisfies the privacy requirements for MSAs. The authors in [31] prevent the negative and positive disclosure of associating between MSAs. In [33], rating of MSAs was proposed that fulfils the privacy requirements. However, the inherent relationship between the SAs can cause association rule attack. An adversary can use related bk to breach the privacy. The authors in [32] prevented the data from association attack and removed the weakness of the rating algorithm. In [37, 38], the authors perform vertical partitioning (i.e., anatomy) and implement decomposition and decomposition plus, respectively, to achieve l-diversity for MSAs. Decomposition plus [38] optimizes the noise value selection in [37] and keeps it closer to the original. The possibility of skewness and similarity attacks in [4, 39] was eliminated by the p+ sensitive t-closeness model [40]. It combines the good features from p-sensitive k-anonymity [39] and t-closeness [4] approaches.

ANGELMS (anatomy with generalization for MSA) [34] vertically partitions the dataset into the QIs table and several SAs tables satisfying the k-anonymity [1] and l-diversity [3] principles, but still it can be attacked with similarity, skewness, and sensitivity attacks. In [16, 18], the KC Slice for dynamic data publishing of MSAs integrates the features of KC-privacy and slicing techniques. The authors have presented the method for a single release, and no studies for multiple releases are available to prove the dynamic claim. In [35, 36], MSAs were handled for achieving privacy but the l-diversity [3] principle was directly adopted that caused huge information loss. The attack was prevented in [41] but still caused high information loss due to the grouping conditions over the data and vulnerability to the background join attack.

The proposed work categorizes the sensitivity of MSAs as top secret, secret, less secret, and nonsecret. c-diverse fingerprint buckets are created that contain records from different categories. The QI values of the created fingerprint buckets are bottom-up generalized through k-anonymity [1].

3. Preliminaries

Let table (as shown in Table 1) is the private data form for a publisher to publish. Let there be tuples in T, and each tuple represents an individual or record respondent i. The components for tuple are explicit identifier attributes (also called identifying attribute) EI = {, }, quasi identifier attributes (also called partial identifiers) QI = {, }, and sensitive information attributes S = {, }. QIs are the partial identifiers or personally identifiably information (PII) that can identify an individual i if linked with external data, e.g., voting or census data. Data privacy is all about protecting the sensitive information, which are the confidential and private information belonging to an individual. In this work, we consider a challenging scenario of more than one sensitive attributes for a single individual named as multiple sensitive attributes (MSAs). Notations used in the paper are shown in Table 7.


SymbolDescription

Original microdata
Anonymized form of T
Explicit identifiers in T
Quasi-identifiers in T
Sensitive information in T
Generalized table
STSensitive table
BIDBucket identifier
EDExternal dataset
GBGeneralized bucket
SAFBSensitive attribute fingerprint bucket
ECEquivalence class
HLPNHigh-level petri nets
Category table for SAs
Nonmembership knowledge
Fingerprint correlation knowledge
Null or zero
External factor
Weight of dependent
Maximum weighted attributes
Minimum weighted attributes
CCategory level (diversity) of SAs in
Linkability or linking between SAFBs
Linkability control factor
kk-anonymous level
Single maximum weighted attribute after

Definition 1 (MSA fingerprint [22]). The MSAs values in table T that belongs to a specific individual form a fingerprint known as MSA fingerprint.

Definition 2 (sensitive attribute fingerprint bucket (SAFB)). A sensitive attribute partitioning of the microdata T consists of a list of SAFB: according to the following conditions:(i)Each FB consists of two columns BID and MSA values.(ii), and for any , among linkable buckets through maximum weighted sensitive attributes (see Section 5).(iii)Each SAFB, fulfils c-diversity from the category table (Table 8). The subscript i of is the BID of bucket .


Category IDPhysicianCancer typeCancer treatmentDiagnosis methodSymptomDiagnosis dateSensitivity level

OneDaisy, FrankColon, prostrateSurgery, radiationBlood testBack pain, testis swelling22/11/19, 9/09/19Top secret
TwoJackRectal, prostrateSurgery, biologic therapyChest X-ray, blood testBack pain7/8/19, 21/09/19Secret
ThreeAlanBreastInhibitor therapy, biologic therapyUltrasound, MRI testSwelling, skin irritation17/8/19, 5/08/19, 12/08/19Less secret
FourTomProstrate, liver, rectalBiologic therapy, ablation, surgeryChest X-ray, CT-scanWeight loss, abdominal pain29/08/19, 5/08/19, 12/08/19Nonsecret

Definition 3 (generalized bucket (GB)). A generalized bucket partitioning from an EC of the microdata T consists of buckets such that(i)Each is the set of tuples only with QI attributes of T and BID from FB(ii), and for any , (iii)Each generalized bucket fulfils k-anonymity principle

3.1. Adversarial Model

In literature, an adversary is the attacker, who intends to breach the privacy and has different types of knowledge known as bk. Data correlation is an important type of adversary knowledge that breaches the privacy. The data correlation can be attribute correlation that exists among two or more attributes, e.g., [42], or row correlation between two or more rows, e.g., [43]. This paper is related to row correlation and more specifically FB correlation. This work focuses on reducing the threat exposed by FB correlation linked through the high-weighted SA value. Each FB contains few fingerprints that belong to k individuals in a specific GB inside GT. The adversary uniquely identifies an individual from the fingerprint correlation knowledge which has direct correspondence with the QI values.

Definition 4 (nonmembership knowledge [22]). If an adversary knows that an individual i in GB cannot be linked to a specific SV in FB it is known as .

Definition 5 (fingerprint correlation knowledge ). The MSA values obtained from correlating two linkable FBs, i.e., , can be assigned to a specific individual.
Based on the available information, we consider the adversary’s consists of , where(i)GT (generalized table) = (ii)ST (sensitive table) = (iii)Any external dataset ED = {} available publicallyThe adversary applies the on available anonymized data to perform an attack and to breach an individual’s privacy.

Definition 6 (fingerprint correlation attack). The adversary with known QIs values and is able to perform attack by deducing single SVs from the intersection of FBs that are linked via . The attack can be(i)Partial attack: few of the SAs from fingerprints in two or more FBs may produce unique SVs(ii)Full attack: all of the SAs from fingerprints in two or more FBs may produce unique SVsFor an adversary, the attack has no doubt to uniquely identify an individual while with the attack can identify a record respondent if the resulted sensitive information belongs to the above minimum weighted attributes (see 5.1 algorithm). The attributes do no contribute in an individual record identification.

Definition 7 (high-level petri-nets (HLPNs) [44]). A petri-net is used as a model to examine the control of information in a system. HLPN formally analyses the system with mathematical properties. A HLPN is a 7-tuple N = {}, where P is a set of places represented by circles, T is a set of transitions represented by rectangular boxes such that , F is the flow relations such that , L are the labels on F, maps places to types., represents the rules for transitions, and is the initial marking. In short, L, , and represent the static semantic, whereas , , and depict the dynamic structure.

4. Critical Review for -Angelization with Attack Identification Using Formal Modelling and Analysis

Definition 8 ((p, k)-Angelization [22]). A pair of bucket partitioning = {} and batch partitioning = {, } of table T produces two tables GT and SBT such that(i)GT consists of QI attributes with batch id (BID) belonging to table T. The QI values from t records are k-anonymously grouped and linked with SBT via BID.(ii)SBT consists of (SA, BID), where BID is i () and SA are the MSA from table T.(iii)Batch partitioning satisfies ()-anonymity [16] where every bucket has records from p categories and have k-tuples to prevent against linking attack while the group partitioning satisfies (p, k)-anonymity [23].The following reasons explains the invalidation for ()-angelization [22].(i)The invalidation of the existing (p, k)-angelization is due to the that causes attack (as shown in Table 4). Although the records from p categories are k indistinguishable on MSAs fingerprint, they are uniquely distinguishable because of unique SAFB values obtained from linkable buckets. So, Lemma 2 in [22] is incorrect. Its corrected form is given in Lemma 1 (Section 5).(ii)Invalidation of theorem 2 in [15]: the adversary can correlate the sensitive information from qik. Scenario I explains the attack that extracts unique sensitive information using QI values.Now, we formally model the (p, k)-angelization algorithm to check its invalidation with respect to the attack. The (p, k)-angelization algorithm is depicted with HLPN and formally analysed with its mathematical properties. The purpose of using HLPNs is to depict (i) the interconnection of the model components and processes, (ii) a clear flow of data among the processes, and (iii) the in depth inside about how the process of information takes place, in order to isolate the flaw in (p, k)-angelization. Figure 1 depicts HLPNs for (p, k)-angelization. The variable types and mapping of data types on places are shown in Tables 9 and 10, respectively. The adversarial model in Figure 1 comprises of three entities: end user, trusted data sanitizer, and adversary. The initial transition is referred to as input transition that contains the raw data (e.g., patient’s EHRs) collected from a health organization. The trusted data sanitizer anonymizes the data using the (p, k)-angelization algorithm and produces GT (Table 2) and SBT (Table 3) tables. The produced tables are ready to be published which are exploited by the adversary through the attack in Table 4.
Rule 1 checks the existence of the number of dependent SAs with respect to another SA. Rule 2 counts the dependent attributes and selects the maximum weighted attributes. However, if there exists more than one in the weight set, then an external factor is added to one of them, based on some external facts. The weight set is sorted in descending to select the maximum weight as in rule 3 and 4. Based on weight calculation and MSAs in T, the category table is formed in rule 5.Rule 1: Rule 2: Rule 3: Rule 4: Rule 5: The problem arises from rule 6 onward. The ()-angelization [22] blindly follows the basic angelization [23] mechanism. According to rule 6, the data in table T based on the category table (Table 8) use angelization to create GT and SBT. In [23], the between two SA buckets does not exist because of single SA. While handling the MSAs, the basic angelization is not applicable without proper measures. Rule 7 shows the attack that breaches the privacy of an individual. In 7, the known QI values in a specific GB in GT have exactly a single-correspondent FB in SBT for applying the fcorr attack to disclose unique values from the MSAs FBs.Rule 6: Rule 7:


TypesDescriptions

TPlace holding integer and string type data
PIDAn integer identifying patient id
A set consists of dependent sensitive attributes
An integer defines final calculated weights of
Set containing sorted weight of SA in descending
Maximum weight after adding external factor
CCategory Id for MSA
BIDBatch id used to connect GT and SBT
Quasi-identifier for ith end user.
Multiple sensitive fingerprint values for ith end user


PlacesMapping

() ()
()
() ()
(C)
(QI 
(MSF 
(QI 
(MSF 
(QI 

5. Proposed -Anonymization for Multiple Sensitive Attributes

Although the (p, k)-Angelization model is a state-of-the-art approach for MSAs, especially for the categorization of sensitive values. But the ST still lacks in privacy because of blindly using the same angelization [23] approach for MSAs. It leads to fcorr attack, i.e., attack or attack. We name the improved form of (p, k)-angelization as (c, k)-anonymization for MSAs and is describe as follows:

Definition 9 ((c, k)-anonymization). A pair of generalized bucket partitioning = {} and sensitive attribute fingerprint bucket partitioning = {} of table T produces two tables GT and ST such that(i)GT consists of QI attributes buckets () with bucket id (BID) belonging to table T. The QI values belonging to t records are k-anonymously grouped and linked with corresponding sensitive buckets in ST via BID. And , and for any , .(ii)ST consists of ( and BID), where BID is i () and FB are the MSAs in buckets from table T. And , for any , and among linkable () buckets through maximum weighted sensitive attributes ().(iii)Generalized bucket partitioning satisfies definition 3 and sensitive bucket partitioning satisfies definition 2 based on the category table CtgT (Table 8). Every generalized bucket has k-tuples to prevent against linking attack. The sensitive partitioning has c-diverse records from c categories in CtgT such that(a)for strict approach, fulfil equation (4)(b)for relax approach, fulfil equation (5)

Lemma 1 (uncertainty in SAFBs). If for T having the MSAs dataset, the anonymized form T’ satisfies (c, k)-anonymization, then T’ satisfies (c, k)-diversity for MSA fingerprints.

Proof. Let sbp be the random sensitive bucket partitioning in T’. There must be at least t tuples from c categories in sbp such that or the , where are the minimum weighted SAs having the lowest dependency (see algorithm 5.1). So the c categorized records are indistinguishable on the MSA fingerprint by k records. Thus, sbp satisfies the definition of (c, k)-diversity and uncertainty in SAFBs is maximized.

5.1. Privacy Risk: Feasibility of the Proposed Work

The presence attack and fcorr attack are the two possible attacks that breach the privacy of GT and ST, respectively. The adversary can breach an individual’s privacy by linking the obtained sensitive information with the QI values and with the bk. Every QI record has a corresponding SA fingerprint in a specific FB, as depicted in Tables 2 and 3. The (p, k)-angelization has the one-to-one real linking through BID. The real linking has high privacy leakage and 100% chances of presence attack. Another type of linking can be the likely linking between GT and ST where all BID linking are not real. Relating the real and likely linking, the privacy risk for an individual is defined as , where r are the real linking and n are the total number of likely linking. The likely linking for a specific size of EC varies, and it depends on the QI values in GT that have linking with certain FBs. For preventing privacy leakage , where l represents the l-diversity [3]. Every FB has c-diverse fingerprints that correspond to at least k individuals. In our proposed (c, k)-anonymization, for c = 1, implies l = 2; for c = 2, implies ; for c = 3, implies , and so on.

Consider the privacy risk for the given Tables 5 and 6 processed by proposed (c, k)-anonymization. In GT Table 5, for c = 2, there are three different sizes of ECs that have varying l-diversity (ranging ) in ST Table 2. For EC size 4, and ; so, , which is a very low privacy risk. Similarly, for EC size 2, and ; so, . For the last EC size 3, and ; so, . Even in case when in an EC, the minimum distance QI values link to a single FB in ST, for example, when EC size is 2, then and so , and the probability of privacy disclosure is the diversity in the FB. In certain cases, the higher utility in QI values may further increase the likely linking which will more reduce the privacy risk. This proves that the proposed approach has very low privacy risk for the presence attack and data disclosure.

5.2. -Anonymization Algorithm

The objective of the proposed ()-anonymization algorithm is to provide a sustainable privacy for MSAs. The algorithm gets a microdata table T (Table 1) as input and produces two anonymized tables, i.e., GT (Table 5) and ST (Table 6).

The proposed Algorithm 1 performs two major functionalities: categorizing the MSAs based on calculated weights and creating secure FBs for the whole dataset. For SA categorization, the algorithm calculates weights for all the MSAs in the dataset to get to know the dependency of SAs. The dependency shows the sensitivity level of the SAs. These weights are sorted in the descending order to get categorized MSAs. The weights calculation for MSAs creates the category table (CtgT) that helps to create l-diverse (c-diverse with respect to category) FBs. The FBs must satisfy equations (4) and (5) in order to prevent attack. Therefore, if some of the FBs are not according to equations (4) or (5) they are refined to fulfil. The purpose to refine is because the input data may be of different nature and may contain SVs that may not be grouped initially. The complete algorithm (Algorithm 1) and its working are explained in the following:

Input:
T: Microdata table = {ID, QI, S}
χ: External Factor
Output:
GData: Generalized Table-GT
SAFB: Sensitive Table-ST
  
  SAFB

Sensitive attributes weight calculation. In Algorithm 1, a calling function , shown in Function 1, calculates weights for all the MSAs from Table 1. There are six different types of SAs, and each has its own level of sensitivity or sensitivity weights. Let W = {} is the set of weights for each SA such that has weight , has weight , and so on. SA weights are the dependency on all other SAs. Similar to [22] the weight is calculated in the following equation:where m are the total number of attributes dependent on attribute . The dependency of an SA is determined by the total range of attributes identifying the SAs. The for loop in the beginning calculates the sensitivity for with all other SAs, i.e., . This determines the sensitivity level for all the SAs. To calculate the weights (second for loop), cardinality checking is performed of all dependent attributes. Calculated weights for the SAs that exist in microdata are shown in Table 11.

Input:
s: sensitive attribute (SA)
: weight for a SA
 Dep: SA dependency
: height of dependency (no. of dependent SAs)
Output: list of weights for all SA, i.e.,
  
  

Sensitive attributesIdentified byDependencyWeightage

s1: cancer types2, s3, s633
s2: cancer treatments111
s3: physicians1, s2, s633
s4: diagnostic date00
s5: symptoms00
s6: diagnostic methods111

Maximum weighted sensitive attributes selection. From the calculated dependencies through Function 1, maximum weighted attributes are selected. Maximum weights mean high dependency leads to high disclosure risk. So attributes set with needs maximum protection. Although there are very rare chances, the problem arises if there exists more than one SAs having the same . Then, an external factor is added to select only one maximum weighted attribute. The algorithm adds an external factor , i.e., to each attribute in set , and the weights are then sorted in the descending order to get one single maximum weighted SA () (can be seen in Algorithm 1).

Categorizing sensitive attributes. Attribute occurring in the first location of the set is selected as . The descending order of calculated weights for MSA is categorized through the categorize() function as top secret, secret, less secret, and nonsecret, in Table 8. From weight calculations, although disease and physician have equal weights, we select physician as the maximum weighted SA () because of some other external factors . For example, physician information may be publically available on the Internet.

Creating . FBs consist of MSAs with the BIDs column at the time of anonymization. Function 2, function , shows the whole process for creating c-diverse FBs. Create FBs mainly based on (as shown in Table 8). Let and are the total number of records in the microdata table T. To create a 2-anonymous 2-diverse FB, i.e,. k = 2, l = 2 (i.e., c = 1), for example, and are two different records in T such that and , where and . The union of these two different records from different sensitive categories creates an FB, as shown in the following equation:where and . Selecting records from different categories to create an FB is to implement the l-diversity principle in the form of c-diversity in our case. Privacy in data is all about creating an EC that prevents an intruder to breach any of its sensitive contents. Unlike [22], we focus on creating an FB that satisfies c-diversity and prevents against any attack from the adversary, e.g., attack. The attack is prevented by taking two measures: (i) minimizing the likability or linking between two FBs and (ii) uncorrelating the records or maximizing the uncertainty between the FBs (explained in Refine FB).

Input:
 = , all in the whole dataset T
r: source record to create FB, it can be or ,
: no. of records in the actual dataset T
: linkability control factor
k: k-anonymity level (minimum FB size)
: any category of SA in category Table 8
: any category of SA in category Table 8
Output: -anonymous records in each
 = {}
  
  
   =  //if e.g.  = 2, will select max 2 //records having same physician
   = distinct , //where and
   = 
  
  
  

Patients records (i.e., SVs) are high in number as compared to the physician attribute and both can normally be correlated or linked with one another. This provides a reason for an attack. Therefore, to minimize the , we use the linkability control factor , (). minimizes the repetition of the same value in different FBs. The table (Table 6) published by the proposed algorithm has , which means that a maximum of two records for a single physician can exist in an FB. This brings the existing values to minimum FBs and is reduced. So, the chances of the attack on possible FB are ultimately reduced (Table 3 has 3 FBs while Table 6 has only 1 FB). The high value for further reduces but increases information loss, so a balance should be maintained between and utility preservation.

Refining FBs. FBs are refined through the function Refine , as depicted in Function 3. The attack between any two FBs can breach the privacy. The purpose is to completely avoid the correlation. A percentage of record that discloses from intersection between FBs can be associated to a specific individual. For example, any percentage of record obtained from equation (3) that results in a single value for each SA correlation is a privacy breach and is not acceptable, especially for high weighted attributes.where and are via . The high percentage disclosure infers a high intruder confidence for privacy breach and vice versa. The decreasing order of is the decrease in probability of privacy breach, among the n FBs. The measure to prevent against the attack is no or minimum data expose from intersection. The refining process in Function 3 for FBs works under two approaches, i.e., strict and relax. Strict approach is given in the following equation:where and FBs are through , i.e., , for example, they are through the physician and n are the total number of FBs where the same physician exists. The intersection in this case ensures that should be zero or have more than one SVs in common to create uncertainty for single SA.

Input:
 = {}: a set of FB that are linked via the same
k: minimum k-anonymity level
Output: set of refined FB = {}, linked via same .
 [ // Strict ECs or
OR ] // Relax ECs
  
  
   // where
   //, merge the FBs, dissolved
end while

In case of the worst dataset, there may be some of the records that do not fit in any FB via the strict approach. A relax strategy is adopted with no breach in privacy. The percentage of record exposure from is minimized to an acceptable value. In the worst case scenario, the proposed (c, k)-anonymization maintains where are the minimum weighted attributes that have no dependency. Relax strategy is given in the following equation:where and FBs are through , i.e.,, for example, they are through physician and n are the total number of FBs where the same physician exists. According to equation (5), only is acceptable which is a percentage of information leakage but not a privacy breach because of not dependent attributes and hence impossible for an intruder to link with other SA or QI of a specific patient record. The working of RefineFB() function for equation (5) is the same as it is for equation (4).

Generalization. Function 4 deals with QI attributes with BIDs that correspond to unique FBs in ST. Initially, the records are sorted by QIs. Then, every individual record is generalized to achieve k-anonymous EC. For generalizing the tuple , the t QI = {,} is generalized to =([ − ], [ − ],…, [ − ]), where and , are the close boundaries for . The Generalization() function in Function 4 shows the generalization process.

Input:
QI attributes from original microdata T