Abstract

Privacy preserving data publishing (PPDP) refers to the releasing of anonymized data for the purpose of research and analysis. A considerable amount of research work exists for the publication of data, having a single sensitive attribute. The practical scenarios in PPDP with multiple sensitive attributes (MSAs) have not yet attracted much attention of researchers. Although a recently proposed technique (p, k)-Angelization provided a novel solution, in this regard, where one-to-one correspondence between the buckets in the generalized table (GT) and the sensitive table (ST) has been used. However, we have investigated a possibility of privacy leakage through MSA correlation among linkable sensitive buckets and named it as “fingerprint correlation attack.” Mitigating that in this paper, we propose an improved solution “-anonymization” algorithm. The proposed solution thwarts the attack using some privacy measures and improves the one-to-one correspondence to one-to-many correspondence between the buckets in GT and ST which further reduces the privacy risk with increased utility in GT. We have formally modelled and analysed the attack and the proposed solution. Experiments on the real-world datasets prove the outperformance of the proposed solution as compared to its counterpart.

1. Introduction

Data generation and sharing have shown a drastic increase in the ongoing decade. The reason behind is obviously the growing sources of data due to huge research and smart revolution (smart grids, cities, devices, etc.). The utility of the shared/published data is utilized in research and analysis by the data researchers. The research and analysis may involve data mining, statistical data analysis, and other policy makings. In the context of health records, the data owners are the individuals to whom the data belong. The hospital that collects, manipulates, and shares that data is known as the data publisher. The data researchers may be a wide range of stakeholders (e.g., pharmaceuticals, government agencies, and survey organizations). The collected data contain private information (e.g., name, contact number, and social security number), partial identifiers (e.g., age, gender, zipcode, and country), and confidential or sensitive information (e.g., disease) about the data owners. Sharing such sensitive information is a privacy breach and legislatively wrong, if disclosed to unauthorized parties.

To ensure privacy of such information, most of the existing algorithms [16] in the literature deal exist with a single sensitive attribute only. However, a dataset may practically have multiple sensitive attributes (MSAs) [714]. For example, a hospital may publish data with more than one sensitive attribute, such as disease, symptom, and physician as shown in Table 1. The sensitive nature of healthcare records urges researchers to handle such scenarios and assure that the privacy of an individual may not be breached.

In data publications, along with privacy, data utility is also a major concern so that researchers may perform research and analysis. Therefore, data should be anonymized in such a way that the research analysts may extract useful information. Balancing privacy and utility in privacy preserving data publishing (PPDP) is a NP-hard problem [1520]. Therefore, the scenario in this paper is more challenging, as we consider the dimensionality in quasi-identifiers (QIs) as well as more than one sensitive attributes, i.e., MSAs.

An adversary or an attacker is a person who tries to breach the data privacy using different types of background knowledge (bk) about the MSA dataset. The bk includes the fact that certain pattern of values in published data is more likely to be observed than other values. For example, this knowledge can be fingerprint correlation (fcorr) knowledge, QI knowledge (qik) [10], or nonmembership knowledge (nmk) [21, 22]. MSA values in a table that belongs to a specific individual form a fingerprint. The fcorr between two k-anonymous [1] groups can increase an adversary knowledge. The qik is the personally identifiable information (PII) [21] for an adversary to uniquely identify an individual, and according to nmk, an individual cannot be linked to a specific sensitive value (SV). The -Angelization [22] is a strong privacy algorithm for MSAs, where p represents the different sensitivity level of categorical SAs and k implies the k-anonymous QIs. The -Angelization algorithm shown in Tables 2 and 3 are obtained from the original microdata in Table 1. The authors in [22] overcame the problem of the nmk attack but still privacy could be breached with named as the attack. The attack is comparatively considered as a strong privacy attack. If the adversary is intended to disclose privacy of almost every individual, the attack iteratively can breach the privacy of the whole dataset. The privacy breach scenario is explained in Section 1.1 in detail.

1.1. Motivation

The -Angelization [22] algorithm directly adopts the single SA approach named as angelization [23] to implement privacy for MSAs. This approach invalidates the -Angelization for the attack. The privacy breach scenario I explains the invalidation for [22] in detail. The complexity, lack of utility, and privacy breaches in SLOMS [24] and SLAMSA [25] techniques have already been invalidated by the -Angelization. Although [22] is an efficient solution for utility improvement, the intruder can easily breach the privacy for a record using the bk and his intelligence. Our work has been motivated by the following limitations in the -Angelization algorithm:

(i) Privacy breach scenario I. For example, an adversary (i.e., David) intends to identify p2 (Lisa) information in Table 1. Since they both live in a neighbourhood, age, gender, and Zipcode are known (21, F, and 34607). Using QIs, David identifies her presence in group 3 of the generalized table (GT), i.e., Table 2, and through the batch ID, the sensitive batch table (SBT), i.e., Table 3, in group 3 can be accessed. For the (p, k)-Angelization, physician is a maximum weighted attribute (see Section 5.2). The maximum weighted attribute implies high dependency that has high privacy risk. An attack on it can easily breach privacy. So the intruder starts the attack from the physician attribute. It is an iterative process that leads to the record identification of the target individual and can identify the complete records in data table T. Since the (p, k)-Angelization blindly follows the angelization [23] mechanism, correlating the MSAs in different buckets may result in single SA values against each SA. This is a column-wise vertical correlation between two SA fingerprint buckets (SAFBs) in SBT that has common physicians and other SA values. The intruder takes intersection of SAFB 3 with groups having common physicians and proceeds iteratively until p2 is identified. So, he takes intersection between SAFB 3 and SAFB 2 because of the common physician Jack, between SAFB 2 and SAFB 4 because of Tom, and then between SAFB 3 and SAFB 1 because of Alan. Table 4 depicts the identified SVs and hence the disclosed individuals. Although the intruder was interested to identify only p2, the privacy of p1 and p4 was also breached during the process, which implies that this process iteratively can breach the individuals in the complete dataset.

The intruder uses Table 3 (SBT) and on each step stores the values in Table 4 and finally identifies all the sensitive information related to p2. In Table 4, the values against each physician attribute are the values obtained by taking intersection between two SAFBs in Table 3 linked through common physician’s names. In Table 3, Jack is common between SAFB 3 and SAFB 2, so whatever value David gets from intersection, he adds against Jack in Table 1. First, chest X-ray is common in the diagnostics method. The leftover value ultrasound for sure belongs to Alan. While in group 3, both the remaining diagnosis values cannot be assigned to Tom, as Tom may have only one value, so the intruder is not sure at this stage. In the symptoms attribute, back pain is common and is stored against Jack. Here, another symptom value “swelling” is definitely for Alan because there is neither physician nor symptom. Since any further intersection for cancer treatment and cancer type does not produce any value, the process is forwarded to SAFB 2 and SAFB 4 because of Tom. Similarly, for the diagnostic method, Tom had CT scan and Blood test and no value for Frank. Although there is one value for Frank, the intruder can refine this while taking Frank intersection with other SAFBs that are not in the current sample dataset. In the symptom column, abdominal pain for Tom and the lifted value in SAFB 4 is testis swelling for Frank. The weight loss and back pain symptoms in SAFB 2 cannot be assigned to either Tom or Jack because the intruder has no enough information about this yet. For cancer treatment, there is no common value while for the cancer type prostrate is assigned to Tom. The last intersection process is between SAFB 1 and SAFB 3. Although there is no common value for the diagnostic method and the values in SAFB 3 are only related to rays, the intruder is intelligent enough that he can easily assign MRI test to Alan. This may not be the exact value, but can help to guess or identify the record. For symptom although we already have swelling for Alan, taking back pain for Alan which is already assigned to Jack and the only two values in this cell do not suit to the intruder knowledge. For cancer treatment, the common value is surgery and for the cancer type breast is the only attribute value. In SAFB 3, the leftover values are Rectal for Jack and in SAFB 1 and colon for Daisy. At the end, as the intruder also knows that p2 is a female, her attribute values {breast cancer, swelling, and ultrasound/MRI} can easily identify p2. The weighted sensitive attributes values disclosed against the linkable SA identifies the patient p2 record. Table 4 shows that during the process, the details about patient p1 and p4 are also identified. Some of the information regarding Frank and Daisy is incomplete or incorrect due to the fact that no further intersection with any other group is possible since the current data are sample data. This process iteratively executes and can also identify the remaining patients MSAs values.

(ii) Need for bucketization. Deeply analysing the (p, k)-Angelization, it is observed that all the features of angelization [16] were not well utilized. Tables 2 (GT) and 3 (SBT) by the (p, k)-angelization have one-to-one correspondence/linking between the two tables using the bucket id (BID). Due to the one-to-one correspondence, both tables are not considered as independent while the purpose of angelization was to publish both tables independently. Applying SA diversity may affect the utility in GT. Similarly, increasing the dimensionality in QIs in GT also decreases the utility. The adversary after finding a presence of an individual in a bucket in GT can easily move from GT to the exact group in SBT, where the fingerprint buckets may help in isolating the sensitive values. In fact, splitting the table into GT and SBT in the (p, k)-angelization is useless.

In our proposed (c, k)-anonymization algorithm, the bucketization approach is adopted, which separates the QIs and SAs into two separate tables: generalized table (GT) and sensitive table (ST), independently. Both tables are respectively linked through BID.

GT consists of k-anonymous QIs generalized buckets (GBs), and an adversary cannot get additional information about an individual’s privacy. The ST is a bucket table with MSAs in the bucketized form named as sensitive attributes fingerprint buckets (SAFBs). Anatomy [26] and Angel [23] are examples of bucketization for preserving privacy; however, they are applicable to single sensitive attributes. In this work, we use the bucketization for MSAs that can prevent different types of adversary’s attack, e.g., attack. Tables 5 and 6 are the GT and ST produced by the proposed (c, k)-anonymization algorithm. In the proposed approach, better privacy has been achieved with minimum utility loss. It is also not necessary that the publisher should always publish the data with all their QIs attributes known as marginal publication. Marginal publication is to publish the GT with few QI attributes instead of all QI attributes, along with ST. The idea of marginal publication was introduced in [23]. The bucketization has the minimum information loss because of the independent publishing of the GT and ST. In both these tables, the connection is not between the buckets, instead it is between the records in generalized buckets and sensitive buckets.

1.2. Contributions

We propose an efficient solution -anonymization for privacy preservation in MSAs. In -angelization [22], privacy can be breached under the attack (explained in Section 1.1). The tables published by the proposed (c, k)-anonymization are depicted in Tables 5 and 6. The “Name” attribute in Table 5 is not published while publishing the data. The proposed approach also prevents against the adversary nmk and qik. The main contributions are as follows:(i)We propose an improvement of -angelization, named as the -anonymizaiton algorithm, for MSAs privacy. The proposed solution prevents against attack. For reducing the privacy risk, the real (i.e., one to one) linking between GT and ST is transformed to one-to-many (i.e., real and likely) linking.(ii)We formally model and investigate the invalidation of -angelization for the attack and correctness of the proposed -anonymization algorithm.(iii)Based on the above points, the experimental results prove that our proposed approach provides better privacy and utility as compared to its counterpart.

In this section, we broadly categorize the data privacy models in order to define boundaries of the proposed work in the available literature.

2.1. Data Privacy Models and Methods

Privacy models can be categorized as (i) syntactic (i.e., partition), or (ii) semantic (i.e., randomized). The syntactic approach achieves privacy in two levels: clustering data and privacy framework. The k-anonymity [1] and then its extension l-diversity [3] and then the t-closeness [4] are the examples of syntactic data privacy models, in which the final set of groups are called equivalence classes (ECs). In the semantic approach, the original values are noised in a random way. -differential privacy [27] is an example of the semantic data model. The researchers have proposed both the syntactic and semantic privacy models for different types of data, e.g., single sensitive attribute [3, 4, 6], or MSAs [79], or 1 : m (i.e., one individual having many records) [28] microdata. For preserving the privacy, the algorithms in privacy models practice different approaches. These approaches can be categorized to (i) generalization [15, 1315] (i.e., greedily convert the more specialized values to less specialized values), (ii) anatomy [25, 26] (i.e., partition the QI and S attributes), and (iii) microaggregation [29, 30] (i.e., dataset is partitioned into clusters where QI values of records are replaced with the mean of value). The proposed work in this paper considers the syntactic data privacy, using generalization and anatomy for MSAs.

2.2. Syntactic Anonymization Literature for Multiple Sensitive Attributes

A plethora of research contributions related to MSAs privacy [714, 1618, 22, 24, 25, 3138] exists. A recent work, anatomization with slicing [25] is an effective technique for MSAs. Although it does not generalize the QIs attributes, it enhances utility but publishes many tables, which makes the solution more complex. To prevent against proximity breach, the authors in [7] have adopted multi-sensitive bucketization (MSB) technique using clustering. However, it is applicable to numerical data only. The (, l) model [8] for a single sensitive attribute satisfies the privacy requirements for MSAs. The authors in [31] prevent the negative and positive disclosure of associating between MSAs. In [33], rating of MSAs was proposed that fulfils the privacy requirements. However, the inherent relationship between the SAs can cause association rule attack. An adversary can use related bk to breach the privacy. The authors in [32] prevented the data from association attack and removed the weakness of the rating algorithm. In [37, 38], the authors perform vertical partitioning (i.e., anatomy) and implement decomposition and decomposition plus, respectively, to achieve l-diversity for MSAs. Decomposition plus [38] optimizes the noise value selection in [37] and keeps it closer to the original. The possibility of skewness and similarity attacks in [4, 39] was eliminated by the p+ sensitive t-closeness model [40]. It combines the good features from p-sensitive k-anonymity [39] and t-closeness [4] approaches.

ANGELMS (anatomy with generalization for MSA) [34] vertically partitions the dataset into the QIs table and several SAs tables satisfying the k-anonymity [1] and l-diversity [3] principles, but still it can be attacked with similarity, skewness, and sensitivity attacks. In [16, 18], the KC Slice for dynamic data publishing of MSAs integrates the features of KC-privacy and slicing techniques. The authors have presented the method for a single release, and no studies for multiple releases are available to prove the dynamic claim. In [35, 36], MSAs were handled for achieving privacy but the l-diversity [3] principle was directly adopted that caused huge information loss. The attack was prevented in [41] but still caused high information loss due to the grouping conditions over the data and vulnerability to the background join attack.

The proposed work categorizes the sensitivity of MSAs as top secret, secret, less secret, and nonsecret. c-diverse fingerprint buckets are created that contain records from different categories. The QI values of the created fingerprint buckets are bottom-up generalized through k-anonymity [1].

3. Preliminaries

Let table (as shown in Table 1) is the private data form for a publisher to publish. Let there be tuples in T, and each tuple represents an individual or record respondent i. The components for tuple are explicit identifier attributes (also called identifying attribute) EI = {, }, quasi identifier attributes (also called partial identifiers) QI = {, }, and sensitive information attributes S = {, }. QIs are the partial identifiers or personally identifiably information (PII) that can identify an individual i if linked with external data, e.g., voting or census data. Data privacy is all about protecting the sensitive information, which are the confidential and private information belonging to an individual. In this work, we consider a challenging scenario of more than one sensitive attributes for a single individual named as multiple sensitive attributes (MSAs). Notations used in the paper are shown in Table 7.

Definition 1 (MSA fingerprint [22]). The MSAs values in table T that belongs to a specific individual form a fingerprint known as MSA fingerprint.

Definition 2 (sensitive attribute fingerprint bucket (SAFB)). A sensitive attribute partitioning of the microdata T consists of a list of SAFB: according to the following conditions:(i)Each FB consists of two columns BID and MSA values.(ii), and for any , among linkable buckets through maximum weighted sensitive attributes (see Section 5).(iii)Each SAFB, fulfils c-diversity from the category table (Table 8). The subscript i of is the BID of bucket .

Definition 3 (generalized bucket (GB)). A generalized bucket partitioning from an EC of the microdata T consists of buckets such that(i)Each is the set of tuples only with QI attributes of T and BID from FB(ii), and for any , (iii)Each generalized bucket fulfils k-anonymity principle

3.1. Adversarial Model

In literature, an adversary is the attacker, who intends to breach the privacy and has different types of knowledge known as bk. Data correlation is an important type of adversary knowledge that breaches the privacy. The data correlation can be attribute correlation that exists among two or more attributes, e.g., [42], or row correlation between two or more rows, e.g., [43]. This paper is related to row correlation and more specifically FB correlation. This work focuses on reducing the threat exposed by FB correlation linked through the high-weighted SA value. Each FB contains few fingerprints that belong to k individuals in a specific GB inside GT. The adversary uniquely identifies an individual from the fingerprint correlation knowledge which has direct correspondence with the QI values.

Definition 4 (nonmembership knowledge [22]). If an adversary knows that an individual i in GB cannot be linked to a specific SV in FB it is known as .

Definition 5 (fingerprint correlation knowledge ). The MSA values obtained from correlating two linkable FBs, i.e., , can be assigned to a specific individual.
Based on the available information, we consider the adversary’s consists of , where(i)GT (generalized table) = (ii)ST (sensitive table) = (iii)Any external dataset ED = {} available publicallyThe adversary applies the on available anonymized data to perform an attack and to breach an individual’s privacy.

Definition 6 (fingerprint correlation attack). The adversary with known QIs values and is able to perform attack by deducing single SVs from the intersection of FBs that are linked via . The attack can be(i)Partial attack: few of the SAs from fingerprints in two or more FBs may produce unique SVs(ii)Full attack: all of the SAs from fingerprints in two or more FBs may produce unique SVsFor an adversary, the attack has no doubt to uniquely identify an individual while with the attack can identify a record respondent if the resulted sensitive information belongs to the above minimum weighted attributes (see 5.1 algorithm). The attributes do no contribute in an individual record identification.

Definition 7 (high-level petri-nets (HLPNs) [44]). A petri-net is used as a model to examine the control of information in a system. HLPN formally analyses the system with mathematical properties. A HLPN is a 7-tuple N = {}, where P is a set of places represented by circles, T is a set of transitions represented by rectangular boxes such that , F is the flow relations such that , L are the labels on F, maps places to types., represents the rules for transitions, and is the initial marking. In short, L, , and represent the static semantic, whereas , , and depict the dynamic structure.

4. Critical Review for -Angelization with Attack Identification Using Formal Modelling and Analysis

Definition 8 ((p, k)-Angelization [22]). A pair of bucket partitioning = {} and batch partitioning = {, } of table T produces two tables GT and SBT such that(i)GT consists of QI attributes with batch id (BID) belonging to table T. The QI values from t records are k-anonymously grouped and linked with SBT via BID.(ii)SBT consists of (SA, BID), where BID is i () and SA are the MSA from table T.(iii)Batch partitioning satisfies ()-anonymity [16] where every bucket has records from p categories and have k-tuples to prevent against linking attack while the group partitioning satisfies (p, k)-anonymity [23].The following reasons explains the invalidation for ()-angelization [22].(i)The invalidation of the existing (p, k)-angelization is due to the that causes attack (as shown in Table 4). Although the records from p categories are k indistinguishable on MSAs fingerprint, they are uniquely distinguishable because of unique SAFB values obtained from linkable buckets. So, Lemma 2 in [22] is incorrect. Its corrected form is given in Lemma 1 (Section 5).(ii)Invalidation of theorem 2 in [15]: the adversary can correlate the sensitive information from qik. Scenario I explains the attack that extracts unique sensitive information using QI values.Now, we formally model the (p, k)-angelization algorithm to check its invalidation with respect to the attack. The (p, k)-angelization algorithm is depicted with HLPN and formally analysed with its mathematical properties. The purpose of using HLPNs is to depict (i) the interconnection of the model components and processes, (ii) a clear flow of data among the processes, and (iii) the in depth inside about how the process of information takes place, in order to isolate the flaw in (p, k)-angelization. Figure 1 depicts HLPNs for (p, k)-angelization. The variable types and mapping of data types on places are shown in Tables 9 and 10, respectively. The adversarial model in Figure 1 comprises of three entities: end user, trusted data sanitizer, and adversary. The initial transition is referred to as input transition that contains the raw data (e.g., patient’s EHRs) collected from a health organization. The trusted data sanitizer anonymizes the data using the (p, k)-angelization algorithm and produces GT (Table 2) and SBT (Table 3) tables. The produced tables are ready to be published which are exploited by the adversary through the attack in Table 4.
Rule 1 checks the existence of the number of dependent SAs with respect to another SA. Rule 2 counts the dependent attributes and selects the maximum weighted attributes. However, if there exists more than one in the weight set, then an external factor is added to one of them, based on some external facts. The weight set is sorted in descending to select the maximum weight as in rule 3 and 4. Based on weight calculation and MSAs in T, the category table is formed in rule 5.Rule 1: Rule 2: Rule 3: Rule 4: Rule 5: The problem arises from rule 6 onward. The ()-angelization [22] blindly follows the basic angelization [23] mechanism. According to rule 6, the data in table T based on the category table (Table 8) use angelization to create GT and SBT. In [23], the between two SA buckets does not exist because of single SA. While handling the MSAs, the basic angelization is not applicable without proper measures. Rule 7 shows the attack that breaches the privacy of an individual. In 7, the known QI values in a specific GB in GT have exactly a single-correspondent FB in SBT for applying the fcorr attack to disclose unique values from the MSAs FBs.Rule 6: Rule 7:

5. Proposed -Anonymization for Multiple Sensitive Attributes

Although the (p, k)-Angelization model is a state-of-the-art approach for MSAs, especially for the categorization of sensitive values. But the ST still lacks in privacy because of blindly using the same angelization [23] approach for MSAs. It leads to fcorr attack, i.e., attack or attack. We name the improved form of (p, k)-angelization as (c, k)-anonymization for MSAs and is describe as follows:

Definition 9 ((c, k)-anonymization). A pair of generalized bucket partitioning = {} and sensitive attribute fingerprint bucket partitioning = {} of table T produces two tables GT and ST such that(i)GT consists of QI attributes buckets () with bucket id (BID) belonging to table T. The QI values belonging to t records are k-anonymously grouped and linked with corresponding sensitive buckets in ST via BID. And , and for any , .(ii)ST consists of ( and BID), where BID is i () and FB are the MSAs in buckets from table T. And , for any , and among linkable () buckets through maximum weighted sensitive attributes ().(iii)Generalized bucket partitioning satisfies definition 3 and sensitive bucket partitioning satisfies definition 2 based on the category table CtgT (Table 8). Every generalized bucket has k-tuples to prevent against linking attack. The sensitive partitioning has c-diverse records from c categories in CtgT such that(a)for strict approach, fulfil equation (4)(b)for relax approach, fulfil equation (5)

Lemma 1 (uncertainty in SAFBs). If for T having the MSAs dataset, the anonymized form T’ satisfies (c, k)-anonymization, then T’ satisfies (c, k)-diversity for MSA fingerprints.

Proof. Let sbp be the random sensitive bucket partitioning in T’. There must be at least t tuples from c categories in sbp such that or the , where are the minimum weighted SAs having the lowest dependency (see algorithm 5.1). So the c categorized records are indistinguishable on the MSA fingerprint by k records. Thus, sbp satisfies the definition of (c, k)-diversity and uncertainty in SAFBs is maximized.

5.1. Privacy Risk: Feasibility of the Proposed Work

The presence attack and fcorr attack are the two possible attacks that breach the privacy of GT and ST, respectively. The adversary can breach an individual’s privacy by linking the obtained sensitive information with the QI values and with the bk. Every QI record has a corresponding SA fingerprint in a specific FB, as depicted in Tables 2 and 3. The (p, k)-angelization has the one-to-one real linking through BID. The real linking has high privacy leakage and 100% chances of presence attack. Another type of linking can be the likely linking between GT and ST where all BID linking are not real. Relating the real and likely linking, the privacy risk for an individual is defined as , where r are the real linking and n are the total number of likely linking. The likely linking for a specific size of EC varies, and it depends on the QI values in GT that have linking with certain FBs. For preventing privacy leakage , where l represents the l-diversity [3]. Every FB has c-diverse fingerprints that correspond to at least k individuals. In our proposed (c, k)-anonymization, for c = 1, implies l = 2; for c = 2, implies ; for c = 3, implies , and so on.

Consider the privacy risk for the given Tables 5 and 6 processed by proposed (c, k)-anonymization. In GT Table 5, for c = 2, there are three different sizes of ECs that have varying l-diversity (ranging ) in ST Table 2. For EC size 4, and ; so, , which is a very low privacy risk. Similarly, for EC size 2, and ; so, . For the last EC size 3, and ; so, . Even in case when in an EC, the minimum distance QI values link to a single FB in ST, for example, when EC size is 2, then and so , and the probability of privacy disclosure is the diversity in the FB. In certain cases, the higher utility in QI values may further increase the likely linking which will more reduce the privacy risk. This proves that the proposed approach has very low privacy risk for the presence attack and data disclosure.

5.2. -Anonymization Algorithm

The objective of the proposed ()-anonymization algorithm is to provide a sustainable privacy for MSAs. The algorithm gets a microdata table T (Table 1) as input and produces two anonymized tables, i.e., GT (Table 5) and ST (Table 6).

The proposed Algorithm 1 performs two major functionalities: categorizing the MSAs based on calculated weights and creating secure FBs for the whole dataset. For SA categorization, the algorithm calculates weights for all the MSAs in the dataset to get to know the dependency of SAs. The dependency shows the sensitivity level of the SAs. These weights are sorted in the descending order to get categorized MSAs. The weights calculation for MSAs creates the category table (CtgT) that helps to create l-diverse (c-diverse with respect to category) FBs. The FBs must satisfy equations (4) and (5) in order to prevent attack. Therefore, if some of the FBs are not according to equations (4) or (5) they are refined to fulfil. The purpose to refine is because the input data may be of different nature and may contain SVs that may not be grouped initially. The complete algorithm (Algorithm 1) and its working are explained in the following:

Input:
T: Microdata table = {ID, QI, S}
χ: External Factor
Output:
GData: Generalized Table-GT
SAFB: Sensitive Table-ST
  
  SAFB

Sensitive attributes weight calculation. In Algorithm 1, a calling function , shown in Function 1, calculates weights for all the MSAs from Table 1. There are six different types of SAs, and each has its own level of sensitivity or sensitivity weights. Let W = {} is the set of weights for each SA such that has weight , has weight , and so on. SA weights are the dependency on all other SAs. Similar to [22] the weight is calculated in the following equation:where m are the total number of attributes dependent on attribute . The dependency of an SA is determined by the total range of attributes identifying the SAs. The for loop in the beginning calculates the sensitivity for with all other SAs, i.e., . This determines the sensitivity level for all the SAs. To calculate the weights (second for loop), cardinality checking is performed of all dependent attributes. Calculated weights for the SAs that exist in microdata are shown in Table 11.

Input:
s: sensitive attribute (SA)
: weight for a SA
 Dep: SA dependency
: height of dependency (no. of dependent SAs)
Output: list of weights for all SA, i.e.,
  
  

Maximum weighted sensitive attributes selection. From the calculated dependencies through Function 1, maximum weighted attributes are selected. Maximum weights mean high dependency leads to high disclosure risk. So attributes set with needs maximum protection. Although there are very rare chances, the problem arises if there exists more than one SAs having the same . Then, an external factor is added to select only one maximum weighted attribute. The algorithm adds an external factor , i.e., to each attribute in set , and the weights are then sorted in the descending order to get one single maximum weighted SA () (can be seen in Algorithm 1).

Categorizing sensitive attributes. Attribute occurring in the first location of the set is selected as . The descending order of calculated weights for MSA is categorized through the categorize() function as top secret, secret, less secret, and nonsecret, in Table 8. From weight calculations, although disease and physician have equal weights, we select physician as the maximum weighted SA () because of some other external factors . For example, physician information may be publically available on the Internet.

Creating . FBs consist of MSAs with the BIDs column at the time of anonymization. Function 2, function , shows the whole process for creating c-diverse FBs. Create FBs mainly based on (as shown in Table 8). Let and are the total number of records in the microdata table T. To create a 2-anonymous 2-diverse FB, i.e,. k = 2, l = 2 (i.e., c = 1), for example, and are two different records in T such that and , where and . The union of these two different records from different sensitive categories creates an FB, as shown in the following equation:where and . Selecting records from different categories to create an FB is to implement the l-diversity principle in the form of c-diversity in our case. Privacy in data is all about creating an EC that prevents an intruder to breach any of its sensitive contents. Unlike [22], we focus on creating an FB that satisfies c-diversity and prevents against any attack from the adversary, e.g., attack. The attack is prevented by taking two measures: (i) minimizing the likability or linking between two FBs and (ii) uncorrelating the records or maximizing the uncertainty between the FBs (explained in Refine FB).

Input:
 = , all in the whole dataset T
r: source record to create FB, it can be or ,
: no. of records in the actual dataset T
: linkability control factor
k: k-anonymity level (minimum FB size)
: any category of SA in category Table 8
: any category of SA in category Table 8
Output: -anonymous records in each
 = {}
  
  
   =  //if e.g.  = 2, will select max 2 //records having same physician
   = distinct , //where and
   = 
  
  
  

Patients records (i.e., SVs) are high in number as compared to the physician attribute and both can normally be correlated or linked with one another. This provides a reason for an attack. Therefore, to minimize the , we use the linkability control factor , (). minimizes the repetition of the same value in different FBs. The table (Table 6) published by the proposed algorithm has , which means that a maximum of two records for a single physician can exist in an FB. This brings the existing values to minimum FBs and is reduced. So, the chances of the attack on possible FB are ultimately reduced (Table 3 has 3 FBs while Table 6 has only 1 FB). The high value for further reduces but increases information loss, so a balance should be maintained between and utility preservation.

Refining FBs. FBs are refined through the function Refine , as depicted in Function 3. The attack between any two FBs can breach the privacy. The purpose is to completely avoid the correlation. A percentage of record that discloses from intersection between FBs can be associated to a specific individual. For example, any percentage of record obtained from equation (3) that results in a single value for each SA correlation is a privacy breach and is not acceptable, especially for high weighted attributes.where and are via . The high percentage disclosure infers a high intruder confidence for privacy breach and vice versa. The decreasing order of is the decrease in probability of privacy breach, among the n FBs. The measure to prevent against the attack is no or minimum data expose from intersection. The refining process in Function 3 for FBs works under two approaches, i.e., strict and relax. Strict approach is given in the following equation:where and FBs are through , i.e., , for example, they are through the physician and n are the total number of FBs where the same physician exists. The intersection in this case ensures that should be zero or have more than one SVs in common to create uncertainty for single SA.

Input:
 = {}: a set of FB that are linked via the same
k: minimum k-anonymity level
Output: set of refined FB = {}, linked via same .
 [ // Strict ECs or
OR ] // Relax ECs
  
  
   // where
   //, merge the FBs, dissolved
end while

In case of the worst dataset, there may be some of the records that do not fit in any FB via the strict approach. A relax strategy is adopted with no breach in privacy. The percentage of record exposure from is minimized to an acceptable value. In the worst case scenario, the proposed (c, k)-anonymization maintains where are the minimum weighted attributes that have no dependency. Relax strategy is given in the following equation:where and FBs are through , i.e.,, for example, they are through physician and n are the total number of FBs where the same physician exists. According to equation (5), only is acceptable which is a percentage of information leakage but not a privacy breach because of not dependent attributes and hence impossible for an intruder to link with other SA or QI of a specific patient record. The working of RefineFB() function for equation (5) is the same as it is for equation (4).

Generalization. Function 4 deals with QI attributes with BIDs that correspond to unique FBs in ST. Initially, the records are sorted by QIs. Then, every individual record is generalized to achieve k-anonymous EC. For generalizing the tuple , the t QI = {,} is generalized to =([ − ], [ − ],…, [ − ]), where and , are the close boundaries for . The Generalization() function in Function 4 shows the generalization process.

Input:
QI attributes from original microdata T
Output:
N =  //BID is received from SAFBs
sort N w.r.t QIs
while
  
   
  
    // using BID
 each GB has mostly both real and likely linking
  
 end function
5.3. Formal Modelling and Analysis for Proposed -Anonymization Algorithm

In this section, we do formal modelling of the proposed (c, k)-anonymization algorithm to analyse and formally validate against adversary’s background knowledge, i.e., attack. We have used HLPN (Definition 7) to model the proposed system. The HLPN provides the mathematical representation to analyse the behaviour of the proposed system. For a system representation in HLPN, first, the data types associated with the P (Places) are defined and then the set of rules for HLPN are defined. Figure 2 represents the HLPN for the proposed (c, k)-anonymization algorithm. Tables 12 and 13 show the data types and mapping of data types on places that are described which are involved in the proposed algorithm HLPN.

The algorithm begins from weight calculation as in [22] to create the CtgT table (Table 8). So, the transitions in rules 1, 2, 3, 4, and 5 are the same for the proposed (c, k)-anonymization algorithm. The main goal is to nullify rule 7 and to create such FBs that prevent against attack. In rule 8, the FBs are created. The function createFB() takes T (i.e., ()), and based on category id C the categorized MSA are accommodated into different c-diverse FBs.Rule 8

In the next step, the created FBs in 8 are evaluated to prevent attack. Rule 9 gets an input of equations (4) and (5) to verify the nonexistence of knowledge between different FBs. All those FBs that satisfy either of the equation are stored at place SAFB via rule 10, while other SAFBs are forwarded for refinement to satisfy the equations as shown in rule 11.Rule 9: Rule 10: Rule 11:

The minimum requirement to prevent from the attack is to at least satisfy equation (5). In rule 12, such requirements are fulfilled via function RefineFB() for all the FALSE FBs from rule 9 and are stored at place SAFBr. Rule 13 just combines all the secured FBs from places SAFB and SAFBr to create one ST that can prevent from any bk attack, e.g., attack.Rule 12: Rule 13:

The generalization of the QIs in T with respect to the sensitivity categorization and the BID obtained from ST (rule 13) is performed in rule 14. The created GT and ST tables via rules 13 and 14, respectively, can thwart against an adversary’s presence and attacks and are ready to publish. Rule 15 shows the adversary’s zero gain after attack.Rule 14:

In rule 15, the adversary’s bk, i.e., , consists of QI, MSA, and PID. To apply the attack, the adversary compares the SVs in bk with published tables MSA buckets and with the corresponding QIs to get a matching MSAs that belong to a specific PID as in the original table T. However, the adversary fails to do so and the union of the bk yields an empty set, which shows that the bk could not identify a specific individual record.Rule 15:

6. Experimental Analysis

In this section, we evaluate our proposed anonymization algorithm (c, k)-anonymization to compare the performance with (p, k)-Angelization. Both the algorithms are implemented in JAVA language on a machine having Windows 10 operating system with 4 GB RAM and Intel Core i5 2.39 GHz processor. The values plotted for the (p, k)-angelization algorithm have been obtained from the algorithm’s program code executed on the same machine. The dataset obtained from the Cleveland Clinic Foundation Heart disease is available at https://archive.ics.uci.edu/ml/ datasets/Heart + Disease. This dataset consists of 75 attributes. The experiments are performed on two QI attributes: age and gender and 12 sensitive attributes. These attributes are enough to evaluate the performance of the proposed algorithm. Table 14 shows the QIs and SAs used in the experiment with attribute description and number of distinct values in each domain.

Different general purpose posteriori measures for utility and privacy loss [9, 15, 18, 22] are available for generalization-based algorithms. In these approaches, the publisher does not know about the recipient analysing method. The publisher only evaluates the similarity between the original and anonymized data. The lower values in utility loss and privacy loss reflect the effectiveness of the developed algorithm. We measure the utility loss using the normalized certainty penalty (NCP) [22], query accuracy [22], and privacy loss by calculating the vulnerable records. Also, both algorithms execution time are analysed and a discussion is provided at the end.

6.1. Utility Loss

For utility loss measure, we analyse our algorithm using the following techniques.

6.1.1. Normalized Certainty Penalty

NCP measures accuracy loss in the anonymized release. Let T = {} are QI. The information loss for a single QI attribute is , where , is the actual QI value from T, and is the domain range on , i.e., max {t.}–min {t.}. The total weighted certainty penalty for the whole table is the sum of all attributes in a tuple, and then NCP obtained from all tuples is added as shown in the following equation:where represents penalty for a tuple, are weights associated to attributes, and is the final anonymized release. Figure 3 shows the percentage value for NCP for varying values of k-anonymity, keeping fixed number of attributes (e.g., MSA = 6) for analysing ()-anonymization and ()-anonymization algorithms. The penalty, i.e., NCP% value for (p, k)-angelization continuously increases while increasing the k-anonymity because the k groups have the one-to-one correspondence with FBs in ST. This means high diversity in FBs may further effect the utility in GT. So, the splitting of table T (Table 1) into GT (Table 2) and ST (Table 3) has no benefit at all. The attributes in each table are still dependent on each other. While the proposed (c, k)-anonymization has one-to-many correspondence between ST and GT where the GB creates closer k-anonymous groups for the same k size class. The comparatively less generalized QIs have lower utility loss and have almost zero loss.

6.1.2. Query Accuracy or Precision of Data Analysis Queries

The purpose of anonymized data is to extract useful statistics and contribute in decision-making. Such utility of the anonymized release is measured through aggregate query answering. Consider the following type of aggregate query:

Published table has q as , i.e., . The domain size depends on query selectivity () which indicates the expected number of record selection. The selectivity  = , where is the total number of records in the dataset and indicates the number of records obtained from query (Q). To measure the precision loss, the query error in equation (8) analyses the error between the COUNT queries executed on the published and original datasets.

Calculating the query error is a common matrix to measure the utility of the anonymized release. We compare the utility of the ()-angelization and (c, k)-anonymization by generating 1000 random queries and averaging their query errors. Figure 4 depicts the query error for (p, k)-angelization and (c, k)-anonymization comparatively. In Figure 4(a), the comparative increase in the query error for varying k size is because of the new record insertions that increase the generalization range. While the proposed (c, k)-anonymization shows a comparative low error rate because of comparative decrease in QI generalization. The selectivity, i.e., graph in Figure 4(b), depicts that for high selectivity, more number of records are selected. So, the difference to calculate the query error via equation (8) will automatically decrease.

6.2. Privacy Loss

Identification of individual record respondents from an anonymized release is directly proportional to the privacy loss. Therefore, the privacy loss is measured with respect to the number of single record identifications that are obtained from the intersection of FBs, considered as vulnerable records. We analyse the privacy loss with varying value of k and MSA. Figure 5 shows the results obtained from the experiments for the (p, k)-angelization and (c, k)-anonymization algorithms. In Figure 5(a), for (p, k)-angelization, the number of vulnerable records increases with increase in k size because there are more number of records that have single SV obtained from the intersection among FBs. Similarly, in Figure 5(b), with increasing the MSAs, the chances of getting a single SV against each SA increases and hence the number of vulnerable records increases for (p, k)-angelization. While in both cases, i.e., Figures 5(a) and 5(b), the proposed (c, k)-anonymization has no such single SV from the intersection in FBs because of satisfying equations (4) and (5). Therefore, there exists no vulnerable records in the proposed (c, k)-anonymization algorithm.

6.3. Execution Time Analysis

The execution time for the proposed (c, k)-anonymization is high as compared to the (p, k)-angelization. This is because of the privacy requirements to satisfy equations (4) and (5). Although the categorization and record selection are almost the same in both algorithms, however, to satisfy equations (4) and (5) is time consuming and may need further time to merge records between FBs. Therefore, at the cost of improved privacy, the execution time increases. In Figure 6, with the increase in number of MSAs, the proposed (c, k)-anonymization algorithm execution time comparatively increases because of satisfying privacy equations for all the available attributes; however, the algorithm execution time is still small and acceptable.

6.4. Discussion

The proposed algorithm has been analysed through utility loss and privacy loss metrics. The main goal of the algorithm is to prevent attribute disclosures using some background knowledge. The algorithm achieves this goal. The main priority of the algorithm is to prevent from the attack. For this purpose, the algorithm creates a classification strategy in the form of to have c-diverse records in each FB and to prevent attribute disclosures. In the defensive position, the algorithm focuses on reducing the between FBs using in order to reduce the number of possible attacks. In the next phase, we enforce each FB for the fulfilment of equation (4) and in the worst case equation (5) which completely avoids the risks of attack. Partitioning of attributes based on weights and enforcement for conditions reduces the privacy loss, and creating closest k-anonymous QI class also results in low information loss. The evaluation parameters for privacy and utility proved that there is a minimum disclosure risk and information loss for the proposed (c, k)-anonymization algorithm.

7. Conclusion

In this paper, privacy for MSAs of healthcare data has been addressed. Irrespective of adopting any predefined methodology for our proposed approach similar to (p, k)-angelization, we proposed a novel algorithm, (c, k)-anonymization for privacy and utility improvement in MSAs. The proposed algorithm consists of two major steps. First, it categorizes the MSAs based on calculated weights. Second, FBs are created and refined iteratively to implement privacy. To categorize the MSAs, the weight calculation is done as in [22]. The privacy risk is reduced by implementing one-to-many linking which disassociate the buckets in GT and ST. The one-to-many linking not only reduces the probability of adversary’s attacks but also improves the utility in GT. The major step in privacy implementation is to reduce the correlation between FBs by satisfying equations (4) and (5). Such measures prevent the main cause of privacy breach, i.e., correlation between SAs. It makes the adversary unable to disclose the privacy of an intended individual. The experiment’s results show that both with respect to utility and privacy the proposed (c, k)-anonymization algorithm performs well as compared to the (p, k)-Angelization algorithm.

Preserving the same privacy, this work can be extended to 1 : M microdata [28]. Another challenging future work can be combining 1 : M with MSA in a dynamic data publishing [6, 16] scenario. To the best of our knowledge, for the later scenario, there exists no available literature.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61932005, 61601051, and 61941105), Beijing Municipal Science and Technology Project (Z181100003218005), and 111 Project of China B16006.