An Edge Correlation Based Differentially Private Network Data Release Method

Lu, Junling; Cai, Zhipeng; Wang, Xiaoming; Zhang, Lichen; Duan, Zhuojun

doi:https://doi.org/10.1155/2017/8408253

Security and Communication Networks

On this page

Abstract Introduction Related Work Preliminaries Conclusion Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Cross-Layer Approaches for Enhancing the Security and Privacy of Wireless Communications and Networking

View this Special Issue

Research Article | Open Access

Volume 2017 | Article ID 8408253 | https://doi.org/10.1155/2017/8408253

An Edge Correlation Based Differentially Private Network Data Release Method

Junling Lu,^1,2Zhipeng Cai ,³Xiaoming Wang,^1,2Lichen Zhang,^1,2and Zhuojun Duan³

Academic Editor: Houbing Song

Received16 Aug 2017

Accepted16 Oct 2017

Published13 Nov 2017

Abstract

Differential privacy (DP) provides a rigorous and provable privacy guarantee and assumes adversaries’ arbitrary background knowledge, which makes it distinct from prior work in privacy preserving. However, DP cannot achieve claimed privacy guarantees over datasets with correlated tuples. Aiming to protect whether two individuals have a close relationship in a correlated dataset corresponding to a weighted network, we propose a differentially private network data release method, based on edge correlation, to gain the tradeoff between privacy and utility. Specifically, we first extracted the Edge Profile (PF) of an edge from a graph, which is transformed from a raw correlated dataset. Then, edge correlation is defined based on the PFs of both edges via Jenson-Shannon Divergence (JS-Divergence). Secondly, we transform a raw weighted dataset into an indicated dataset by adopting a weight threshold, to satisfy specific real need and decrease query sensitivity. Furthermore, we propose -correlated edge differential privacy (CEDP), by combining the correlation analysis and the correlated parameter with traditional DP. Finally, we propose network data release (NDR) algorithm based on the -CEDP model and discuss its privacy and utility. Extensive experiments over real and synthetic network datasets show the proposed releasing method provides better utilities while maintaining privacy guarantee.

1. Introduction

Recently, social networking such as cooperation networks, online/mobile social networks, and software defined vehicular network [1] is becoming increasingly prevalent. Accompanied with the growth of the networks, mass of network data is released for analytical decisions or scientific researches. However, direct publication of these data, including sensitive information, leads to privacy leakage of individuals. For example, whether two individuals in a social network have a close relationship may be expected to be kept a secret. Therefore, privacy concerns have been raised in increasingly emerging technologies [2–9].

In general, a dataset corresponding to such a network, usually modeled as a graph, is considered as correlated data; that is, tuples in this dataset are dependent. Clearly, privacy preserving in such correlated settings is more difficult because an adversary can infer the relationship of two individuals from their associated friends. Accordingly, our concern is preventing, whether the relationship of two individuals appears in a network dataset, from being unveiled.

Differential privacy (DP), a privacy preserving model originated from statistical database, has currently drawn considerable attentions in research communities [10–17] due to (i) its rigorous and provable privacy guarantee and (ii) its assumption of adversaries’ arbitrary background knowledge. However, DP actually assumes that the tuples in databases are independent [18]. In other words, DP cannot provide claimed privacy guarantees over correlated (nonindependent) data [19]. Therefore, the application of DP over correlated data is a challenge, and how to achieve a differentially private correlated data release method deserves to be further explored.

The focus of our work is on hiding the affinity degree of two individuals in a correlated dataset corresponding to a weighted network, that is, protecting whether the affinity degree of two individuals exceeds a given weight threshold, in a differentially private manner. Toward this end, we first transform a weighted network dataset into a corresponding weighted graph and define the correlation of both edges via Jenson-Shannon Divergence (JS-Divergence). For satisfying specific query need in spite of some utility loss, we utilize Threshold Based Transformation (TBT) algorithm to transform a weighted dataset, by adopting a weight threshold, into an indicated dataset, which also decreases query sensitivity. Finally, we present the notion of -correlated edge differential privacy (CEDP), by combining the correlation analysis and the correlated parameter, that is, the maximal number of correlated tuples, with traditional DP, and design differentially private network data release (NDR) algorithm to obtain better utilities while maintaining DP guarantee. Experimental results over real and synthetic network datasets also show the advantages of the proposed method. The framework of our solution is shown in Figure 1.

The contributions of our work are as follows.

First, we extract the Edge Profile (PF) vectors of edges in a weighted graph corresponding to a network dataset and then define the correlation of both edges via JS-Divergence. The inferred correlation analysis is more reasonable for datasets corresponding to such networks, since the typical Pearson correlation coefficient assumes that sample data follows normal distribution; however, the degree and weight distributions in such networks are often not so.

Second, we propose the -CEDP model based on our result of correlation analysis and the introduction of the correlated parameter, which makes DP over correlated datasets applicable and flexible. Furthermore, the NDR algorithm, based on correlated sensitivity and Laplace mechanism, is proposed, which also satisfies -CEDP and achieves the tradeoff of privacy and utility.

Third, we utilize TBT algorithm to transform a raw weighted dataset into an indicated dataset; that is, a weight value is equal to 1 or 0, by adopting a weight threshold, to satisfy specific real need and decrease query sensitivity. Admittedly, some utility loss exists in such transformation. However, many queries in real world only need Boolean values indicated by one and zero instead of accurate numeric answers. Therefore, this solution provides a feasible way for decreasing query sensitivity while maintaining real query need.

The rest of this paper is organized as follows. Section 2 discusses related literature. Section 3 provides the preliminaries. In Section 4, correlation analysis of both edges in a weighted graph is presented, and the -CEDP model and sensitivity calculation are proposed. Furthermore, a differentially private NDR algorithm, including TBT algorithm, to obtain the tradeoff between privacy and utility over correlated data is proposed in Section 5. The extensive experiments are illustrated in Section 6. Finally, Section 7 concludes the paper.

Compared with previous works in privacy preserving, DP proposed by Dwork [20] provides a probabilistic formulation, which represents that adversaries learn little from both databases differing in one tuple even if adversaries know about all tuples except the target one. In other words, the inference abilities of adversaries about the presence or absence of a tuple are bounded regardless of adversaries’ knowledge; that is, the presence or absence of a tuple is probabilistically indistinguishable for adversaries.

Currently, DP has drawn much attention in privacy preserving work needed in many fields. Wang et al. [10] considered a unified privacy distortion framework, where the distortion is defined to be the expected Hamming distance between the input and output databases, and investigated the relation between three different notions of privacy: identifiability, differential privacy, and mutual-information privacy. To provide personalized recommendation in big data resulting from social networks and maintain user privacy, a cloud-assisted differentially private video recommendation system based on distributed online learning was proposed [11]. The work in [12] proposed a new privacy preserving smart metering scheme for smart grid, which supports data aggregation, differential privacy, fault tolerance, and range-based filtering simultaneously. To et al. [13] introduced a novel privacy-aware framework for spatial crowdsourcing, which enables the participation of workers without compromising their location privacy. Focusing on the privacy protection of sensitive information in body area networks, the authors in [14, 15] proposed different privacy preserving schemes, based on differential privacy model, via a tree structure and dynamic noise thresholds, respectively. The work in [16] proposed a novel differentially private frequent sequence mining algorithm by leveraging a sampling-based candidate pruning technique, which satisfies -differential privacy and can privately find frequent sequences with high accuracy. In order to protect users’ privacy in ridesharing services, a jointly differentially private scheduling protocol has been proposed [17], which aims to protect riders’ location information and minimize the total additional vehicle mileage in the ridesharing system.

However, existing works have found that DP provides weaker privacy guarantee over nonindependent data; that is, DP needs more noise added to the output query result to cancel out the impact of correlations among tuples on privacy guarantee. Undoubtedly, how to analyze correlations among tuples and apply them into DP are desired to be further explored. For example, Kifer and Machanavajjhala [19] first explicitly doubted the privacy guarantee of DP in correlated settings, for example, social networks, and then adopted the subsequently proposed privacy framework, that is, Pufferfish, to formalize and prove that DP assumes independence between tuples [18]. Inspired by the Pufferfish framework, Blowfish privacy [21] was proposed to achieve the tradeoff between privacy and utility using policies specifying secrets and constraints. Similarly, the authors in [22] proposed Bayesian DP to evaluate the level of private information leakage even when data is correlated and prior knowledge is incomplete. The work in [23] regarded the correlation among tuples as complete correlation and multiplied the query sensitivity with the number of correlated tuples in publishing correlated network data, which leaves room for fine-grained correlation analysis in the following work. Aiming to decrease the noise amount, Zhu et al. [24] depicted the correlation between tuples via Pearson correlation coefficient, including complete correlation, partial correlation, and independence. Liu et al. [25] inferred the dependence coefficient, distributed in interval , to evaluate the probabilistic correlation between two tuples in a more fine-grained manner, thus reducing the query sensitivity which results in less noise. Considering temporal correlations of a moving user’s locations, the work in [26] leveraged a hidden Markov model to establish a location set and proposed a variant of DP to protect location privacy. Wu et al. [27] proposed the definition of correlated differential privacy to evaluate the real privacy level of a single dataset influenced by the other datasets when multiple datasets are correlated. The work in [28] formalized the privacy preservation problem to an optimization problem by modeling the temporal correlations among contexts and further proposed an efficient context-aware privacy preserving algorithm. Cao et al. [29] modeled the temporal correlations using Markov model and investigated the privacy leakage of a traditional DP mechanism under temporal correlations in the context of continuous data release. The work in [30] quantified the location correlation between two users through the similarity measurement of two hidden Markov models and applied differential privacy via private candidate sets to achieve the multiuser location correlation protection.

As seen from the above discussions, correlation analysis plays an important role in privacy preserving mechanisms, which directly influences the tradeoff between privacy protection and service utility. Obviously, the more accurate the correlation analysis, the better the balance of both aspects. Therefore, we attribute the underestimated privacy guarantee of DP over correlated data to the lack of data knowledge, and our work starts from data correlation analysis.

In this paper, we focus on correlated datasets corresponding to weighted cooperation networks. Different from the existing methods of correlation analysis, for example, simple multiplication in [23], Pearson correlation coefficient in [24], and the maximal information coefficient in [31], we extract the PF vectors of edges in a weighted graph corresponding to a correlated dataset and then define the correlation of both edges via JS-Divergence, which is more accurate and reasonable. Specifically, the work in [23] assumes both tuples are completely correlated; however, our proposed correlation results lie in interval representing multiple correlation including complete correlation. In addition, the work in [24] assumes sample data follows normal distribution, while our method is not the case. Also, the maximal information coefficient proposed in [31] satisfies two heuristic properties including generality and equitability, and we will consider it in our future work.

3. Preliminaries

3.1. Differential Privacy

Differential privacy provides the privacy guarantee for an individual in the probabilistic sense [20]. It is defined as follows.

Definition 1 (-differential privacy). A randomized mechanism satisfies -differential privacy if, for any pair of databases and differing in only one tuple and for any output representing the possible output set of , where is the privacy budget depicting the probabilistic difference between the same outputs of over and .
Generally, DP is achieved via two mechanisms: Laplace mechanism [32] and exponential mechanism [33]. Both mechanisms include a concept of global sensitivity [20], which reveals DP’s preferable choice of protecting the extreme case.

Definition 2 (global sensitivity). For any query function , where is a dataset and is a -dimension real-valued vector, the global sensitivity of is defined as where and denote any pair of databases differing in only one tuple and denotes norm.
Laplace mechanism, used in this paper, is formally presented as follows.

Theorem 3 (Laplace mechanism). Given any query function , where is a dataset and is a -dimension real-valued vector, the global sensitivity of , and privacy budget , a randomized mechanism provides the -differential privacy, where denotes Laplace noise.

3.2. Weighted Adjacency Matrix

In this paper, we model a correlated dataset as a weighted undirected simple graph , where is the set of vertices and is the number of vertices, is the set of edges and , , , , , and is the set of weights where weight corresponds with edge . Then, the weighted adjacency matrix of can be denoted as where represents the affinity degree between two individuals. Obviously, the weighted adjacency matrix is symmetric.

Example 4. Suppose a raw weighted dataset is listed in Table 1. Then, the corresponding weighted adjacency matrix of can be denoted as

3.3. Correlation Metric

Motivated by the entropy in information theory, we adopt JS-Divergence, inferred from Kullback-Leibler Divergence (KL-Divergence) [23], to depict the difference of two distributions, which can be transformed to depict the correlation of two tuples in a correlated dataset.

Definition 5 (KL-Divergence). Suppose and are the probability distributions of random variables and ; then the KL-Divergence of and is defined as follows: Here = 0 is required. Based on KL-Divergence, we can obtain JS-Divergence as follows.

Definition 6 (JS-Divergence). Suppose and are the probability distributions of random variables and , and ; then the JS-Divergence of and is defined as follows:

4. Correlation Analysis of Weighted Edges

In this section, we first discuss how to define the correlation of both edges in a weighted graph corresponding to a network dataset and then introduce why and how we conduct dataset transformation based on a given weight threshold. Finally, we define the -CEDP model and calculate the correlated sensitivity for smaller added noise.

4.1. Correlation Definition

For achieving the correlation of tuples in a raw weighted dataset , we first obtain a weighted graph , whose weighted adjacency matrix is denoted by . Then, the correlation problem is changed to seeking the correlation of edges in . To this end, we first describe the PF vector of a weighted edge from the perspectives of relational strength and network structure and then define the correlation of both edges via JS-Divergence instead of Pearson correlation coefficient.

For a weighted edge , suppose represents the set of vertices connected with and represents the set of vertices connected with ; we extract the PF vector of , denoted by , from the perspectives of relational strength and network structure simultaneously. Specifically, we obtain from the global weights of all edges. In addition, we get and from the local weights of edge . On the other hand, similar to the representation of relational strength, and are constructed, by introducing the node degree , to depict the global active degree for both vertices of edge . Also, is adopted, via the set similarity, to depict the ratio of the number of common vertices connecting and to that of and . Meanwhile, and are used to depict the local active degree for both vertices of edge . Combining the above factors, we obtain as follows:

Similarly, for any other edge , according to (8), we define as follows:

Note that Pearson correlation coefficient assumes that sample data follows normal distribution. However, in social networks, the weight and degree distributions do not follow such distribution, which is also verified by our experiments shown in Figure 2, where geom and out are the abbreviations of geom.net [34–36] and out.moreno_lesmis_lesmis [37–39] for simplicity. Specifically, geom.net is the authors collaboration network in Computational Geometry based on the file geombib.bib, and the reduced simple network contains 7343 vertices and 11898 edges. Two authors are linked with an edge, iff they wrote a common work. The value of an edge is the number of common works. out.moreno_lesmis_lesmis is the characters cooccurrences network in Victor Hugo’s novel “Les Misérables,” and it contains 77 vertices and 254 edges. A node represents a character and an edge between two nodes shows that these two characters appeared in the same chapter of the book. The weight of each link indicates how often such a coappearance occurred.

(a) Degree distribution on geom

(b) Weight distribution on geom

(c) Degree distribution on out

(d) Weight distribution on out

Since our constructed PF vectors of edges do not satisfy the assumption of normal distribution, we adopt JS-Divergence, instead of Pearson correlation coefficient, to measure the CORrelation (COR) of any two edges , in a weighted graph. To this end, we normalize and as and , which are two probability distributions. Therefore, we have where + + + and where + + + .

Meanwhile, we consider the distance of both edges as follows.

Definition 7 (edge distance). Suppose and are two edges in graph , denotes the length of the shortest path between nodes and , and is the index of the smallest value in vector ; then the distance of and is defined as follows:Specifically, we first calculate the distances including , , , and and then determine the smallest one among these distances and its index , complete the calculation of the distance of another pair of nodes, and finally obtain the distance of two edges and .

Based on Definitions 6 and 7, we define the CORrelation (COR) of two probability distributions via JS-Divergence as follows.

Definition 8 (CORrelation). Suppose and are the probability distributions of random variables and ; then the CORrelation of and is defined as follows:According to (10)–(13) and Definition 6, we adopt the normalized PN vectors of edges , to measure their correlation as where denotes the element of vector and

Substituting (15) into (14), we obtain

In our opinion, the proposed correlation definition, extracted from two aspects of relational strength and network structure, is more reasonable. The rationale is (i) graph models, commonly abstracted from networks, reflect inherent dependent relations of individuals, which naturally form edge correlations and (ii) the weights of edges in weighted graphs describe the affinity degree of individuals’ relations, which also influence the variances of edge correlations.

Example 9. Take and in as an example to demonstrate the calculation of . According to (8) and (9), we have Furthermore, according to (10) and (11), we get Finally, according to (16), we obtain

4.2. Dataset Transformation

We consider some real world situations that do not need exact query answers. For example, people sometimes only want to learn about whether two individuals have an intimate relationship or not, rather than the specific number of communication or cooperation. So the privacy concern at this time is to avoid the leakage of close relationship, that is, yes or no. Therefore, the first thing we focus on is to transform a weighted dataset to an indicated dataset , based on a given weight threshold . In other words, we consider replacing query “Select SUM(weight) from where ” with query “Select COUNT() from where ”, which satisfies some specific situations and decreases the query sensitivity simultaneously. Note that this method aims to avoid the leakage of whether an edge satisfying the given threshold condition exists and not to avoid the weights of edges satisfying the one exposed. In our opinion, this solution is reasonable and suitable for achieving privacy protection via DP in spite of some utility loss.

To this end, we propose the TBT algorithm to modify raw weight values in as an indicated value; that is, if ; otherwise, , thus transforming to . The TBT algorithm is presented in Algorithm 1.

Input: Weighted dataset , weight threshold .
Output: Indicated dataset .
() for (Each tuple ) do
()if () then
() = 1;
()else
() = 0;
()end if
() end for
() return with indicated values.

4.3. Correlated Edge Differential Privacy

As discussed above, we only consider the situations: the query answers responding to a correlated weighted data are yes or no, which indicates whether two individuals have close relationship. That is, the privacy concern herein is to avoid the leakage of whether there is a close relationship between two individuals, in a weighted dataset whose at most tuples are correlated, where is the correlated parameter. To this end, we first define correlated neighboring databases as follows.

Definition 10 (correlated neighboring databases). Any pair of databases and are correlated neighboring databases, if the weight change of a tuple in results in the weight changes of at most other correlated tuples in based on the correlation of both tuples.

Note that the neighboring databases in Definition 10 are described by two parameters: the correlation aforementioned and the correlated parameter . Specifically, we have the following.

(i) Based on JS-Divergence, we have the following conclusion about the correlation .

Theorem 11. For any two edges and in the weighted graph corresponding to a network dataset , holds.

Proof. For the ease of exposition, we denote the last two items in the numerator of (16) as follows.Since denotes a probability distribution, we havewe consider two cases separately.
Case 1 (). Since when , we haveSubstituting (22) into (23), we haveSimilarly, we obtainCombining (21), (24), and (25), we haveCase 2 (). Since when , we haveSimilarly, we obtainCombining (21), (27), and (28), we haveCombining (26) with (29), then we haveNote that due to (12). Finally, according to (16) and (30), we have

Clearly, denotes is independent of ; that is, the corresponding tuples in a dataset are independent. denotes is fully dependent to ; that is, the corresponding tuples in a dataset are fully correlated. denotes is partially dependent on ; that is, the corresponding tuples in a dataset are partially correlated.

(ii) Similar to [23–25], we introduce the correlated parameter representing that there are at most correlated tuples in a dataset. In other words, a tuple is correlated with at most other tuples; that is, an edge in a graph is correlated with at most other edges. Obviously, represents the independent case of tuples in a dataset, represents the fully correlated case of tuples in a dataset, and represents the partially correlated case of tuples in a dataset. Therefore, the variance of increases the flexibility of Definition 10.

Furthermore, we define the -CEDP model as follows.

Definition 12 (-correlated edge differential privacy). A randomized mechanism satisfies -differential privacy if, for any neighboring databases and and for any output representing the possible output set of , where is the privacy budget depicting the probabilistic difference between the same outputs of over and and and are the correlation of two tuples and the correlated parameter representing the maximal number of correlated tuples, respectively.

4.4. Sensitivity Calculation

After transforming weighted dataset to indicated dataset , we add Laplace noise to query answers based on the -CEDP model. Laplace noise is determined by two factors: privacy budget and the global sensitivity of a query, and the latter refers to the maximal change of query result due to the modification of only one tuple. Here, for a query , assume the global sensitivity of , resulting from the change of tuple , in independent settings is . Clearly, . However, for dataset with tuples where at most tuples are correlated, the query sensitivity resulted from modifying tuple , called Edge Sensitivity denoted by , is more complex. Specifically, (i) if , that is, , denoting the independent case, , (ii) if and , denoting the fully correlated case, , and (iii) if and , denoting the partially correlated case, is defined as follows.

Since the change of a tuple only affects at most other correlated tuples, can be rewritten as

Finally, we have the correlated sensitivity denoted by CS, that is, the maximal in dataset , as follows:

Note that the CS is also suitable for the independent and fully correlated cases. Based on the CS, we can achieve -CEDP, which is shown as follows.

Theorem 13. Given any query function , where is a correlated dataset with the correlation definition and the correlated parameter and is a -dimension real-valued vector, the correlated sensitivity of , and privacy budget , a randomized mechanism , provides -CEDP, where denotes Laplace noise.

Proof. According to (35), the following holds:Finally, combining (37) with (38), we have

For indicated dataset with weight and the correlated parameter , we can easily infer the global sensitivity is equal to . Due to , we have . Therefore, CS is less than the global sensitivity. In other words, added noise via is less than that via the global sensitivity; hence the utility of the mechanism based on is better.

5. Network Data Release Method

Based on indicated dataset and the CS discussed in Section 4, we proposed a network data release method in special cases, which achieves the -CEDP model. Furthermore, the theoretical analysis of privacy and utility is elaborated.

5.1. NDR Algorithm

The goal of NDR algorithm is to achieve the tradeoff between privacy and utility under correlated settings. To this end, three phases are taken into account: (i) for achieving the correlation of two tuples in , we transform dataset into the corresponding graph and calculate the correlation of both edges via the JS-Divergence, (ii) based on a given weight threshold , we convert into via TBT algorithm. In other words, the sensitivity in independent settings is 1, irrelevant to the weights. Furthermore, we implement the calculation of CS, and (iii) combining the affordable privacy budget with CS, we calculate the added Laplace noise and finally obtain the noisy query result for query in query set . The NDR algorithm is presented in Algorithm 2.

Input: Original dataset , privacy budget , correlated parameter , threshold and query set .
Output: Noisy query result .
() Calculate the correlation of any two edges in according to Eq. (16);
() Call Algorithm TBT(, ), return ;
() Calculate the correlated sensitivity according to Eq. (35);
() for (Each ) do
();
() end for
() return Noisy query result as .

5.2. Utility Analysis

Clearly, NDR algorithm satisfies the -CEDP model. To conduct utility analysis, we adopt the -useful definition in [40] to depict the utility of NDR as follows.

Definition 14. A mechanism NDR is -useful for a query in all queries , if, with probability at least , for any query and dataset , NDR satisfies .

Based on Definition 14, we obtain the following utility analysis.

Theorem 15. For any query and dataset , a mechanism NDR satisfies -useful if NDR can obtain with at least probability when .

Proof. By Definition 14, we haveIf , then the following holds:According to (40) and (41), we obtain Therefore, mechanism NDR satisfies -useful.

6. Experiment

Generally, the goal of privacy preserving is to achieve maximal utilities while maintaining required privacy guarantees; that is, the tradeoff between privacy and utility is desired. In this section, we first present the better privacy guarantees and utilities of Algorithm NDR based on the definition of -useful and then further demonstrate its better utilities in terms of mean absolute error (MAE). Here the Baseline algorithm adopts the multiplication in [23] to handle with the correlated tuples in a network dataset. Considering the constraint of applying Pearson correlation coefficient, we do not adopt the method using Pearson correlation coefficient as comparison reference in the following experiments. To verify the advantages of Algorithm NDR concerning privacy and utility, we conduct NDR and Baseline algorithms on three datasets: geom, out explained in Section 4, and randomly generated dataset (rgd), which is a randomly generated weighted network containing 100 vertices and 1645 edges. The weight of each edge is uniformly distributed in interval . Such doing can also show the better adaption of the proposed correlation metric and algorithm over real world and synthetic datasets. Without loss of generality, threshold here is set as 0, and the selection of its value is to be investigated in future work.

6.1. Privacy and Utility

We analyzed privacy and utility of NDR and Baseline algorithms in terms of -useful when the correlated parameter is set to the size of the whole dataset. In terms of privacy, we evaluate the consumption of privacy budget under the same accuracy and the same possibility . Clearly, the smaller the consumed privacy budget, the better the performance of algorithm.

Figures 3(a), 3(c), and 3(e) present the variation of privacy budget, consumed by algorithms NDR and Baseline based on datasets geom, out, and rgd, with the increase of from 1 to 40 when equals 0.1 and 0.5, respectively. From Figures 3(a), 3(c), and 3(e), we can see that privacy budgets decrease in all cases with the increase of . The reason is that, with the relaxation of , larger noise can be allowed when stays fixed; therefore algorithm can consume smaller privacy budget. Meanwhile, we also see that privacy budgets decrease with the increase of from 0.1 to 0.5 when stays fixed. Because the possibility of satisfying accuracy requirement decreases with the increase of , which means that algorithms can have more chances to add larger noise; that is, algorithms can use smaller privacy budget. Such advantage of both algorithms especially in the case of is more obvious than that of other ones. In fact, when stays constant, the higher the accuracy presented by , the more the privacy budget needed by algorithms.

(a) Privacy on geom

(b) Utility on geom

(c) Privacy on out

(d) Utility on out

(e) Privacy on rgd

(f) Utility on rgd

On the other hand, Figures 3(b), 3(d), and 3(f) demonstrate the variation of of algorithms NDR and Baseline based on datasets geom, out, and rgd, with the increase of from 0 to 10000 when equals 0.1 and 1.0, respectively. We can see that decreases in all cases with the increase of ; that is, the possibility increases with the increase of . Note that this trend varies from dataset to dataset; for example, the possibility and accuracy of algorithm NDR over datasets geom and out are evidently different when . Clearly, when is determined, the possibility increases with the relaxation of . In addition, we find that algorithm NDR can have larger possibility, that is, smaller , than the Baseline algorithm to achieve the same accuracy under the same level of privacy budget . Also, when increases from 0.1 to 1.0, algorithms also have possibility to achieve the same accuracy , which is easily understood from (40).

6.2. Utility

We adopt MAE, that is, , to depict the performance of algorithms NDR and Baseline. Obviously, the smaller the MAE value, the better the utility. For each dataset, 10000 queries are randomly generated, and each query result ranges from 0 to the maximal number of tuples.

Figures 4(a), 4(c), and 4(e) show the variances of MAEs of NDR and Baseline algorithms, over datasets geom, out, and rgd, under various privacy budgets when the correlated parameter is 10. From Figures 4(a), 4(c), and 4(e), we can see that the MAEs of both algorithms decrease with the increase of privacy budget from 0.1 to 1. Because larger privacy budget leads to smaller noise added to raw data, the downtrends always hold. More importantly, algorithm NDR can obtain better accuracy, that is, smaller MAE, under various . Furthermore, the smaller the privacy budget , the more obvious such advantage. The reason is that algorithm NDR adopts the more reasonable correlation metric compared with the Baseline algorithm.

(a) Utility on geom (correlated parameter = 10)

(b) Utility on geom (privacy budget = 0.5)

(c) Utility on out (correlated parameter )

(d) Utility on out (privacy budget = 0.5)

(e) Utility on rgd (correlated parameter )

(f) Utility on rgd (privacy budget = 0.5)

Figures 4(b), 4(d), and 4(f) show the variances of MAEs of NDR and Baseline algorithms, over datasets geom, out, and rgd, under various correlated parameters when privacy budget is 0.5. In Figures 4(b), 4(d), and 4(f), we find that the MAEs of both algorithms increase with the increase of correlated parameter from 1 to 40. Undoubtedly, with the increase of the number of correlated tuples in a dataset, larger noise needs to be injected to eliminate the effect of tuple correlation, which necessarily results in the increase of MAE. In addition, we also note that algorithm NDR can obtain better accuracy, that is, smaller MAE, under various correlated parameters compared with the Baseline algorithm. Also, the larger the correlated parameter, the larger such advantage. All these advantages are due to the more reasonable correlation metric, which is proposed in Section 4 and adopted by algorithm NDR.

7. Conclusion

In this paper, we focus on adopting differential privacy model to avoid the leakage of close relationship between two individuals in a network. To this end, we first extract the PF vector from both aspects of node degree and edge weight to depict an edge in a network dataset and then design the correlation metric of two edges via JS-Divergence to avoid the constraint of adopting Pearson correlation coefficient. Next, we proposed the -CEDP model to deal with the correlated dataset by introducing two parameters including our correlation metric and the correlated parameter. Furthermore, we present the NDR algorithm based on the -CEDP and discuss its privacy and utility in terms of the definition of -useful. Extensive experiments on real and synthetic network datasets verify the advantages of our proposed privacy preserving model and algorithm concerning privacy and utility. Admittedly, the proposed solution is currently appropriate for weighted network datasets, and other datasets are out of the scope of this paper. In future work, we will discuss the impacts of choosing weight threshold on algorithm performances, explore more appropriate correlation metrics, and investigate privacy preserving algorithms in different applications.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is partly supported by the Fundamental Research Funds for the Central Universities of China under Grant no. GK201703061, the National Natural Science Foundation of China under Grants nos. 61402273 and 61373083, the National Science Foundation (NSF) under Grants nos. CNS-1252292, 1741277, and 1704287, and the Natural Science Basic Research Plan in Shaanxi Province of China under Grants nos. 2017JM6060 and 2017JM6103.

References

J. Liu, J. Wan, B. Zeng, Q. Wang, H. Song, and M. Qiu, “A scalable and quick-response software defined vehicular network assisted by mobile edge computing,” IEEE Communications Magazine, vol. 55, no. 7, pp. 94–100, 2017.
View at: Publisher Site | Google Scholar
P. Hu, H. Ning, T. Qiu, H. Song, Y. Wang, and X. Yao, “Security and privacy preservation scheme of face identification and resolution framework using fog computing in internet of things,” IEEE Internet of Things Journal, vol. 4, no. 5, pp. 1143–1155, 2017.
View at: Publisher Site | Google Scholar
Z. He, Z. Cai, and J. Yu, “Latent-data privacy preserving with customized data utility for social network data,” IEEE Transactions on Vehicular Technology, vol. PP, no. 99, pp. 1–1, 2017.
View at: Publisher Site | Google Scholar
X. Zheng, Z. Cai, J. Yu, C. Wang, and Y. Li, “Follow but no track: privacy preserved profile publishing in cyber-physical social systems,” IEEE Internet of Things Journal, vol. PP, no. 99, pp. 1–1, 2017.
View at: Publisher Site | Google Scholar
Z. Cai, Z. He, X. Guan, and Y. Li, “Collective data-sanitization for preventing sensitive information inference attacks in social networks,” IEEE Transactions on Dependable and Secure Computing, vol. 99, pp. 1–1, 2017.
View at: Publisher Site | Google Scholar
Y. Liang, Z. Cai, Q. Han, and Y. Li, “Location privacy leakage through sensory data,” Security and Communication Networks, vol. 2017, Article ID 7576307, 12 pages, 2017.
View at: Publisher Site | Google Scholar
X. Zheng, Z. Cai, J. Li, and H. Gao, “Location-privacy-aware review publication mechanism for local business service systems,” in Proceedings of the IEEE INFOCOM 2017 - IEEE Conference on Computer Communications, pp. 1–9, Atlanta, GA, USA, May 2017.
View at: Publisher Site | Google Scholar
L. Zhang, Z. Cai, and X. Wang, “FakeMask: a novel privacy preserving approach for smartphones,” IEEE Transactions on Network and Service Management, vol. 13, no. 2, pp. 335–348, 2016.
View at: Publisher Site | Google Scholar
Y. Wang, Z. Cai, G. Yin, Y. Gao, X. Tong, and G. Wu, “An incentive mechanism with privacy protection in mobile crowdsourcing systems,” Computer Networks, vol. 102, Supplement C, pp. 157–171, June 2016.
View at: Publisher Site | Google Scholar
W. Wang, L. Ying, and J. Zhang, “On the relation between identifiability, differential privacy, and mutual-information privacy,” Institute of Electrical and Electronics Engineers Transactions on Information Theory, vol. 62, no. 9, pp. 5018–5029, 2016.
View at: Publisher Site | Google Scholar | MathSciNet
P. Zhou, Y. Zhou, D. Wu, and H. Jin, “Differentially Private Online Learning for Cloud-Based Video Recommendation with Multimedia Big Data in Social Networks,” IEEE Transactions on Multimedia, vol. 18, no. 6, pp. 1217–1229, 2016.
View at: Publisher Site | Google Scholar
J. Ni, K. Zhang, K. Alharbi, X. Lin, N. Zhang, and X. S. Shen, “Differentially private smart metering with fault tolerance and range-based filtering,” IEEE Transactions on Smart Grid, vol. 8, no. 5, pp. 2483–2493, 2017.
View at: Publisher Site | Google Scholar
H. To, G. Ghinita, L. Fan, and C. Shahabi, “Differentially private location protection for worker datasets in spatial crowdsourcing,” IEEE Transactions on Mobile Computing, vol. 16, no. 4, pp. 934–949, 2017.
View at: Publisher Site | Google Scholar
C. Lin, P. Wang, H. Song, Y. Zhou, Q. Liu, and G. Wu, “A differential privacy protection scheme for sensitive big data in body sensor networks,” Annals of Telecommunications-Annales des Télécommunications, vol. 71, no. 9-10, pp. 465–475, 2016.
View at: Publisher Site | Google Scholar
C. Lin, Z. Song, H. Song, Y. Zhou, Y. Wang, and G. Wu, “Differential privacy preserving in big data analytics for connected health,” Journal of Medical Systems, vol. 40, no. 4, article no. 97, pp. 1–9, 2016.
View at: Publisher Site | Google Scholar
S. Xu, S. Su, X. Cheng, K. Xiao, and L. Xiong, “Differentially private frequent sequence mining,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 11, pp. 2910–2926, 2016.
View at: Publisher Site | Google Scholar
W. Tong, J. Hua, and S. Zhong, “A jointly differentially private scheduling protocol for ridesharing services,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 10, pp. 2444–2456, 2017.
View at: Publisher Site | Google Scholar
D. Kifer and A. Machanavajjhala, “A rigorous and customizable framework for privacy,” in Proceedings of the 31st Symposium on Principles of Database Systems (PODS '12), pp. 77–88, Scottsdale, Arizona, USA, May 2012.
View at: Publisher Site | Google Scholar
D. Kifer and A. Machanavajjhala, “No free lunch in data privacy,” in Proceedings of the 2011 ACM SIGMOD and 30th PODS 2011 Conference on Management of Data (SIGMOD), pp. 193–204, Athens, Greece, June 2011.
View at: Publisher Site | Google Scholar
C. Dwork, “Differential privacy,” in Proceedings of the 33rd International Conference on Automata, Languages and Programming - Volume Part II (ICALP '06), pp. 1–12, Venice, Italy, 2006.
View at: Google Scholar
X. He, A. Machanavajjhala, and B. Ding, “Blowfish privacy: Tuning privacy-utility trade-offs using policies,” in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, (SIGMOD '14), pp. 1447–1458, Snowbird, Utah, USA, June 2014.
View at: Publisher Site | Google Scholar
B. Yang, I. Sato, and H. Nakagawa, “Bayesian differential privacy on correlated data,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, (SIGMOD '15), pp. 747–762, Melbourne, Victoria, Australia, June 2015.
View at: Publisher Site | Google Scholar
R. Chen, B. C. M. Fung, P. S. Yu, and B. C. Desai, “Correlated network data publication via differential privacy,” The VLDB Journal, vol. 23, no. 4, pp. 653–676, 2014.
View at: Publisher Site | Google Scholar
T. Zhu, P. Xiong, G. Li, and W. Zhou, “Correlated differential privacy: hiding information in Non-IID data set,” IEEE Transactions on Information Forensics and Security, vol. 10, no. 2, article no. A2, pp. 229–242, 2015.
View at: Publisher Site | Google Scholar
C. Liu, S. Chakraborty, and P. Mittal, “Dependence makes you vulnerable: differential privacy under dependent tuples,” in Proceedings of the Network and Distributed System Security Symposium (NDSS '16), San Diego, Calif, USA.
View at: Publisher Site | Google Scholar
Y. Xiao and L. Xiong, “Protecting locations with differential privacy under temporal correlations,” in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, (CCS '15), pp. 1298–1309, Denver, Colorado, USA, October 2015.
View at: Publisher Site | Google Scholar
X. Wu, T. Wu, M. Khan, Q. Ni, and W. Dou, “Game theory based correlated privacy preserving analysis in big data,” IEEE Transactions on Big Data, vol. PP, no. 99, pp. 1–1, 2017.
View at: Publisher Site | Google Scholar
L. Zhang, Y. Li, L. Wang, J. Lu, P. Li, and X. Wang, “An efficient context-aware privacy preserving approach for smartphones,” Security & Communication Networks, vol. 2017, no. 2, pp. 1–11, 2017.
View at: Google Scholar
Y. Cao, M. Yoshikawa, Y. Xiao, and L. Xiong, “Quantifying differential privacy under temporal correlations,” in Proceeding of the IEEE 33rd International Conference on Data Engineering (ICDE '17), pp. 821–832, San Diego, CA, USA, 2017.
View at: Google Scholar
L. Ou, Z. Qin, Y. Liu, H. Yin, Y. Hu, and H. Chen, “Multi-user location correlation protection with differential privacy,” in Proceedings of the 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS '16), pp. 422–429, Wuhan, China, December 2016.
View at: Publisher Site | Google Scholar
D. N. Reshef, Y. A. Reshef, H. K. Finucane et al., “Detecting novel associations in large data sets,” Science, vol. 334, no. 6062, pp. 1518–1524, 2011.
View at: Publisher Site | Google Scholar
C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Proceedings of the Third Conference on Theory of Cryptography (TCC '06), vol. 3876, pp. 265–284, New York, NY, USA, March 2006.
View at: Publisher Site | Google Scholar
F. McSherry and K. Talwar, “Mechanism design via differential privacy,” in Proceedings of the 48th Annual Symposium on Foundations of Computer Science (FOCS '07), pp. 94–103, Providence, RI, USA, October 2007.
View at: Publisher Site | Google Scholar
V. Batagelj and A. Mrvar, “Pajek datasets,” http://vlado.fmf.uni-lj.si/pub/networks/data/.
View at: Google Scholar
N. H. F. Beebe and H. F. Nelson, “Beebe’s bibliographies page,” http://www.math.utah.edu/~beebe/bibliographies.html.
View at: Google Scholar
B. Jones, “Computational geometry database,” http://jeffe.cs.illinois.edu/compgeom/biblios.html.
View at: Google Scholar
Les misérables network dataset – KONECT, 2016, http://konect.uni-koblenz.de/networks/moreno_lesmis.
D. E. Knuth, The stanford graph base: a platform for combinatorial computing, vol. 37, Addison-Wesley Reading, Boston, Mass, USA, 1993.
J. Kunegis, “KONECT — The koblenz network collection,” in Proceeding of the 22nd International Conference on World Wide Web Companion (WWW '13), pp. 1343–1350, Rio de Janeiro, Brazil, May 2013.
View at: Publisher Site | Google Scholar
A. Blum, K. Ligett, and A. Roth, “A learning theory approach to non-interactive database privacy,” in Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing (STOC '08), pp. 609–618, ACM, New York, NY, USA, 2008.
View at: Google Scholar | MathSciNet

Copyright

Copyright © 2017 Junling Lu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1027

Downloads

955

Citations

Security and Communication Networks

Cross-Layer Approaches for Enhancing the Security and Privacy of Wireless Communications and Networking

An Edge Correlation Based Differentially Private Network Data Release Method

Abstract

1. Introduction

2. Related Work

3. Preliminaries

3.1. Differential Privacy

3.2. Weighted Adjacency Matrix

3.3. Correlation Metric

4. Correlation Analysis of Weighted Edges

4.1. Correlation Definition

4.2. Dataset Transformation

4.3. Correlated Edge Differential Privacy

4.4. Sensitivity Calculation

5. Network Data Release Method

5.1. NDR Algorithm

5.2. Utility Analysis

6. Experiment

6.1. Privacy and Utility

6.2. Utility

7. Conclusion

Conflicts of Interest

Acknowledgments

References

Copyright