Abstract

Social networks can be analyzed to discover important social issues; however, it will cause privacy disclosure in the process. The edge weights play an important role in social graphs, which are associated with sensitive information (e.g., the price of commercial trade). In the paper, we propose the MB-CI (Merging Barrels and Consistency Inference) strategy to protect weighted social graphs. By viewing the edge-weight sequence as an unattributed histogram, differential privacy for edge weights can be implemented based on the histogram. Considering that some edges have the same weight in a social network, we merge the barrels with the same count into one group to reduce the noise required. Moreover, -indistinguishability between groups is proposed to fulfill differential privacy not to be violated, because simple merging operation may disclose some information by the magnitude of noise itself. For keeping most of the shortest paths unchanged, we do consistency inference according to original order of the sequence as an important postprocessing step. Experimental results show that the proposed approach effectively improved the accuracy and utility of the released data.

1. Introduction

Social networks, such as Facebook and Twitter, have played an important role in people’s daily social interaction. Social network analysis attempts to discover important social issues, including disease transmission, emotional contagion, and occupational mobility. Due to the need of scientific research and data sharing, social networks are supposed to release data without leaking privacy information. Privacy can be guaranteed by disturbing or encrypting the original data, or doing anonymous processing before releasing the data [13].

Privacy is a charged term, meaning different things to different people. In social networks, the edge weights may reflect the frequency of communication, the price of commercial trade, the intimacy of relationship, and so forth, which are associated with sensitive information. A typical example is an intelligence network, in which edge weights denote the contact frequencies of two institutions. Too-frequent communications may imply potential problems. Another example is a commercial trade network, in which edge weights indicate the transaction price between two companies. Most managers would be reluctant to reveal a commercial secret to their adversaries, due to the fierce competition. Our goal is to protect the edge weights in social networks without leakage while preserving as much utility as possible.

Das et al. [4] considered edge-weight anonymization in social graphs. They built a linear programming (LP) model to preserve the properties of the graph, for example, the shortest paths, -nearest neighbors, and minimum spanning tree, which are expressible as linear functions of the edge weights. Liu et al. [5] considered preserving the weights of some edges, while trying to preserve the shortest-path lengths and exactly the same shortest paths of some pairs of nodes. They developed two privacy-preserving strategies: Gaussian randomization multiplication and a greedy perturbation algorithm based on graph theory. Costea et al. [6] analyzed how differential privacy can be used to protect the edge weights in graph structures. Nonetheless, simply adding Laplacian noise to the edge weights would distort the accuracy very significantly. Our approach is to disturb the edge weights via differential privacy for protection, which effectively improves the accuracy and utility of the released data.

Hayy et al. [7] showed that it is possible to significantly improve the accuracy of a general class of histogram query while satisfying differential privacy. The approach carefully chooses the queries to evaluate, and then exploits the consistency constraints that should hold over the noisy output. After a postprocessing phase, the final output is differentially private and consistent, but, in addition, it is often much more accurate. The technique was used to very precisely estimate the degree sequence of a graph, which is an important instance of an unattributed histogram. Inspired by the above, we treat the edge-weight sequence as an unattributed histogram in the proposed approach, which is a key step in this paper. To better keep the shortest paths unchanged, we do consistency inference according to original order of the sequence.

Xu et al. [8, 9] proposed two algorithms, namely, NoiseFirst and StructureFirst, for computing differentially private histograms. The main difference lies in the relative order of the noise injection and the histogram construction. Going one step further, they extended both solutions to answer arbitrary range queries. StructureFirst constructs an optimal histogram based on the original data. Then, the algorithm randomly moves the boundaries between the barrels, which adds noise to the structure of the histogram. After setting down all the boundaries, Laplace noise is added to the average counts. Thus, this method introduces two kinds of errors: construction error and noise error. For the specific application, our strategy is to merge all barrels with the same count into one group and then add Laplace noise to each count, so the proposed approach only has noise error. In the step of merging barrels, to prevent leaking some information by the magnitude of the noise itself, inspired by literature [10, 11], we propose the definition of -indistinguishability between groups to guarantee differential privacy.

2. Background

In this section, we review the definition of differential privacy and its implementation mechanism. Then, we clarify the concepts of an unattributed histogram versus a conventional histogram.

2.1. Differential Privacy

Dalenius [12] proposed an issue for statistical databases: no one should learn anything about an individual while accessing the database. Nevertheless, the type of privacy that is an absolute guarantee about disclosures cannot be achieved because of auxiliary information. Differential privacy [13] sidesteps this problem to the related ones; any given disclosure will be within a small multiplicative factor. Note that a bad disclosure may still occur, but it will not be caused by the presence of an individual’s data in the database. Differential privacy can hide the influence of a single record, that is, the output probability of the same results will not change significantly, whether a record is in the data set or not. Hence, differential privacy makes no assumptions about the background knowledge of any potential adversary. However, we still face the challenge of making the tradeoff between protecting privacy information and maintaining the data utility.

Differential privacy was presented in a series of Dwork’s papers [1418] and its implementation mechanism was presented in the literature [19, 20]. McSherry [21] pointed out that a differentially private algorithm for some complex privacy problem satisfies two combination properties. Recently, differential privacy has mainly been used in data publishing, including releasing histograms [79, 2224] and graph data [22, 2528], and also in data mining [2931].

Definition 1. A randomized function gives -differential privacy if, for all data sets and differing on at most one element, and all ,Here, is a small positive value with which one can balance the tradeoff between privacy and accuracy. Relatively, if is smaller, the privacy is higher and accuracy is lower, and vice versa. Usually, is chosen by the user administering the privacy policy; therefore, selecting a reasonable is worth further study. Moreover, an algorithm that provides -differential privacy for neighboring databases differing on a single record also provides -differential privacy [14] for neighboring databases differing on at most records.

To achieve differential privacy, a certain amount of random noise must be added to the answer of the query set. Intuitively, its magnitude should cover up the largest change that a single record could have on the output.

Definition 2. Let be a sequence of counting queries. The sensitivity of is denoted by :In particular, a simple counting query has . For example, consider a private personnel database with an attribute column to indicate marital status. An analyst may query the number of married persons, , and the number of unmarried persons, , so this query set has , because adding or removing one record changes exactly one output by a value of one. Furthermore, if he simultaneously queries the total number of people, , the query set has , because one change could affect two outputs, each by a value of one. Note that, in the second query set, there exist constraints, , by which someone can search for the closest consistent solution to boost the accuracy of the results.

The Laplace mechanism [19], the most common noise-adding mechanism, disturbs the outputs by adding noise produced by a Laplace distribution to achieve differential privacy.

Proposition 3. Let be a query sequence of length . The randomized algorithm , which takes database as input and outputs the following vector, satisfies -differential privacy.Here, denotes a -length vector of i.i.d. (independent and identically distributed) samples from a Laplace distribution with scale . In other words, the magnitude of the noise is proportional to , and inversely proportional to . Proof of the proposition can be found in the literature.

Sometimes we need to combine several differentially private mechanisms in complex privacy issues as in this paper, so we can take advantage of the combination properties [21] of differential privacy.

Proposition 4. Let each provide -differential privacy. A sequence of over the database provides -differential privacy.

It is the sequential theorem of differential privacy. Intuitively, can be split among a sequence of differentially private mechanisms and the final output still provides differential privacy.

2.2. Unattributed Histogram

A conventional histogram adopts a box-dividing technology, a popular form for data reduction, to approximate the data distribution. It divides ranged attributes into disjoint subsets or barrels, which usually are continuous intervals for a given attribute, and then computes counting queries for each specified range. Distinguishingly, in an unattributed histogram, each barrel only represents a single attribute value, that is, a unit-length range. An important instance is the degree distribution of networks, in which each barrel is the degree of one node, and the histogram is simply the sorted degree sequence.

In the context of our application, firstly consider a communication database , organized as a set of records , in which a record represents one communication between two addresses. Whenever a communication occurs, a record is added to the database. Next, this database is converted to a multigraph. If the same record appears times, there are edges between and in the graph. Finally, the multigraph is transformed into a weighted graph, in which the edge weight is the number of edges, , between any two vertices. Therefore, we view each edge weight as one barrel, and the edge-weight sequence as an unattributed histogram. Naturally, differential privacy for edge weights can be implemented based on the histogram.

3. Methods

In this section, we detail the Lap strategy and the MB-CI (Merging Barrels and Consistency Inference) strategy, which are used to perform differential privacy for edge weights. Furthermore, the algorithm of MB-CI strategy is provided. To evaluate and quantify the error of the added noise, a formula is given using the common Squared Error to calculate the expectation of the possible randomness.

Definition 5. For a primitive edge-weight sequence and its noisy sequence , the introduced error is . Here, is the length of the sequence.

3.1. Lap Strategy

To achieve differential privacy for edge weights, the naive strategy, called Lap strategy, is to directly add Laplace noise without any processing.

Theorem 6. The edge-weight sequence has , is the maximum of edge weights, and is the minimum of edge weights.

Proof. Given a graph and its neighbor graph differing on at most one edge weight, the edge-weight sequence has only one value changed by at most and all other values kept the same. According to Definition 2, the edge-weight sequence has . For simplicity, is denoted by .
On the basis of Proposition 3, the scale of Laplace noise added is , so each edge weight should add . The error in this strategy can be computed as follows:

3.2. MB-CI Strategy

We propose a novel strategy which needs less noise to achieve differential privacy for edge weights; for the global utility, it keeps most of the shortest paths unchanged. The proposed MB-CI strategy includes two key steps, merging barrels and consistency inference.

3.2.1. Merging Barrels

Consider that in a social network, especially a large one, some edges or even more should have the same weight. Viewing the edge-weight sequence as an unattributed histogram, we merge barrels with the same count into one group to reduce the added noise. It does not introduce histogram construction error, as the value of each barrel does not change after merging. Then, we can add less Laplace noise to each barrel merged, while adding the same amount as the Lap strategy to other ones.

Theorem 7. The noise added to every merged barrel is ; is the number of merged barrels.

Proof. Given a graph and its neighbor graph differing on at most one edge weight, which will affect at most one group by value, and for every barrel in the group, hence, the noise added to every merged barrel is .
Others unmerged will still need to add as the Lap strategy does. The error of this approach can be calculated as follows. Given weights, merging into groups, , the first group has values, and so forth, and the th group has values, and . Then,

Moreover, differential privacy may be violated if we simply merge all barrels with the same count into one group, because the magnitude of the noise itself may disclose some information. Therefore, -indistinguishability between groups is proposed to guarantee that these groups require the same amount of noise. That is, these groups are indistinguishable only from the aspect of the amount of noise. In fact, we compromise in the merging step. We merge them while guaranteeing -indistinguishability between groups; otherwise, we do nothing.

Definition 8. The groups are said to satisfy -indistinguishability for an integer , if the number of groups with the same amount of barrels is greater than or equal to .
For example, a simple weighted graph is shown in Figure 1, the range of weights is limited in 1~25. , , and the other weights are different. If we set , and can be merged into one group and and into the other group. Thus, there are two merged groups, and, with the noise added to , , and , is . If we set , there are no merged groups and the noise added to each weight is . Suppose that is 13 instead of 5 in Figure 1; then . When or 3, there will be three merged groups in both cases, and for larger there will be no merged groups.

3.2.2. Consistency Inference

Here, we do consistency inference according to original order of the sequence as an important postprocessing step. The disturbed sequence should satisfy the original order to maintain consistency, which also means the relative weights between each edge do not change. Intuitively, the shortest paths will not go around easily but tend to be unchanged. It is worth mentioning that the process is only based on the known order, without accessing the private database; hence, there is no privacy leakage.

As a matter of fact, this problem is an instance of isotonic regression, and the following min-max formula [32] is one of the solutions.

Proposition 9. Let be the mean of elements from indexes to . Denote and . The minimum solution is unique and given by .

In the literature [22], Hay et al. provided theoretical proof of the error brought by consistency inference; the results of derivation showed that it barely hurts the accuracy. However, the results of experiments showed that it could improve the accuracy obviously.

3.2.3. Algorithm of the MB-CI Strategy

Algorithm 1 is the complete algorithm of the MB-CI strategy.

Input: Raw weighted-graph database , privacy budget , parameter
Output: Disturbed weighted-graph database
// and contain three column vectors , , , and , , , respectively.
// represents the starting points of edges. indicates the ends.
// and store the original edge weights and the disturbed ones.
(1) Scan once to compute three vectors , , :
, , .
(2)
(3)
(4) for to
(5)  if then
(6)   
(7)  else
(8)   
(9)  end if
(10)   end for
(11) if then
(12) for to
(13)  
(14) end for
(15) while
(16)  
(17)  while
(18)   if or then
(19)    
(20)   else
(21)    
(22)   end if
(23)  end while
(24)  
(25) end while
(26) for to
(27)  
(28) end for
(29) return

Algorithm MB-CI presents entire process of the proposed strategy. Line 1 scans once to compute three vectors, , , and . Each element of stores the count of the corresponding , which is also the number of barrels with the same count. Each element of stores the count of the corresponding to estimate whether to merge or not. Each element of points to the index of the corresponding in the original order. Line 2 allocates privacy budget according to the proportion of 2 : 8 in the experiments. That is, ; . To randomly choose groups to merge, Line 3 adds Laplace noise to each according to Theorem 10.

Theorem 10. The vector has .

Proof. Given a graph and its neighbor graph differing on at most one edge weight, the vector has only one value changed and all other values kept the same. The vector , storing the count of the corresponding , has two values changed with one plus 1 and the other minus 1. In the same way, the vector , storing the count of the corresponding , has four values changed. According to Definition 2, the vector has .

Lines 4–10 add Laplace noise to every weight; we need to test whether it satisfies -indistinguishability between groups. If it is true, we merge the barrels, so the amount of noise should be . Otherwise, is still added, which is equivalent to not merging. Line 11 mainly deals with the negative weight, which is meaningless and inexistent. Specifically, if the minimum of is less than zero, we uniformly adjust all the values to subtract the minimum, rather than simply resetting all the negatives to zero. The purpose is not to mandatorily change some weights, but to ensure that all the weights remain relatively unchanged. In addition, adding one to the results is to make the minimum nonzero; otherwise, it is likely to cancel an edge between two vertices.

Lines 12–14 generate a vector that stores the noisy weights according to the corresponding indexes. Obviously, should satisfy the original order. Lines 15–25 adopt nonrecursive programming based on the ideas of min-max formula to do consistency inference. If the current value does not meet the conditions—that is, it is smaller than the previous value, or bigger than the next—we continue to merge back and calculate the mean. Otherwise, the mean is assigned to each element in this cycle. It is worth mentioning that, considering some special situation, the last group may be out of order, so we do consistency inference from back to front to readjust it again in the experiments. Lines 26–28 reset the processed noisy weights and is obtained in the end. Line 29 returns as the output of the algorithm.

Theorem 11. Algorithm (MB-CI) guarantees -differential privacy.

Proof. In the algorithm, adding Laplace noise guarantees differential privacy according to Theorems 6 and 7. Furthermore, randomly choosing groups to merge guarantees differential privacy according to Theorem 10. The rest lines do not incur any extra privacy cost. Therefore, MB-CI algorithm as a whole guarantees -differential privacy according to Proposition 4.

4. Experiments

In this section, the proposed approach is evaluated from two aspects, accuracy and utility. We use ARE (Average Relative Error) to test the loss of accuracy due to the added noise and KSP (Keeping Shortest Paths) to measure the proportion of unchanged shortest paths.(1)WARE is the average relative error of all the edge weights. The smaller the value, the higher the accuracy.(2)KSP is the proportion of unchanged shortest paths. is the number of all the reachable shortest paths, and is the number of all the unchanged shortest paths. The greater the value, the more the unchanged shortest paths and the better the utility.(3)LARE is the average relative error of all the unchanged shortest paths, not considering the shortest paths that have changed, as it does not make sense to compare the lengths of different paths.

Three data sets were used in the experiments, shown in Table 1. There are two synthetic data sets employing a BA (Barabási–Albert) model to generate scale-free networks. The first one has five fully connected vertices in the original state. With each new vertex, five edges are associated at the same time, until it grows to 1,000 vertices. In the same way, we got the second one with a total of 2,000 vertices. The other is a real data set: CA-GrQc, the Collaboration Network of Arxiv General Relativity Category. There is an edge if two authors have coauthored at least one paper. We randomly assigned weights for each edge, ignoring its semantics. The experimental environment is an Intel® Core™ i7-6700 CPU @ 3.40 GHz, with 24 G memory, using the Windows 10 operating system; the algorithm offered was implemented in Matlab R2014a.

The MB-CI strategy is mainly composed of two steps: merging barrels, MB for short, and consistency inference, CI for short. To improve the accuracy, MB reduces the added noise by merging barrels with the same count while guaranteeing -indistinguishability between groups. For the sake of guaranteeing better utility and keeping most of the shortest paths unchanged, CI does not change the relative weights between each edge by consistency inference. In order to display the respective effects of the two steps, we break them up to compare with the Lap strategy. The MB strategy merges barrels with the same count while guaranteeing -indistinguishability between groups and then adds Laplace noise to each barrel. The Lap-CI strategy does consistency inference based on the Lap strategy, which adds Laplace noise directly to each barrel with no merging. In the experiments, we set between 1 and 50, relatively large values, to balance the tradeoff between privacy and data utility, as we set the weights to a relatively large range. The fact is that large , more than 10, provide almost no privacy protection in practice and it is to check the performance of the algorithm in real social networks.

In the experiments, we evaluated the error for MB under different , comparing with Lap first. The results are shown in Figures 2, 6, and 10; takes three different values 1, 5, and 10 uniformly. The error decreases with the increase of , due to less noise. The error for Lap is maximal because there is no merging. When , it means that all the barrels with the same count are merged unconditionally, so the number of merging barrels is the greatest and the error is minimum. If is larger, it means the limiting condition is stricter; there may be more barrels with the same count that could not be merged. Thus, when takes two other values, the error is between them. The curves are not smooth as shown in the figures, because it depends on the proportion of the nonmerging barrels. Next, we set to be value 5 for MB, to test how much error was introduced by consistency inference. The results are shown in Figures 3, 7, and 11. It can be seen that compared with Lap, Lap-CI effectively reduces the error; MB-CI adds some error compared with MB in the last data set. MB-CI may introduce extra error because it needs to process more data according to consistency inference in the larger data set, and the error in MB is already very small.

The most important step is to evaluate the change of the shortest paths, which is a key measure of the global utility. As shown in Figures 4, 8, and 12, with the increase of , more shortest paths will remain unchanged. Obviously, compared with Lap, MB can better protect the shortest paths. MB-CI has a little bit better effect than MB, as MB has kept about 90% of the shortest paths unchanged when ε is more than 20. Lap-CI has a much better effect than Lap when ε is more than 20. As shown in Figures 5, 9, and 13, we evaluated the error of all the unchanged shortest paths. It can be seen that the trends of these curves are consistent with the previous analysis. This suggests that consistency inference can further improve the proportion of unchanged shortest paths and reduce the error effectively; this is an essential step in our application. In conclusion, MB-CI has achieved better performance of the experimental results.

5. Conclusions

In this paper, we proposed the MB-CI strategy, a novel approach for protecting the edge weights of social networks. The starting point was treating the edge-weight sequence as an unattributed histogram; we merged all barrels with the same count into one group, while guaranteeing -indistinguishability between groups. Then, we added Laplace noise to every edge weight and did consistency inference according to original order of the sequence. We conducted experiments on both synthetic data sets and a real data set. The results showed that the MB-CI strategy improved the accuracy and utility of the released data, which are consistent with the theoretical analysis. That is, the approach was effective in reducing the error introduced by the added noise, and kept most of the shortest paths unchanged.

Note that, the edge weights considered here, are integers not continuous values. Thus, generalizing the data set to the real-value field is an object for future study. Moreover, many applications in the real world demand higher user-level privacy rather than record-level privacy. Therefore, we will further extend the method for providing stronger protection.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by The Natural Science Foundation of China (no. 61370083, no. 61402126, and no. 61672179), Specialized Research Fund for the Doctoral Program (no. 20122304110012), Youth Science Fund of Heilongjiang Province (no. QC2016083), Postdoctoral Fellowship of Heilongjiang Province (no. LBH-Z14071), and Basic Research Business of Education Department of Heilongjiang Province (no. 135109314, no. 135109245).