Abstract
For data analysis with differential privacy, an analysis task usually requires multiple queries to complete, and the total budget needs to be divided into different parts and allocated to each query. However, at present, the budget allocation in differential privacy lacks efficient and general allocation strategies, and most of the research tends to adopt an average or exclusive allocation method. In this paper, we propose two series strategies for budget allocation: the geometric series and the Taylor series. We show the different characteristics of the two series and provide a calculation method for selecting the key parameters. To better reflect a user’s preference of noise during the allocation, we explored the relationship between sensitivity and noise in detail, and, based on this, we propose an optimization for the series strategies. Finally, to prevent collusion attacks and improve security, we provide three ideas for protecting the budget sequence. Both the theoretical analysis and experimental results show that our methods can support more queries and achieve higher utility. This shows that our series allocation strategies have a high degree of flexibility which can meet the user’s need and allow them to be better applied to differentially private algorithms to achieve high performance while maintaining the security.
1. Introduction
Collecting an individual’s data is crucial for many applications [1], and privacy protection has always been one of the focuses of researchers. Classical privacy protection technologies such as anonymous algorithms and homomorphic encryption have already achieved remarkable results. There are also other technologies to achieve privacy protection; for example, Qi et al. [2, 3] proposed a faster privacy protection method with localitysensitive hashing. However, most of these protection technologies do not take the adversary’s background knowledge into consideration. Differential privacy, as a privacy protection technology with a rigorous mathematical foundation, not only takes into account the maximization of the adversary’s background knowledge but also protects private data, which has received considerable attention.
Differential privacy works by incorporating noise into the statistical results obtained from the sensitive underlying data. This approach provides strong privacy guarantees by requiring the involvement of an individual in the data set to be indistinguishable in the released information; and one of the important factors affecting the performance of differential privacy is privacy budget.
Privacy budget (Dwork also called it “privacy loss” in the literature [4] in 2017), together with sensitivity of a query, is one of the critical parameters that will affect security and utility. For a specified query, the larger is, the lower the security is, and the less noise there is. In our common understanding, the noise is inversely proportional to the utility; the larger the noise is, the lower the utility is. This means that if we want to achieve a high level of performance, the amount of noise cannot be too large; that is, a large privacy budget needs to be allocated. For sure, the above analysis is only for the application of a single query. But we tend to have similar thoughts when facing multiple queries. Thus, average allocation method usually becomes the first choice for most differentially private algorithms owing to its least overall noise [5–10].
However, does the average allocation strategy still bring a high level of utility if there are correlations between the queries? For example, we know that there are iterations in lots of data mining algorithms. A significant feature of iterations is to obtain the result of one certain iteration (or execution) that needs to rely on the results of previous iterations, and the most representative is clustering. When applying differential privacy to these algorithms, the input of each iteration is the noisy output of the previous iteration, and the output is still the result with noise to satisfy differential privacy [11]; and it raises two interesting questions: (1) Are there any iterations that have a greater impact on the final result of an algorithm than the other ones? Taking clustering as an example, we know that the initial centroids will affect the utility of the final clustering results. Does it mean that the first few iterations are more important and demand larger privacy budget terms than the other iterations? Or, in order to converge to a stable state of clusters, should the last few iterations be more important than other ones and a less amount of noise is required? (2) Does the noise have a direction? For any numerical data, we can regard them as points in space. Furthermore, since the noise value can be positive or negative, the noise added to each dimension of a point in the space can be regarded as different components of a vector that is, the overall noise has a direction. Once the direction is taken into account, the performance and utility of a differentially private algorithm depend not only on the amount of noise but also on the direction in which noise is added at each iteration. Both questions seem to imply that the average allocation strategy may not be so ideal. In fact, Cormode et al. [12] have already shown by experiments that the average strategy in their treebased decomposition does not work well compared with the nonaverage method, which confirms our doubts from the side.
In addition, in the setting of random queries without considering correlation, although the average strategy may be superior in performance, there is a more serious problem, namely, the security problem. Since the budget for each query in the average allocation is equal, if the mean of the distribution of noise is 0, then the true result can be solved by conducting the exact same queries or queries of the same semantic [13]; it cannot defend against collusion attacks.
As such, we need to solve the problem of how to cut the pizza of the privacy budget.
On the issue of nonaverage allocation of the privacy budget, a simple method once given by Dwork is bisection allocation (the budget allocated for the first time is , and then the budget for the next query is of the previous query). However, bisection allocation is rarely applied in practice. This is because of the fact that the allocation speed of the budget is too fast and the noise is too large to support many iterations. On the other hand, people usually determine the number of iterations in advance, and they will naturally choose the average allocation once they have determined the number of iterations. Many researchers have discussed optimizing the privacy budget allocation for their work; however, most optimizations are exclusive or just the average allocation in disguise. In addition, the current nonaverage allocation methods seem to be limited to a fixed mindset in which the allocation of budgets should be a monotonic strategy, especially monotonically decreasing. To date, there is still no discussion on the nonmonotonic allocation. Moreover, although Dwork once discussed the theoretical lower limit of the budget, each user’s tolerance for noise is different in practice. The discussion on how to construct a budget allocation strategy that meets the practical needs of a user is also lacking.
To solve these problems, we consider the portion of the overall budget assigned to a query as one term in a sequence; hence, the whole process of allocation could be treated as a series. We focus on using the existing series technology to allocate the privacy budget. The main contributions of this paper can be summarized as follows:(i)We propose two series allocation strategies which are of different monotonic features. The geometric series provides a monotonic allocation of the budget, while the Taylor series provides a nonmonotonic allocation of the budget. We show how a parameter of a series affects the performance of series allocation methods.(ii)We give a simple calculation method to select ratio and the initial parameter to meet the utility requirements of multiple queries in practice from the perspective of the interactive scheme.(iii)To reflect user’s preference on noise and to meet a higher utility, we discuss in detail the relationship between sensitivity and noise and propose the idea of sensitivity normalization. Then, we present the concept of an acceptable budget and an optimization approach for the series strategies on the basis of the acceptable budget.(iv)To defend the collusion attacks and improve security, we provide three different ideas for protecting the allocation sequence.
The remainder of this paper is organized as follows. We introduce related work in Section 2 and give the preliminary results in Section 3. In Section 4, we propose two series strategies and discuss the method to select the key parameters. In Section 5, we explore the relationship between sensitivity and noise, and we also present an approach for optimization. In Section 6, we discuss the protection of the budget allocation sequence. In Section 7, we present the experimental analysis of series strategies. Some conclusions are given in Section 8.
2. Related Work
As a core parameter, the privacy budget determines the security level of differential privacy. Although there is a lack of a specialized discussion on the allocation of privacy budget, it has been briefly discussed in many differentially private algorithms by researchers owing to the importance of this topic. The existing allocation methods generally fall into the three following kinds of strategies.
The first kind is to apply the total privacy budget as a whole without allocation, which is according to the parallel composition theorem [4, 14]. This situation has been widely used in the data publishing. As mentioned in the following articles, one of the methods proposed in the work of Xu et al. [15] is to independently add Laplace noise to every count ; Xiao et al. [16] added independent Laplace noise to each wavelet coefficient; and both research groups provided the methods that could be implemented with no need to allocate the privacy budget.
The second strategy is to allocate the budget uniformly, namely, the average strategy. This is the most commonly used strategy in the fields of data publishing and data analysis. It is the simplest approach to implement and may be very effective at times. Several studies have been widely performed. For example, Chen et al. [17] and Qardaji et al. [18] implemented differentially private data publication approaches with a uniform budget allocation, as does another method proposed in the work of Xu et al. [15]. Dwork [13] and Su et al. [6, 7] allocated the budget equally with a fixed number of iterations for means clustering. Friedman et al. [5], Jagannathan [8], and Rana et al. [19] all applied a uniform budget to construct decision trees. Bhaskar et al. [20] proposed two differentially private frequent pattern mining algorithms with uniform allocation. Using the average method, Shin et al. [9] divided the overall budget into each iteration for the recommendation system. There are also a large number of other applications that seem to use nonuniform allocations, but, essentially, they still utilize the average method. For example, Hua et al. [21] adopted the linear allocation strategy that divides into two parts and , and thus . Similarly, Wu et al. [22] divided into four different parts corresponding to the four original parameters in the density estimation with the Gaussian mixtures model (GMM). Then, they both divided some parts of the budget into even smaller values with the average method to conduct iterations. Li et al. [23] and Cheng et al. [10] also demonstrated similar approaches. However, as we already know, security issues might occur with the average strategy, and the average method is not always the most effective one, which was confirmed by Cormode et al. [12].
The last kind of allocation strategy, namely, the adaptive strategy, is to make an optimization only oriented to some specific work. The representatives are as follows. For treebased decomposition, Cormode et al. [12] adopted a geometric allocation method that minimizes the error according to the height of the tree, and let , where is the privacy budget allocated to level and is the height of the tree. In addition, inspired by [12], Qardaji et al. [18] used a local optimal allocation strategy that minimizes the average error variance by letting and , where is the weight of level . Fan et al. [24] proposed the idea of arithmetic progression allocation based on the work of Su et al. [6], but the biggest problem of this allocation is the convergence of the budget sequence. To satisfy the requirement for the total budget, the arithmetic progression allocation has been very close to the average allocation even when the number of queries is not large. Basically speaking, all the methods mentioned above have one common problem, which is too exclusive to use in general situations.
Although much work has been done with differential privacy, the research that focuses on the privacy budget allocation strategy is still limited. Previous studies in the literature [25] have explored how different users allocate and compete for privacy budgets but did not address the budget allocation for a query sequence. Existing approaches either have adopted uniform allocation or are exclusive and can only be used in some specific scenarios. Therefore, because of these limitations, our interest has been stimulated and we are motivated to carry out this work.
3. Preliminaries
Differential privacy was a result of a line of work that was presented by Dwork [11, 26] in 2006. A brief introduction of differential privacy is given as follows.
Definition 1 ( differential privacy). A (randomized) mechanism satisfies differential privacy if any data sets and differ with respect to at most one element (one is the subset of the other; we consider them as neighbours) and for any subset of the output ,Parameter controls the leakage of the privacy of and is referred to as the privacy budget.
Definition 2 (global sensitivity). For any function , the global sensitivity of denoted as is defined to be the maximum distance of the output from any two neighbouring data sets and .Any mechanism meeting Definition 1 can be considered as differentially private.
To achieve differential privacy, Laplace mechanism has been widely used, which adds random noise to the true answer [11]. Let denote the Laplace distribution with mean 0 and a scale parameter . Then the probability density function is
Theorem 1 (Laplace mechanism). Given a data set , for any function with sensitivity , the random mechanism provides differential privacy ifWe have .
There are two widely used properties in this paper: sequential composition and parallel composition [14]. The sequential composition is the basis for the establishment of this article.
Theorem 2 (sequential composition). For a sequence of mechanisms , if each of them provides differential privacy, then the composed mechanism over the data set D provides differential privacy.
Theorem 3 (parallel composition). For a sequence of mechanisms , each of them provides differential privacy; let be arbitrary disjoint subsets of the data set ; then the composed mechanism provides max differential privacy.
We list notations that are frequently used throughout the paper in Table 1.
4. Privacy Budget Allocation via a Convergent Positive Series
In this section, we will introduce two privacy budget allocation strategies based on two convergent and positive series (the geometric series and the Taylor series). We show the role of core parameters and the method for selecting them.
Although these two series are wellknown classics in the field of mathematics, they have not been widely used in budget allocation. To the best of our knowledge, it is the first time that the Taylor series has been applied to budget allocation. With regard to the geometric series, it is the easiest to understand and apply, but only a few existing allocations are geometric methods, and those methods lack generality and flexibility. In addition, there is still no induction of such methods. Given that, our clear ideas of these two series allocation strategies are presented for the first time to date.
4.1. Privacy Budget Allocation and Convergent Positive Series
If the overall budget for a differentially private mechanism is , then the budget allocated to the query can be expressed as follows:where is the proportion of the budget term in the overall budget and .
The allocation of a privacy budget is like cutting a pizza; no matter how many people there are, it must be ensured that everyone can get a slice, no matter how small the slice is. This implies that the mathematical form of budget allocation needs to be discrete, positive, and with a summation of a constant; and the convergent positive series just perfectly meets these requirements. Therefore, the overall budget allocated to a sequence of queries can be expressed as a budget sequence . Similarly, we have the proportion sequence , and the sum is . If , then can be expressed as a convergent infinite series , and there is (in fact, we should obey the sublinear constraint proposed in the literature [27], but, for the analysis, it will not be considered in this section).
Theorem 4. If a positive series converges to a constant , the overall privacy budget is , given a sequence of mechanisms , , each of them provides differential privacy, and ; then the composed mechanism over the data set satisfies differential privacy.
Proof. For any , the sum of the sequence is . Since , it is easy to see that , , and we have . According to Theorem 2 (Sequential Composition), we learned that the composed mechanism provides differential privacy; that is, satisfies differential privacy.
4.2. Geometric Series Strategy
Geometric series is the easiest series for people to think of and implement when allocating privacy budget.
A geometric series can be expressed as , where is the start term, and is the ratio of two adjacent terms. According to D’Alembert’s test, the series converges to 1 only if and . The series is represented as follows:
Hence, we have
According to Theorem 4, when using geometric series for budget allocation, the differential privacy can be satisfied. Although the geometric series strategy is simple and easy, the value of will significantly affect the steepness of the series curve. If is small, for example, , then the speed of allocating the budget is relatively fast and the first term is large. In addition, if the overall budget is allocated too quickly, the remaining budget available for allocation will soon be reduced to 0. However, if is large enough and close to 1, then the series curve will be quite flat and the first budget term would be relatively small. For example, when and , the comparison of these two series is shown in Figure 1. The series with is much gentler than that of .
In fact, the bisection method and Cormode method mentioned in the previous sections are essentially the special cases of geometric series. To be specific, the bisection method is the geometric series with , and the Cormode method is the geometric series with . However, they are both fixedform methods and do not have the ability to be flexible according to practical needs.
Thus, choosing a reasonable to meet more meaningful queries should be considered when allocating via the geometric series strategy.
4.3. Taylor Series Strategy
A monotonically decreasing budget allocation method is in line with people’s daily thoughts and behavior habits, but it may not be the most appropriate. Thus, sometimes we can change our way of thinking, that is, using a nonmonotonic way of allocation. Although it is hard to find a convergent series that is both nonmonotonic and positive, we found an unexpected surprise after analyzing the Taylor series carefully.
A general definition of Taylor series is shown below. If a function has derivatives of all orders at a single point , then could be represented by a Taylor series, which is an infinite sum of terms calculated from the values of the function’s derivatives at :
To ensure that each term is positive, we choose the exponential function , and the Taylor series for at is as follows:
When and , the equation above can be written as follows:
Since , we can turn the equation into
Let . Then, we have the following:
Since , the Taylor series strategy still meets the differential privacy after an unlimited allocation of the privacy budget.
What Taylor series strategy needs to concern and select is the first term , namely, the value of . We draw the curves of Taylor series as shown in Figure 2. Interestingly, when is small in the ordinary sense, as shown in Figure 2(a), it creates an illusion that no matter how small is, the speed of budget consumption is extremely fast. However, when is extremely small, the Taylor series begins to show an approximately symmetrical waveform, as shown in Figure 2(b); and this provides a new way of allocation to us.
(a)
(b)
A direct comparison of the two series is shown in Figure 3, which makes it easier to see their respective allocation characteristics. The characteristic of the geometric series is to allocate the budget in a monotonically decreasing way, and the speed of the budget decrement can be adjusted by adjusting parameter . The Taylor series is characterized by the allocation with a waveform; that is, the allocation speed is higher in the middle stage and slow in the other stages, and the parameter that controls the waveform is .
4.4. Determine the Key Parameters of the Series Strategies
Determining the key parameters and for the series strategies is actually determining the consumption speed of budget , in other words, the steepness of the series curve. The speed of budget consumption and the total amount of noise will both affect the actual output utility. By calculation, we can know that the total amount of noise added with the average method must be the least; and although the series strategy is a kind of uneven allocation, it may also perform poorly with excessive imbalance. An unbalanced budget allocation will bring an unbalanced addition of noise. As a result, very little noise will be added to some queries, while others are meaningless due to excessive noise.
We note that, to satisfy the convergence, although both series curves can be extremely flat and nearly average in the limiting case, their unique advantages are also lost at the same time. Therefore, we need to make a tradeoff between average and nonaverage allocation strategies.
Since the expressions and features of two series are different, we solve the problem separately.
Geometric Series. For the geometric series, reasonable can ensure that the difference between the first budget term and the last budget term is not too small or too large. By analyzing Figure 1, we can find that, at any term, there must be a certain geometric series whose value is greater than of all other geometric series. Based on this, we only focus on the last budget term, and the proportion of can be written as a function of :
We can derive the function to obtain the extreme point as
This equation allows us to calculate a relatively ideal value for queries.
Taylor Series. Since the Taylor series can exhibit a waveform which is approximately symmetric, we focus on the budget term in the middle, that is, the term. To ensure a complete waveform, we need to let . Then, the proportion of for the Taylor series can be written as a function of .and the extreme point can be derived as
Take 50 queries as an example, the curves of the two series with their ideal parameters are shown in Figure 4. It seems that the two series perform well. However, we still have some extra budget available.
4.5. Calibration of the Series Strategies
Since any kind of convergent series is for infinite terms, the series strategy can certainly be assigned to an infinite number of queries. But we only use the first terms for queries. This means that our overall budget has not been used up, so extra budget is available to magnify each term in the budget sequence. Assuming that the sum of the proportion sequence is , the actual proportion of in the total budget should be
The actual privacy budget for the query should be . We refer to this process as the calibration (the essence of calibration is scaling up; to distinguish it from the data scaling operation later in the article, we use the notion of calibration). Taking the geometric series as an example, the calibrated budget sequence is shown in Figure 5.
Note that the following text describes why we solve the ideal parameters after the calibration rather than before it.
First, there is no general formula for the Taylor series to quickly obtain the summation, so the way to solve after calibration is not realistic. Second, although there is a convenient formula for the geometric series to calculate the sum of the first terms written as follows:solving the extreme point of function is also complicated and difficult. Moreover, it can be judged empirically and intuitively that the extreme point of the term of the geometric series after calibration is just , which is the same as the average allocation. Therefore, we do not have to seek the optimal solution; an approximate optimum solution is enough to meet our needs.
It should be noted that the drawing of the series curve and the calculation of the series term easily give us an illusion; that is, given any function and randomly taking outputs, the same allocation effect can be achieved. But it is not feasible. The biggest difference between function and our series strategy is that the sum of all terms in a function does not converge to a constant; and whether it is a continuous or discrete function, how to determine the value interval is also a problem. This means that the function method cannot predict the monotonicity of budget allocation and the speed of consumption; that is, we cannot allocate effectively with function method.
5. Optimization for the Privacy Budget Sequence
In this section, we will present the optimization from two perspectives. One is to transform the obtained budget sequence , and the other is to improve the sequence to meet the user’s requirements for accuracy.
5.1. Flip the Sequence
Our series strategies are flexible and simple. The process of budget allocation is only related to the number of queries and is independent of the actual database and specific query function. Once the number of queries has been determined in advance, the whole budget sequence will be known, which means that we can transform the sequence for some specific applications. A good way to transform the sequence while still maintaining the allocation characteristic is to flip the sequence. The monotonic feature of an allocation sequence can be quickly adjusted by flipping.
This approach is simple for the geometric series; the sequence only needs to be flipped horizontally (namely, reverse). However, for the Taylor series, in addition to flipping the sequence vertically, a calibration is also required.
The budget term for the query after horizontal flipping is as follows:
Proportion of is as follows:
For the Taylor sequence, the budget term for the query after flipping vertically and its proportion are as follows:
The transformations of the two series after calibration for 50 queries are shown in Figure 6. For strategies that have been flipped, we add the prefix “Flip” to their names.
5.2. Acceptable Budget
Although series strategies and their transformations look good, they still do not reflect users’ willingness for accuracy. Since noise can be a direct reflection of accuracy, we want users to be able to adjust the allocation methods based on their tolerable upper bound of noise.
5.2.1. Sensitivity Normalization
A user can easily give an expected upper bound of noise if all queries are of the same type. However, when facing different query types, it is difficult for users to provide a uniform and tolerable upper noise limit, since different types of queries have different sensitivities, and the sensitivity is an important factor that affects the amount of noise. With the same budget, the higher the sensitivity of a query is, the larger the noise required is. Then, how can we set a uniform upper bound of noise for different queries without considering their sensitivities? That is, can we make the different sensitivities the same?
From the perspective of mathematical definition, the query sensitivity should not depend on the data set itself, since the privacy guarantee of differential privacy should hold for all possible data sets. However, in the practice of differential privacy for realistic data set, the sensitivity of a query function is affected not only by its own mathematical property but also by some specific numerical bounds or semantic constraints. This means that the data added with noise also needs to satisfy these constraints. For instance, data such as score and age should be between 0 and 100. No matter the value after adding noise is over or under the interval, it is meaningless, not to speak of interference to the adversary. Since differential privacy itself does not limit the adversary’s background knowledge, we should assume that these data intervals have been known by the adversary. In other words, for some specific types of data in a data set, we may have a clear border constraint. As a result, the sensitivity of a query function conducted on this data set may be affected, which also implies that the sensitivity is closely related to the range (domain) of data. In fact, similar ideas have already been proposed by Dwork [6, 13].
It seems that such a phenomenon conflicts with the original definition of query sensitivity, but it does not. We can analyze it from another perspective; the same function that acts on the different data range can actually be regarded as a different function. Namely, and are the same query function if and only if ; otherwise, they are two different functions.
Therefore, we argue that one can scale the query sensitivity by scaling the value domain of the data set, except for the count query, since its sensitivity can always be 1. Therefore, we consider adjusting the sensitivity of different queries to 1. For example, given a sum query, assuming the original value domain of the data set is , the sensitivity is (here readers need to notice the definition of the neighbour data set we used in Definition 1). However, if the domain is reduced by a factor of and converted to , then the sensitivity will be , which is the same as the count query [4]. Same operation has been adopted when processing in literatures [6, 13] to clamp the sensitivity of a function within an interval.
However, it does not mean that if we clamp the value domain of all the dimensions of a data set to , all statistical queries can satisfy the sensitivity . For instance, for a max query, the domain to satisfy should be or instead of . For complex practical applications, we cannot guarantee that unifying the domain for all dimensions could constrain the sensitivity of all queries to 1. Therefore, we allow each query function to scale its range separately to satisfy before execution. Then, the noise will only be related to the budget and be independent of the sensitivity. We define this operation as sensitivity normalization.
Definition 3 (sensitivity normalization). If the global sensitivity of a query function after scaling the value domain of data set is , then this process is considered as sensitivity normalization.
That is, we need to meet the following equation:where is the factor to scale the value domain of for the query function and is the data set after scaling. If the original sensitivity is , then we have , since (readers should be careful not to associate query functions with arbitrary mathematical functions).
5.2.2. Acceptable Budget and the Upper Bound of Noise
If the sensitivity of a query has been normalized, then its corresponding noise compared to the original noise has also changed, and we call it the normalized noise.
Although the corresponding noise has changed, the impact of noise on the true result has not changed. Assume that the true result before sensitivity normalization isand if the original sensitivity is , then the standard deviation of Laplace noise is
The actual result after sensitivity normalization and the standard deviation of normalized noise are as follows:
The ratio of noise to the true result remains consistent before and after normalization; that is, . The principle is simple, since we only adjust the data domain proportionally without modifying the essence of the data. Therefore, no matter how much data have been scaled, it does not affect the ratio between noise and its corresponding true result.
By normalizing the sensitivity, we can achieve a unified noise constraint on different query functions. In other words, we can obtain the lower bound of the budget from the perspective of the user.
Definition 4 (acceptable budget). Assuming that one gives the upper bound of the normalized noise that he/she can tolerate as , a query has been sensitively normalized and its standard deviation of the normalized noise with privacy budget satisfiesand then we consider as an acceptable budget.
Note that the noise we draw from the Laplace distribution is random, so we use the upper bound of noise provided by user as the maximum standard deviation of random noise when allocating budget.
Lemma 1. If the minimum budget term in a budget sequence satisfiesthen all the budgets in this sequence are acceptable.
Proof. According to the definition of an acceptable budget, for any acceptable , we have . Since is the minimum budget term of the budget sequence , for any , , there is ; thus, any budget term in the sequence is acceptable. The proof is complete.
Lemma 1 indicates that if one needs to conduct queries with acceptable budget terms, then the minimum budget term in the budget sequence must satisfy .
Lemma 2. Given the overall budget and the user’s upper bound of the normalized noise, to allocate budget for queries successfully, there must be .
Proof. Assume that there is . Then, the minimum budget term for queries is ; thus, , which is a contradiction to . Therefore, this assumption is incorrect, and the proof is complete.
Lemma 3. Given an allocation budget sequence , the minimum term in is , , and the overall budget is . To make the noise bound meaningful and for effective optimization, there should be .
Proof. Since the noise that current budget sequence can endure is at most, we think that there is no need for a further improvement if one gives . According to Lemma 2, there should be . The proof is complete.
5.3. Optimization with an Acceptable Budget
How could we improve the obtained budget sequence to satisfy the requirement of an acceptable budget? This question makes us rely on the average method again, that is, to compound the obtained budget sequence with the average budget sequence . If the average budget is just enough to meet equation (27), then any of the other nonaverage allocation methods cannot be improved further. As long as the averaged budget is larger than the minimum budget that satisfies equation (27), we can adjust the budget sequence to satisfy the acceptable budget.
Let be the minimum term of the proportion sequence that has been calibrated. Then, to satisfy the acceptable budget by compounding with the average budget sequence, we must havewhere is a multiplicative factor. Thus, according to Lemma 1, we can make all the terms in the budget sequence satisfy the acceptable budget if we have
After compounding, the actual assigned proportion is as follows:
There is no need for calibration again because this optimization is performed based on the calibrated sequence. For those series methods that have been optimized to satisfy the acceptable budget, we add the prefix “AC” to their names. Taking a Taylor series method as an example, its optimization is called the ACTy and is shown in Figure 7.
5.4. Useful Allocation Strategy
For a better evaluation of the usefulness of the queries, we learn from a wellknown usefulness definition, which was proposed by Blum et al. [28] and could be used for measuring the utility of the released synthetic database in the noninteractive scheme. To make the definition more suitable for the scenarios and methods of our work, we make a slight modification and obtain the following definition.
Definition 5 (useful strategy). Given a sequence of queries and the total amount of privacy budget , if for each query , , then the normalized noise of with strategy is , which satisfies the following equation:Then, we consider the budget allocation strategy to be useful.
According to Definition 4, we can obtain the following lemma.
Lemma 4. Given the total amount of privacy budget , the number of queries is . If all the terms in budget sequence with allocation strategy satisfy the acceptable budget, then is useful, where is the minimum in the budget sequence.
Proof. If strategy satisfies the ideal query, then, for any , it must be subject toLet ; then, for any query in , the noise less than has the following probability:The proof is complete.
6. Protection for the Budget Sequence
In this section, we will discuss the collusion attack for the nonaverage strategies and our countermeasures.
6.1. Collusion Attack for Nonaverage Budget Strategies
We already know that one of the problems of the average allocation is security, and the series strategies seem to be able to resist this risk. However, let us imagine the following situation of a collusion attack. There are users, each submits queries, and the query for everyone is the same. With the queries completed, they exchange information with each other, which will also lead to a breach of privacy. Since the series strategy is independent of specific queries and is only related to the number of queries, as long as the number of queries and the type of the series are known, the budget of the query can be predicted in advance. Thus, for these users, the same query has been conducted times, and the worst is almost times if all the queries are semantically the same. Therefore, we cannot say that the current series strategies must be safer than the average method. We need to protect the budget sequence.
6.2. Protect the Budget Sequence with a Random Arrangement
The simplest and most intuitive way is to randomize the budget sequence completely. Although this approach can guarantee the security, it also loses the characteristics and advantages of the series itself. So we are more willing to seek other approaches that could preserve these characteristics.
6.3. Protect the Budget Sequence with Probability
Although the budget sequence is a numeric type, if we only want to adjust the order of terms, then we can treat the terms as different options, and each option will be chosen with a certain probability. Similar to the sampling without replacement, we choose a budget term for a query according to the probability; we call this protection perturbation. There have been many comprehensive and mature results in the field of computer science regarding this topic. We only give a simple idea of determining the probability for each budget term in this article.
We score each budget term and give the probability based on the sum of all the scores. Assuming that budget terms have been chosen for the queries, we are about to conduct the query and choose a budget term for it; then, the remaining budget terms available for choosing are a total of . We represent the remaining terms as remaining sequence and the score for the term in as . If a budget term is sorted as in the original budget sequence and as in , then the score assigned to the term in could be
We believe that the smaller the number is, the higher the score should be, and the more is less than , the higher the score is. Then the probability for the term in is as follows:
The actual budget sequence after the perturbation is shown in Figure 8. We can see that the trend of the original series is still maintained.
Alternatively, we can construct a noisy budget sequence with noisy parameters.
6.4. Protect the Budget Sequence with the Laplace Mechanism
Since the construction of a budget sequence depends on the number of queries , we can apply differential privacy again to protect the budget sequence. Therefore, the third idea is to add the Laplace noise to the number according to Theorem 1 and make the budget sequence satisfy differential privacy. Noisy is calculated as follows:where since the number (of the query set) and the data set for queries are two disjoint sets, the privacy budget is still according to Theorem 3 (parallel composition).
It should be noted that the noise drawn from the Laplace distribution can be positive or negative. If the negative noise occurs, then will be smaller than actual , which may result in constructing a budget sequence that is too unbalanced and significantly affect the final results. Therefore, in the absence of the constraint, we believe that positive noise should be added.
The Taylor series curve with is as shown in Figure 9, and the difference from the original curve is obvious.
If we want to increase the difference between users, then we can assign different privacy budgets for different users.
7. Analysis and Experiment
In this section, we evaluate our strategies with two different sets of experiments. First, we compare the total amount of noise generated with different budget allocation strategies after multiple interactive queries. Second, to demonstrate that our series strategies have better adaptability and utility in practice than the average allocation, we apply our strategies to the means clustering and compare the accuracy of the noisy clustering results. For convenience, we use AVG, Geo, and Ty to represent the average strategy, the geometric series strategy, and the Taylor series strategy, respectively, in the graphs and tables.
7.1. Analysis of the Total Amount of Noise for Multiple Interactive Queries
We implement this experiment with the data set “Document” [29, 30]. “Document” is a database that contains 353,160 records describing the frequency of 3,430 politically relevant news and reviews on professional vocabulary. There are 6,906 professional vocabularies, and the data set contains three attributes: “article ID,” “word ID,” and “wordcount.” The meaning of a record is counting the number of word IDs that appear in the article ID.
For comparison, we conduct sum queries with each allocation method and calculate the total amount of noise, where is 20, 50, and 100. Since the queries are of the same type and the first two attributes are indexes, we can scale the value domain of the “wordcount” to to normalize the sensitivity (though it is a sum query, the semantics of “wordcount” require the value to be greater than 0). We compare the mean square error of the overall noise. To avoid the extremums caused by random noise, we repeat rounds , and the mean square error of each round is denoted as . Then, we obtain the averaged mean square error . A large indicates substantial noise and low accuracy of the query results. Let be the true result of the query, and let be the noisy result with approach . Then, we have the following:
The total privacy budget in this experiment is set to 1 and is 30, 95, and 200, which correspond to values. We compare five conventional methods and the results are shown in Table 2.
It is noted that since the Taylor series strategy without considering is certainly injecting much more noise into the queries at the early and late stages of the allocation than other methods, we default to optimize the Taylor series with in the experiments, and this method is expressed as “blTy.”
The results show that the total amount of noise generated by the geometric series strategy is very close to the average allocation, since a geometric series method can be relatively “balanced” by setting a large . The Taylor strategy is inferior to the geometric strategy, but it still can have the same order of magnitude of noise. In addition, the methods after optimization can have less noise than the original ones. These results are consistent with our previous theoretical analysis.
From the perspective of noise, the average allocation has the least total amount of noise, while our series strategies can ensure that the total noise is always close to the average method and provides an uneven allocation for the budget.
7.2. Differentially Private Means Clustering with Different Strategies
Clustering is a very typical representative of differential privacy in data mining applications. A brief description of one iteration in differentially private clustering is as follows [11] (Algorithm 1).

By analysis, we can find the following: (1) since the clustering algorithm itself requires multiple iterations and to satisfy the guarantee of differential privacy, clusters usually cannot achieve a steady state (what we also call “converge” in this paper) after adding the noise; (2) within an iteration, the privacy budget still needs to be consumed many times [31]. Therefore, the differentially private clustering has been focused on by many researchers.
To compare the performances of different allocation strategies carefully, we apply the means clustering to two data sets, which are of different clustering features; and we will also use different metrics for comparative analysis.
7.2.1. Data Sets and Evaluation Methodology
The first data set we use is “Unbalance” [29], which is a small and labelled data set. It has 3 large clusters of 2000 points and 5 small clusters of 100 points (8 explicit clusters and 6500 2D points). The second data set we use is Brich3 [29], which has a large number of 100,000 2D random points, and the clusters are not explicit; we set for the clusters. There are only two types of queries for data sets in the clustering task: one is sum and the other is count. Thus, we could scale all dimensions of the data set to in advance to satisfy the sensitivity normalization.
To evaluate the performance of clustering on the labelled data set, the commonly used method is the confusion matrix, which is used to calculate true positive (TP), true negative (TN), false positive (FP), and false negative (FN) for the clusters. To show our results more intuitively and comprehensively, we employ the score, which can be used for evaluating the clustering with labels. The definition of for cluster is shown below:
Because there are clusters after partition and they are of different sizes, we consider giving them weights according to their size to obtain the overall score:where is the given data set with clusters and represents the true clusters. To map each noisy cluster to its corresponding true cluster , we use the algorithm.
However, the score is not suitable for a random and unlabelled data set, since there might be more than one best partition after iterations. Therefore, we decide to compare the normalized intracluster variance (NIVC), as mentioned in [6]:
Here, is the centroid of and is a point of . The lower the NIVC is, the better the performance is.
7.2.2. Analysis of the Unbalance Data Set
Since the total number of iterations to achieve an optimal result of differential privacy clustering cannot be computed, we trace the clustering results of each method with different total number of iterations and repeat 100 times to compare the averaged score. Meanwhile, to decide a maximum total number of iterations for comparison, we use the method proposed by Su et al. [6], according to which the maximum number of iterations could be set as 14 (note that Su et al. used the average strategy).
The value of the privacy budget has always been a major problem in differential privacy [32]. Although some researchers have discussed the upper or lower limit of the privacy budget, it has not been widely approved yet. Taking the total number of iterations and queries into account, and with a lot of experiments, we found that a budget of about 0.3 can get good clustering results and reflect the gap between different allocation methods as well; therefore, we set the total budget as .
We divide our comparison into three groups, which correspond to different strategies, and all treat the average allocation curve as a benchmark.
(1) Comparison of the Performance of Geometric Strategy. The first comparison is of geometric strategy, which includes 6 kinds of different geometric methods. Among them, the bisection method is a fixed series allocation method, the Comrade method is only used for decision trees, and the others are general and flexible geometric methods proposed by us. In order to be able to make a fair comparison, we change the tree height in the Comrade method to the total number of clustering iterations , and is the number of the current iteration; namely, . Meanwhile, to ensure that all budgets are exhausted, we have calibrated all the methods.
According to Lemma 3, the boundary of is different with different numbers of iterations when optimizing the series strategies. The boundary of calculated for different numbers of iterations is shown in Table 3, and the actual values of that we set are roughly in the middle of the interval. The comparison of the score for the geometric strategy is shown in Figure 10. Affected by the probability, the curve fluctuates slightly, but it does not affect the overall variation trend.
(a)
(b)
From Figure 10, we can observe the following phenomena: Firstly, although both the bisection method and the Cormode method belong to series strategy, their results are obviously not as effective as the series methods proposed in this paper (readers should pay attention to the scale of the yaxis in two subfigures). Secondly, among the methods designed according to the series strategy proposed in this paper, two monotonically increasing allocation methods, the FlipGeo and the ACFlipGeo, are significantly better than all the other methods; and although the monotonically decreasing allocation methods are better than the bisection method and the Cormode method, they are inferior to the average allocation especially when the total number of iterations is larger than 8. Thirdly, the best performing method is the FlipGeo instead of the ACFlipGeo when the total number of iterations is larger than 10. Fourthly, when the total number of iterations is less than 8, the monotonically decreasing method ACGeo is better than the average method. Finally, all methods can reach the highest level of performance when the total number of iterations is around 5, except for the bisection method.
Based on these results, we can analyze several conclusions. First of all, not only is the series strategy proposed in this paper flexible and variable but also it guarantees the utility of output of an algorithm; even the monotonically decreasing methods are only slightly worse than the average method. Second, in differentially private clustering, monotonically increasing allocation methods are more practical. It means that, in the differentially private algorithms in which queries are associated with each other, the total amount of noise is not the only factor that determines the final output, and budget allocation strategy will also have a significant impact. We can see that when the total number of iterations is less than 10, the method ACFlipGeo is the best performing one; and when the total number of iterations is less than 6, the monotonically decreasing ACGeo is also better than the average method. All these show that both the total amount of noise and the allocation strategy make an impact on the output; and this is also the reason why the method FlipGeo has larger noise than the ACFlipGeo but it gradually shows an absolute advantage as the number of iterations increases. Last, the actual number of iterations required for clustering may be very small, and the choice of allocation strategy will affect the number of iterations that can be supported.
Obtaining such result is inseparable from the monotonicity of the geometric allocation strategy itself and the requirements of convergence for clustering. Although the clustering algorithm depends on the result of the previous iteration, in order to converge, the accuracy of the clustering should be gradually improved; otherwise, it will be difficult to converge. Under this demand, two flipped methods for allocation are very prominent.
(2) Comparison of the Performance of the Taylor Strategy. For the same reason mentioned in Section 7.1, we use the blTy instead of the original Taylor method in this group of comparisons. Since the FlipTy is very close to the average method, the ACFlipTy constrained by is much closer to the average method. Thus, a comparison of the method ACFlipTy and the average method is meaningless. As a result, this group of comparisons only includes four methods: AVG, blTy, ACblTy, and FlipTy. The boundary of for different numbers of iterations is shown in Table 4, and the comparison of the score for the Taylor methods is shown in Figure 11.
We can see from Figure 11 that the results are similar to those from the geometric series. Just as what we have concluded in the experiment of geometric series, allocating low budget terms to the later iterations is not conducive to the convergence of clustering. Thus, although both blTy and the ACblTy are nonmonotonic allocation methods, they perform poorly. This is due to the fact that their curves are similar to convex functions, and most of the budget terms in the allocation sequence are very low, especially at the early and late stages of the allocation, while for the FlipTy, though it is close to the average strategy, it has the ability to maintain a higher budget allocation at the early and late stages; thus its performance is the best among this group of methods. Although it is only slightly higher, it is very effective. At the same time, we also noticed that as the number of iterations increases, the performance of the method FlipTy gradually tends to the average method.
(3) Comparison of the Performance of the Strategy with Different Protections. In Figure 12, we take the FlipGeo as representative and compare its performance with two different protections. We can see that the performances of the two FlipGeo methods with different protections are still better than the average allocation; they all maintain a high level of accuracy. Relatively speaking, the protection with noisy is more stable than the protection with probability. It shows that the budget sequence determined by the series strategies in this paper could still obtain higher performance than the average method even if it has been disturbed or not been allocated according to the original sequence to some extent.
To sum up, with the total budget unchanged, both two series strategies can achieve a better performance than the average strategy regardless of whether the total number of iterations is small or large. Furthermore, we fully believe that it is not the allocation strategy with the least amount of noise which brings good results. It is necessary for us to choose the appropriate allocation strategy according to the specific applications.
7.2.3. Analysis of the Brich3 Data Set
Although our series strategies show obvious advantages on wellclustered data sets, we are curious to know whether they still perform well and support more iterations on complex, highdensity, large data sets. We select three methods that performed well with the “Unbalance” data set for comparison in this experiment. This time we will conduct more iterations. Similar to the operation of the previous data set, we calculated the maximum number of iterations as 47 according to Su et al.’s method in literature [6], and we separately compare the clustering results of different allocation methods after iterations. Considering the complexity of the data set and the instability of the results, we repeat each method 1000 times to avoid some extreme cases and compare the averaged clustering results of each method at different total iterations. The boundaries and the actual setting values of are shown in Table 5; and, in this experiment, we set budget to 0.5.
Recall the NIVC introduced in 7.2.1; the smaller the value is, the better the clustering result is. In addition, since each method repeats the experiment up to 1000 times, we also analyzed the standard deviation between these 1000 results.
It can be seen from Figures 13 and 14 that, even in complex data sets which are with more iterations, our methods are still better than the average strategy, which is consistent with the results of our previous set of experiments. We certainly also observe something new.
First, as the total number of iterations increases, the stability of the final result is broadly in line with the budget terms allocated to the last few iterations, but it has nothing to do with the performance. We can see that although the standard deviation of 1000 results of the FlipTy is almost the same as the average method, the performance is significantly better and sometimes even better than the ACFlipGeo. The reason for the above phenomenon is not complicated. One of the reasons we believe is that, in addition to the last budget terms and the total amount of noise, the factors affecting the clustering should also include the direction of the noise. In the clustering algorithm, we usually regard the data that needs to be clustered as points in a space; it means that the noise we add to each centroid has direction. Although a low budget usually implies noise of large magnitude, if we take the noise direction into consideration, a large amount of noise can also have a positive effect, such as correcting the deviation caused by the previous noise. Therefore, the method FlipTy which allocates some low budget terms during the allocation process may not cause a negative impact.
Second, when the total number of iterations exceeds a certain number, the gaps between four methods will gradually get close. The reason is simple. The gaps between series strategies and average strategy will increase with the total number of iterations, which is the advantage of our series strategies. However, as the number of iterations increases further, any kind of strategy will gradually tend to be average, which is inevitable. Therefore, we find that even the geometric strategy will also approach the average method gradually as the number of iterations increases. However, this does not mean that our series strategies are not as good as the average strategy, since the experiments on two data sets have shown that, within the maximum total number of iterations, our series strategies are better than average strategy all the time.
Third, the complexity of this data set is very close to the real one, and, in such a data set, we still find that the number of iterations to achieve the best performance is actually much smaller than we thought. This implies that although the differential privacy mechanism brings noise, it does not mean that the original algorithm needs to become more complicated. Of course, it should be emphasized that this is only for clustering.
Therefore, it can be ensured that, on the premise of satisfying the clustering performance requirements, our series strategies can be sufficiently better than the average allocation and can support more iterations meanwhile.
8. Conclusion
From the perspective of monotonicity and convergence, we studied two series strategies for privacy budget allocation in this paper: the geometric series strategy and the Taylor series strategy. In order to quickly find out a series to allocate the privacy budget efficiently to satisfy the actual needs in practice, we provided a simple calculation method to determine the key parameters for two series strategies: the ratio and the initial term . Our series strategies have a high level of flexibility and adaptability, which is reflected in that the budget sequence can easily change the monotonicity to meet the application requirements and set the minimum acceptable budget to adjust the consumption speed of the budget. To further improve security, we briefly explored protection for the budget sequence. Several sets of experiments have fully shown the effectiveness of our strategies and verified that the choice of budget allocation method will obviously affect the utility of the final output especially for the field of data mining and machine learning. Taking clustering as an example, our series budget allocation method can achieve higher accuracy compared to other existing methods and can also support much more iterations. We believe that the series strategy can certainly become a very effective and convenient tool for budget allocation.
Of course, we also find some interesting problems during the experiment. For example, at present, we add noise to the centroid of each cluster to obtain the next clustering results, but, in fact, adding noise to any one centroid affects all clusters partitioned more or less in the next iteration. Thus, does the existing method add too much unnecessary noise? In addition, even if the sensitivities of two query functions are both 1, due to the different ranges of results, the tolerance of the two query functions to noise is different. For example, for the count query with range and the max query with , the count function can tolerate more noise obviously. How to deal with it more reasonably will be our future work.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under grant nos. 61972209, 61572263, 61502251, 61502243, and 61602263 and in part by China Postdoctoral Science Foundation under Grant 2016M601859 and Grant 2015M581794.