Abstract

Budget factor is an important factor to measure the intensity of differential privacy, and its allocation scheme has a great impact on privacy protection. This paper studies the selection of the parameter in several cases of differential privacy. Firstly, this paper proposes a differential privacy protection parameter configuration method based on fault tolerance interval and analyzes the adversaryʼs fault tolerance under different noise distribution location parameters and scale parameters. Secondly, this paper proposes an algorithm to optimize the application scenarios of multiquery, studies the location parameters and scale parameters in detail, and proposes a differential privacy mechanism to solve the multiuser query scenarios. Thirdly, this paper proposes the differential privacy parameter selection methods based on the single attack and repeated attacks and calculates the upper bound of the parameter based on the sensitivity , the length of the fault tolerance interval , and the success probability as long as the fault tolerance interval. Finally, we have carried out a variety of simulation experiments to verify our research scheme and give the corresponding analysis results.

1. Introduction

In recent years, with the rapid development of information technology, user data have experienced explosive growth. Personal information extracted by data mining and information collection has become a valuable resource for research and decision-making of various research institutions, organizations, and government departments [1]. The analysis and use of massive user data not only bring convenience to peopleʼs lives but also bring a great threat to user privacy protection [2].

More and more people pay attention to protecting data privacy while applying data. On the one hand, for published data, k-anonymity, l-diversity, and T-closure protect sensitive information from attacks, such as link attacks, skew attacks, and underlying knowledge attacks [37]. However, due to the lack of a strong attack model, they are not strong against background knowledge attack. The existing privacy protection models lack effective and strict methods to prove and quantify the level of privacy protection. Once the model parameters change, the quality of privacy protection will not be guaranteed. However, differential privacy has better resistance to the above attacks and has good privacy protection, which has been widely used by scholars [8, 9].

1.1. Motivation

Privacy protection theory and technology need to be able to prevent different attack means. What is more, with the rapid development of data analysis techniques such as data mining in recent years, attackers can extract information related to user privacy from massive data. Therefore, how to protect the privacy of user data and provide high availability data as much as possible in the process of data query, publishing, and sharing has become a research hotspot in privacy protection [10, 11].

At present, most of the proposed privacy protection schemes use anonymous fuzzy or data distortion processing (such as adding random noise) and other technologies and use mathematical regression analysis, data distortion adjustment, and noise scale parameter adjustment to reduce the error caused by noise, so as to improve the availability of data [1214]. However, these schemes also have some shortcomings; that is, the same query results will cause the disclosure of privacy information when the query users with different permissions and reputation levels query the sensitive data.

The differential method has become a hot research topic on many practical applications in recent years. Compared with the traditional privacy protection mode, differential privacy has its unique advantages. Firstly, the model assumes that the adversary has the greatest background knowledge. Secondly, differential privacy has a solid mathematical foundation, a strict definition of privacy protection, and a reliable quantitative evaluation method. By using the output perturbation technology to add random noise to the query output, the single record is in the dataset or not in the dataset, which has little impact on the calculation results. Even if the adversary has the maximum background knowledge, it can ensure that the adversary cannot obtain accurate individual information by observing the calculation results.

The research work of differential privacy protection mainly focuses on improving the privacy protection and data utility in the differential privacy data release, but a small amount of strict mathematical reasoning and research is conducted on the configuration method of privacy protection parameters in the specific practice of differential privacy. In practice, the dataset size, query function sensitivity, and privacy protection probability threshold should be considered in the configuration of privacy protection parameters.

Differential privacy is based on a good mathematical basis and can quantitatively describe the problem of privacy disclosure [1]. These two unique features make it different from other methods. Even in the worst case, if an adversary knows all the sensitive data except one record, it can ensure that the sensitive information will not be disclosed, because the adversary cannot judge whether the record is in the dataset from the query output [2].

The research of differential privacy protection technology mainly considers three problems: (1) how to ensure that the designed privacy algorithm meets the differential privacy to ensure that the data privacy is not leaked; (2) how to reduce the error probability to improve the availability of data; (3) in the face of different environments and attack modes, how to determine the value range of parameter and give a credible and reasonable reasoning proof process.

1.2. Contributions

Aiming at the above problems, to ensure the privacy and availability of sensitive data in the process of data query, solve the problem of real data information leakage in the process of data, and reduce the probability of attackers to obtain real results through differential attack and probabilistic reasoning attack, we study the differential privacy parameter selection methods in various situations; these specific contributions are as follows:(i)We propose a differential privacy parameter configuration method based on fault tolerance interval and analyze the adversaryʼs fault tolerance under different noise distribution location parameters and scale parameters and study the influence of the user’s query permission on privacy protection parameter configuration.(ii)We study the location parameters and scale parameters in detail and propose a differential privacy mechanism to solve the multi-user query scenarios.(iii)For a single attack, we propose a differential privacy attack algorithm and calculate the upper bound of the parameter based on the sensitivity , the length of the fault tolerance interval , and the success probability . Furthermore, we propose an attack model to achieve the security of differential privacy protection technology under repeated attacks, analyze the results of repeated attacks and the characteristics of noise distribution function to obtain the probability of noise falling into the fault-tolerant interval, deduce the probability of the adversaryʼs successful attack by the permutation and combination method, and then obtain the selection range of parameter ε.(iv)We design several experiments, analyze the relationship between adversaryʼs fault tolerance and privacy parameters, derive the configuration formula of the privacy parameter , and configure appropriate parameters without violating the privacy probability threshold.

This paper studies the selection of the parameter in three cases of differential privacy. The structure of this paper is as follows. In Section 2, we introduce and analyze the research progress of correlation differential privacy parameters. In Section 3, we introduce the concept and theory of differential privacy. In Section 4, we propose a privacy parameter selection method-based fault tolerance and analyze the case of multiple scale parameters. In Section 5, we propose a differential privacy algorithm for a multi-user query. In Section 6, we introduce the query attack mode in differential privacy. In Section 7, we design relevant experiments and show the characteristics of the study through analysis and comparison. In Section 8, we summarize and propose future work.

Recently, many achievements have been made in differential privacy research. At present, the research of differential privacy protection technology combines database theory, cluster algorithm, statistical knowledge, and modern cryptography [1, 2]. It defines a very strict mathematical model and provides rigorous and quantitative representation and proof of privacy leakage risk [37, 15]. Based on the relevance contents, this paper divides the research work of differential privacy protection into two parts.

2.1. Research on the Basic Theory of Differential Privacy

How to reduce the noise of dataset on the premise of differential privacy: Yi and Zhabin [16] proposed a data publishing algorithm based on wavelet transform, which can effectively reduce the size of parameter and improve the accuracy of the histogram counting query. Park and Hon [10] studied parameter to protect differential privacy and introduced a new attack index to capture the relationship between attack probability and privacy assurance. Yao [12] introduced the concept of α-mutual information security and showed that statistical security meant mutual information security. Du and Wang [13] proposed a query model and implemented differential privacy by Laplace noise. Tsou and Chen [17] quantified the disclosure risk and linked the differential privacy with k-anonymity. Zhang and Liu [18] proposed a privacy-preserving decision tree classification model based on differential privacy mechanism, through the Laplace mechanism and index mechanism, which provided users with a secure data access interface and optimized the search scheme to reduce the error rate.

Lin et al. [19] proposed an optimized differential private online transaction scheme for online banking, which set consumption boundary with additional noise, and selected different boundaries while satisfying the definition of differential privacy. Besides, they provided a theoretical analysis to prove that the scheme can meet the differential privacy restriction. The choice of a privacy mechanism usually does not have a significant impact on performance but is critical to maintaining the usability of the result. Goryczka and Xiong [20] described and compared distributed data aggregation methods with security and confidentiality, studied the secure multiparty addition protocol, and proposed a new effective Laplace mechanism, ensuring the security of computation, the minimum communication traffic, and the high reliability of the system. Kang and Li [21] proposed a new framework based on the concept of differential privacy, by purposefully adding noise to locally perturb its training parameters, which achieved a compromise between the convergence performance and privacy protection level.

Li et al. [22] focused on the linear query function based on Laplacian mechanism and proposed a method to determine the upper bound of the number of linear queries from the perspective of information theory. Huang and Zhou [23] proposed a differential privacy mechanism to optimize the number of queries in multi-user scenarios and analyzed the distortion of data distribution and the absolute value of noise in terms of utility. Ye and Alexander [15] studied the minimax estimation problem under the restriction of the discrete distribution in the privacy of differential privacy, under the given conditions, considering the structure - privacy level of the optimal problem of the privatization program, minimizing expected estimated losses.

2.2. Application of Differential Privacy

Differential privacy has a wide range of applications. Cheng et al. [11] realized the private publishing of high-dimensional data and determined the optimal parameters by non-overlapping coverage. The studies in [14, 24] introduced differential privacy to protect data privacy and prevented the adversary from inferring important sensitive information. Due to the high complexity and multi-dimension of data, [25] proposed a data partition technology and further used the interactive differential privacy strategy to resist the privacy leakage. Based on noise estimation and Laplace mechanism, the work in [26] studied the trade-off relationship between privacy and utility, derived the optimal differential privacy mechanism, and effectively adapted to the needs of personalized privacy protection.

Zhang et al. [27] formally studied the issue of privacy-preserving set-value data publishing on hybrid cloud, provided a complete system framework, and designed a new data partition mechanism, further setting up query analysis tools that can be automatically switched on the structure of the query optimization of hybrid cloud data query, ensuring the confidentiality of data. In a voting system, users can report their desired parameter values to the selector mechanism. Without limiting user preferences, [28] struck a balance between protecting personal privacy and returned accurate results through the parameter epsilon control.

Sun and Tay [29] constructed an optimization framework that combined local variance privacy and inferential privacy measures and proposed a two-stage local privacy mapping model that can achieve information privacy and local variance privacy within a predetermined budget. Cao and Yoshikawa [30] studied the potential privacy loss of a traditional differential privacy mechanism under time dependence, analyzed the privacy loss of adversaries with time dependence, and designed a fast algorithm to quantify the time privacy leakage. Based on the differential privacy model, the study in [31] constructed a privacy protection method based on clustering and noise and proposed a privacy measurement algorithm based on adjacency degree, which can objectively evaluate the privacy protection strength of various schemes and prevent graph structure and degree attacks.

In the cloud service, the study in [32] proposed a priority ranking query information retrieval scheme to reduce the query overhead on the cloud. The higher-ranking query can retrieve a higher percentage of matching files; users can retrieve files on demand by selecting different levels of queries. Sun and Wang [33] proposed a weight calculation system based on the classification regression tree method, which combined differential privacy and decision tree method, and used differential private small-batch gradient descent algorithm to track privacy loss and prevented adversary from invading personal privacy. Chamikara et al. [34] proposed a recognition protocol, which used different privacy to disturb the featured face and stored the data in a third-party server, which can effectively prevent attacks such as member inference and model memory attacks.

To determine the reasonable release time of dynamic positioning data, the study in [35] designed an adaptive sampling method based on proportional integral derivative controller and proposed a heuristic quadtree partition method and a privacy budget allocation strategy to protect the difference privacy of published data, which improved the accuracy of statistical query and improved the availability of published data. There is often a trade-off between privacy and mining results. Xu and Jiang [36] described the interaction between users in the distributed classification scenario, constructed a Bayes classifier, and proposed an algorithm that allowed users to change their privacy budget; users can add noise to meet different privacy standards. Yin and Xi [37] combined practicability with privacy to establish a multi-level location information tree model and used the index mechanism of differential privacy to noise the access frequency of selected data.

3. Basic Concepts

Here, this paper will introduce some concepts of differential privacy and related theories.

Definition 1 (Adjacent dataset) [1]. Given the dataset and with the same attribute structure, when the number of records difference is 1, the datasets and are called adjacent datasets.

Definition 2 (Differential privacy) [1]. A random algorithm satisfies differential privacy, if and only if, for any two sets , and any output with only one tuple difference, the following conditions are met:where is a constant number of the user. Both and differ by at most one tuple; is a natural logarithm constant. When the parameter is small enough, it is difficult for an adversary to distinguish whether the query function acts on or on for the same output .

Definition 3 (Global sensitivity) [1]. There is a function ; the global sensitivity of function is expressed as follows:where and are adjacent datasets, is the dimension of function , and is the 1-order norm distance between and .

Definition 4 (Laplace mechanism) [1]. It adds independent noise to the true answer and uses to represent the noise from Laplace distribution with a scaling parameter .
For a function over a dataset , the mechanism provides the -differential privacy:For query on the database , the random algorithm returns to the user based on a query result and adds the noise to satisfy the Laplace distribution. In the theory of probability and statistics, the probability density function of variable is expressed as follows:This is the Laplace distribution, is the position parameter, and is the scale parameter, and is the sample value that satisfies the Laplace distribution: , ; notice that the larger the , the smaller the . For the convenience of discussion, ; the expectation and variance are and , respectively. The implementation of -differential privacy algorithm is relatively simple. From Laplace distribution , the location parameter does not affect the adversary, while the parameter directly affects the vulnerability of the attack. When the parameter is smaller, the sampling data is closer to the location parameter ; on the contrary, when the parameter is large enough, the sampling data is equal to the average distribution on , which is very difficult for the adversary.

Definition 5 () (see [1, 38]). A mechanism meets the ; it has the formulafd6where and are the accuracy parameters and is the private algorithm of .

Theory 1 (Sequential composition theory [2]). For , they satisfy-difference privacy, -difference privacy, and -differential privacy. When they are applied to the same dataset, publishing results meet the -differential privacy, .

Theory 2 (Parallel composition theory [2]). A dataset is divided into units, , respectively, so that can satisfy differential privacy, can satisfy -differential privacy.

Theory 3 (Medium convexity theory [38]). Given that two algorithms and satisfy -differential privacy, for any probability , is used as a mechanism. It uses the algorithm with the probability and uses the algorithm with the probability ; then the mechanism satisfies the -differential privacy.

4. Privacy Parameter Selection Based on Fault Tolerance

The query value of the adversary is generated based on the real value; the distribution of noise directly affects the probability of the adversary obtaining the real information.

4.1. Privacy Fault Tolerance

For some query functions, if the noise is distributed in , the adversary can infer the true value with a large probability and then analyze whether a specific record is in or not in the dataset. In this paper, is called the fault tolerance interval, and the corresponding fault tolerance is .

According to the Laplace definition, the probability that the random noise lies in the fault tolerance can be obtained by . Thus, the mathematical expression of the adversaryʼs fault tolerance is obtained as follows:

Through this mathematical theory analysis, we can select appropriate privacy parameters and add noise that meets the requirements of differential privacy protection, to prevent the adversaryʼs probabilistic reasoning attack.

4.2. Analysis of Privacy Parameter

When the adversaryʼs fault tolerance level satisfies the privacy probability threshold, the appropriate scale parameter value can be obtained. In this method, the privacy probability threshold is determined by the privacy attribute, which means that the adversaryʼs probabilistic inference attack will not exceed the privacy protection threshold.

To meet the requirements of privacy protection, the scale parameter can meet the formula

The mathematical expression of fault tolerance has many forms according to the different position parameters .(1)When , we can get the formula(2)When , by solving the formula (8), we can get the formula(3)When , we can get the formula(4)when , by solving the above inequality, we can obtain the formulafd12(5)When , formula (8) can be rewritten as follows:fd13

Budget parameter ε configuration can be expressed as follows:

From the above analysis, we can deduce the selection range of privacy parameter under different location parameters, scale parameters, and privacy probability thresholds.

In this paper, the value range of query authority is set as [0, 1]. To configure smaller privacy protection budget parameters to users with low query rights, the privacy budget parameter is set. Based on this, the configuration method of privacy parameter under different query permissions can be obtained by the following formula:

Through this privacy parameter configuration method, the privacy protection probability threshold can be set, and the appropriate privacy parameter can be selected according to the query function and the fault tolerance, so as to achieve the privacy protection and ensure the maximization of data utility.

5. Differential Privacy of Multiuser Query

In this section, we continue to study the location parameters and scale parameters and propose a differential privacy mechanism to solve the multi-user query.

Assume that the number of users is , and the query number of each user is . The query set is ; the results for user are covered with scale parameter and location parameter . The is randomly chosen from the interval .

According to Definition 3, the global sensitivity is , the is the noisy value of the query by the database , the is the real value of the , and is noise with and . The is the noisy value answer of the query by the database , is the noisy value for by the database , and is the real value for by the database .

Theory 4. For the database and query set , the mechanism is -differential privacy.

Proof. For the , and the userʼs query , the location parameter is , so it can get the formula meets the Laplace distribution; it can get the formulaFor the adjacent database, it can get the formulaFor the user’s query , it can get the formulaIn Algorithm 1, there are some denotations. The database is denoted by and its global sensitivity is . for the query of the user, the privacy budget is . According to Theory1 of differential privacy for the query set , this mechanism is differential privacy.

Require: the number of user is
The number of query for each user is
The query set is
The interval is
The database and its global sensitivity is
The privacy budget is
Ensure: the set of answer for queries
(1)For each user do
(2)Choose from for user
(3)Set the user’s noise distribution
(4)For each query do
(5)The answer
(6)End

6. Research of the Attack Model

In the actual application scenario, users often face attack problems of different privacy. This section is divided into two parts: single attack and repeated attack.

6.1. Single Attack

Assume that there are only two potential input sets in the worst case, this section discusses how to guess the real value according to the . An adversary puts forward a query question against the attack object. The database owner gets the result according to the query question and returns it to the adversary after adding the noise . The adversary needs to make a judgment by the result ; an attack object is not in the collection. Each noise satisfies the Laplace distribution, so it is impossible for the adversary to accurately guess this . Considering the characteristics of query functions, the adversary can only guess that falls in a certain range. To describe the above phenomenon, the probability of in interval decreases with the increase of , which can reflect the difficulty of the adversary.

Lemma 1. If the Laplace distribution is used to add noise to , then the probability of in the interval () is expressed as .

Proof. Based on Definition 3, the probability of falling in the interval is equal to the probability of in the interval . Therefore, from the Laplace function, according to , the probability of in the interval is expressed as .

Lemma 2. The probability of an adversaryʼs success in Algorithm 2 is .

Input:
Output present or absence
Laplace distribution , and
(1)
(2)
(3)Return present
(4)Else
(5)Return absence

Proof. Assume that or ; there are two intervals and for . From Theory1, if , the probability of in the is . If , the probability of in the is the same.
Therefore, according to , the probability of success is ; if falls into the interval, then ; otherwise, . Note: means the adversary is not in the original data; means that the adversary is in the original data. For a common query, it can deduce the probability .
With Lemma 2 and Algorithm 2, when the adversaryʼs success probability is solved, it can obtain the upper bound of the that meets the formulafd20The upper bound of the parameter in formula (20) is independent of the dataset, which is related to the query function and the adversaryʼs success probability .

6.2. Repeated Attack

Although differential privacy is the latest technology to protect personal privacy, it has an obvious defect in the Laplace mechanism. If the adversary can perform the same query function infinitely, he can infer the real query result by observing which point the query results concentrate on. Therefore, it is necessary to study the limit of the number of query times.

According to the above sections, an adversary can obtain results after times of attacks.

Lemma 3. If the adversary attacks times and adds noise to by Laplace distribution, the probability of times in is expressed as follows:

Proof. According to Definition 2, it can be known in a query that the probability of in the is expressed as follows:If there are times in the interval , from the binomial distribution function, the probability of in the times of repeated attacks is .
In Algorithm 3, is the normal query, ; the half-length of the fault-tolerant interval . After making times of attack, the adversary can judge whether the attack object is in the result set.

Input
Output present or absence
distribution, and
(1)
(2)While
(3)Begin
(4)
(5)If
(6)
(7)Else
(8)
(9)
(10)End
(11)If
(12)Return present
(13)Else
(14)Return absence

Lemma 4. According to Algorithm 3, the adversary performs times of query, and the probability of success is expressed as follows:

Proof. Let ; assume that or . Considering two intervals and , times falls into and times falls into after times of attacks. According to Lemma 4, if , the probability of is .
If , then ; the probability of is .
Therefore, the probability of a successful attack is expressed as .
indicates that the attack object is not in the original dataset, and indicates that the attack object is in the original dataset.

7. Experiment Simulation Analysis

The experimental environment: Intel core i7-7500, CPU 2.9 GHz, 8 GB memory, Windows 10 operating system, MATLAB 2015b. The experiment uses UCI machine learning dataset, which contains 48842 records of US census data with 14 attributes. Here, we select five attributes in Table 1: education, marital status, occupancy, native country, and work class.

7.1. Fault-Tolerant Experiment

To express the problem more intuitively, according to the configuration method of privacy parameter in Section 4, the parameter is analyzed qualitatively and quantitatively.

In Figure 1, , when the location parameter is outside the fault tolerance interval ( or ), the adversaryʼs fault tolerance on the fault interval is low; the adversary cannot effectively obtain the real information in the dataset. This is because the location parameter is large, the data distortion is serious, and the data availability is low.

When the location parameter is in , the adversaryʼs fault tolerance is higher, which has reference significance for privacy protection analysis. According to Figure 1, this paper analyzes the impact of different interval lengths on the adversaryʼs fault tolerance when the location parameter is within the fault tolerance interval.

In Figure 2, the configuration of is related to the location parameter and fault tolerance interval of noise distribution. Under the same fault tolerance interval, when the position parameter is taken as 0, the adversaryʼs fault tolerance is larger. Under the same location parameters, the larger the fault tolerance interval, the greater the fault tolerance level. The maximum privacy parameter value can be obtained without violating the privacy protection probability threshold .

In Figure 3, the smaller the query authority is, the smaller the upper limit of privacy protection budget parameters is. By limiting the upper limit of privacy protection budget parameters, different values can be configured for query users with different query permission ranges.

7.2. Query Success Rate Experiment

is an important factor to measure the intensity of privacy protection. Its different allocation schemes have a great impact on the error of the privacy protection algorithm. Next, we give the query probability in different conditions to verify the role of our parameters.

In Figure 4, in the interval , with the increasing of the value of , the probability of falling in the given interval also increases.

Figure 5 shows the probability curve of different values of the privacy parameter in the interval . It can be seen from the figure that, in the range of , the probability of falling into the interval will decrease with the increase of the value of ; that is, the probability of falling in the interval will decrease with the increase of .

Figure 6 shows the probability curve image of the noise value falling in the interval with different privacy parameters. As can be seen from Figure 6, with the increase of the privacy parameter , the probability of falling into a given interval becomes smaller.

In Figure 7, under the same privacy budget , the probability of attack success increases with the number of attacks; with the increase of , the success rate reaches 1; furthermore, the selection range of parameters can be deduced by formula (23).

8. Conclusion

This paper studies the selection of the parameter in several cases of differential privacy. Firstly, this paper proposes a differential privacy parameter configuration method based on fault tolerance interval and analyzes the adversaryʼs fault tolerance under different noise distribution location parameters and scale parameters. Secondly, this paper proposes an algorithm to optimize the application scenarios of multi-query and proposes a differential privacy mechanism to solve the multi-user query scenarios. Thirdly, this paper proposes the differential privacy parameter selection methods of several attack models and calculates the upper bound of the parameter based on the sensitivity , the length of the fault tolerance interval , and the success probability . Finally, we have carried out a variety of simulation experiments to verify our research scheme and given the corresponding analysis results.

The research of is limited not only to choosing a proper privacy parameter value in the Laplace mechanism but also to choosing a reasonable in exponential mechanism and calculating an ideal parameter value by the method of probability and statistics.

Data Availability

The original data of this article is confidential, but the processed data (data used to support the research in this article) have been given in Section 7 of the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.