Abstract

Collecting and analyzing data can generate a wealth of knowledge, but it can also raise privacy concerns. Local differential privacy (LDP) is the latest privacy standard to address this issue and has been implemented on platforms such as Chrome, iOS, and macOS. In the LDP solution, users first perturb their own data on the user side and then upload the perturbed data to the server. This not only protects against background attacks but also against untrusted servers. However, existing multidimensional solutions ignore the personalized privacy needs of users. In this paper, we meet the personalized privacy needs of users while reducing the mean square error of the perturbed data. Specifically, we first designed a personalized privacy budget allocation within a certain range, which meets the personalized privacy needs of users. Then, we optimized the sampling dimension of the existing solution, which resulted in a smaller mean square error of the perturbed data. Finally, we proposed our solution for collecting multidimensional numerical data and estimating the mean. In addition, we conducted experiments on two real datasets. The results demonstrate that the mean square error of our solution is lower than the existing solutions.

1. Introduction

With the development and advancement of information technology, a large amount of data is being generated every day in all industries. The vast majority of companies and organizations recognize the wealth of knowledge that can be generated by collecting and analyzing data. As a result, data collection and analysis became popular and expanded. However, data collection also raises serious privacy concerns. Large amounts of sensitive user data have been leaked, raising a number of public safety issues such as fraud and harassment. To address the privacy issue, Dwork et al. [1] proposed differential privacy (DP) as a standard for privacy protection in various domains.

Unfortunately, since DP focuses only on centralized datasets and assumes that the server is trusted, data is not protected on the client side and privacy issues still arise. In this context, local differential privacy (LDP) [2] is proposed as a variant of differential privacy. In the LDP solution, users first perturb their own data on the user side and then upload the perturbed data to the server. By this way, the server receives only the perturbed data of the user, and the real data of the user never leaves the user’s side. Therefore, there is no privacy problem due to the untrustworthiness of the server. Due to its strong security, LDP has been studied and applied by organizations such as Google Chrome [3], Apple iOS, and macOS [4], and Microsoft Windows Insiders [5].

However, existing multidimensional LDP solutions focus only on improving the usability of aggregated datasets while ignoring users’ personalized privacy needs. In most solutions, users can only give their data to the perturbation mechanism to generate perturbed data and then upload it to the server. Specifically, the perturbation mechanism in a multidimensional LDP solution typically has two inputs, the user data and the privacy budget for each attribute. User data is the real data of the user, and the privacy budget for each attribute is usually equally distributed. This privacy budget allocation method meets the privacy needs of the server setting, i.e., each attribute is equally important. However, since users usually have different sensitivities to each attribute, it usually does not meet the true user privacy needs. As a result, attributes with high sensitivity have weak protection strength, while attributes with low sensitivity have poor usability. Therefore, it is important to design a multidimensional LDP solution that meets users’ personalized privacy needs.

This paper first discusses the existing optimal solutions for multidimensional numerical data collection. To address its personalized privacy problem, we propose a new privacy standard, personalized local differential privacy (PLDP). Under the PLDP standard, the solution ensures overall security and availability while meeting the personalized privacy needs of users within certain ranges. To solve its sampling dimension problem, we assume that each attribute data is uniformly distributed and obtain a better sampling dimension by minimizing the average variance. Based on the above work, we design a personalized multidimensional piecewise mechanism (PMPM) to collect multidimensional numerical data, which has a smaller mean square error than the existing optimal solution and can meet the personalized privacy needs of users. In addition, we compared our solution with existing solutions on two real datasets. The experimental results demonstrate that the mean square error of our scheme is lower than that of existing solutions. The contributions and innovations can be summarized as follows:(1)We design a personalized privacy budget allocation within a certain range and further propose a new privacy standard, personalized local differential privacy. It meets the personalized privacy needs of users while ensuring the availability of perturbed data.(2)We optimized the sampling dimension of the existing solution, which resulted in a smaller mean square error of the perturbed data.(3)We propose a personalized multidimensional piecewise mechanism (PMPM) for collecting multidimensional numerical data and estimating the mean. It not only has a smaller mean square error but also meets the personalized privacy needs of users.(4)We compared our solution with existing solutions on two real datasets. The results validate the superiority of our solution, which meets users’ personalized privacy needs while having a smaller mean square error.

Differential privacy (DP) is a strict privacy standard designed to ensure that data is shared without risk. It guarantees that even if an attacker knows all but one piece of data in the dataset, he still cannot infer information about that piece of data [6]. In a DP solution, the server needs to be trusted because it collects and perturbs the user’s real data. However, in real life, servers are not always trusted. This leads to solutions that may have privacy issues even if they satisfy the DP. However, servers are not necessarily trusted in real life, which leads to solutions that may have privacy issues even if they satisfy DP. In this context, Local Differential Privacy (LDP) [2] has been proposed as a variant of DP. In the LDP solution, the user first perturbs the real data on the client side and then uploads the perturbed data to the server. Only the data owner can access the original data, which provides stronger privacy protection for the user [7]. Even if the server is not trusted, there are no privacy issues.

Under LDP, collecting numerical data and estimating the mean is one of the basic goals of statistics. Duchi et al. [810] proposed an extreme value perturbation solution by randomly perturbing the user’s real data to one of two extreme values or with probabilities and , respectively. The worst-case variance of this solution is , which makes the variance of this solution very small when is small but not very good when is large. Because no matter how large ε, is greater than 1. Kairouz et al. [11] demonstrated that such extreme value perturbation solutions are not always optimal. Then, Wang et al. [12] proposed a distribution perturbation solution called piecewise mechanism (PM), which randomly perturbs the user’s real data to a distribution with output domain , where . Specifically, for each , there are and , where . The solution perturbs the input to with high probability and to and with low probability. Li et al. [13] proposed a method similar to PM called the square wave mechanism for estimating the distribution.

The solutions mentioned above are used to collect one-dimensional numerical data and estimate the mean. For multidimensional numerical data, the one-dimensional solution cannot simply be repeatedly applied to each attribute due to the curse of the dimensionality problem. Duchi et al. [9] proposed a multidimensional extreme value perturbation solution that randomly perturbs the user’s real data to , where is a constant related to . Based on Duchi’s solution, Nguyên et al. [14] proposed the Harmony solution, which randomly selects one attribute for perturbation and uploads it instead of all attributes. The Harmony solution has a smaller communication cost while achieving the same variance as Duchi’s solution. Wang et al. [15] adjusted the probability of Duchi’s solution so that it satisfies -LDP and the variance becomes small. Wang et al. [12] extended PM to multiple dimensions by random sampling, achieving a smaller variance than Duchi’s solution, especially when is large. In addition, Akter and Hashem [16] proposed personal local differential privacy and extended Duchi’s solution.

In summary, existing solutions for multidimensional numerical data collection and mean estimation perform well in terms of data availability but fall short in terms of users’ personalized privacy needs. Our goal was to design a solution that would ensure overall usability and security while also meeting the personalized privacy needs of our users.

3. Preliminaries

3.1. Differential Privacy

Differential privacy (DP) is a definition of privacy tailored to the problem of privacy-preserving data analysis [17]. In the DP solution, the users upload real data to the server. For any analysis algorithm, the server perturbs the real output and then outputs it to ensure that an attacker cannot infer any real data about the user. DP is defined as Definition 1.

Definition 1. (Differential privacy (DP) [18]). A randomized mechanism M satisfies ϵ-differential privacy if and only if for all data sets D1 and D2 differing on at most one element, and all , it has

3.2. Local Differential Privacy

Local differential privacy (LDP) is a variant of DP. Unlike DP, LDP ensures that an attacker cannot infer the user’s real data, even if the server is untrustworthy. LDP is defined as Definition 2.

Definition 2. (Local differential privacy (LDP) [19]). A randomized mechanism M satisfies ϵ-local differential privacy if and only if for any pair of inputs and any possible output , it hasSince LDP is based on differential privacy theory, it inherits the composability property of the latter [20], as shown in Theorem 1.

Theorem 1. (Sequential composition of LDP [21]). Suppose a randomized mechanism includes d independent randomized mechanisms , each satisfies -LDP, then M satisfies -LDP.
The proof appears in Appendix.

4. Collecting Multidimensional Numerical Data and Estimating Mean

In this section, we first discuss the existing optimal solution, MPM. After that, we point out the shortcomings of MPM and propose our solution, PMPM. Some notations used in this paper are listed in Table 1.

4.1. MPM: Existing Optimal Solution
4.1.1. One-Dimensional Numerical Data

Wang et al. [12] proposed a randomized mechanism called the piecewise mechanism (PM). It is mainly used to collect one-dimensional numerical data and estimate the mean. It takes a value as input and then outputs a perturbed value , where

The probability density function of is a piecewise constant function, as shown below as follows:where

The pseudocode is described in Algorithm 1.

Input:: The ith raw date, : The total privacy budget
Output:: The ith perturbed data
Sample x uniformly at random from ;
ifthen
 Sample uniformly at random from ;
else
 Sample uniformly at random from ;
return

Wang et al. [12] proved that PM satisfies -LDP. And, given an input , PM returns an output with and

4.1.2. Multidimensional Numerical Data

Then, Wang et al. [12] proposed a multidimensional piecewise mechanism (MPM) based on PM. It is mainly used to collect multidimensional numerical data and estimate the mean. The pseudocode is described in Algorithm 2.

Input: : The ith raw data, : The total privacy budget
Output: : The ith perturbed data
Let ;
Let ;
Sample k values uniformly without replacement from ;
for each sampled value j do
 Feed and as input to PM, and obtain a noisy value ;
;
return

Wang et al. [12] proved that MPM satisfies -LDP. And, given an input , MPM returns an output with and

By minimizing the worst-case variance of , they get

In addition, let and , Wang et al. [12] prove that is an unbiased estimator of , and with at least probability,which is asymptotically optimal [9].

4.2. PMPM: Our Solution
4.2.1. System Model

Our solution mainly applies to data collection in crowdsourcing mode. An untrusted server wants to collect multidimensional data from each user. To protect privacy, the users perturb the data at each client. Then, the users upload the perturbed data to the server for use by third parties. Our goal is to personalize the perturbation and make the data available after the perturbation as high as possible.

Specifically, the system model consists of a server and n users. Each user has data with d attributes. For each data, the user reduces its dimension from d to k by random sampling. For the k sampled attributes, the user allocates a privacy budget to each attribute and then perturbs each attribute according to the allocated privacy budget. After that, each user uploads the perturbed data to the server. The server aggregates the dataset, calculates the mean for each attribute, and publishes it. The system model of our solution is shown in Figure 1.

Example 1. For simplicity, we consider n is 4, d is 2, and k is 1. The simplified system model is shown in Figure 2.

4.2.2. Personalized Local Differential Privacy

In MPM, the privacy budget is evenly allocated to each sampled attribute. However, users have different sensitivities to each attribute, which we call personalized privacy needs. It results in weak privacy protection for some attributes and low availability for others. To solve the problem of personalized privacy needs, we propose a new concept called personalized local differential privacy (PLDP). The specific definition is shown in Definition 3.

Definition 3. (Personalized local differential privacy (PLDP)). A randomized mechanism M satisfies -personalized local differential privacy if and only if it allocates the total privacy budget to d attributes according to the user’s needs within and for any pair of inputs and any possible output , it haswhere τ is a personalization parameter within .
The value of τ determines the range of privacy budget allocation. The larger τ is, the more personalized the randomized mechanism is. The smaller τ is, the less personalized the randomized mechanism is. When τ is 1, it is equivalent to allocating the privacy budget equally. We will introduce the specific settings of τ in combination with experiments in Section 5.3.
PLDP is an extended version of LDP. It adds a privacy budget allocation condition, which we call personalized privacy budget allocation. In the personalized privacy budget allocation, users can allocate the privacy budget to each attribute within a certain range according to their own personalized needs. Specifically, since τ, ϵ, and d are given, the users have a definite allocation range . Then, users can allocate privacy budgets to each attribute at will within this range. It should be noted that the sum of the allocated privacy budget should be equal to ϵ. Here, we give a simple allocation method. First of all, we divide attributes into important attributes and unimportant attributes. Then, we allocate the smallest privacy budget to each unimportant attribute. Finally, we evenly allocate the remaining privacy budget to important attributes.

Example 2. For simplicity, we consider τ is 1.5, ϵ is 10, and d is 2 (two attributes: gender and age). Then, the range of the personalized privacy budget allocation is . Since we think age is important and gender is not important, we allocate the privacy budget to age and to gender.
The relationship between LDP and PLDP can be summarized as follows. If a randomized mechanism M satisfies ϵ-PLDP, then it satisfies ϵ-LDP. If a randomized mechanism M satisfies ϵ-LDP and personalized privacy budget allocation, then it satisfies ϵ-PLDP.

4.2.3. Personalized Multidimensional Piecewise Mechanism

In MPM, the sampling dimension k is obtained by minimizing the worst-case variance. However, the worst-case scenario is usually rare. As a result, the variance of the overall data is not optimal. To solve this problem, we get the sampling dimension k by minimizing the average variance. Since the variance needs to be calculated, we first give our solution a personalized multidimensional piecewise mechanism (PMPM). The pseudocode is described in Algorithm 3.

Input:: The ith raw data, : the total privacy budget, : the personalization parameter
Output:: The ith perturbed data
Let ;
Sample k values uniformly without replacement from {1, 2, …, d};
Let ;
for each sampled value j do
 Allocate to attribute according to user’s needs within and get ;
 Feed and as input to PM, and obtain a noisy value ;
;
return

PMPM meets the personalized privacy needs of users while ensuring overall security and availability. Theorem 2 ensures the security of PMPM.

Theorem 2. PMPM satisfies -PLDP.
The proof appears in Appendix.
Theorem 3 ensures the unbiased nature of the perturbed data.

Theorem 3. Given an input , PMPM returns a output and for any sampled value j,

The proof appears in Appendix.

Theorem 4 ensures the availability of the perturbed data.

Theorem 4. Given an input , PMPM returns a output and for any sampled value j,

The proof appears in Appendix.

By Theorem 4, we have the average variance of is setting as and as , which is shown as follows: is related to the distribution of attributes. However, the distribution of each attribute is different. To get a definite value, we must assume that all attributes have the same distribution. Finally, in order to be as close to the real value as possible and to facilitate calculation, we assume that all attributes are uniformly distributed. Then we have

In addition, we assume that each attribute is valued by the same number of users, then we have the following:

Thus we have the average variance of is shown as

We plot with respect to (see Figure 3). Obviously, it is observed that is roughly minimized when (i.e. ). Thus to minimize the average variance of , we set the optimal value of k to be the following

4.2.4. Estimating Mean

After the server aggregates the perturbed data of all users, it can estimate the mean of each attribute data. Suppose there are n users and the jth attribute of the perturbed data of the ith user is represented as , then the server calculates to estimate the mean of each attribute. Note that when does not exist (i.e., not sampled), it is skipped.

Theorem 5 ensures the accuracy of the estimated mean, which is asymptotically optimal [9].

Theorem 5. For any , let and . With at least probability,

The proof appears in Appendix.

5. Evaluation

We implemented the proposed solution and compared it with two existing solutions (Duchi’s solution [9] and MPM solution [12]) on two real datasets.

5.1. Datasets

We conducted experiments on two real datasets with the parameters shown in Table 2. Census 2015 [22] and Census 2017 [22] include census data for all counties in 2015 and 2017, respectively. They contain statistics for each county’s citizens, such as “TotalPop,” “Men,” and “White.” To simplify the experiment, we removed the category attribute and the numerical attribute that lacked data.

5.2. Evaluation Metrics

As in much previous work, we evaluate the performance of the solution by using the mean square error (MSE), which is defined as follows:where is the number of attributes of the data, is the estimated mean of the attributes, and is the true mean of the attributes.

In addition, since is a random variable, to make the evaluation more accurate, we repeated the MSE calculation 100 times and then took their average value as the final result.

5.3. Personalization Parameter

Different values of τ determine different privacy budget allocation ranges. Different privacy budgets have different probability ratios of differential privacy. We control the maximum and minimum values of the privacy budget allocation by setting the value of τ. Then, we can control the maximum difference in probability ratio between two attributes (i.e., the degree of personalization). In order to show it more intuitively, we set τ to 1.125, 1.25, 1.375, and 1.5 to observe the difference. Figure 4 shows the maximum difference in probability ratio between two attributes with different personalization parameters τ.

Through experimentation, we can set the value of τ according to the degree of personalization we need. However, the value of τ should not be too large; otherwise, a too small privacy budget will lead to large errors.

5.4. Comparison with Existing Solutions under Different Privacy Budgets

When the privacy budget is greater than or equal to 7.143, k can be greater than 2 and there exists a personalized privacy budget allocation. Therefore, to evaluate PMPM more accurately, we set different privacy budgets from 8 to 14.

We compare our solution with MPM [12] solution on Census 2015 and Census 2017 under different privacy budgets, and the results are shown in Figures 5 and 6.

The experimental results show that the MSE of our solution is always lower than that of the MPM solution when τ is taken as 1.125, 1.25, and 1.375. When τ is taken as 1.5, the MSE of our solution is sometimes higher than that of the MPM solution. In addition, both our solution and the MPM solution experience a rebound in MSE as the privacy budget increases.

Through our analysis, we believe that the reasons for the above-mentioned gaps are as follows:(1)Our k value is better. In our solution, the value of k is obtained by minimizing the average variance. In the MPM solution, the value of k is obtained by minimizing the worst-case variance. In fact, the worst-case scenario is rare. Therefore, our k value is better, which leads to a smaller overall mean square error.(2)An increase in the value of τ leads to an increase in MSE. As the value of τ increases, the range of personalized privacy budgets allocated expands, which leads to the emergence of smaller privacy budgets. The smaller the privacy budget, the larger the error. So there comes a situation where our solution has a higher MSE.(3)Round down. In determining the value of k, we round down, which makes the resulting k value biased. Our k value may be the same under different privacy budgets. When the k value is too small, there is a rebound.

Then, we compare our solution with Duchi’s solution [9] on Census 2015 and Census 2017 under different privacy budgets, and the results are shown in Figures 7 and 8.

The experimental results show that the MSE of our solution is much lower than that of Duchi’s solution under different privacy budgets.

Through our analysis, we believe that the reasons for the above-mentioned gaps are as follows:(1)Random sampling. Our solution reduces the dimensionality of high-dimensional data by random sampling. At the same time, Duchi’s solution does not process high-dimensional data. Due to the high-dimensional curse, the mean square error of our solution is much lower than that of Duchi’s solution.

5.5. Comparison with Existing Solutions under Different Dimensions

By removing some attributes, we set the two datasets to different dimensions. In addition, for better comparison, we set the privacy budget to 10.

We compare our solution with MPM [12] solution on Census 2015 and Census 2017 under different dimensions, and the results are shown in Figures 9 and 10.

The experimental results show that the MSE of our solution is lower than that of the MPM solution under different dimensions. In addition, a larger value of k does not necessarily result in a larger MSE for our solution.

Through our analysis, we believe that the reasons for the above-mentioned gaps are as follows:(1)Our k value is better. Therefore, the MSE of our solution is lower than that of the MPM solution in different dimensions.(2)Privacy budgets are randomly allocated. The larger the value of τ, the larger the range of privacy budget allocation. However, since we randomly allocated the privacy budget within the range during the experiment, the MSE is not necessarily large for a large τ value.

Then, we compared our solution with Duchi’s solution [9] on Census 2015 and Census 2017 under different dimensions, and the results are shown in Figures 11 and 12.

The experimental results show that the MSE of our solution is much lower than that of Duchi’s solution under different dimensions. In addition, the MSE of Duchi’s solution increases as the dimensionality increases.

Through our analysis, we believe that the reasons for the above-mentioned gaps are as follows:(1)Random Sampling. Our solution handles high-dimensional data by random sampling, so the mean square error is lower(2)Dimensional Curse. Duchi’s solution does not effectively handle high-dimensional data, and thus its mean square error increases rapidly as the dimensionality increases

6. Conclusions

In this paper, we first designed a personalized privacy budget allocation within a certain range and proposed a new privacy standard personalized local differential privacy (PLDP). Then, we optimized the sampling dimension of the existing solution MPM by minimizing the average variance. Finally, we designed a personalized multidimensional piecewise mechanism (PMPM) based on the above research. In addition, we validated the superiority of our solution by comparing it with existing solutions on two real datasets.

Overall, our solution has a smaller mean square error while meeting the personalized privacy needs of users. However, our solution is only applicable to collect multidimensional numerical data and estimate the mean. For category data, we have not extended it effectively. On the other hand, to be closer to the true distribution, we obtain k values by assuming that all attributes are uniformly distributed. There would be better results if there was a way to achieve a true distribution.

Appendix

Proof of Theorem

Theorem A.1. (Sequential composition of LDP [21]). Suppose a randomized mechanism includes d independent randomized mechanisms , each satisfies -LDP, then M satisfies -LDP.

Proof. For any pair of inputs , and output y, by Definition 2, we have the following:This completes the proof.

Theorem A.2. PMPM satisfies -PLDP.

Proof. From Algorithm 3, PMPM composes k numbers of PM.
Since PM satisfies ϵ-LDP, and the sum of k privacy budgets is ϵ, then by Theorem 1, PMPM satisfies ϵ-LDP.
By Algorithm 3, PMPM satisfies personalized privacy budget allocation.
By Definitions 2, 3, PMPM satisfies ϵ-PLDP
This completes the proof.

Theorem A.3. Given an input , PMPM returns a output and for any sampled value j,

Proof. Algorithm 3, has the probability that is equal to , and is unsampled.
Since, in PM, The authors have the following:This completes the proof.

Theorem A.4. Given an input , PMPM returns a output for any sampled value j,

Proof. By Algorithm 3, has the probability that is equal to , and is unsampled.
Since in PM, then by Theorem 3, The authors have the following:This completes the proof.

Theorem A.5. For any , let and . With at least probability,

Proof. For any , by Theorem 3, the random variable has zero mean. In addition . Then by Bernstein’s inequality,By Theorem 4,In addition,Applying equations (3) and (4) to (2), The authors haveBy the union bound, there exists such that holds with at least probability. By equation (17), and for . Since the asymptotic expressions involving consider , can also be written as .
This completes the proof.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant nos. 62062020, 62002081, and 62002080, and by the Project Funded by China Postdoctoral Science Foundation under Grant no. 2019M663907XB.