Abstract

Local differential privacy (LDP) is a promising privacy-preserving technology from users’ perspective, as users perturb their private information locally before reporting to the aggregator. We study the problem of collecting heterogeneous data, that is, key-value pairs under LDP, which is widely involved in real-world applications. Although previous LDP work on key-value data collection achieves a good utility on frequency estimation of key and distribution estimation of value, they have three downfalls: (1) existing work perturbs numerical value in a discrete manner that does not exploit the ordinal nature of the numerical domain and lead to poor accuracy, (2) they do not lead to improved privacy budget composition and consume more privacy budget than necessary to achieve the given privacy level, and (3) the frequency estimation of the key is not the most accurate due to the lack of consistency requirement. In this paper, we propose a novel mechanism to collect key-value data under LDP leveraging the numerical nature of the domain and result in better utility. Due to our correlated perturbation, the mechanism consumes less privacy budget than previous work while keeping the privacy level. We also adopt consistency as the postprocessing, which is applied to the estimated key frequency to further improve the accuracy. Comprehensive experiments demonstrate that our approach consistently outperforms the state-of-the-art mechanisms under the same LDP guarantee.

1. Introduction

Differential privacy (DP) [1] is the state-of-the-art technology for private data release, which provides provable and measurable privacy protection regardless of the adversary’s background knowledge. Different from DP in the centralized setting that protects data after data collection, local differential privacy (LDP) has been proposed to protect data during data collection. In LDP, the server is assumed to be untrusted. Each user locally obfuscates his/her personal data using the LDP mechanism before uploading. After receiving the perturbed data from all users, the server performs data analytics or answers queries. LDP technology enables collecting statistics of users under a privacy guarantee and has been widely deployed in practice. For example, Apple deploys the LDP mechanism in iOS to identify heavy hitters in emojis while keeping user privacy [2]; RAPPOR has been deployed in Google’s browser Chrome to collect and analyze the web browsing behavior of users under LDP [3].

Early LDP work mainly focuses on simple statistical queries such as frequency estimation on categorical data [4] and mean estimation [5, 6] on numerical data. Nowadays, LDP is also applied for hybrid data types or queries [79], for example, key-value data, which has categorical data and numerical data simultaneously and is widely used in practice. The following examples show the potential applications of key-value data:(i)Product rating analysis: online market platforms such as Amazon and eBay are collecting users’ ratings for the products they bought and show the ratings online as a reference for other buyers. These rating data are usually in the form of key-value data where the key is the product and the value is the rating.(ii)Software usage analysis: software developers and providers such as Microsoft need to collect usage time about each software to analyze users’ preferences. This data is usually in the form of key-value pairs, where the key is the software identifier and the value is the usage time of this software.

Literature [7, 8] are the first to study the problem of collecting key-value data under LDP. They design mechanisms to support two estimation tasks: (1) the frequency of the key and (2) the mean estimation of the value. Recently, Ye et al. [9] expand the previous work and are the first to propose PrivKVM to estimate the frequency of the key and the distribution of the value. It discretizes the domain into many bins and reports which bin contains the private value using categorical frequency oracle (e.g., GRR [10]) while considering the correlation between keys and values. However, it has three main limitations. First, it perturbs the numerical value in a discrete manner, which does not work well because it does not exploit the ordinal information of the numerical domain, that is, a perturbed report that is close to the true value also has useful information for the distribution estimation. Second, although the mechanism considers the correlation between keys and values, it does not lead to an improved budget composition. Third, it does not consider the consistency in the estimated frequency of the key. That is, the estimated frequency should be consistent with the property of frequency: (1) each frequency should be non-negative and (2) the sum of all estimated frequencies should be 1. Without enforcing the consistency requirement, the mechanism may not result in the most accurate answers to the key frequency [11].

Motivated by this, in this paper, we study the problem of collecting key-value data under LDP and propose a mechanism to solve the above three limitations: existing mechanisms (1) do not exploit the ordinal information about the numerical domain, (2) do not consider consistency to achieve the most accuracy, and (3) do not result in improved budget composition. Our mechanism aims to collect the two most fundamental statistics of key-value pairs: key frequency and value distribution. It contains three steps: (1) padding and sampling, (2) perturbation, and (3) aggregating and estimating key frequency and value distribution.

In step 1, each user pads his key-value data by dummy into the same length (make the sampling rate identical for all users) and samples one key-value pair. The reason for using the sampling protocol is that multiple key-value pairs are possessed by each user; if all the pairs are reported to the server, each pair will split the privacy budget, which results in large noise in each pair and bad utility.

In step 2, we solve the first and second limitations. Each user perturbs the sampled key-value pair in a correlated manner because there is an inherent correlation between key and value and reporting the value may also reveal information about the presence of the corresponding key [8]. Thus, we first perturb the key and then perturb the value according to the perturbation results of the key. If a possessed key is still possessed after perturbation, we then report the value via the LDP mechanism that utilizes the ordered nature of the domain and directly perturbs the value in the numerical domain, which solves the first limitation by leveraging the ordinal information about the numerical domain. However, there is a challenge when a non-possessed key is perturbed as possessed. By LDP definition, the perturbed values of the dummy keys and the perturbed values of the genuine keys are indistinguishable. Thus, the perturbed values of the dummy keys would affect the distribution estimation. To address this problem, we generate fake values for these keys that can satisfy LDP. Previous work selects values uniformly at random from the discrete output domain as fake values. However, such fake value generation does not work for our mechanism. In our mechanism, the output domain is a numerical continuous interval, and such fake value generation will violate the privacy guarantee since the probability of outputting the values that are not those discrete values is unbounded. Therefore, we adopt the existing fake value generation for our mechanism.

By an outgrowth of the fake value generation, we show that our correlated perturbation has a privacy amplification effect: it consumes less privacy budget overall than the summation of budget in key and value perturbation. This solves the third limitation by providing a tighter budget composition, which achieves a better privacy-utility trade-off than basic sequential composition (used in PrivKVM).

In step 3, we solve the third limitation. The server collects the perturbed results from all users and estimates the key frequency and the value distribution. For the frequency estimation, the server enforces consistency requirements to improve the accuracy. Since the fake values generated in the perturbation step affect the distribution, removing the influence of the fake value is another challenge. The consistency is not designed for this challenge since it only requires the estimated frequencies are sum-up-to-1 and non-negative but cannot detect the fake value and remove them. To address this problem, we design a method to statistically remove the fake value in the distribution estimation.

Our main contributions are summarized as follows:(1)Novel LDP mechanism for key-value data collection: we propose a mechanism that supports frequency estimation and distribution estimation over key-value data. It takes advantage of the numerical nature of the data domain and achieves better accuracy than existing solutions.(2)Improved privacy budget composition: we show that the privacy budget composition of our correlated perturbation mechanism has a tighter bound than sequential composition, which provides a privacy amplification effect and achieves a better privacy-utility trade-off.(3)Consistency as postprocessing to improve accuracy: we enforce consistency as postprocessing for key frequency estimation in our mechanism, which can further improve the accuracy than the existing LDP mechanism for key-value data collection.(4)Comprehensive evaluation: we implement the mechanism and evaluate it on real-world data sets. The results show our mechanism outperforms existing LDP schemes. In particular, our mechanism significantly improves the accuracy, that is, reducing the error of current mechanisms by about an order of magnitude in most cases, especially when is small (large noise).

2. Preliminary

2.1. Local Differential Privacy

In the centralized setting of differential privacy, a trusted server or data aggregator has all users’ personal data, and it is responsible for responding to queries while using DP mechanisms to protect user privacy. However, the assumption of a trust server may not hold in practice. Local differential privacy addresses this problem. In the local setting, each user perturbs his personal data and then uploads the perturbed results to the server for data analysis. In this way, the server can be untrusted because it cannot access the original data.

Definition 1. (local differential privacy (LDP) [12]). A randomized algorithm is -LDP if and only if for any input and , the probability ratio of outputting the same result is bounded by . Formally,By Definition 1, given any output , the adversary cannot infer the original input is or with high confidence. Here, the confidence is controlled by the parameter (called privacy budget). The smaller the , the closer the probability is to the probability . That is, the mechanism provides stronger privacy protection since the adversary has lower confidence to distinguish whether the original input is or .
When multiple LDP mechanisms are combined together to generate a new mechanism, the sequential composition theorem guarantees the total privacy of the new mechanism.

Theorem 1 (sequential composition theorem [12]). If a sequence of mechanisms , , …, satisfies -LDP, -LDP, …, -LDP, then the sequential composition satisfies -LDP.

By Theorem 1, given a total privacy budget and a computation task, we can split the task into multiple parts and allocate each part a portion of privacy budget to achieve -LDP.

Theorem 2 (postprocessing [12]). If a mechanism satisfies -LDP, given any function that cannot access the original data and noise, then the also satisfies -LDP.

By Theorem 2, the postprocessing does not violate the privacy guarantee of LDP mechanisms. In this paper, we use the postprocessing method to further improve the utility of our mechanism.

2.2. Basic LDP Mechanisms
2.2.1. Unary Encoding (UE)

Unary encoding first encodes an input into a binary vector where the only -th position is 1 and others are 0 [10]. Also, the length of the vector is equal to the domain size of the input. Then it perturbs each bit as follows :

As shown in [10], unary encoding provides -LDP, where .

After receiving the perturbed results from all users, the server or aggregator can estimate the frequency of users who possess the -th item, that is, the input value is . Denote the estimated frequency of the -th item by ; the aggregator can estimate the frequency by the following unbiased estimator :where is the indicator function and indicates the -th bit of the vector of the user .

2.2.2. Square Wave Mechanism (SW Mechanism)

SW mechanism is designed for numerical distribution estimation under LDP [13]. The intuition behind this mechanism is to increase the probability that a noisy reported value can carry useful information about the input. For the numerical domain, a noisy reported value that is different from but close to the true value also contains useful information for distribution estimation. Therefore, given an input , the SW mechanism reports values closer to with a higher probability than values farther away from . Formally speaking, the SW mechanism assumes the domain of the input is (any bounded value can be linearly transformed into this domain) and the domain of the output is , then it perturbs the input with the following probabilities:

As shown in [13], the value and are set to bewhich are derived by maximizing the difference between and while satisfying the total probability is add-up-to 1. Also, the parameter is set to bewhich is obtained by maximizing the mutual information between the input and output of the SW mechanism.

Since the output domain and the input domain are different, the server/aggregator uses expectation-maximization (EM) algorithm to reconstruct the distribution after receiving the noisy reported values.

3. Key-Value Data Collection under LDP

3.1. Problem Statement
3.1.1. System Model

There are users and one server in our system model, where each user possesses one or multiple key-value pairs . The domain of the key is assumed to be , while the domain of the value is assumed to be (any bounded value can be linearly transformed into this domain). Besides, we assume the user possesses the set of key-value pairs . The goal of the server is to collect the key-value data from all users and then estimate the (1) frequency and the (2) value distribution of a certain key. In other words, the server calculates the fraction of users who have a certain key and the value distribution of that key among those who have it.

3.1.2. Threat Model

We assume the server is untrusted, and a data breach might occur as a result of unauthorized data publishing or hacking. The adversary is considered to have access to all users’ output and to be aware of the perturbation algorithm in the mechanism locally established on the user side. Furthermore, we assume that all users will honestly follow the perturbation mechanism.

3.2. PrivKVM

To the best of our knowledge, PrivKVM [9] is the state-of-the-art LDP framework for key-value data collection that can support frequency estimation of key and distribution estimation of value. We first briefly describe the mechanism and then summarize the main differences between our work and PrivKVM.

3.2.1. Workflow of PrivKVM

PrivKVM collects the key-value data in two phases. In the first phase, each user first samples one key-value pair uniformly at random from the full domain of the key and then perturbs it in a correlated manner. Specifically, it first perturbs the key and then perturbs the value according to the perturbed result of the key. The following are the four cases of perturbation:(1)The sampled key is possessed by the user, and the key is perturbed as possessed. The user perturbs the value using the technology called GVPP, which is a categorical frequency oracle with a boundary. That is, the user first discretizes the numerical domain into many bins and then discretizes the value to the boundary of the bin containing the value with a specific probability. Then, the user reports which boundary contains the private value using categorical frequency oracle, for example, GRR [10].(2)The sampled key is possessed by the user, and the key is perturbed as non-possessed. The existing key disappears after perturbation in this case, and the user simply sets the value to be 0.(3)The sampled key is non-possessed by the user, and the key is perturbed as possessed. In this case, a “fake” key appears, and the user samples a mean uniformly at random from the current means of all bins as the value.(4)The sampled key is non-possessed by the user, and the key is perturbed as non-possessed. The user simply set the value to be 0.

After the perturbation, each user reports the obfuscated result, and the server estimates the statistical information: (1) the frequency of all keys, (2) the mean of the value, and (3) the distribution of values. To obtain a more accurate mean estimation, the server leverages virtual iteration technology and further calibrates the estimated mean. In the second phase, the server broadcasts to all users the heavy hitters (the keys with frequencies higher than a given threshold) and their corresponding mean estimated in the first phase, and users and the server repeatedly execute the steps as in the first phase to obtain the statistics estimation, except that the statistics are for the set of heavy hitters instead of all keys. For the rest of the keys, the server averages their statistics to reduce the noise effect since they are non-heavy hitters and the number of samples are insufficient.

3.2.2. Limitations of PrivKVM

We summarize the limitations of PrivKVM as follows:(1)PrivKVM estimates value distribution using categorical frequency oracle with boundary. In this way, the server can estimate the count of values falling in each bin and obtain the density distribution over the domain. However, values in the numerical domain have meaningful total order, and this method ignores such information in the value due to discretization. What is worse, it faces the challenge of finding the optimal size of bins. Binning causes two sources of errors: (1) LDP noise and (2) bias due to grouping values together. More bins lead to large error due to LDP noise, and fewer bins result in a greater error because of bias. Unfortunately, finding the optimal size of bins is a non-trivial task since the effect of the size depends on both and the property of actual data distribution that is unknown to the server [13].(2)PrivKVM does not consider the consistency problem in the frequency estimation of the key. That is, the estimated frequencies could not satisfy the basic requirements of frequency: (1) every frequency should be non-negative and (2) all frequencies should be summing-up-to-1. Thus, the estimated frequencies are not the most accurate.(3)Although PrivKVM perturbs the key and value in a correlated manner, it does not lead to an improved budget composition for LDP.

In what follows, we elaborate on our mechanism. Some important notations are summarized in Table 1.

4. Proposed Method

The overview of the proposed method is shown in Figure 1. The idea of our mechanism is as follows. Each user first samples one key-value pair from his personal data (the sampling protocol will be discussed in Section 4.1); then each user privately perturbs the sampled key-value pair by our LDP mechanism (the mechanism will be discussed in Section 4.2). After receiving the reported results, the server aggregates the perturbed data and estimates the key frequency and value distribution, which will be shown in Section 4.3.

4.1. Sampling Protocol

In this subsection, we explain why we need to sample before reporting perturbed data and elaborating on our sampling protocol.

4.1.1. Why Sampling Protocol

In practice, each user may have multiple key-value pairs. If the user perturbs all his key-value pairs, then each pair would consume the privacy budget, and the LDP mechanism has to split the total privacy budget. Thus, the noise added to each pair would be too large. To solve this problem, a promising method is to sample and submit one pair, which avoids the privacy budget splitting and improves the utility.

4.1.2. Our Sampling Protocol

Sampling protocols are widely used in existing DP mechanisms for key-value data perturbation [79]. However, they either do not support distribution estimation or do not work well on large domain sizes. In particular, PrivKVM [9] samples from the full domain in the first phase to identify heavy hitters, which does not work well when the domain size is very large and each user only possesses a small number of keys since users rarely report the information about the keys they possess. Therefore, we adopt the padding-and-sampling protocol [8, 14] for key-value data to support frequency and distribution estimation. The advantage of the padding-and-sampling protocol is that it samples from the set of keys users possess instead of from the full domain, and thus, it handles large domains better.

The step of our sampling protocol is as follows. First, all users generate dummy key-value pairs whose keys are and values are zeros. For user whose , he adds different random dummy key-value pairs to and make the length be . Without padding, determining the probability that a pair is sampled is difficult, resulting in inaccurate estimation. Therefore, the domain of the key of the padded data is where . Then each user samples one pair from the padded data to perturb and upload. Although some pairs may be unsampled, this case only occurs for infrequent pairs, and the useful information is still reported with high probability. The following shows an example to illustrate the sampling process, and the details are shown in Algorithm 1.

Input: The set of key-value pairs, padding length .
Output: Sampled key-value pair , where and
(1)Generate dummy key-value pairs, whose keys are and values are 0.
(2)if   then
(3) Add random different dummy pairs to .
(4)end if
(5)Pick one key-value pair uniformly at random from the .
(6)return Sampled key-value pair .
4.1.3. Example

Suppose the domain size and the padding length . Let the domain of the key , the domain of the dummy key is , and the domain of the padded key-value pair becomes . For a user whose key-value pairs are , since he possesses two pairs, he first pads the data by one dummy pair and gets (suppose the dummy is picked). Then he picks one pair uniformly at random from the padded result to perturb and upload.

We note that the previous mechanism PCKV [8] also adopts a padding-and-sampling protocol for key-value data collection under LDP. We emphasize there is a difference between their protocol and ours. PCKV only supports the mean estimation of values; thus, it discretizes the value into 1 and −1 with particular probability to guarantee the unbiasedness for mean estimation. Our sampling protocol does not adopt the discretizing step because we want to estimate the value distribution and discrete values lose the numerical information and would affect the distribution estimation.

According to the literature [8, 14], padding length would cause two types of error: (1) variance between true values and estimated results and (2) bias between true values and estimated results. A smaller would underestimate the key frequency and results in a large bias, and a larger would enlarge the noise in estimation, thus leading to a large variance. Unfortunately, finding the optimal padding length that can balance the trade-off between the variance and the bias is a non-trivial task, and it is still an open problem so far [8]. Thus, in this paper, we empirically set the suitable padding length in the experiments for comparing with other LDP mechanisms.

4.2. Perturbation Mechanism

In this subsection, we introduce our perturbation mechanism. By Algorithm 1, each user samples one key-value pair as the input of the perturbation mechanism. The basic idea of our perturbation mechanism is to perturb the value according to the perturbed results of the key. If a non-possessed key is perturbed as possessed or a possessed key is perturbed as non-possessed, we generate a fake value for the key to avoid the influence on the distribution estimation of the value. Under this strategy, we find the mechanism provides a tighter privacy budget composition (see Theorem 3), that is, it is shown that the total privacy budget of the combined perturbations (key perturbation and value perturbation) is smaller than the sequential composition. Based on the above idea and two basic LDP mechanisms (UE and SW), we design an LDP mechanism for key-value data collection that can support numerical distribution estimation. Overall, the LDP perturbation is shown in Algorithm 2.

Input: The sampled key-value pair , privacy budget
Output: Perturbed result
(1)Encode as vector
(2)Perturb as by (2)
(3)if then
(4) Perturb as by (3)
(5)end if
(6)if then
(7) Generate fake value
(8)end if
(9)if then
(10)
(11)end if
(12)return .

In the UE mechanism, the original input is encoded as a binary vector where the bit at the input-corresponding position is 1 and other bits are 0. Similarly, for key-value data, we encode the sampled key-value pair as a vector where the -th element (corresponding to the key ) is and the other element is . Then the perturbation can be divided into two steps. Note that each element in the vector has two items (key and value), for brevity, we use notation and to represent the key and value of the -th element of vector , respectively, and use and to represent perturbed key and value. First, we perturb the key as follows:where . Given the perturbation result of the key, we then perturb the value. The value perturbation can be divided into three cases as follows:(i)If the key is perturbed from 1 to 1, the corresponding value is perturbed as such that(ii)If the key is perturbed from 0 to 1, the fake value drawn from the uniform distribution is assigned.(iii)If the key is perturbed from 0 to 0 or from 1 to 0, the key is reported as non-possessed; thus, we set the perturbed value to be 0.

4.2.1. Privacy Analysis of Our Mechanism

In our mechanism, the key is perturbed by the UE mechanism with privacy budget (see Section 2); the value is perturbed by the SW mechanism with privacy budget . In our mechanism, the key perturbation and value perturbation are correlated, that is, the value perturbation relies on the key and key perturbation. Generally, the correlated perturbation may leak less privacy than independent perturbation and has a privacy amplification effect [8]. That is, the total privacy budget is less than the summation . Theorem 3 shows our mechanism satisfies LDP and has a tighter budget composition than sequential composition.

Theorem 3. Denote the privacy budget for key perturbation and value perturbation are and , respectively; our mechanism satisfies -LDP where

Proof. For a key-value set , we denote the key-value pairs by for all , where means the key-value pair . Suppose the sampled key-value pair is , we have the perturbed value if the key is drawn from . For vector , only the -th element is non-zero, and others are zeros. Then we have the probability of outputting a vector is as follows:Denote the first term by and the second term by , that is,Since the second term is same for different inputs, it will be canceled out when we calculate the ratio of to (, are two different key-value sets). Therefore, we first calculate the .
According to the perturbation mechanism, we can calculate the numerator as follows:Based on the result, we can obtain the byThen, we discuss the upper and lower bounds of . Since both the UE mechanism and the SW mechanism have a higher probability of maintaining the input value than that of perturbing the input as other values, we haveBecause is greater than and , thus, we have the upper bound and lower bound of as follows:Then, we have the probability of outputting given a key-value set isSimilarly, we also have . Thus, the following inequality holds for two different key-value sets and :The second equality holds because, in SW mechanism, , and .
It is worth noting that the work [8] also proposed a tighter privacy budget composition in their mechanism. However, our tighter privacy budget composition is different from that in [8]. Specifically, the composition theorems hold for different LDP problems. The improved privacy budget composition in [8] holds for the estimation of the key frequency and the mean of the value under LDP. However, our tighter budget composition holds for the estimation of key frequency and the distribution of the value. Moreover, the perturbations are different between our mechanism and [8]. Literature [8] proposed two mechanisms: (1) PCKV-UE and (2) PCKV-GRR. PCKV-UE comprises unary encoding (UE) and randomized response, and PCKV-GRR is based on GRR. However, the components in our mechanism are UE and squared wave (SW) mechanisms. As a result, the privacy budget in our mechanism and literature [8] composes in different ways. The privacy budget is composed as equation (9) in our mechanism. But, in PCKV-UE and PCKV-GRR, the budget is composed as and , respectively.
Figure 2 shows the (1) basic sequential composition, (2) the tighter composition of our mechanism, and (3) the tighter composition of PCKV (including PCKV-UE and PCKV-GRR). Note that the composition of PCKV-GRR depends on the padding length and the larger results in tighter budget composition. Therefore, we compare PCKV-GRR with varying . The result shows that the composition of our mechanism is less tight than that of PCKV even under the minimum , that is, . There is an intuition behind this result. Our mechanism estimates the value distribution under LDP, which needs more information about the data than PCKV that only estimates the mean of the value. Thus, PCKV can bound the privacy loss at a tighter level.

4.2.2. Privacy Amplification

Figure 2 also shows the relationship between our composition with basic sequential composition, which demonstrates the privacy amplification of our mechanism. Compared with a sequential composition where the total privacy budget , our mechanism consumes less privacy budget because . In other words, our mechanism has a privacy amplification effect.

4.2.3. No Privacy Amplification Effect from Sampling Protocol

In Theorem 3, the privacy guarantee is independent of the padding length , which means our mechanism obtains no privacy amplification from the sampling protocol. The main reason is that our mechanism outputs a vector containing multiple keys and multiple positions in the vector are 1. Therefore, even if the sampling protocol is used, the upper bound of the probability ratio in the worst case is independent of the protocol. Here, we take an example that only considers the key perturbation to make this point more clear. Suppose the key domain is , , and two key sets and . Note that the output domain is . Then the encoded vector of is or , and that of is or (depending on which key is sampled). Since the probability ratio is . Thus, in the worst case, we need to maximize the and minimize . To this end, we select the output vector because, in our mechanism, with the highest probability and with the smallest probability. Therefore, no matter which key is sampled, the probability of outputting is the same, that is, and . In other words, there are no privacy benefits from the sampling protocol.

4.2.4. Privacy Budget Allocation

Since our mechanism contains two steps and each of them uses and , allocating a privacy budget for each step is important. A basic and widely used idea to allocate a privacy budget is to calculate the error as the function of and then find the optimal and that minimize the error [8]. However, calculating the error (or distance) between the estimated distribution and the true distribution as the function of is a non-trivial task [13]. Thus, we use an empirical allocation method in this paper and leave the strategy of finding the optimal privacy budget allocation method for future work.

In particular, given the total privacy budget , we first set . Then, according to Theorem 3, we can calculate the such that the total budget is . Specifically, we find the such that the term as the . This is because, in Theorem 3, the other two terms when , and only when the third term , the total privacy budget is not violated. Given the privacy budget , we set the perturbation probability for key perturbation as follows:

We note that this perturbation probability is aligned with the optimized unary encoding (OUE) mechanism [10], which achieves the minimum error of frequency estimation under the same privacy budget.

In our experiments, we observe that even under such suboptimal budget allocation, our mechanism is still better than other mechanisms that consider the optimal privacy budget allocation.

4.3. Aggregation and Estimation

In this subsection, we introduce how to aggregate the perturbed results and get the estimation of the frequency of the key and the estimation of the value distribution. For frequency estimation, an unbiased estimator is proposed in [8, 14]. However, they do not take the prior knowledge of the estimated frequency into account, which reduces the utility. For numerical distribution estimation, the SW mechanism uses the EM algorithm to estimate the distribution. However, due to the fake value in our design (we set if is perturbed from 0 to 1), directly using the EM algorithm would not get a useful estimation. We use postprocessing methods to address these problems. Note that postprocessing of the output of a DP mechanism does not affect its privacy guarantee [1].

4.3.1. Key Frequency Estimation

After the server receives the perturbed results from all users, it counts the number of 1’s that supports each key , denoted as . Then we first use the estimator in [8, 14] to obtain an unbiased frequency estimation of key . Formally,

Theorem 4. If the padding length for all users , the estimator is unbiased, that is, , and the variance is

Proof. The random variable is the summation of independent random variables, each of which follows the Bernoulli distribution. For users who input the key for perturbation (accounting for of all users), the variable is drawn from , and for users who do not input the key (accounting for of all users), the variable is drawn from . Thus, we have the expectation of estimator that isand the variance of the estimator is

4.3.2. Improve the Utility with Postprocessing

The estimator only provides unbiasedness in theory. However, the estimation may not be consistent. That is, the estimations for many values may be negative, and the total sum of frequencies is not equal to 1. Such inconsistency may reduce the utility of LDP mechanisms [11]. Therefore, we further enforce the following consistency requirements on the estimated results to improve the utility:(1)The estimated frequencies are non-negative(2)The sum of the estimated frequencies is 1

To achieve consistency given the estimated results , we solve the following optimization problem and find the postprocessed results as the estimation:

Based on the KKT condition [15], we can solve the postprocessed results as follows:where is the set containing the non-negative frequencies in .

We further explain why we use as the objective function. norm is used in the objective function because the noise by UE is well approximated by Gaussian noise, and minimizing norm achieves MLE [13]. Besides, when we enforce the consistency requirement on the estimated results, there are many results that can achieve the consistency requirements. For example, suppose , the postprocessed results and are both consistent. However, the postprocessed results that are far from the estimated results lead to poor utility. This is because the is a useful unbiased estimator for each key, and a large deviation from it results in a large error. Therefore, the postprocessed results should not only be consistent but also be close to the estimated results.

Theorem 5 proves that postprocessing leads to positive bias for frequency estimation.

Theorem 5. Given the unbiased estimated results , the corresponding postprocessed results solved by (13) lead to positive biases.

Proof. For each key , we have the bias that isSince , thus, we have .
In many application domains, the number of users is large, and the true frequency of many keys is far from zero. Thus, few estimated results may be negative, and may be large. Therefore, even though the postprocessing introduces a positive bias, it is sufficiently small in practice.
4.3.3. Numerical Distribution Estimation
The server performs the distribution estimation in a discretized way, that is, the histogram on the domain. For a key , the server first finds the reported results that support key , that is, the -th bit of the perturbed vector . Then it discretizes the received value (in domain ) into buckets and construct a histogram with bins, where each bin corresponds to the count of values falling in this bin. Since the value of the reported results that support key is the result perturbed either from the true value or from the fake value (the fake value would affect the distribution estimation), the server then statistically removes the fake value and reconstructs a histogram with bins (in domain ) as the estimated value distribution. We denote the histogram constructed by the true values as (with bins), the reconstructed histogram as , and the -th bin of and as and . Next, we introduce how to statistically remove the fake value and then elaborate on the method of reconstructing the histogram.
Since the fake value is generated by users who do not possess the key but report as possessed, we first calculate the number of such users. Given the estimated frequency for the key , we can calculate the count of such users is approximately . Since the fake value is drawn from the uniform distribution , the count of such users in each bin of the histogram is approximately . Thus, the server can statistically remove the fake value by subtracting from each bin in the histogram. Then the server divides each bin by the count of users who really possess the key , that is, , to obtain the frequency of values falling in each bin. Denote this histogram by and the -th bin of by , the frequency of values falling in each bin can be used to estimate the probability . Leveraging the probability for all , the server can reconstruct the histogram and obtain the numerical distribution estimation, that is, the probability for each . Note that we denote the probability by and by for brevity.
Given the histogram of the frequency of values, the server uses the EM algorithm to reconstruct the histogram (in domain ) and obtain the estimated value distribution. Denote the number of values falling in the by ; the overall algorithm is shown in Algorithm 3.

Input: Perturbed results, estimated frequency , padding length , number of bins .
Output: Reconstructed histogram
(1)Discretize the perturbed results into bins.
(2)Subtract from each bin.
(3)Divide each bin by to generate .
(4)Initialize for all .
(5)while not converging do
(6) E-step ,
(7) M-step ,
(8)end while
(9)Reconstructed histogram
(10)return reconstructed histogram

5. Experiments

5.1. Setup
5.1.1. Data Sets

Four real-world data sets are involved in our evaluation: E-commerce [16], Clothing [17], Amazon [18], and Movie [19]. We summarize the data sets parameters in Table 2, where is the padding length. All rating values are linearly transformed into the range .

5.1.2. Competitors

We compare our mechanism with three existing mechanisms: PrivKVM [9], PCKV [8], and KVUE [20]. PrivKVM is elaborated in Section 3.2, and we do not repeatedly introduce it here. PCKV contains two mechanisms, namely PCKV-UE and PCKV-GRR, which are based on optimal unary encoding (OUE) and generalized random response (GRR), respectively, and we compare both in this paper. KVUE is a mechanism proposed to improve the performance of PrivKVM [7], which is the degraded version of PrivKVM. It treats each key-value pair as a whole entity instead of treating key and value separately and directly perturbs each entity.

Since only PrivKVM can support frequency estimation of keys and distribution estimation of values and other mechanisms are only designed for frequency estimation and mean estimation of values, we compare with PrivKVM on both frequency and distribution estimation tasks and compare with PCKV and KVUE only on frequency and mean estimation.

5.1.3. Evaluation Environments

All mechanisms are implemented using Python 3.6 and Numpy 1.14. All experiments are conducted on an Amax server. The operating system of the machine is Ubuntu 16.04; the CPU is Intel Xeon Silver 4214 2.2 GHz, 24 cores in total; and the memory is DDR4-2666, with a total of 128 GB.

5.2. Metric
5.2.1. Frequency

We evaluate the key frequency by the mean squared error (MSE). Formally, we measurewhere is any subset of the key domain , and we set the default to be .

5.2.2. Distribution Distance

We evaluate the distribution estimation by the average Wasserstein distance. Formally, we measurewhere is any subset of the key domain , and we set the default to be . is the Wasserstein distance between the true value distribution of the key and the estimated distribution. Formally, given the histogram constructed by the true value of key and the reconstructed histogram , the Wasserstein distance is

5.2.3. Mean and Variance

Given the estimated value distribution, we can also calculate the mean and the variance of the value. We also use the MSE to evaluate the mean estimation and variance estimation. Formally, we measurewhere similar to frequency estimation, is any subset of the key domain and we set the default to be ; and are the estimated mean and the true mean of the value of the key , respectively; and and are the estimated variance and the true variance of the value of the key , respectively.

All metrics measure the error between the estimated result and the true result, and the smaller the metric, the more accurate the estimated result. All results are averaged with 50 repeats to make the experiment results stable.

5.3. Key Frequency

We first evaluate the existing LDP mechanisms on key frequency estimation. Here, we analyze these methods on three tasks:(1)Frequency of individual key: We measure MSE between and . In this task, the is the key domain .(2)Frequency of most frequent keys: We select the top-T key and measure the MSE between their actual frequencies and the postprocessed ones. Formally, denote the top-T keys by the set and measure MSE between and . We also set the default and . In this task, is the domain of the top-T key.(3)Frequency of subsets of keys: Estimating the subset of keys plays an important role in the interactive data analysis setting (e.g., estimating which category of products is more popular). We uniformly sample () data from the domain of key and measure the MSE between the sum of the actual frequencies and the postprocessed frequencies. Formally, suppose is the random sampled subset of the key that has keys, we define and . We sample 100 times and measure MSE between and . We set the default and .

5.3.1. Frequency of Individual Key

We first evaluate the performance when querying the frequency of individual keys, and the results are shown in Figure 3. As a result, we conclude that our method is better than any other methods (the MSE of our method is the smallest) on all data sets because we enforce the consistency as postprocessing. Especially when the noise is large , our method reduces the MSE of the state-of-the-art solution by about 2 orders of magnitude. This is because the estimated frequencies are prone to be inconsistent under large noise and our postprocessing improves the accuracy significantly. This also happens to the other two tasks for a similar reason (see Figures 4and 5). On data sets E-commerce, Clothing, and Amazon, the MSE results of other existing methods are very similar; this is because the number of users in data sets Clothing and Amazon are large, and it compensates for the impact of the large domain of the key. However, our method shows the smallest MSE in data set Amazon among all four data sets. This is because the number of users on Amazon are largest, which lead to smaller bias and better accuracy (according to the analysis of our postprocessing in Theorem 5). In data set Movies, although all methods do not perform as good as they do on the first three data sets (due to large padding length leading to large error), our method still performs best among all mechanisms due to the consistency requirement.

5.3.2. Frequency of Most Frequent Keys

The MSE results when querying the top-T frequent keys under varying and values are shown in Figures 4 and 6. Overall, our mechanism significantly reduces the MSE of other methods under all values and T values in most cases. Similar to the results when querying the frequency of individual keys, the MSE of our method is also apparently lower than that of other solutions when the noise is large (ε ≤ 2). As the value grows, the decline in the MSE of our method is becoming stable. Figure 6 represents the MSE of our mechanism is significantly smaller than other mechanisms on all data sets under all , which actively demonstrates that our method can cope with various queries for top-T frequent keys.

5.3.3. Frequency of Subsets of Key

We show the results for frequency estimation of a subset of keys in Figures 5 and 7. Overall, our method outperforms other mechanisms under all values and values. In Figure 5, the MSEs of all mechanisms decrease as the value grows, and the large gap between our method and other methods indicates our method performs much better than other existing methods. Moreover, it is worth noting that in Figure 7, the MSEs of other mechanisms are getting greater as the grows, but the MSE of our method is symmetric with . This is because the individual estimation error accumulates as increases under other mechanisms, but we enforce consistency on the estimated results and all estimated frequencies are summing-to-1; thus, estimating the frequency of a subset for is equivalent to estimating the rest.

5.4. Distribution

We evaluate existing LDP mechanisms on distribution estimation. Here, we evaluate it from three perspectives: (1) distribution distance, (2) mean, and (3) variance. We compare our mechanism with PrivKVM on all three tasks and compare with other mechanisms only on mean estimation since they are only designed for mean. We also set the number of buckets in our experiments as it has been shown to perform best in most cases for distribution estimation [13].

5.4.1. Distribution Distance

We plot the AW results as the function of value in Figure 8. It shows our mechanism outperforms PrivKVM and achieves a reasonable distribution estimation on all data sets, and the largest AW is only about . This is because PrivKVM perturbs numerical values in a discrete manner and does not exploit the ordinal information of the numerical domain. It is also worth noting that the AW results on the first three data sets (E-commerce, Clothing, and Amazon) are similar and lower than that on data set Movies. This is because the padding length for data set Movies is the largest , which leads to a large error for frequency estimation (see Figure 3). Thus, we may get a relatively inaccurate number of users who generate fake values when we statistically remove the fake value, which leads to high AW for distribution estimation.

5.4.2. Mean

The evaluation of mean estimation is shown in Figure 9. As a result, our method performs much better than any other mechanisms under all value. Specifically, when is relatively small, our mechanism significantly reduces MSEs of all other solutions; when is larger than 4, the MSE of our mechanism is one to two orders of magnitude smaller than most other solutions. This is because our mechanism reports the value closer to the original value with a higher probability than the value far away from the original value. In this way, such perturbed result can carry more useful information about the original value and leads to more accurate results.

5.4.3. Variance

Figure 10 plots the MSE results as the function of value. Due to the categorical frequency oracle, PrivKVM underperforms in our experiments. It is worth noting that the MSE on data set Movies is the highest. This is also because the largest padding length for data set Movies leads to a large error for frequency estimation (see Figure 3) and results in a relatively inaccurate number of users who generate fake values when we statistically remove the fake value.

Differential privacy has been the de facto standard for privacy-preserving. There are many LDP deployments in the real world: Google Chrome extension [3], spelling prediction of Apple [2], and telemetry collection by Microsoft [21].

6.1. Frequency Oracle and Distribution Estimation

Estimating the frequency of values is a basic task in LDP. There have been several mechanisms [3, 10, 22, 23] proposed for this task, and they are often called frequency oracles. For example, RAPPOR [3] enables the estimation of the marginal frequencies of a set of strings. However, it needs a dictionary for the candidate strings, which can be very large or unknown in practice. To solve this problem, Fanti et al. [24] use the EM algorithm as a decoder for RAPPOR to enable learning without explicit dictionary knowledge. Based on RAPPOR, Ren et al. [25] propose a novel mechanism to estimate distribution for high-dimensional data. Instead of the EM algorithm, they use Lasso regression to estimate the distribution in one round. Combining the EM algorithm and Lasso regression, Ren et al. [26] further propose a solution that can generate synthetic data by leveraging the estimated distribution of the data under LDP. Although the above schemes also use the EM algorithm, there are two differences compared to the EM algorithm in our mechanism: (1) our EM algorithm can statistically remove the fake values and (2) it takes the aggregated results and is thus more efficient.

When estimating the distribution of numerical data, a naïve approach is to bucketize the data and apply the categorical frequency oracles listed above. In [4], the authors achieve distribution estimation under LDP but with a strictly weaker privacy guarantee. There are also mechanisms that can handle numerical settings but focus on the specific task of mean estimation, that is, SR [5, 21] and PM [27]. The SW mechanism [13] is the state-of-the-art mechanism for distribution estimation tasks under LDP, which can recover the distribution instead of focusing on a specific task.

Different from existing LDP mechanisms that only focus on simple statistical queries (such as frequency and mean), our paper designs a new LDP mechanism for key-value data collection that considers both key frequency and value distribution simultaneously.

6.2. Postprocessing

For statistic tasks in differential privacy, one can utilize the structural information to postprocess and improve the data accuracy. Following this idea, Hay et al. [28] utilize the structural information and propose an efficient hierarchical method to minimize difference between the noisy result and the processed result. Besides that, Lee et al. [29] consider the non-negativity constraint and propose to use the alternating direction method of multipliers (ADMM) to obtain a result that achieves maximal likelihood. Wang et al. [11] further improve the data accuracy by enforcing consistency that the frequency should be non-negative and sum-to-one. Jia and Gong [30] use conditional expectation to estimate the true data given the LDP-protected results. This method shows satisfactory results when data approximately follows power-law distribution. EM algorithm is used by [13] to improve the accuracy of histogram data when estimating numerical distribution.

In this paper, we adopt postprocessing for key frequency estimation to further improve the accuracy.

6.3. Key-Value Data Collection

Ye et al. [7] are the first to propose the LDP mechanism to collect key-value data called PrivKV, PrivKVM, and PrivKVM+. PrivKVM iteratively estimates the mean to guarantee unbiasedness. PrivKV is a simple version of PrivKVM, and it can be regarded as PrivKVM with only one iteration. To balance unbiasedness and communication cost, they also propose the advanced version of PrivKVM called PrivKVM+. Sun et al. [20] proposed another estimator for frequency and mean estimation under the PrivKV to achieve better accuracy. They also introduced conditional analysis for key-value data for other complex analysis tasks in machine learning. Gu et al. [8] proposed the framework PCKV. It perturbs the key and value in a correlated manner and provides a tighter privacy budget composition. As a result, PCKV outperforms the above LDP mechanisms in both estimation of the key frequency and the estimation of the value mean. To the best of our knowledge, PrivKVM [9] is the state-of-the-art mechanism that not only can support more statistical tasks but also can achieve the best accuracy in most cases.

7. Discussion and Conclusion

In this paper, we propose a novel LDP mechanism for private key-value data collection. Due to the consideration of numerical information of the value domain, our mechanism outperforms existing schemes in most cases. The mechanism perturbs the key-value data in a correlated manner and results in the privacy amplification effect. We further improve the accuracy of the frequency estimation by consistency. Finally, we evaluate our mechanism on four real-world data sets and demonstrate our mechanism outperforms existing schemes.

Although our mechanism performs well in our experiments, it still has the following limitations:(1)We do not consider the optimal padding length that would lead to more accurate results.(2)Our mechanism only adopts the suboptimal privacy budget allocation scheme instead of studying the optimal allocation scheme. Although our method still outperforms previous mechanisms, it does not achieve the minimum error.

In future work, we will study the optimal padding length that can further improve the privacy-utility trade-off and study the optimal privacy budget allocation.

Data Availability

The key-value data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was funded by NSFC (61932015 and 61732022), Shaanxi Innovation Team Project (2018TD-007), and Higher Education Discipline Innovation 111 Project (B16037).