Limiting Privacy Breaches in Average-Distance Query
Querying average distances is useful for real-world applications such as business decision and medical diagnosis, as it can help a decision maker to better understand the users’ data in a database. However, privacy has been an increasing concern. People are now suffering serious privacy leakage from various kinds of sources, especially service providers who provide insufficient protection on user’s private data. In this paper, we discover a new type of attack in an average-distance query (AVGD query) with noisy results. The attack is general that it can be used to reveal private data of different dimensions. We theoretically analyze how different factors affect the accuracy of the attack and propose the privacy-preserving mechanism based on the analysis. We experiment on two real-life datasets to show the feasibility and severity of the attack. The results show that the severity of the attack is mainly influenced by the factors including the noise magnitude, the number of queries, and the number of users in each query. Also, we validate the correctness of our theoretical analysis by comparing with the experimental results and confirm the effectiveness of the privacy-preserving mechanism.
Nowadays, a major concern in the modern society is the leakage of private information, e.g., health condition information and location information. Reports show that healthcare data breaches in 2018 resulted in the exposure of 13 million healthcare records . Breaches in the US healthcare field cost $6.2 billion each year . Meanwhile, the disclosure of personal location data can cause serious issues, e.g., by analyzing the semantic information (e.g., hospital and church) of users’ locations, sensitive information such as home address, health status, and religious faith may be revealed [3, 4]. On the other hand, users’ data are valuable to be queried for public usage. Therefore, privacy-preserving mechanisms are often used during the query processing . However, it is still questionable whether existing mechanisms are sufficient for privacy.
We focus on the average-distance query (AVGD query) [6–8], which serves as the basic component of several applications such as business decision and medical diagnosis . Given a database containing data of users, an AVGD query takes a query point and a set of users as inputs and queries the average distance between the query point and the data of those users. The query point is a value located in the same space as the users’ data. To the best of our knowledge, privacy breaches in the AVGD query have not been studied yet, and the user privacy needs attention during such queries. In this paper, we study the privacy breaches in the AVGD query and propose a new kind of attack. As we find, an adversary can make an attack by selecting the user sets and querying on specific query points and reveal users’ data by leveraging the query results.
Both the following examples of the AVGD query, which correspond to different dimensions of query points, respectively, are exposed to the attack. Example 1: a company plans to deploy a new branch, before which the branch’s location is required to be carefully chosen according to several facts, e.g., traveling time. In this case, a location-based service  benefits the choosing process by outputting the average distance between local users and a location inputted by the company. Here, the query point is the coordinate of the location. Example 2: medical data can be clustered to predict the likelihood of diseases. The average distance between elements in a cluster and the cluster centroid and the average distance between elements in two clusters are the basic measurements in clustering approaches. We consider the following one-dimensional case: given the medical data (e.g., blood pressure) of a group of patients and a query point (e.g., the cluster centroid), a hospital queries the average distance between the data of each patient and the query point. Here, the query point is the real-valued medical data, and the absolute value distance is considered as the distance measure. To perform the attack, an adversary selects two user sets, which differ in one target user. Then, the adversary chooses a set of query points and gets the results queried on these query points and the two user groups. The adversary can set up equations based on the correlations between the results and recover the data of the target user. Detailed process of the attack will be described in Section 3.
Nevertheless, most of the existing techniques on privacy-preserving query processing cannot prevent these kind of attacks. K-anonymity-based approaches [9–11] are suitable for publishing or querying on single user’s record, while the AVGD query returns aggregate results. Other methods like data transformation [12, 13] and homomorphic encryption [8, 14] avoid transmitting user data in the plain text. They prevent the exposure of the original user data during the process of the query, but they do not consider the information leakage from the query results.
Noise-based output perturbation  (e.g., differential privacy with Laplace noise ) is a possible method to prevent the attack. However, as we demonstrate in this paper, the efficiency of the noise may be weakened under the attack. Too less noise may not provide enough preservation, while too much noise destroys the utility of the results. Thus, the amount of noise should be quantified for balancing utility and privacy. We aim to decide the minimum amount of noise to meet the desired privacy requirement. Previous works [16–18] have studied attacks on queries with noisy outputs, but they all have focused on sum queries or questions based on sum queries. They showed that the lower bounds of noises (as a function of the number of queries) are needed to prevent violations of privacy. To the best of our knowledge, we are the first to investigate attacks on AVGD queries. Instead of giving a lower bound directly, we formalize the process of the attack and obtain explicit expressions of the uncertainty of an adversary’s estimation under different conditions, which serves as a guide on the amount of noises to be added. As we further find, the uncertainty is affected by several factors during the queries, e.g., the number of users in each query, the data value of the target user, and the number of queries. Finally, we propose the privacy-preserving mechanism based on the analysis.
We evaluate the proposed attack on two real-world datasets: a one-dimensional medical dataset and a two-dimensional location dataset. The results show the feasibility and severity of such attacks. We compare the experimental results with our theoretical analysis under different factors and show the correctness of the theoretical analysis, which guarantees the effectiveness of the corresponding privacy-preserving mechanism.
The contributions in this paper are as follows:(i)We discover a new kind of attack on the AVGD query where the adversary can recover a target user’s data through analysis of query results(ii)We perform a detailed theoretical analysis of the attack and propose the privacy-preserving mechanism based on the analysis(iii)We evaluate the attack on real-world datasets. We show the severity of the attack and the effectiveness of our theoretical analysis
The rest of this paper is organized as follows. Section 2 reviews the relevant work on the average-distance query and privacy-preserving query processing. Section 3 gives the formal definition of the AVGD query and the detailed process of the proposed attack and describes the goal of this paper. Section 4 theoretically analyses the uncertainty of the adversary’s estimation. Section 5 gives a privacy-preserving mechanism based on the theoretical analysis. Section 6 conducts the attack on real-world datasets and gives the experimental results. Section 7 concludes this work and discusses future directions.
2. Related Work
2.1. Average-Distance Query
The average distance serves as a basic metric in several real-life applications. Scellato et al.  introduce the average distance as one of the metrics for the spatial properties of social networks. Armenatzoglou et al.  use the average distance as a metric for ranking users in a geosocial network. Yang et al.  seek to find a group of attendees for an impromptu activity that the average spatial distance between each attendee and the rally point is minimized under a restriction of the social relationship. The average distance metric is also used for the problem of min-dist optimal location selection [8, 20, 21]: given a set of existing facilities and a set of users, a location for a new facility is found such that the average distance between each user to his/her nearest facility is minimized.
2.2. Privacy-Preserving Query Processing
Several approaches have been proposed to meet the purpose of the privacy-preserving query. K-anonymity-based approaches  mix the target user’s data with other users’ data such that an adversary cannot distinguish the target user from these users. L-diversity  and t-closeness  are proposed to enhance the definition of k-anonymity. The attack proposed in this paper leverages the outputs of AVGD queries, which are aggregate information, while k-anonymity-based approaches preserve privacy when publishing or querying on single user’s data.
Approaches of data transformation [12, 13] transform the user’s data before sending queries to the server. The server answers queries based on the transformed data other than the original data. Some approaches use cryptography [22–24] like homomorphic encryption [8, 14] to encrypt the data of the users and enable queries on encrypted data. These approaches avoid transmitting the original user data during the process of the query, thus protecting the user privacy, but they do not consider the privacy leakage from the query results, which motivated the attack proposed in this paper.
Differential privacy  is proposed to protect the privacy of individuals while releasing aggregate outputs. For two neighboring databases, differential privacy requires that query results on these two databases are indistinguishable by an exponent factor. One way to achieve differential privacy is to add controlled noise (e.g., Laplace noise ) on the query results. Applying differential privacy to the AVGD query is possible, but it is essentially a noise-based method, and the amount of noise should be carefully chosen under different factors, as shown in this paper.
Dinur and Nissim  derived lower bounds for the noise needed in sum queries to prevent violations of privacy, with the assumption of unbounded and bounded adversaries. Several works [17, 18, 25] further improved the study with more general databases and more efficient attacks. Blum et al.  proposed a practical framework to preserve the privacy where the number of queries is limited sublinear in the size of the database. All these works focused on questions based on sum queries.
3. Problem Formulation
In this section, we give the formal definition of the AVGD query. Then, we show the proposed attack on the AVGD query and describe the goal of this paper.
3.1. System Model of Average-Distance Query
The system model of the AVGD query is as follows. There is a server that provides a service of the AVGD query and a client that requests queries. The server has a database which contains records of users. We assume that all these records are real valued. To run an AVGD query, the client needs to specify a set of users in the database. One way to specify the users is to use a uniform identifier between the server and the client, e.g., the phone number. The client can also specify the users based on some attributes of the users, e.g., patients who meet specific symptoms in a healthcare database.
The AVGD query is formalized as follows.
Average-distance query: (1) the client requests a query by choosing a user set and a query point ; (2) the server answers the query by computing the average distance . Here, denotes the data of and is the distance between and . The expression of depends on detailed applications. The implementation of depends on detailed applications. For example, it can be the absolute distance in a one-dimensional case or the spatial distance when the data are two-dimensional location points.
3.2. Adversary Model
The client is considered to be “semihonest.” The client will follow the protocol correctly to request queries. He/she will not break into the server system to get users’ information illegally. However, the client may get extra information by analyzing the query results. The server is considered to be honest and is aware of the exact data value of each user but has to hide users’ data from the client.
3.3. Attack on the Query
An adversary can make an attack as shown in Figure 1. He/she chooses two groups of users and that differ in one user (the differed user can be targeted based on the attributes when the adversary knows enough partial information ). and . We assume that the adversary is not aware of the value of . Then, the adversary chooses query points . For each , he/she gets the query results and . The equation holds that , where denotes the data of . Then, the adversary can solve equation (1) to recover :
The goal of this paper is to quantify the effect caused by the attacks and design the corresponding noise-based protection schemes. Specifically, in the protection scheme, we assume that the server adds noises on the query results. We consider two kinds of noises: the multiplicative noise and the additive noise. The multiplicative noise is represented as a random variable , and thus the noisy output is . We assume that obeys Gaussian distribution, which has been used for the statistical disclosure control . The additive noise is represented as a random variable , and the noisy output is . We assume that obeys Laplace distribution, which is one of the means to provide differential privacy . Both kinds of noises are assumed to have a mean value of 0 to avoid bias . Note that we aim to present the attack with different kinds of noises. Enabling a differential privacy mechanism for the AVGD query is out of the scope of this paper, and we do not require the additive noise with Laplace distribution to satisfy the demand of differential privacy. We analyze how the noises affect the estimation of the user’s data value recovered by the adversary and the utility of the query results.
4. Uncertainty Analysis
To protect users’ privacy, the server replies query results added with noises. This causes uncertainty in adversary’s estimation of the target user’s data. In this section, we investigate the factors that influence the precision of the adversary’s estimation. Specifically, we solve the problem in a general model under two kinds of noises, i.e., the multiplicative noise and the additive noise, and then refine the solutions in two practical cases.
4.1. The Multiplicative Noise
The adversary solves the following fitting problem to recover the data of the target user. Assume the adversary chooses two user sets and that . For a query point , the noisy results of the AVGD queries with and are and , respectively. We assume that . Let and the data of be , and the adversary needs to fit the unknown parameters and in the model of equation (2) with sets of and noisy and , where and are the noisy outputs queried on the i-th query point . and are the exact results queried on , and the noises added on them are and , respectively:
We simplify the model as follows. Let and . Let denote the parameters to be fit. We simplify the model in equation (2) as
Then, the problem is to fit in equation (3) with noisy and , where and . Here, . We write , where , , and the symbol denotes the Hadamard product.
Next, we derive the uncertainty of the estimation of the parameter in equation (3) caused by the noise. Sader et al.  solved the problem of uncertainty analysis while fitting a function with noisy output data (). We expand their work and enable uncertainty analysis while both input () and output data () contain noises. The method of least squares is used to solve such a fitting problem with the goal that the residual  is minimized, whereand is the estimated value of . Thus, satisfies the following equation:where is the gradient with respect to .
We expand , where is the solution in the noiseless condition; i.e., is the ideal value of the fitting parameter; denotes the deviation between the estimation and the ideal value. Then, we havewhere is the gradient with respect to .
As we assume as the solution without the effect of noise, we can get
The expectation and variance of the fitting parameters can be approximated in the form of integrals. We assume that the query points are uniformly picked from a query range . We set as the average area in segmented by the query points. and are noises added by the server and are assumed to follow the same distribution that and . Also, and . We approximate the sums in equations (8) and (9) with integrals and get the results for the expectation and variance of the m-th component, , of , under the multiplicative noise:where denotes the m-th component of vector and for matrix , . is denoted as
4.2. The Additive Noise
The adversary solves a similar fitting problem as the case of the multiplicative noise, where the fitting model is the same as equation (2). The only difference is that the noisy results of the AVGD queries with and under the additive noises are and , respectively. We assume that and . The residual which aims to be minimized in equation (4) changes intowhere and .
We omit the detailed derivation. and are assumed to follow the same Laplace distribution that and . Also, and . We give the expectation and variance of the uncertainty of the adversary’s estimation, , as follows, where is the average area in segmented by the query points and is defined in equation (12):
4.2.1. Case 1: One-Dimensional Average-Distance Query
We consider a query of one-dimensional average distance, i.e., the queried records of users are one-dimensional real values. For instance, the queried data can be the diastolic blood pressure (diaBP) data of users from a medical database. In this case, given a user set and a query point , the query function is , where denotes the number of users in and denotes the data of user . Assuming that the value of the target user is , the model for the adversary to fit is
We compute the expectation and variance of the adversary’s estimation in the case of multiplicative and additive noises, as in equations (10), (11), (14), and (15). In both cases, we have and . We assume that the multiplicative noise obeys Gaussian distribution with an expected value of 0 and the variance of , such that and in equation (10), and in equation (11). The additive noise is assumed to obey Laplace distribution with zero mean and variance of , and then and in equation (14) and and in equation (15).
To simplify the results, we give an approximation of as follows. Assume the data in the database follow a Gaussian distribution , and the query range is from to . Then, we approximate the value of aswhere is the probability density function.
Then, the uncertainty of the adversary’s estimation is as follows. Let be the adversary’s estimation of and be the error of the adversary’s estimation. The adversary chooses a total of query points, and thus . According to equations (10), (11), (14), and (15), we can get the variance and the expected value of under the assumption of multiplicative noise and additive noise, respectively, aswhere denotes the relative value of in the query range and are the coefficients whose values only rely on .
4.2.2. Case 2: Two-Dimensional Average-Distance Query
We consider the query of the average distance between spatial locations in this case. Assume that the server has a database of users’ location data. The coordinates of locations are in the polar coordinate system. The client queries the average distance between a set of users and a specified query location , where and . The query function , where denotes the coordinate of user . We denote the location of the target user as . In this case, the model for the adversary to fit is
In this case, we have and in equations (10), (11), (14), and (15). The multiplicative noise added on the query results in equations (10) and (11) is assumed to obey the Gaussian distribution with the expected value of 0 and the variance of . Thus, we have , , , and . The additive noise in equations (14) and (15) is assumed to obey Laplace distribution with zero mean and variance of , and then and in equation (14) and and in equation (15).
We give an approximation of in this case as follows. We find that users’ check-in locations show concentrations at the center of the city and disperse towards the edge of the city in real datasets. Figure 2 shows the distribution of the check-ins of users in New York from a real-world dataset . The distribution is shown as a heat map that the red area shows higher concentrations of check-ins than the blue area. The check-ins concentrate in the center of the city and decrease towards the outskirts, and this comes from the reason that people are more active in the city center. Accordingly, we make an assumption that the coordinates of users follow the distribution that the radial coordinate and the angular coordinate of each user are uniformly selected from and , which gives concentrations of points in the center. Then, can be represented as
The uncertainty of the adversary’s estimation in this case is given as follows. Let the adversary’s estimation of be . and denote the errors of the adversary’s estimations of and . The adversary queries on a total of query points. According to equations (10) and (11) and letting , we can get the variances of and under the assumption of multiplicative and additive noises, respectively, aswhere denotes the relative location of in the query range and are the coefficients whose values are determined by . The expected values of and are both :
An obvious finding in equations (18), (19), and (23)∼(26) is that the variances of , , and are in direct proportion to or and in inverse proportion to . In other words, the noise of the query results brings difficulties for adversaries to reveal the target user’s data. Besides, as the number of returned query results increases, the risk of data leakage also increases since more equations can be built to solve the fitting parameters. Moreover, fewer number of users in a query would also lead to a higher accuracy of malicious estimation of users’ data.
5. Privacy-Preserving Mechanism
In this section, we give a privacy-preserving mechanism for AVGD queries. We define the privacy metric for the AVGD query. Then, we design a mechanism to decide the minimum amount of noise for a query given the desired privacy requirement.
5.1. Privacy Metric for AVGD Query
We define the privacy metric for the AVGD query as follows. Hoh et al.  and Shokri et al.  quantified the location privacy as the expected distance between the user’s location and the adversary’s estimation. We follow the idea and quantify the privacy in the attack on AVGD queries as the expected error distance (eed), i.e., the expectation of the distance between the target user’s data and the adversary’s estimation. The definition of eed is as follows:where is the real data of the target user and is the value estimated by the adversary. Therefore, a smaller eed corresponds to a higher risk of privacy leakage. Then, we show the detailed description of eed in the one-dimensional case () and the two-dimensional case .
5.1.1. One-Dimensional Case
According to equation (8), the error of an adversary’s estimation is the sum of random variables. In the case of multiplicative noises, the noises are assumed to obey Gaussian distribution. Thus, the error is the sum of Gaussian noises, which obeys Gaussian distribution as well. The noises are assumed to obey Laplace distribution in the case of the additive noise. However, the sum of Laplace noises may not obey Laplace distribution. We empirically approximate the distribution of the error in the case of additive noises as a Gaussian distribution.
The expectations of the errors in the multiplicative noise case and the additive noise case are both 0, and the variances are described in equations (18) and (19), respectively. We omit the subscripts in and and simply write them as . According to equation (28), we have
5.1.2. Two-Dimensional Case
Denote the real coordinate of the target user as and the estimation of the adversary as . Let and denote the errors of the adversary’s estimations on and , respectively. Then, the distance between the real coordinate of the target user and the adversary’s estimation is
Approximating with , we get
in the two-dimensional case is
To calculate the expected value of , we use the following approximation. For a function on a random variable , using Taylor approximation , we can get
and are random variables that obey Gaussian distribution in the case of multiplicative noise and are approximated to obey Gaussian distribution in the case of additive noise. Both the expected values of and are , and the variances are and (subscripts are omitted) respectively, as described in equations (23) to (26). We have
5.2. Privacy-Preserving Algorithm
Given the privacy metric for the AVGD query, we design a mechanism for the privacy-preserving AVGD query, as shown in Algorithm 1. Since the precision of an adversary’s estimation increases when he/she gets more results, and when there are fewer users in each query, we assume that the server limits the number of queries on a specified user set that each client can request as and limits the minimum number of users in each query as . Each user in the database has the privacy requirement ; i.e., eed of the adversary’s estimation on the data of must be less than . Assume that a client specifies a user set and a query point . Then, the server calibrates the noise magnitude such that the privacy requirement of each user in is satisfied and responses the client with noisy results.
In this section, we perform the attack on real-world datasets and seek to answer the following questions: How do different factors influence the uncertainty of the adversary’s estimation? How is the severity of the attack in real-life cases? How is the correctness of our theoretical analysis? How much utility is lost when enabling privacy preserving?
6.1.1. One-Dimensional Case
We use a medical dataset from the Framingham Heart Study  that contains health data of 4,240 users. We use the diastolic blood pressure (DiaBP) data of each user as the inputs in our experiments. The DiaBP data in this dataset range from 48 mmHg to 142.5 mmHg and obey Gaussian distribution with a mean value of 81.31 mmHg and a standard deviation of 10.98 mmHg.
6.1.2. Two-Dimensional Case
We use a location dataset collected from Foursquare . The dataset contains check-ins contributed by users in New York from 12 April 2012 to 16 February 2013 . Each check-in is associated with a user-id, a timestamp, and the GPS coordinate. The original dataset contains 1,083 unique users and 227,428 check-ins. In our settings, we only consider one check-in for each user in each query. For simplicity, we choose the latest check-in of each user as the target of the clients in our experiments. The radius of this dataset is 28.5 km.
We aim to measure the severity of the attack and the correctness of our theoretical analysis. The severity of the attack is measured as and defined in equations (29) and (32) for the two cases. Smaller values of and mean higher severity. We measure the correctness of the standard deviations in equations (18), (19), and (23) to (26) and and in equations (29) and (32). The correctness here is measured as the error rate between the theoretical estimation and the experimental value, i.e.,where and denote the theoretical estimation and the experimental value, respectively.
6.3. Experimental Setup and Statistical Result
The behavior of an adversary is simulated as follows. The dataset and a query range specified by the server are given. The adversary focuses on a target user (assume that is within the query range). The adversary randomly chooses a user set within the query range containing users and a neighboring user set . Assume that the maximum number of queries that each client can request on a specified user set is . The adversary chooses query points within the query range. In the one-dimensional case, we set the maximum query range from 20 mmHg to 145 mmHg. In the two-dimensional case, the query range for the angular coordinate is from 0 to 2, and the maximum query range for the radial coordinate is from 0 m to 28,500 m. For each query point , the adversary requests AVGD queries on and , respectively. The server computes results and and adds multiplicative noises and (resp., additive noises and ) on the query results such that the noisy outputs are and (resp., and ), where and (resp., and ) are both random variables from Gaussian distribution (resp., Laplace distribution ). The adversary gets a total of noisy results and solves equations like equation (1) to recover the data of . We use the lsqnonlin function with the default trust-region-reflective algorithm  in MATLAB R2018b to solve the nonlinear least-square problems.
We study the influences on the adversary’s estimation caused by different factors. The influences include the uncertainty of the adversary’s estimation, eed of the adversary, and the error rate of the theoretical estimation. The factors we consider in our experiments are as follows: the magnitude (standard deviation) of the noise added on the query results ( and ), the number of users in the query requested by the adversary (), the limited number of queries on a specified user set for each client (), the relative position of the target user’s data in the query range (), and the size of the query range.
In our experiments, we observe the influences by simulating attacks with different values of the parameters. Under each configuration of parameters, we perform the simulation of the attack described above 2,000 times. The detailed settings of the parameters and the results are as follows. Note that in Section 6.3, we only demonstrate the results, and the detailed analysis is in Section 6.4–6.6.
We first demonstrate the influence of the noise magnitude. In the experiments, we fix the query range with the maximum query range and set , , and (in the two-dimensional case, denotes the relative position of the radial coordinate of the target user. The angular coordinate does not affect the uncertainty of the adversary as in equations (23)–(26), thus we set it with 0). For the one-dimensional case, we change the multiplicative noise (resp., additive noise ) from 0.01 to 0.1 (resp., 0.1 mmHg to 1 mmHg). For the two-dimensional case, we change (resp., ) from 0.001 to 0.01 (resp., 10 m to 100 m). The results of the first experiment are shown in Figures 3 and 4.
Secondly, we study the influence caused by the number of users in each query. We fix , for the one-dimensional case and , for the two-dimensional case. For both the cases, we fix the query range with the maximum query range and fix parameters and and change from 20 to 200. The results are shown in Figures 5 and 6.
We then show the effect of the number of queries on a specified user set. We set , for the one-dimensional case and , for the two-dimensional case and fix and and fix the query range with the maximum query range for both cases. We choose from 200 to 2,000 for both cases. The results are shown in Figures 7 and 8.
We also study the effect of the relative position () of the target user’s data in the query range. Note that in the two-dimensional case, we only consider the impact of the radial coordinate of the target user because the angular coordinate has no effect on the uncertainty of the estimation. The parameter (resp., ) is set with 0.01 (resp., 0.1 mmHg) for the one-dimensional case and 0.001 (resp., 10 m) for the two-dimensional case. We fix the query range with the maximum query range and fix and and choose from 0.1 to 1 for both cases. The results are shown in Figures 9 and 10.
At last, we demonstrate the influence of the size of the query range. We fix , for the one-dimensional case and , for the two-dimensional case. For both the cases, parameters , , and are fixed. We change the size of the query ranges. Specifically, we fix the central point of the query ranges and change the ranges around the central point. The results are shown in Figures 11 and 12.
We omit the results of , , and . Changes of these factors do not show obvious impacts on , , and in the experimental results.
6.4. Effects of Different Factors
6.4.1. The Noise Magnitude
As shown in Figures 3 and 4, the standard deviation of the adversary’s estimation error and eed of the adversary are in direct proportion with the standard deviation of the noise. The precision of the adversary’s estimation decreases when the server adds larger noises on the query results.
6.4.2. The Number of Users in Each Query
The standard deviation of the adversary’s estimation error and eed of the adversary’s estimation are in direct proportion to the number of users in each query, as shown in Figures 5 and 6. The precision of the adversary’s estimation decreases when the number of users in each query increases. The reason is as follows. For and , where , the difference between and is smaller when there are more users in , and thus it is more difficult for an adversary to recover the target user’s data.
6.4.3. The Number of Queries
As per the results shown in Figures 7 and 8, the standard deviation of the estimation error and decrease when increases. More queries generate more results which can be used for the fitting problem, and thus the adversary gets a more precise estimation.
6.4.4. The Data of the Target User
As shown in Figures 9 and 10, the influences of on the uncertainty of the adversary do not show a consistent pattern, which depend on specific cases. Also, the multiplicative noises and the additive noises in the same case affect the results differently.
6.4.5. The Size of the Query Range
As shown in Figures 11 and 12, for both one-dimensional and two-dimensional cases, of the adversary’s estimation increases under multiplicative noise when the query range gets larger. When the noise is additive, does not change much with different query ranges. Actually, the value of is closely related with the absolute noise magnitude. The additive noise itself is the absolute noise and remains the same with different query ranges. The multiplicative noise is the relative noise. When the value of is fixed, the absolute noise magnitude increases with larger query ranges, and thus the value of rises.
6.5. Severity of the Attack
We demonstrate the severity of the attack based on the theoretical estimation and the results on real-world datasets. First of all, the severity of the attack depends on the sensitivity of the data in different cases, i.e., to what extent the information leakage threatens a user’s privacy. For example, in the two-dimensional case, when an adversary has , it is difficult for an adversary to link the estimated location with a specific sensitive place (e.g., hospital and church) with such uncertainty. But, in the one-dimensional case, if the adversary has , a health problem such as high blood pressure of the target user may be leaked. Besides the sensitivity of different cases, the severity of the attack increases when the adversary gets a higher accuracy of the estimation, which is mainly influenced by the following three facts.
6.5.1. The Noise Magnitude
When the server reduces the noise magnitude, eed of the adversary decreases, thus the severity rises. In Figure 3, decreases from 5.6 mmHg to 0.45 mmHg (resp., 1.25 mmHg to 0.13 mmHg) when changes from 0.1 to 0.01 (resp., changes from 1 mmHg to 0.1 mmHg). In Figure 4, decreases from 631 m to 71.8 m (resp., 277 m to 28 m) when changes from 0.01 to 0.001 (resp., changes from 100 m to 10 m). Moreover, the proposed attack can resist the noise to some extent. As shown in Figure 3, the adversary gets the of about 5.6 mmHg even when the server adds the noise with .
6.5.2. The Number of Users in Each Query
The adversary gets a lower value of eed when the number of users in each query decreases. As shown in Figures 5 and 6, when changing from 200 to 20 (in the multiplicative noise case), decreases from 1.8 mmHg to 0.19 mmHg and decreases from 251 m to 29.5 m. It is the server’s responsibility to take measures to limit the minimum number of users in each query.
6.5.3. The Number of Queries
The more queries the adversary requests, the lower the eed is. As shown in Figures 7 and 8, decreases from 1.4 mmHg to 0.44 mmHg and decreases from 211 m to 71.7 m when increases from 200 to 2,000 (in the multiplicative noise case). The server has to limit the number of queries on a specified user set requested by each client to mitigate the risk of privacy breaches.
6.6. Correctness of the Theoretical Estimation
The error rate of our theoretical estimations increases when the magnitude of the noise gets larger, as shown in Figures 3 and 4. This comes from the assumption in our theoretical analysis that and . In Figure 3, the error rate of is less than 0.01 under multiplicative noises when for the one-dimensional case and increases to 0.06 when . In the two-dimensional case, the error rate of increases from 0.01 to 0.12 when changes from 0.001 to 0.01, under multiplicative noises, as shown in Figure 4.
The magnitude of the noise determines the utility of the results. The server may not add a large noise (e.g., with ) on the results due to the utility concern, and thus the error rates of our theoretical estimations can be controlled. The correctness of the theoretical analysis ensures the effectiveness of the privacy-preserving mechanism in Algorithm 1.
It is worth noting that the error rate of increases when is small, as shown in Figure 5. This comes from the error in the approximation of in equation (17), which rely on an assumption of the distribution of the target users. The accuracy of the approximation decreases when the user set is small.
As shown in Figure 11, the error rate is high when the query range is small. The reason is that when we narrow down the query range, the distribution of the queried data limited in the query range deviates from our assumption in Section 6.1 (we assume the data in the one-dimensional case to obey a Gaussian distribution) and thus affects the theoretical results.
6.7. Utility Loss under the Privacy-Preserving Mechanism
We show the utility loss when the privacy-preserving mechanism adds noise on the query results to meet different privacy requirements. The precision of the query results determines the utility, and thus we use the standard deviation of the noise to measure the utility; i.e., the more the standard deviation of the noise is, the less utility the query results have. The privacy for the two cases is quantified as and as defined in equations (29) and (32). We give examples to show the minimum amount of noise to meet the desired privacy requirement in Figure 13. For the one-dimensional case, we set and and change from 0 mmHg to 15 mmHg. For the two-dimensional case, we set and and change from 0 m to 1,000 m. We show the corresponding minimum amount of noise when is set to 0.3, 0.6, and 0.9, respectively.
The standard deviation of the minimum noise is in direct proportion to the desired privacy requirement, i.e., a higher privacy requirement causes the lower utility of the query results. In the one-dimensional case with , the standard deviation of the minimum noise (resp., ) increases from 0.05 to 0.15 (resp., 1.81 mmHg to 5.44 mmHg), when the privacy requirement increases from 5 mmHg to 15 mmHg. In the two-dimensional case with , the standard deviation (resp., ) increases from 0.008 to 0.014 (resp., 173 m to 347 m), when the privacy requirement increases from 500 m to 1,000 m.
To decrease the magnitude of the noise and thus preserve the utility of the results, one possible way is to restrict the maximum number of queries that each client can request and the minimum number of users in each query.
7. Conclusion and Future Work
The AVGD query serves as basic components of real-world applications such as business decision and medical diagnosis. The work of this paper is instructive to the service providers to enable AVGD queries while preserving the privacy of users’ data. Specifically, we first propose an attack on the AVGD query in this paper. An attacker can recover a target user’s data by querying with carefully selected user sets and query points. To understand the severity of such an attack and further propose the privacy-preserving mechanism, we formalize the process of the attack as a fitting problem with noisy data. We give the theoretical analysis of the uncertainty of the adversary. The results show that the severity of such an attack is mainly related to the factors such as the noise magnitude, the number of queries, and the number of users in each query. Other factors such as the data of the target user and the size of query ranges show low correlations with the severity of the attack. Based on the theoretical analysis, we design an algorithm for the privacy-preserving AVGD query. Experiments on two real-life cases show the severity of such an attack and the effectiveness of our theoretical analysis. We also evaluate the effectiveness of the service quality when enabling the privacy-preserving mechanism.
The attack proposed in this paper is somewhat idealized. Future works can focus on more practical situations. First, recall that the proposed attack is based on queries on two groups of users where the groups merely differ in one user. In the future work, the attack can be extended by selecting more than two user sets while it can be more flexible in choosing the sets. Besides, recall that an adversary can get a higher precision of estimated private data by using more query results. How to limit the number of queries in practical applications should be studied, as several attackers can collude to query on the same user sets. However, this is a nontrivial task as discussed in previous works .
All data included in this study are available upon request from the corresponding author.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
The research was supported by the National Key R&D Program of China (2018YFB0803400 and 2018YFB2100300), the National Natural Science Foundation of China, under Grant nos. 61972369, 61572453, 61520106007, and 61572454, and the Fundamental Research Funds for the Central Universities, no. WK2150110009.
HIPAA, “Largest healthcare data breaches of 2018,” 2018, https://www.hipaajournal.com/largest-healthcare-data-breaches-of-2018/.View at: Google Scholar
Becker's Healthcare, “Healthcare breaches cost $6.2 billion annually,” 2016, https://www.beckershospitalreview.com/healthcare-information-technology/healthcare-breaches-cost-6-2b-annually.html.View at: Google Scholar
Z. Cheng, J. Caverlee, K. Lee, and D. Z. Sui, “Exploring millions of footprints in location sharing services,” in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, Barcelona, Spain, July 2011.View at: Google Scholar
Z. Huo, X. Meng, and R. Zhang, “Feel free to check-in: privacy alert against hidden location inference attacks in geosns,” in International Conference on Database Systems for Advanced Applications, pp. 377–391, Springer, Berlin, Germany, 2013.View at: Google Scholar
L. Willenborg and T. De Waal, Elements of Statistical Disclosure Control, vol. 155, Springer Science & Business Media, Berlin, Germany, 2015.
S. Scellato, A. Noulas, R. Lambiotte, and C. Mascolo, “Socio-spatial properties of online location-based social networks,” in Proceedings of the Fifth international AAAI Conference on Weblogs and Social Media, Barcelona, Spain, July 2011.View at: Google Scholar
J. H. Cheon, K. Han, A. Kim, M. Kim, and Y. Song, “Bootstrapping for approximate homomorphic encryption,” in Proceeedigs of the Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 360–384, Springer, Darmstadt, Germany, May 2018.View at: Google Scholar
C. Dwork, “Differential privacy,” in Encyclopedia of Cryptography and Security, pp. 338–340, Springer, Berlin, Germany, 2011.View at: Google Scholar
I. Dinur and K. Nissim, “Revealing information while preserving privacy,,” in Proceedings of the Twenty-Second ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems, pp. 202–210, ACM, San Diego, CA, USA, June 2003.View at: Google Scholar
C. Dwork and S. Yekhanin, “New efficient attacks on statistical disclosure control mechanisms,” in Proceedings of the Annual International Cryptology Conference, pp. 469–480, Springer, Santa Barbara, CA, USA, August 2008.View at: Google Scholar
C. Dwork and K. Nissim, “Privacy-preserving datamining on vertically partitioned databases,” in Proceedings of the Annual International Cryptology Conference, pp. 528–544, Springer, Santa Barbara, CA, USA, August 2004.View at: Google Scholar
D.-N. Yang, C.-Y. Shen, W.-C. Lee, and M.-S. Chen, “On socio-spatial group query for location-based social networks,” in Proceedings of the 18th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, pp. 949–957, ACM, Beijing, China, August 2012.View at: Google Scholar
D. Zhang, Y. Du, T. Xia, and Y. Tao, “Progressive computation of the min-dist optimal-location query,” in Proceedings of the 32nd international Conference on Very Large Data Bases, pp. 643–654, VLDB Endowment, Seoul, Korea, September 2006.View at: Google Scholar
S. P. Kasiviswanathan, M. Rudelson, A. Smith, and J. Ullman, “The price of privately releasing contingency tables and the spectra of random matrices with correlated rows,” in Proceedings of the Forty-Second ACM Symposium on Theory of Computing, pp. 775–784, ACM, Cambridge, MA, USA, June 2010.View at: Google Scholar
A. Blum, C. Dwork, F. McSherry, and K. Nissim, “Practical privacy: the sulq framework,” in Proceedings of the Twenty-Fourth ACM SIGMODSIGACT-SIGART Symposium on Principles of Database Systems, pp. 128–138, ACM, Baltimore, MD, USA, June 2005.View at: Google Scholar
R. Shokri, G. Theodorakopoulos, J.-Y. Le Boudec, and J.-P. Hubaux, “Quantifying location privacy,” in Proceedings of the 2011 IEEE Symposium on Security and Privacy, pp. 247–262, IEEE, Oakland, CA, USA, May 2011.View at: Google Scholar