Abstract
With the prevailing use of smartphones and locationbased services, vast amounts of trajectory data are collected and used for many applications. When trajectory data are sent to a thirdparty research institute for analytical applications, the privacy of users would be severely disclosed. For example, the relationship between users will be revealed from the correlation between trajectories. In this paper, we propose a method for releasing trajectory datasets without revealing correlation between trajectories, called RDPT. In RDPT, we first quantify the trajectory correlation and convert the problem of protecting trajectory correlations into reducing the trajectory similarities of users and preserving the utility of the perturbed trajectories. Based on the insight, we model a multiobjective optimization problem and solve the problem with the particle swarm optimization algorithm modified to satisfy differential privacy. Then we generate synthetic trajectories and the correlations between trajectories are reduced. We conduct extensive experiments on three real trajectory datasets. The experimental results show that RDPT achieves almost equivalent data utility to and better privacy than the existing methods.
1. Introduction
With the extensive use of smart mobile devices furnished with locationbased applications, such as WeChat and Twitter, vast amounts of trajectory data are being collected by locationbased service (LBS) providers. Trajectory data can be widely used in many analytical applications, such as intelligent transportation, smart cities, infectious disease surveillance and epidemiological investigations. When LBS providers want to know whether a traffic congestion or certain events will occur in a certain area in the future, they often send the trajectories to a thirdparty research institute for an offline analysis. However, the thirdparty research institute is not always trusted and is likely a potential adversary. Moreover, since trajectories are formed by users’ temporal and spatial behavior as well as their social relationships [1], the trajectories can reflect the features of users’ behaviors. For example, an adversary could infer the social relationships of a user by analyzing correlation between trajectories of users. If two trajectories are correlated and have many checkin points close or even coincident that indicates they often visit close or the same locations, the attacker could infer the users of the two trajectories are friends with high probability because the two users often have activities together. From the social relationships, the attacker can even infer the health status, habits of daily life and occupations of the users with a high probability [2]. Therefore, the privacy of users is seriously revealed.
There are many privacypreserving methods for trajectories, and these approaches can be summarized into three categories: methods without considering protection correlation [3–5], mechanisms considering correlations within a single trajectory [6–8] and approaches considering the trajectory correlations [9–11]. For the previous two categories, correlation between trajectories is not taken into consideration in these methods, causing the trajectories perturbed by these approaches to still face privacy leakage risks. There are also several researchers [9–11] who have proposed privacypreserving mechanisms considering trajectory correlation. However, approaches in the literature [9, 10] are restricted to scenarios of publishing two trajectories. For example, these methods work well when two passengers in DiDi want to hide their trajectory correlation. However, there are more than two users in real social applications, so the number of trajectories published offline is usually greater than two. Moreover, even if we can reduce the correlation between any two trajectories in a dataset, there is still a risk that the correlations amongst three or over three real trajectories are not reduced. This would still reveal the privacy of those users. Under scenarios in which a large number of trajectories need to be released, the research in the literature [9, 10] is no longer effective. In addition, although Zhao et al. [11] proposed a location privacypreserving mechanism considering trajectory correlation, their approach was derived from kanonymity and cannot resist location inference attacks [12–14]. Therefore, even if the trajectories are perturbed by existing methods, the problem of privacy leakage caused by trajectory correlation still persists.
To solve the problem of privacy leakage caused by trajectory correlation when LBS providers release a large number of trajectories offline, we propose a method for releasing differential private trajectories, called RDPT. First, the continuous geographical space of real trajectories is discretized via an adaptive grid partition method. The continuous geographical space is divided into toplevel cells, and the original location trajectories are converted into cell trajectories. For each toplevel cell, to preserve the privacy when we divide each toplevel cell into bottomlevel cells, we add Laplacian noise [15] to the frequencies of locations in the toplevel cell. Second, as the spatial distribution and frequencies of visiting to the toplevel cells are important features for calculating correlation of trajectories, we extract the cell visit probability vector from each cell trajectory of a user. Then we quantify the trajectory correlation and convert the problem of protecting the trajectory correlation into reducing the similarity of each pair of obfuscated cell visit probability vectors and retaining the similarity between the real cell visit probability vector and the corresponding perturbed cell visit probability vector to preserve the utility of the perturbed trajectories. Based on this insight, we model a multiobjective optimization problem to find the balance between the security and data utility. By solving the problem with the particle swarm optimization algorithm [16], which is modified to satisfy differential privacy(PSODP), we obtain a perturbed cell visit probability vector for a cell trajectory of a user and reduce the correlations between the obfuscating trajectory of a user and the perturbed trajectories of other users. Finally, based on the perturbed cell visit probability vector, we generate a synthetic trajectory. Thus, the correlation between trajectories of users is reduced and the social relationship between users is protected.
To solve multiobjective optimization problem, there are generally several methods, such as derivativebased methods and evolutionary algorithms (EAs). Derivativebased methods are useful for solving problems with low dimensions, which are not suitable for our multiobjective optimization problem because our problem is complexity and high dimensions . Conversely, EAs are widely used optimization methods that can effectively deal with complex problems. The genetic algorithm (GA) [17], particle swarm optimization algorithms (PSO) [16, 18, 19] and differential evolution(DE) algorithms [20, 21] are widely used in the proposed EAs. However, these algorithms cannot be used directly to solve our problem. No matter what algorithms in literature [16, 18–21] are selected, we need to add noise into these algorithms to satisfy differential privacy because reading an original trajectory would lead to a potential privacy leakage. Furthermore, compared with GA and DE, PSO has a simpler principle and fewer parameters, and we need not process the crossover and mutation for the trajectories such as genetic algorithm with differential privacy [17], thus ensuring its efficiency in different situations. Moreover, since users usually interact with other users in social networking, the trajectory of a user is formed by the user’s social ties in LBS applications, which is similar to the idea of PSO. Therefore, we choose a classical PSO to solve our problem.
The difference between our PSODP and other multiobjective optimization algorithms is that we modify a classical PSO algorithm with exponential mechanism of differential privacy in selection procedure. When we select a local optimal particle, since we read a real trajectory we need add noise to preserve privacy. So, we select a particle as a local optimal particle according to the probabilities of local particles that are calculated by utility scores. Usually, the local optimal particle would be selected with higher probability, but we may or may not select the local optimal particle because of randomness. Thus, the privacy is preserved. To the best of our knowledge, this is the first work about PSO with differential privacy in trajectory privacypreserving.
The main contributions of this paper are summarized as follows:(1)We propose a method of releasing differential private trajectory datasets i.e., RDPT. We firstly model the problem of reducing trajectory correlations between trajectories of users and retaining data utility as a multiobjective optimization problem. By solving the problem, we can synthesize trajectories for users.(2)We adapted a classical PSO algorithm to solve the modelled multiobjective optimization problem. By modifying the selection procedure of local optimal particles in each iteration to satisfy differential privacy, we obtain a PSODP algorithm and prove the security of PSODP.(3)We evaluate our method on three real datasets, five different metrics and two specific implementations of quantifying the trajectory correlation. The experimental results demonstrate that our method achieves almost equivalent data utility to and better privacy than the existing methods.
The remainder of this paper is organized as follows. In section 2 we review the related work in the literature. We present preliminaries in section 3 and formally define the problem in section 4. In section 5 we give a detailed description of RDPT. We show the experimental results and detailed analysis in section 6. Finally, we conclude this paper in section 7.
2. Related Work
In this section we briefly review three existing categories of privacy protection methods for the trajectories. Then we analyze the evolutionary algorithm satisfying differential privacy.
Trajectory privacy protection methods without considering the correlation. We can divide the approaches into: suppression methods [4], bounded perturbation [22], Kanonymity and its derivation approaches [3, 23, 24], and differential private methods [25–30]. Hasan et al proposed a privacy architecture with a bounded perturbation technique to protect user’s trajectory from the privacy breaches [22]. Huo et al. proposed kanonymity called ”YCWA” [23]. They first extracted the sensitive locations in the trajectory according to the time interval and location density. They then partitioned the geographic space of trajectories into discrete kanonymity regions, and the sensitive locations in the trajectories were replaced with the corresponding kanonymity regions to protect the sensitive locations. The YCWA method reduces the complexity and information loss of kanonymity over complete trajectories and preserves the privacy of the individuals. Zhang et al. proposed a dualK mechanism (DKM) to protect users’ trajectory privacy [24]. DKM first inserts multiple anonymizers between the user and the location service provider (LSP), and K query locations are sent to different anonymizers to achieve Kanonymity. Simultaneously, they combined a dynamic pseudonym and location selection mechanisms to improve user trajectory privacy. Although kanonymity has been widely applied in many applications, the datasets published by these methods still face combination attacks and background knowledge attacks. Differential privacy has become a widely adopted privacy protection mechanism because of its strong guarantee of privacy. In literature [28], Ding et al. proposed a stream processing framework with differential privacy that contains two modules for trajectories. One module can concurrently receive realtime queries from individuals and release new sanitized trajectories, and another module comprises three algorithms based on differential privacy to facilitate publication of the distribution of location statistics. Wang et al. developed a privacypreserving reference system that can extract privacydemanding featurebased anchors that can subsequently be used to calibrate sequences from raw trajectories [29]. They provided a private trajectory data sensitization approach that scales to large spatial domains reflecting realistic trajectories. In the literature [30], the authors presented an algorithm for protecting sensitive place visits in privacypreserving trajectory publishing. By generalizing sensitive places using sensitive zones and distorting the subtrajectories within the sensitive zones based on differential privacy, their method not only prevents leakages of sensitive place visits, but also preserves individual movement features. These privacypreserving approaches for trajectories focus on sensitive information of locations and do not consider the privacy leakage caused by trajectory correlations contained in the trajectories.
Trajectory privacy protection methods considering correlations within a single trajectory. The spatiotemporal correlation contained in the trajectory data easily leads to the privacy leakage problem of the users [31]. Many researchers have proposed trajectory privacypreserving methods considering the temporal and spatial correlation within a trajectory. He et al. proposed a differential private approach based on the spatiotemporal correlation in a trajectory, called DPT [6]. They first constructed a hierarchical index system to capture users’ mobility features. They then constructed a prefix tree to represent the spatial transfer features between adjacent locations of trajectories and added Laplacian noise to the visit frequencies of each node in the prefix tree. Finally, the perturbed trajectory is synthesized according to the obfuscated prefix tree to protect the spatial correlation contained within the trajectory. In the literature [7], the authors presented a differential private method called TGM for publishing trajectories. In TGM, they partitioned the geographical space and constructed a prefix sequence graph to model the spatial transfer features between grids in trajectories. Then the trajectories were iteratively synthesized using an exponential mechanism. Gursoy et al. [8] proposed a differentially private trajectory synthesis method. They extracted the Markov transition matrix, trajectory length probability distribution and journey probability distribution, and obfuscated the three features by the Laplacian mechanism. During the synthesis of trajectories, the trajectories were processed considering against Bayesian attacks, area (sub trajectory) sniffing attacks and abnormal location leakage attacks. These methods only consider spatiotemporal correlations within a single trajectory, and still do not focus on the trajectory correlation between different trajectories, which could still cause serious privacy leakage.
Trajectory privacy protection methods with correlation between different trajectories. Ou et al. [9] proposed a trajectory publication mechanism based on a hidden Markov model (HMM) to protect the correlation between trajectories. Similarly, in [10] the authors proposed an nbody Laplace framework. Then, under the framework, they presented two privacy protection methods for two types of data utilities. However, these methods are restricted to the scenario of releasing two trajectories and cannot provide an efficient privacy protection when publishing a large number of trajectories offline. To ease the social relationship attacks caused by trajectory correlation, Zhao et al. designed an effective model to simultaneously deal with social relationship attacks and reidentification attacks while maintaining a high data utility [11]. In their model, they proposed a slidingwindow algorithm that is a variant of Kanonymity, i.e., anonymity. The anonymity generates anonymized trajectories according to the socialaware distance, which concerns both the spatiotemporal distance and the social proximity. Moreover, the anonymity processes the anonymity with subtrajectories in an length window instead of the entirety of the trajectories. However, the anonymity approach only satisfies kanonymity and cannot resist homogeneous attacks or location inference attacks [12–14].
We summarize the three category works about privacypresrrving approaches in Table 1.
Evolutionary algorithm satisfying differential privacy. Evolutionary algorithms are widely used to solve multiobjective optimization problems. There are few works on the application of evolutionary algorithms in privacy protection. In [17], Zhang et al. proposed PrivGene, a differentially private model fitting task using genetic algorithms. They modified the genetic algorithm and proposed a differential private version of the genetic algorithm. In PrivGene, the authors use an exponential mechanism to select parent individuals for crossing and mutation, thus enhancing the security of the selection process.
3. Preliminaries
In this section, we introduce the basic concepts, including differential privacy, global sensitivity, and particle swarm optimization.
Definition 1 (Trajectory). A trajectory is a timeseries sequence of tuples (location, time), i.e., , where is a location consisting of latitude and longitude, is the moment when location is generated, and is the number of locations in trajectory .
Definition 2 (Neighbor datasets). Suppose and are two datasets. and are neighbor datasets if and only if or , where is a record in a dataset.
Definition 3 (Differential Privacy). Let be a privacy protection mechanism, be any output of , and and be neighbor datasets. will satisfy differential privacy if we have:
Definition 4 (Global sensitivity). Global sensitivity indicates the maximum difference of two query results over neighboring datasets. Suppose is a query function, the definition of global sensitivity is:There are two widely used mechanisms for achieving differential privacy, i.e., the Laplace mechanism [15] and the exponential mechanism [32]. The Laplace mechanism is suitable for perturbing numerical query results, and the exponential mechanism is suitable for perturbing nonnumerical query results. We use the Laplace mechanism and exponential mechanism in RDPT.
There are two common properties in differential privacy. The first property is sequential composition, indicating that a sequence of computations where each provides differential privacy in isolation also provides differential privacy in sequence, but the privacy budget is accumulated. The second property is parallel composition, meaning that if the sequence of computations is performed on disjoint databases, then the privacy budget is not accumulated additively, but rather determined by the worst privacy guarantees of all computations. The following definitions formally describe these two properties.
Definition 5 (Sequential Composition). Suppose an algorithm runs randomized algorithms: , and each satisfies differential privacy. When publishing the output that runs over a dataset in sequence: , satisfies differential privacy.
Definition 6 (Parallel Composition). Suppose an algorithm runs randomized algorithms: , and each satisfies differential privacy. When publishing the output that runs over disjoint dataset for : , and , satisfies differential privacy.
Particle Swarm Optimization (PSO). PSO is an evolutionary algorithm [16]. In PSO, each particle searches for the optimal solution as an extremum relative to the objective. The optimal individual extremum in the swarm is considered to be the current global optimal solution. Then, each particle adjusts its speed and position based on its extremum and the global optimal solution. The above process is iterated many times until PSO is convergent. Then the current global optimal solution is the final solution of a given optimization problem.
4. The problem
Attack hypothesis: Suppose a thirdparty research institute obtains a trajectory dataset for analysis applications. Since the thirdparty institute is not always trusted and is likely a potential adversary, if the data owners had not perturbed the trajectories before they released the trajectory dataset offline, the adversary could have extracted the correlations from the trajectories, and the social ties among users or other privacy would be revealed. Therefore, in this paper, we will suppose the adversary can obtain the following information.(1)A published trajectory dataset , where is a perturbed trajectory and is the number of trajectories.(2)A user set corresponding to , where for any user there is only one trajectory .(3)Quantification of the trajectory correlation and the trajectory privacypreserving method in this paper.
Our goal is as follows: given a real trajectory dataset, we perturb the dataset to reduce the trajectory correlation between a trajectory and other trajectories to the greatest extent possible, while ensuring a high data utility. Even if the adversary obtains the perturbed dataset, the quantification method of the trajectory correlation and the privacypreserving approach, the adversary still cannot infer the social relationship between users. In a word, we can protect as much privacy of individuals as possible while maintaining a high data utility.
5. Releasing correlated trajectory datasets
5.1. The overview
Before describing RDPT in detail, we provide a highlevel description of our approach. Suppose that the overall privacy budget consumed by RDPT is . There are the following three steps in RDPT:
Step 1. We divide the geographical space of into identical cells and obtain a grid via an adaptive grid partition method. Then the trajectory in is converted from location mode into cell mode, i.e., a cell trajectory , and the dataset is obtained for cell trajectories. Then the cell visit probability vectors are extracted from the cell trajectories , and we can quantify the trajectory correlation using the dimensional cell visit probability vectors. For each cell in grid , we add Laplacian noise into the density for locations in the toplevel cell for all trajectories when we divide the cell (in the following, we call the cell a toplevel cell) into bottomlevel cells. The density for locations in a toplevel cell for a trajectory is calculated by normalizing the location visit frequencies that the duplicated visits to locations are removed. The privacy budget consumed in this step is , and the bottomlevel cells will be used in step 3 for synthesizing the final trajectory.
Step 2. According to the quantification method of the trajectory correlation and the cell visit probability vectors, in order to reduce the correlation of and other perturbed cell trajectories, and to preserve the utility of the dataset as much as possible, we model a multiobjective optimization problem. Then we solve the problem via a particle swarm optimization algorithm modified by an exponential mechanism and obtain a perturbed cell visit probability vector for . The privacy budget consumed in this step is .
Step 3. According to the perturbed cell visit probability vector, and the bottomlevel cells (in which Laplacian noise is added when the toplevel cells are divided into bottomlevel cells), we generate a synthetic trajectory of the location mode for the trajectory .
After all the trajectories in the dataset are processed one by one according to step 2 and step 3, we can obtain a perturbed dataset .
5.2. Adaptive Grid Partition and Quantification of Trajectory Correlation
5.2.1. Adaptive Grid Partition
It is difficult for us to generate the location visit probability vectors in the same dimension from the original location trajectory dataset . Therefore, we divide the geographic space of into identical cells and obtain the grid space . We call a cell in a toplevel cell. Then, each trajectory for user in is converted into a cell trajectory (suppose the number of locations in is . Then, the geographic space of is discretized, and the original dataset is converted into a set of trajectories in cell mode . We have following definitions.
Definition 7 (Cell trajectory). For partition over the geographic space in , we have a grid space , . For an original trajectory in location mode (the number of locations in is , if is in cell , and is in cell is in cell , then we have a trajectory in cell mode . We call a cell trajectory.
We improve the adaptive grid partition method in the literature [8]. Suppose that the total number of locations in a toplevel cell for a dataset is . Intuitively, a large would lead to many bottomlevel cells. However, in extreme cases, if all of the locations occur in one place (or several near places), then all locations would be in exactly one bottomlevel cell, which means that there is no location in other bottomlevel cells, thus resulting in too many useless bottomlevel cells and affecting the efficiency of trajectory synthesis in the last step. Therefore, both the diversity and the number of locations need to be considered when constructing the query function for bottomlevel cell partitioning. Therefore, after we divide into toplevel cells, for each toplevel cell we count the sum of normalized number of locations in that cell for all cell trajectories as follows:where in the numerator is the number of different locations inside the toplevel cell for a cell trajectory corresponding to original trajectory , is the number of locations in , and is the density of locations in the toplevel cell for cell trajectories in .
is used for the adaptive partition of cell . To enhance the security of adaptive grid partitioning, we add Laplacian noise to to obtain a stochastic partition for the bottomlevel cells. The important parameter for Laplacian noise is global sensitivity. Then we have the following theorem 1.
Theorem 1. The global sensitivity of the query is 1.
Proof. Suppose that the neighboring datasets for the query are and , where is a cell trajectory. Then we have the following deduction.According to Definition 4, the global sensitivity of the query is 1.
Therefore, it suffices to add noise to each query answer to obtain the noisy answers , where is the privacy budget. Then, each toplevel cell is further divided into bottomlevel cells according to , where is proportional to . is the same as that in the literature [8], i.e., , where is the number of trajectories in .
5.2.2. Quantifying Trajectory Correlation
Quantifying trajectory correlation is a fundamental problem. There are three categories of methods: extraction of features from original checkin data [10, 33, 34], machine learning [35, 36], and statistical information of trajectories [31, 37]. The first category of methods has limitations: the lengths of different trajectories are different, and we need to preprocess the trajectories via interpolation to align the trajectories. Then, unnecessary errors are introduced in quantifying the trajectory correlation. In addition, these methods have strict requirements on length of trajectories, for instance, the length of trajectory cannot not be too long. For the second category, the time cost is too massive to be suitable for applications in which the trajectory correlation needs to be computed many times. Conversely, the methods based on statistical information of trajectories consume less time and are more efficient, which is proportional to trajectory length or to the number of toplevel cells. For our problem, the frequencies of visiting locations in the toplevel cells are important feature for a trajectory, and the sequence of cells would describe the space distribution feature and transitions amongst checkins within the trajectory. Then we use cell visit probability vector of a trajectory as the statistical information to describe the feature of a trajectory. In addition, we can align the trajectories in toplevel cells by the adaptive grid partition. Therefore, we compute the trajectory correlation using the third category of method, and the statistical information we use in RDPT is the cell visit probability vector of a trajectory.
Definition 8 (The cell visit probability vector). After we partitioned the geographic space of a trajectory dataset to identical cells, for a cell trajectory , we define a cell visit probability vector which is a dimension vector. The kth component of is then calculated as follows: for a cell , if has locations within and the number of locations in (or the length of is , then the kth component of is .
Definition 9 (Trajectory Correlation). The trajectory correlation is a measure of the correlation between two trajectories. Suppose and are two cell visit probability vectors of two trajectories and , respectively, and a function is a type of method calculating the vector similarity. We define as follows:There are many common implementations for calculating the similarity between vectors, such as cosine similarity and Pearson correlation coefficient. Therefore, our method can be applied to different specific implementations of trajectory correlation.
5.3. Modeling a multiobjective optimization problem
From this subsection, we will describe the second step in RDPT. We first model a multiobjective optimization problem. Then, we perturb the real cell visit probability vector of for user to and reduce trajectory correlations between cell visit probability vector and the perturbed cell visit probability vectors of other users.
When we perturb the trajectories in the real dataset and obtain a perturbed dataset , we need to ensure a high data utility and security for . Therefore, for the cell trajectory being perturbed, we should reach two goals: (1) should have a high similarity with the real cell visit probability vector of to maintain data utility; and (2) should have a low similarity with the perturbed cell visit probability vectors of cell trajectories of other users to preserve privacy. Then, we can model the two goals in the following two objective functions:where is a dimensional vector to be solved, is the cell visit probability vector of real cell trajectory , and denotes the set of perturbed cell visit probability vectors of trajectories for other users. By solving the extremum for two functions in formula (6), we can obtain a solution that is the perturbed cell visit probability vector corresponding to .
Since is a cell visit probability vector, the sum of all dimensions in is equal to 1. Formula (6) has following constraint:
To preserve data utility, we restrict the lower bound and upper bound of each dimension in . If a cell in a real cell trajectory in time slot is , the actual scope of activities for a user is within and the adjacent eight cells of . Therefore, if is perturbed to its 9 adjacent cells (including itself) while solving the multiobjective optimization problem in formula (6), the data utility for the trajectory of the user will not be largely lost. We can calculate the maximum count of locations from the 9 adjacent cells of , and the count is divided by the number of locations in . Then the th component of upper bound vector of for is obtained. Therefore, after traversing each toplevel cell corresponding to each cell in trajectory , we obtain the upper bound vector for each dimension of , where the lower bound vector of is a zero vector.
We choose particle swarm optimization algorithm (PSO) to solve the problem in formula (6). Since a particle is the basic object in the iterative process of PSO, we need to integrate the cell visit probability vector into a particle. We will have following definition for a particle.
Definition 10 (Particle). A particle is a fourtuple , where is the cell visit probability vector to be solved, denotes the position vector of a particle, is the speed vector of a particle, is the utility score function of a particle, which is determined by objective functions in formula (6), and is the degree of violation of constraint:The particle swarm optimization algorithm with a linearly decreasing inertia weight (PSOIW) in the literature [16] is a commonly used version. The th round iteration of PSOIW has two important steps: select and update. In the select step, the new local optimal particles and global optimal particles are chosen from all the particles in particle swarm , the historically global optimal particles and the historically local optimal particles. In the update step, the velocity vector and position vector of each particle are modified. However, when calculating the objective function value of each particle in the select step, we need to read the real cell visit probability vector from the cell trajectory dataset that would lead to a potential privacy leakage. Therefore, we modified the select step and present a particle swarm optimization algorithm with differential privacy to enhance the security in solving the multiobjective optimization problem in formula (6).
5.4. The particle swarm optimization algorithm with differential privacy
In this subsection, we will describe the particle swarm optimal algorithm with differential privacy (PSODP) in detail.
5.4.1. The privacy budget for the th round iteration
In PSODP, the noise of differential privacy is added in the select step. Specifically, the original step select is changed into step em_select, which satisfies differential privacy by introducing an exponential mechanism into selecting new local optimal particles.
In PSODP, suppose the step em_select is executed times, and the privacy budget is divided into parts. When the number of iterations are small, the the randomness of particles is strong and the difference between each particle in the particle swarm and the final solution is large. At this time, the particle swarm optimization algorithm can search the solution space thoroughly, and we do not need to allocate too much privacy budget to determine the global and local optimal particles earlier. With an increasing , the particles in particle swarm tend to be steady, and we need to allocate more privacy budget to reduce the randomness caused by differential privacy to avoid deteriorating the convergence of the particle swarm optimization algorithm again. Therefore, we use reciprocals of triangular numbers with the following series, which elegantly provides this property: , where the sum of the series converges to 1. In the series, the total privacy budget consumed by iterations is less than . Next, we divide the remaining privacy budget into iterations, and then the privacy budget for the th iteration is computed as follows:
5.4.2. The utility score function for the exponential mechanism
In step em_select, the selection of local optimal particles are perturbed by an exponential mechanism. To select the local optimal particle with the highest probability by using an exponential mechanism, each particle needs to be evaluated by the utility score function . The evaluation of a particle is then determined by objective functions. Therefore, according to formula (6) we perturb each particle with an exponential mechanism by the utility score function as follows.
In formula (10), is the utility score for cell visit probability vector , and and denote the values of the two objective functions when substituting into formula (6). The function denotes the rank for the objective function value of in all candidate particles. For example, denotes the ascending rank of the objective function value in that of all candidate cell visit probability vectors.
In the formula (10) for , the former part (before ) evaluates the preservation degree of the perturbing cell visit probability vector for the real cell trajectory of user , i.e., the data utility, while the latter part (after ) evaluates the reduction degree of trajectory correlations for the perturbing cell visit probability vector and perturbed cell visit probability vectors of other users, i.e., the data security. is a weight, and we define . Then we have: (1) is inversely proportional to the privacy budget for PSODP. When is small, focuses on security, and when is large, focuses on data utility; (2) the maximum value of will not exceed 1; and (3) two square root operations over and avoid being too small (i.e., is too large) or too large (i.e., is too small). Then, can balance the data utility and security to an appropriate extent.
According to the constraint condition in formula (7), when the constraint condition is satisfied, the trajectory correlation between the solution of user in equation (6) and the perturbed cell trajectories of other users can be reduced. Namely, the security for the output of PSODP would be increased.
To reduce the noise added to the particle swarm optimization algorithm, after the utility scores for the cell visit probability vectors of all candidate particles are calculated, the utility scores of all particles are normalized, and then the global sensitivity of the exponential mechanism is 1. We thus have following theorem.
Theorem 2. After the utility scores in equation (10) for all candidate particles are normalized, the global sensitivity of queries based on is 1.
Proof. Suppose the set of cell visit probability vectors in candidate particles is . The cell trajectory datasets and are neighbor datasets, where is a cell trajectory. The sets of cell visit probability vectors for and are and , respectively. According to equation (10), we can calculate the utility scores for all cell visit probability vectors as . Utility scores are the weighted sum of . Then we have the denominator of the normalizing utility scores:According to the sets of cell visit probability vectors and , as well as formula (10) and , we have:Therefore, according to Definition 4 the global sensitivity of normalized utility score is 1.
According to Theorem 2, when we add noise to utility score with an exponential mechanism, the global sensitivity is 1.
5.4.3. PSODP algorithm
We modify the particle swarm optimization algorithm with differential privacy and obtain the new algorithm PSODP shown in Algorithm 1. The output of Algorithm 1 is the perturbed cell visit probability vector .

In Algorithm 1, the step em_select that adds noise for PSODP to enhance the security of Algorithm 1 is shown in Algorithm 2. After is perturbed in Algorithm 1, user is added into to update when we use Algorithm 1 to perturb the cell visit probability vector of the next user.

In Algorithm 2, when selecting the local optimal particles, an exponential mechanism is introduced. We illustrate the forumla following exponential mechanism to calculate the probabilities for particles in step 4. In step 5, we select a local optimal partilce according to the probabilities, and thus the security is enhanced.
5.5. Trajectory Synthesis
After we obtain the perturbed cell visit probability , we synthesize the perturbed location trajectory for user .
We first obtain a toplevel cell set from , in which the visit probability of each cell is greater than 0. Then we randomly select a cell from for each time slot of , and from the cell we select a bottomlevel cell with a maximum density of checkins. We randomly generate a location in the bottomlevel cell as the perturbed location. Then the perturbed location trajectory is formed. This is the postprocessing of RDPT and will not consume privacy budget and can preserve the privacy of individuals [39] because when we partition a cell into bottonlevel cells we add Laplacial noise.
In the following, we illustrate Algorithm 3 to process all trajectories in a dataset. In Algorithm 3 we have two stages: the adaptive grid partition in line 2  8, and calling PSODP in line 14 to perturb trajectories in the original dataset one by one. The privacy budgets consumed in each stages are and , respectively.

5.6. The Parameters and Convergence of RDPT
In Algorithm 3, several parameters will influence the solution of the multiobjective optimization problem in formula (6). These parameters are the population size of particle swarm , the number of new global optimal particles selected in each iteration , and the parameter that controls the maximum number of iterations. According to the literature [40, 41], since there are two objective functions and one constraint in formulas (6) and (7), we will select an appropriate value in the range for the population size . Usually we let . We empirically select appropriate to control the maximum number of iterations to make the PSODP convergent for each trajectory, usually we set the in [400,1000]. When is larger, is less. When is less, for example , we need more iterations for PSODP( is set 1000). Usually, for different datasets, these parameters should be analyzed and empirically adjusted to make the results better. Other parameters in PSODP, we did not adjust them.
It is difficult to prove the convergence of Algorithm 2(the PSODP algorithm). Usually, as shown in Algorithm 1 when the difference of values of objective function between two iterations is less than in several consecutive iterations and the number of iterations is larger than , we consider the PSODP is convergent. The exponent mechanism will interfere the convergence of PSODP, but the general trend of PSODP is convergent. In fact, the randomness introduced by exponent mechanism may let the PSODP algorithm jump out of a local optimal solution and get a better local optimal solution. Even so, there are several trajectories not convergent. At this time, we can reexecute the Algorithm 2 for the trajectory, and then the PSODP usually will be convergent. In our experiments, the Algorithm 2 is convergent for all trajectories in .
5.7. Privacy Analysis
The need for data privacy appears in two different scenarios. One is the data collection scenario in which individuals regard data collectors are untrusted and send their checkins with local differential privacy (e.g. voluntarily on social network sites). The second is the data releasing scenario in which datasets are released to a thirdparty research institute for analysis applications and differential privacy protection is used over the centralized datasets. RDPT solved the problem of privacy disclosure in the second sceario before differential privacy implementation.
Next, we prove that the RDPT satisfies differential privacy, we first prove that Algorithm 1 satisfy differential privacy.
Theorem 3. Algorithm 1 satisfies differential privacy.
Proof. We first analyze Algorithm 2, in which the important step is selecting a local optimal particle. When Algorithm 2 is called in each iteration, the exponential mechanism selects a new local optimal particle in the particle swarm, and the corresponding candidate particle set disjoints from that of other iteration calls. According to the parallel composition property of differential privacy in Definition 6, the privacy budget consumed by selecting a new local optimal particle for each particle is . Therefore, the process for selecting local optimal particles in Algorithm 2 is differential privacy.
For the selection of global optimal particles, we only select global optimal particles from the union of the newly local optimal particles and the historically global optimal particles; we do not need to add noise anymore. Therefore, Algorithm 2 satisfies differential privacy as a whole.
In Algorithm 1, Algorithm 2 is sequentially called with maximum times; hence, the privacy budget comsumed in Algorithm 1 is no more than differential privacy, according to the sequential composition property in Definition 5.
Theorem 4. Algorithm 3(RDPT) satisfies differential privacy.
Proof. Suppose the total privacy budget is . We split it into two parts in RDPT, denoted by , and . is for adaptive grid partitioning, while is for perturbing the cell visit probability vector in Algorithm 3.
In line 28 in Algorithm 3, we add Laplacian noise to the result of formula (3) to ensure the privacy of dividing a toplevel cell into bottomlevel cells. As shown in Theorem 1, the sensitivity of formula (3) is 1, which means that the added Laplacian noise follows a distribution of . Therefore, the adaptive grid partitioning in Algorithm 3 satisfies differential privacy.
When we perturb a real trajectory in a dataset, Algorithm 1 is called in Algorithm 3 in line 14. The Algorithm 1 only reads the real trajectory of one user and perturbs cell visit probability vector. According to the sequential composition property in Definition 5 and Theorem 3, the process of perturbing a real trajectory satisfies differential privacy for Algorithm 1. When next real trajectory is processed, the real trajcetory that the Algorithm 1 reads is disjointed with previous trajectory. Therefore, when we perturb real trajectories in a dataset, although Algorithm 1 are called times, perturbing real trajectories in a datset still satisfies differential privacy according to parallel composition property in Definition 6.
As a result, Algorithm 3 (RDPT) satisfies differential privacy according to the sequential composition property in Definition 5.
RDPT not only satisfies differential privacy theoretically but also reduce trajectory correlation, and then the privacy is then preserved.
6. Experiments and Analysis
Extensive experiments are conducted on three real datasets to verify the effectiveness of RDPT. By adjusting the total privacy budget , we compare RDPT with several methods over five metrics to verify that RDPT achieves almost the same data utility and better security. We also verify the stability of RDPT over two specific implementations for trajectory correlation.
6.1. Datasets
We use three real datasets in our experiments: Gowalla [42], Yonsei [10] and Geolife [8]. The three datasets denote different applications for releasing trajectory data offline.
Gowalla is a locationbased social networking website where users share their locations by checkingin. The dataset GowallaNew York (GNY) in our experiments is a subset of checkin records in the Gowalla dataset, in which latitude and longitude coordinates are located in the city of New York. The dataset depicts the relationships and daily activities of users in real social networking. In data preprocessing, the locations where latitude, longitude or timestamp (time slot) are null are deleted from these two datasets. In addition, if the number of locations in a trajectory is less than 80, the trajectory is also deleted from the datasets because the trajectory correlation between two users are formed in a relative long time. Thus the experiments over the dataset can simulate a real scene and patterns in releasing a trajectory dataset offline.
The Yonsei is a dataset collected by Yonsei University, Seoul, Korea [10]. The trajectories of nine graduate students at Yonsei University were collected using the mobile location service application SmartDc over two months in 2011. The dataset YonseiSeoul (YSO) is a subset of locations in the Yonsei dataset in which latitude and longitude coordinates are located in Seoul. Different from the GNY dataset, all users in Yonsei dataset have relationships and the data are denser than GNY dataset. Yonsei dataset also records the behavior of users, and simulate a real scene in releasing trajectories offline.
The Geolife dataset contains trajectories from 182 users over three years. Locations are represented by GPS latitude and longitude coordinates, the date and time when the user visits the locations. We selected trajectories of 33 users from October 23 to November 6, 2008 and all the locations are in Beijing. For each user, we connect multiple trajectories of the user into one trajectory, then we obtain a datasets(GEO) for our experiments. The data are much more dense than other two datasets.
The statistical information of the dataset after preprocessing is summarized in Table 2.
6.2. The methods to be compared
We used DPT [6], TGM [7], and AdaTrace [8] as three comparison methods. These methods consider the correlations within a trajectory. In addition, these methods are privacypreserving approaches with differential privacy. Although the correlations within a trajectory in DPT, TGM and AdaTrace are different from the trajectory correlation between trajectories in RDPT, we still choose them for a comparison because their aim is to generate synthetic trajectories that can provide efficient protection while publishing a large number of trajectories offline. As for the methods in literature [9, 10], these methods only perturbed two trajectories with the same length, and require the two trajectories are not two long. It is difficult to find a real dataset for the experiments to compare with them. So we do not compare with the two methods. For the comparison approaches, we use the parameters recommended by the literature or the parameters that improve the experimental results.
6.3. The metrics
We select five metrics in our experiments to evaluate the data utility and privacy of RDPT. The metrics for evaluating the data utility are the JensenShannon divergence of the location visit probability vector [6 –8], the Kendall coefficient of the location visit probability [8], and the query error [8]. The ability to a resist Bayes inference attack [8] and the ability to protect the trajectory correlation are two metrics for privacy.
The JensenShannon divergence of the location visit probability vector. This metric evaluates the degree of preserving the location visit frequency vectors between two trajectories in the perturbed dataset. The smaller the value is, the better preserved the spatial distribution feature is, and the smaller the utility loss of the dataset is. The equation for calculating is:where denote global location visit probability vectors in the real dataset and perturbed dataset, and is the KullbackLeibler divergence.
The Kendall coefficient of the location visit frequencies. The Kendall coefficient evaluates the statistical results for any pair of locations in the datasets, such that the order of the visit frequencies of the two locations are not changed in the real dataset or perturbed dataset. The larger the value is, the more relative “hot” locations are preserved, and the higher the data utility is. To describe the metric, we first define “order preserving”. “Order preserving” means that for two locations and , if the visit frequencies of are larger than (or less than) that of in the real dataset, as in perturbed dataset, then we say that and is a pair of “order preserving” locations. Suppose the number of ”order preserving” pairs is and the number of “order not preserving” pairs is over the real dataset and perturbed dataset, respectively. The equation for calculating is:where is the number locations in a dataset.
The query error . We define a query as follows: counting the number of trajectories passing through a specific region in a dataset. Suppose denotes the query results over a dataset , and the query error is calculated as follows:where controls the impact of the query results in extreme cases. In our experiments, we let [8]. is a set of 500 regions selected uniformly and randomly to avoid the query error influenced by accidental abnormal query results. The less is, the more number of trajectories is preserved that are traversed in different regions of the dataset, and the more conducive the dataset is for commercial block planning and other application scenarios.
The three metrics are used in the three compared papers. The for the trip distribution in DPT [6], TGM [7] and AdaTrace [8] is the same as the in our paper and the trip distribution indicates the global location visit probability vectors. The and are used in AdaTrace. Therefore, we select the three metrics and ensure the fairness of the comparison.
The ability to resist a Bayes inference attack . Suppose is a sensitive region; is a vector corresponding to the prior Markov transition matrix of over a real dataset ; and the attacker knows the vector. is a vector corresponding to the posteriori Markov transfer matrix of over the perturbed dataset , and the attacker also knows the vector. We will evaluate the difference of and using the JensenShannon divergence as follows.
The less is, the less the difference between and is, the less privacy the attacker infers, and the higher the security is. To avoid being influenced by the specific sensitive area , we regard the maximum value of the calculation results of all regions in the sensitive region set as the final result.
The ability to protect the trajectory correlation . The metric evaluates the protection degree of trajectory correlation for a privacypreserving method. The larger the value is, the better the trajectory correlation is protected, and the lower the probability that an attacker infers the social relationship through trajectory correlation is. The metric is:
In the following, we compare RDPT with DPT [6], TGM [8], and AdaTrace [9] over above five metrics, including the JensenShannon divergence , the Kendall coefficient , the query error , the ability to resist a Bayes inference attack , and the ability to protect trajectory correlation over three datasets.
6.4. Experimental results and analysis
We implement RDPT in Java, and the experiments are conducted on a computer with a 3.60 GHz Intel(R) Core(TM) i58600K, and 16 GB of memory. We adjust the parameters in our experiments to guarantee the Algorithm 1 is convergent. For each experiment, we repeat several times and obtained the average values for the five metrics. The parameters for our experiments are listed in Table 3.
In the following subsections, we compare the data utility when is increasing. We use the Pearson correlation coefficient and cosine similarity to calculate the trajectory correlation in RDPT. In following results, the RDPTPearson and RDPTCos are the results for trajectory correlation calculated by Pearson correlation coefficient and cosine similarity resprctively.
6.4.1. Comparison of data utility
We will first compare RDPT with DPT, TGM, and AdaTrace over three datasets for three data utility metrics: , , and .
The experimental results. Figure 1, Figure 2 and Figure 3 show the results of RDPT, DPT, TGM and AdaTrace on different over three datasets for three different metrics: , and , respectively. From Figure 1, the results of for RDPT are better than other three methods. In Figure 2, the results of , for RDPT are better than other three methods over GNY and YSN datasets, but the results of for RDPTPeason is less than Adatrace over Geolife dataset when and the difference of s between RDPTPeason and AdaTrace is less than 0.04. Therefore, the results of over Geolife for RDPT are better or almost equivalent to other three methods.
(a)
(b)
(c)
(a)
(b)
(c)
(a)
(b)
(c)
In Figure 3(a), the results of , for RDPT are a slight worse than DPT and are very close to AdaTrace, but much better than TGM. From Figure 3(b), RDPTPearson and RDPTCos are better than TGM and AdaTrace, and a slight worse than DPT. However, the difference of the , s between DPT and RDPT is about . In Figure 3(c), the results of RDPT are a slight worse than TGM, and are better than AdaTrace. However, the results of RDPTPearson are better than DPT and the results of RDPTCos are almost equivalent to DPT. After all, the , s for RDPT, DPT, and TGM are approximate.
From Figure 1 to Figure 3, RDPT preserves almost equivalent data utility to the comparison methods on two different implementations of trajectory correlation.
Analysis of the results. RDPT perturbs the trajectories one by one, and preserves the feature of spatial distribution of locations for each trajectory. Although DPT, TGM, and AdaTrace all focus on features of global spatial distribution, they replace the spatial distribution feature of each trajectory with the common spatial distribution feature of all trajectories. Then, when the spatial distribution feature is generalized, and metrics are worse than or equivalent to RDPT. Moreover, as shown in the multiobjective functions in (6), the first part (i.e., ) is formulated for preserving the cell visit probability vectors of the trajectories. After solving it by PSODP, we can obtain its extremum and achieve the purpose, thus preserving a better and metrics than DPT, TGM, and AdaTrace. As for the metric, the postprocessing in step 3 introduced noise, then the query for the number of trajectories that are traversed in different regions would be changed slightly, then the , metric is a slight worse than DPT or TGM. As a result, the data utility for RDPT is almost equivalent to DPT, TGM, and AdaTrace.
6.4.2. Comparison of privacy
The experimental results. Since the ability to protect trajectory correlation will be influenced by the specific implementations of trajectory correlation for DPT, TGM, and AdaTrace, then we show the results of DPT, TGM, and AdaTrace with different trajectory correlation for Pearson correlation coefficient and cosine similarity. The experimental results are shown in Figure 4 and Figure 5 with different privacy budgets over three datasets and two metrics: the ability to resist Bayes inference attacks and the ability to protect trajectory correlation , respectively. As shown in Figure 4, for , RDPTPearson and RDPTCos are better than DPT, TGM, and AdaTrace on different privacy budgets . The results indicate that RDPT could preserve the features of Markov transfer matrix in sensitive regions on datasets better than comparison methods, and RDPT could avoid the leakage of privacy in sensitive areas after the dataset being perturbed. For , the results of RDPT are better than DPT, TGM, and AdaTrace. In addition, the smaller is, the better the metric is.
(a)
(b)
(c)
(a)
(b)
(c)
Analysis of the results. As the metric reflects the spatial feature of the Markov transfer matrix on trajectories, and the results from Figure 1 to Figure 2 show that RDPT better preserves the location visit probability vector (i.e., spatial feature). Therefore, from the aspect of the privacy metric , the metric is also better than compared methods. For the metric , since RDPT focuses on preserving the spatial distribution feature on the trajectory of each user, the differences for the spatial distribution features between perturbed trajectories are large. While DPT, TGM and AdaTrace preserve the global spatial distribution features of all users, they generate perturbed trajectories by constructing mobile models based on global spatial distribution features. Therefore, the spatial distribution features between perturbed trajectories are similar in the perturbed dataset, so the average value for trajectory correlations between perturbed trajectories and original trajectories has less reduction. Hence, their performance is worse than RDPT. Moreover, as shown in the multiobjective functions in equation (6), the second part (i.e., is formulated for protecting the trajectory correlation of different trajectories. After solving the optimization problem via PSODP, we can obtain its extremum and achieve the goal, which means that the trajectory correlation is well protected (i.e., . Consequently, RDPT naturally outperforms DPT, TGM and AdaTrace over . From the Figure 5, the results of over Geolife dataset for RDPT is not better than GNY and YSN, the reason is that the GNY and YSN datasets are for real social networking and some users have relationships, then trajectory relations between users are higher, while the users in Geolife dataset usually have not relationships in social networking, therefore the descent degree of trajectory correlation is not obvious.
7. Conclusion
In this paper, we proposed a differentially private trajectory publication method, named RDPT, to protect the trajectory correlation. We designed a multiobjective optimization problem that aims to reduce the trajectory correlation between a given trajectory and other trajectories. We integrated the optimization problem into RDPT and solved it through a modified particle swarm optimization algorithm with differential privacy. The experimental results on three real datasets show that RDPT achieves almost equivalent data utility to existing methods, as well as on two different specific implementations of trajectory correlation. Moreover, RDPT achieves a better privacy insurance than existing methods, and RDPT is more suitable for preserving privacy of long trajectories. For the dense dataset like Geolife which has overlength trajectories, how to improve our method, is our future work.
Conflicts of Interest
The authors declare that they have no conflict of interest.
Acknowledgments
This research is supported by National Science Foundation of China (the project number is 61772215) and the project supported by Wuhan Science and Technology Bureau(the project number is 2018010401011274).
This work is also the extension of the paper ”CTP: Correlated Trajectory Publication with Differential Privacy” in 2021 IEEE the 6th International Conference on Computer and Communication Systems.