Abstract
Secure data publishing of private trajectory is a typical application scene in the Internet of Things (IoT). Protecting users’ privacy while publishing data has always been a longterm challenge. In recent years, the mainstream method is to combine the Markov model and differential privacy (DP) mechanism to build a private trajectory generation model and publishes the generated synthetic trajectory data instead of the original data. However, Markov cannot effectively model the longterm trajectory data spatiotemporal correlation, and the DP noise results in the low availability of the synthetic data. To protect users’ privacy and improve the availability of synthetic trajectory data, we propose a trajectory generation model with differential privacy and deep learning (DTG). In DTG, we design a private hierarchical adaptive grid method. It divides the geospatial region into several subregions according to the density of positions to realize the discretization of coordinates of the trajectory data. Second, GRU is used to capture the temporal features of the trajectory sequence for good availability, and we generate synthetic trajectory data by predicting the next position. Third, we adopt the optimizer perturbation method in gradient descent to protect the privacy of model parameters. Finally, we experimentally compare DTG with the stateoftheart approaches in trajectory generation on actual trajectory data TDrive, Portotaxi, and Swedishtaxi. The result demonstrates that DTG has a better performance in generating synthetic trajectories under four error metrics.
1. Introduction
With the development of the Internet of Things (IoT), the surge in IoT devices has boosted the growth of trajectory data, and users’ trajectory data could be recorded timely and accurately by the advanced positioning technology. The great and accurate trajectory data contribute to the development of locationbased services (LBS), where the reports indicate that the market value of locationbased services in 2020 is $44.47 billion and will reach $155.13 billion by 2026 [1]. Currently, it has become a trend to provide users better services based on the trajectory information from IoT. The increasing locationbased services, such as navigation, carhailing, and living services, provide users with convenient life [2, 3] by mining the trajectory data with intelligence technologies. Furthermore, governments and organizations guide social construction by human mobility data [4, 5]. The trajectory data have important implications for our society.
Protecting users’ privacy is the basis of service in the use of trajectory data [6–8]. On the one hand, users’ trajectory data may be collected without approval, and the user's track would be recorded by various devices [9]. Even after the user turns off the track recording function, the trajectory information can still be recorded by the operating system or SIM card [10]. The collection and publication of the trajectory data are a direct threat to users’ privacy.
On the other hand, if the data owner does not use the appropriate privacy protection mechanism when publishing the trajectory data, the attackers will have the opportunity to obtain the sensitive information of mobile users [11–13]. As shown in Figure 1, IoT converges the realtime trajectory data, collected by various devices, to edge servers and publishes the data to the thirdparty partners which are trustless or honest but curious. The edge servers generally desensitize the aggregated data before publishing, but the traditional anonymitybased privacy mechanisms are inadequate strength. The malicious third party can infer the data hidden in the anonymous area through the background knowledge of the published data and get the user’s sensitive information. The sensitive information involves frequently visited sites such as users’ homes, offices, hospitals, clinics, entertainment venues, and religious places [9, 12]. According to users’ trajectories, applications could accurately serve ads, such as catering, medical, and other services. And some applications are paid promotion fees, so they consider the interests of the payer more than those of users. These are common violations of users’ privacy, and even more than that, criminals may formulate crime plans based on the leaked trajectory information.
These problems have inspired trajectory generation technology. Researchers hope to extract mobility features from the trajectory dataset and construct models to generate synthetic trajectories based on the features. Recent studies have developed differentially privatebased [14, 15] trajectory generation [16–21] to provide strong privacy guarantees. The key challenge of developing a differential privacybased generation model mechanism is to preserve the mobility features accurately when adding noise. There are three main works in the existing model. The first is to divide the geospatial region of the trajectory dataset and transform continuous spatial coordinates into discrete region identifications. Second, it is to build a model that can model longterm spatiotemporal trajectory data. The third is to design the privacy protection mechanism to protect the spatial and temporal privacy of trajectory data. Traditional generations that mostly use the Markov model are hard to accurately extract and preserve the mobility patterns, and the differential privacy mechanism distorts the features of trajectory, such as position distribution, trajectory diameter, and frequent pattern. Researchers turn to the study of generation models based on deep learning [22–24]. The deep learning model can learn the hidden patterns contained in the original dataset. Particularly, a recurrent neural network (RNN) has the advantages of modelling longterm timedependent sequence data, and it can capture geographic features of trajectory data preeminently. Nevertheless, most of the current research on trajectory generation based on the deep learning model [22, 23, 25–27] ignores the protection of mobility privacy. The deep learning model has mass parameters with a huge semantic space, which may contain the features of the original trajectory data. The attackers could obtain the information of the original data by model inference attack.
To solve the above problems, this paper presents a trajectory generation model with differential privacy and deep learning (DTG). DTG preserves the spatialtemporal correlation of trajectories and provides highstrength privacy protection by combining differential privacy mechanisms and the deep learning model. First, we propose the private hierarchical adaptive grid model and index the coordinates of trajectory data by the identifiers of grid cells as the training data. The model could divide the extremely dense regions of geospatial space hierarchically, and the adaptive mechanism is adopted to flexibly determine the size of the grid according to the density of the local region. The count of each cell of the grid is perturbed by the Laplace mechanism to protect spatial privacy. Second, we extract the features of the trajectories by GRU. A subsequence of a trajectory is generated over a sliding window and the GRU model train according to the sequence taking the first part of a sequence except for the last node as input and the last node as the output. Third, to protect the temporal correlation privacy, we take the optimizer perturbation method in the training process. The optimizer perturbation method is implemented by adding the random noise with the Gaussian distribution to satisfy differential privacy.
The major contributions of this paper are as follows:(i)For the first time, we introduce the private hierarchical adaptive grid model. It is a densityaware grid model that reduces the number of empty cells and fully divides the dense regions to keep the cells free of excess data.(ii)We develop a trajectory generation model with a differential privacy mechanism and the GRU model. The GRU model could guarantee good availability of the generated trajectory data, and the differential privacy mechanism could protect the real model parameters from being accessed by attackers.(iii)Third, we conduct an experiment on DTG over reallife datasets and demonstrate that our solution outperforms the stateoftheart techniques [6, 20, 22] in terms of the point distribution error metric, frequent pattern error metric, region query error metric, and diameter error metric.
The remainder of this paper is outlined as follows. We first discuss the stateofart technology of trajectory synthesis in Section 2 and introduce the background and main theorems in Section 3. Section 4 presents the core components of DTG. Section 5 describes the evaluation metrics and the experiment setup to prove the superiority of our method using real datasets. Furthermore, some research emphasis is put forward in Section 6.
2. Related Work
We classify related work into two categories and discuss each category.
2.1. Trajectory Generation with Differential Privacy
Most of the existing trajectory generation schemes are mainly divided into stochastic methods [28–32] and simulationbased methods [33–37]. The simulationbased method generates simulated trajectory data by simulating the human mobility pattern in various road networks. The stochastic modelling method generates random variables following a particular probability distribution to fit the real data. And with the development of differential privacy technology, using the stochastic model with perturbation based on differential privacy to protect the privacy of trajectory datasets [6, 16, 18, 20, 21] has become the mainstream method.
Chen et al. [16] use the variablelength ngram method to process sequential data, counting the sequential transition probability and generating synthetic data. In the process, they take differential privacy methods to add noise and design an exploration tree to improve data availability. He et al. [18] protect the privacy of the trajectory generation process by adding Laplace noise to the prefix tree. In addition, they propose hierarchical reference systems and direction weighted sampling to improve the availability of generated data. AdaTrace [20] uses the loworder Markov model to generate the synthetic trajectory data. They add Laplace noise to the grid method, length histogram, start and end distribution diagram, Markov model, and so on. TGM [21] models the encoded data as the graphical generative model method, specifies the starting point, calculates the direction of advance, and transfers to the adjacent grid or maintains the current position, thus generating features of arbitrary length and retaining the stop point.
These methods provide strong privacy protection, but the availability of the generated trajectory data is inadequate.
2.2. Trajectory Generation with Deep Learning
Deep learning has brought new developments to trajectory synthesis, having significant advantages in modelling sequential data compared to the Markov model [38].
Kulkarni et al. [22] take four evaluation ways to measure seven kinds of formation models, including CharRNN [39], RNNLSTM [40], RHN (recurrent highway networks) [41], PSMM (pointer sentinel mixture model) [42], SGAN [43], RGAN [44], and Copulas [45]. The results show that it is feasible to use RNNs to generate trajectory data. Huang et al. [46] combine variational autoencoder (VAE) based on LSTM to get a wellconstructed potential space to capture the salient features of training data, showing excellent performance. TrajGANs [23] framework applies generative adversarial nets (GAN) to trajectory generation, and the generator and discriminator of trajGANs are RNN networks. The generator generates a synthetic trajectory according to a random vector, and the discriminator is used to identify the authenticity of the trajectory. Ouyang et al. [25] propose trajectory generation technology based on GAN, and its generator and discriminator use a convolutional neural network. This method treats the trajectory as a stay sequence, and three features such as geographical location, start time, and duration of the trajectory are extracted for trajectory generation. Song et al. [26] use a fourlayer convolutional neural network within the GAN framework to generate trajectories which are represented by a 512 × 512 matrix. Movesim [27] uses a modelfree generative adversarial framework to generate synthetic trajectory data. The generator uses a selfattentionbased sequential modelling network to model human mobility, and the discriminator distinguishes the generated trajectory sequence by a mobility regularityaware loss.
The above methods by deep learning model ignore the privacy protection of the generation model and training data.
It is suggested that we should combine differential privacy and deep learning in trajectory generation for improving the availability and privacy of the generated synthetic trajectory data.
3. Preliminaries
This section introduces the preliminaries of DTG. We review the trajectory dataset and the basic terminology of differential privacy.
3.1. Trajectory Dataset and Notation
Let be a trajectory with as the start point and as the endpoint of the trajectory. is a timeordered sequence composed of sequential points. And a trajectory dataset contains trajectories. A spatialtemporal point is a pair , where indicates the geographic location and the timestamp and indicates the time when is visited. is a pair of space coordinates, which is usually represented by longitude and latitude, as .
To simplify the problem of trajectory generation, we preprocess the original dataset before trajectory generation. In the dataset , the intervals between the adjacent points are the same. Hence, we simplify to representing a point after data preprocessing because of the same intervals. The detailed process is in Section 4. And a trajectory is denoted by , and the dataset is denoted by . We define twodimensional space as the geospatial region of . is the boundaries of space coordinates of the points in .
3.2. Grid Method
The grid method could discretize the continuous twodimensional (latitude and longitude) coordinates. The grid method divides the geospatial region into multiple disjoint subregions, as shown in Figure 2, and we take the identifiers of subregions to index the points instead of the coordinates of latitude and longitude.
Definition 1. (grid model). Let be a grid model. For the geospatial region of the dataset , contains independent regions, represented as , and it satisfies for , and for , . has a unique identifier .
If a pair of coordinates of point is in the region of , we take for . Therefore, the grid model could map spatial coordinates to the identifiers of grid units.
3.3. Differential Privacy
The standard privacy definitions used in our paper are derived from the work of Dwork [47]. Differential privacy [15, 23] is a robust database privacy protection standard. Dwork proves that it is impossible to provide absolute privacy protection in the presence of background knowledge [48] and further proposes the concept of differential privacy based on indistinguishability. Differential privacy [47] requires that the output of any computation is insensitive to changes in a single data; namely, the effect of an adversary learning information from a database containing a record is the same as learning information from a database that does not have this information. Therefore, the adversary cannot violate the privacy of any piece of data in the database. Furthermore, it cannot violate the privacy of the whole database.
Definition 2. (neighboring datasets). If datasets, and , differ in only one record such that , and are neighboring datasets.
Definition 3. (differential privacy [47]). Let and be neighboring datasets. A randomized algorithm is differentially private () for , and if equation (1) is workable. And is differentially private () for , and if equation (2) is workable.
Definition 4. (sensitivity). Let be a function on the dataset , and its output is a fixed dimension vector of numbers. The sensitivity of is defined asThe and in (3) are neighboring datasets. represents the norm of . The Laplace mechanism uses the norm, and the Gaussian mechanism uses the norm.
The most popular algorithm is the Laplace mechanism [47], and the most popular algorithm is the Gaussian mechanism [49]. They perturb the returned values by adding random noise according to the sensitivity.
Definition 5. (Laplace mechanism [47]). denotes a random variable from Laplace distribution with mean 0 and scale parameter . For a function with sensitivity , the randomized function satisfies when .
The Laplace mechanism provides a strict differential privacy, and some work considers the availability of data and does not require excessive privacy. We use the Laplace mechanism in the grid model and use the Gaussian mechanism, which provides relaxed differential privacy constraints, in the deep learning model.
Definition 6. (Gaussian mechanism [49]). Let denotes a random variable from Gaussian distribution with mean 0 and variance . For a function with sensitivity , the randomized function satisfies when for .
We will use the serial combination theorem when designing the privacy protection mechanism of trajectory composition.
Theorem 1 (serial composition theorem). For , , where , , if each step of the serial mechanism satisfies, the whole process satisfies differential privacy, namely differential privacy.
3.4. Problem Statement
A trajectory generation model is an algorithm able to generate a set of synthetic trajectories , which describe the movements of the given population. The generated trajectory should be a timeordered sequence composed of spatiotemporal points. And the attackers could not violate the trajectory privacy of users. The availability of the synthetic trajectory dataset is evaluated concerning the point distribution error metric, frequent pattern error metric, region query error metric, and diameter error metric.
4. Synthetic Trajectory Generation
Figure 3 illustrates the system architecture of DTG. It mainly includes data preprocessing, GRU model, gradient descent algorithm with differential privacy mechanism, and trajectory generation algorithm.
4.1. Data Preprocessing
Usually, what we get is the raw trajectory dataset , which is composed of spatiotemporal points. We should first transform to the original dataset, denoted by .
4.1.1. Transformation of the Trajectory Dataset
of contains geographical location and time, including user identification, transportation, and other information. We divide into subsets, denoted by , by user identification, and in each subset, the time of is earlier than the time of . We require that the user’s location should be collected at the same sampling rate. Therefore, the sampling interval between adjacent locations in a trajectory is the same. is movement records of the user for a certain period, and we need to divide this data into multiple trajectories. Given the sampling interval , if the interval between the adjacent nodes is greater than , we divide it into subtrajectories. For and , if , we take as the last point of the previous trajectory and as the first point of the next trajectory. Then, is transformed to by this way. denotes a trajectory and consists of the sequence of locations where . The time intervals between adjacent points in each trajectory of are the same.
4.1.2. Discretization of the Trajectory Dataset
The discretization process of DTG is as follows: DTG partitions the geographic region by the private hierarchical adaptive grid method. It divides the region according to the given parameters, including the count threshold of position points , the upper limit of the hierarchy , the privacy budget , and the partition velocity parameters . is used to reduce the speed of partition, and its range is . The maximum hierarchy of the grid model of the target geographic space shall not exceed , and the partition shall be stopped if the number of location points in the grid unit is lower than . The partition process of hierarchy is as follows: firstly, count points in each grid cell, respectively, and each grid cell is divided into subgrid cells of equal size, where ; then, add noise with differential privacy mechanism to the counts and take the noisy counts as the final counts of the grid cells.
Then, DTG has discretized the , and the grid model is already built. We index the points of by the identifiers of the grids. As shown in Figure 4, the curves represent the user’s movement trajectory, and the points on the curves are the sampling points. For example, the point of the trajectory stands for the location of the trajectory. DTG indexes the locations of the trajectories by the identifications of grid cells. DTG transforms the location sequence to the cell identification sequence , where represents the identification of the grid cells. DTG discretizes the trajectory dataset to the trajectory sequence dataset by mapping spatial coordinates to the gird identifiers.
4.2. GRU Model
RNN is qualified for modelling sequential data. It can transfer the output and state of the current moment to the next moment as input. Therefore, this serial structure can reserve the relationship between each moment. Considering it is difficult for RNN to reserve longterm dependence, and there are problems of gradient disappearance and gradient explosion, researchers further propose many excellent evolution models based on RNN, such as LSTM (long shortterm memory) and GRU. These models solve longterm dependence by adding memory units and avoiding gradient explosions by gating units. Compared with LSTM, GRU has fewer parameters and faster training. Hence, GRU is a better choice for our mechanism.
4.2.1. The GRU Unit
The input and output structure of GRU is shown in Figure 5.
There is a current input and a hidden state , containing the relevant information, passed down from the previous time steps. The GRU model gets the output and the hidden state of the current time step based on and . And GRU passes the hidden state to the next time step.
The GRU unit structure, shown in Figure 6, consists of two gates, the update gate and the reset gate. These are two vectors that determine what information should be passed to the next steps and finally to the output, and GRU could retain useful information from long ago by them. We will introduce the mathematics of the GRU unit.
The update gate helps the GRU model to determine how much of the past information, and we calculate the update gate at time step by equation (4). In equation (4), is multiplied by , the weight matrix from the input layer to the update gate, and the hidden state from time step is multiplied by , the weight matrix from to the update gate. Both result and the offset vector of the update gate are added together, and we get a result between 0 to 1 by the function .
The reset gate helps the GRU model to determine how much of the past information to forget. We calculate the reset gate by equation (5). In equation (5), is the weight matrix from the input layer to the reset gate, is the weight matrix from the hidden state to the reset gate, and is the offset vectors of the reset gate.
The candidate’s hidden state retains the relevant information from the past by the reset gate . And we calculate by (6), where is the connection weight matrix between the hidden states, is the weight moment from the input layer to the hidden state, is the offset vector of the hidden unit, and is denoted as the Hadamard product.
The current hidden state contains the information of the current unit which also includes the past information. The formula of is as follows:
4.2.2. The Training Set and the Generation Method
DTG generates highly available and realistic synthetic trajectories by the GRU model. GRU model helps model sequential data and could simulate the internal associations of the trajectories. DTG generates trajectories by iteratively predicting the next position. For trajectory , the probability that the next position is is . The context constrains the choice of , and we introduce the Markov assumption to simplify the problem as
And we use the ngram model where the current location is determined by the locations before it, to solve the problem of oversize spatial dimension and data sparsity, as shown in the following equation:
Based on the ngram model, if , DTG segments the trajectory sequences into subsequences by sliding window with width . For example, if , a trajectory sequence could be divided into 2 subsequences: and . We take the first nodes of each subsequence as the input data and the last node as the output data, and then, we get inputoutput pairs as and . According to this method, DTG transforms the dataset to the training set , where is input with length , and is output with length 1.
The GRU model is trained on the . And, in the process of generation, GRU takes the last nodes of the given sequence as input, and DTG appends the output from GRU to the end of the given sequence. DTG processes the above process iteratively to generate synthetic trajectories.
4.3. Differentially Private Gradient Descent Algorithm
This section describes the algorithm by which DTG combines the differential privacy mechanism and the gradient descent algorithm. The form of empirical risk minimization (ERM) in machine learning is shown in equation (10). Our target is to make the function satisfy the differential privacy mechanism. Hence, the output model parameter has a similar distribution for any adjacent data and . We could take differential privacy mechanisms at every step, from processing the input data to generating the output data of the machine learning process to protect privacy. We choose to add noise to the process of gradient descent for better availability. This method is an optimization perturbation.
In Algorithm 1, it shows the process of minimizing the loss function by adjusting the parameter . During training, DTG calculates the gradient of loss function for each pair of input and output data, then truncates the gradient, and adds noise by the Gaussian mechanism to the truncated gradient. Finally, DTG updates the parameters.

We adopt the mechanism, which provides looser privacy protection than , improving the availability of the generated data. Algorithm 1 shows the training process of an epoch, and we choose to divide and equally between the parameters of each epoch, sample, and layer of the model.
DTG sets the threshold as and truncates the gradients. In , if the norm of is equal or greater than , the gradient will take , and if it is less than , the gradient will be reserved. The global sensitivity of the function is .
DTG adopts the Gaussian mechanism to add noise. As required by the Gaussian mechanism, , , the Gaussian distribution satisfies differential privacy. Therefore, the noise added in the gradient provides privacy protection and meets the differential privacy requirements. DTG allocates privacy, and , to each epoch equally, and the privacy budgets satisfy and . DTG adds Gaussian noise , where , to the parameters of gradient in each epoch. According to the composition theorem of differential privacy, the GRU satisfies the differential privacy.
4.4. Privacy Analysis
If we adopt the differential privacy mechanism at any step in the process of DTG from data input to output and ensure that the process after adopting the differential privacy mechanism does not obtain the data from the step before the differential privacy mechanism, the whole process can be guaranteed to meet the differential privacy in the theory.
We take as the overall privacy budget and allocate them to each calculation in the optimization process of the neural network, and the privacy budget of each epoch is . In the training process, adding noise consumes the allocated privacy budget , and the overall disturbance consumes the total privacy budget , where and . Because the calculation of each round is based on the same input dataset, hence, this process satisfies the serial combination theorem. Based on the serial combination theorem, the whole process also satisfies differential privacy.
The subsequent operation on the processed data does not affect the sequential combination principle of differential privacy.
5. Experiment Evaluation
We evaluate the availability of the generated data in terms of geographic and semantic characteristics and select peer trajectory generation tools as competitors.
5.1. Experiment Setup
We experiment on real datasets TDrive [50, 51], Portotaxi [3], and Swedishtaxi [52]. The TDrive trajectory dataset records the trajectories of 10,357 taxis in Beijing, China, during a week. The Portotaxi dataset describes the trajectory of all 442 taxis in Porto, Portugal, during the whole year (January 7, 2013, to June 30, 2014). The Swedishtaxi dataset is the trajectory data for Swedish taxi cars during October and November 2018. This paper samples the TDrive trajectory data at a sampling interval of 300 seconds. A total of 7,881 trajectories sample data are obtained, including 90590 positions. We randomly selected 3962 trajectories from the Portotaxi dataset with 191868 positions in total and 10,000 trajectories from the Swedishtaxi dataset. They are used as experiment datasets for model training. Considering the average length of the trajectories, the length of the subtrajectories should not be longer than half of the average length, and it should not be overly short so that the model cannot obtain the longterm dependent correlation. Therefore, the length of the subtrajectories in the TDrive dataset is set to 5, and these of the Portotaxi and Swedish datasets are 20 and 6, respectively.
To verify the effectiveness of the proposed method, we choose to compare it with the current mainstream privacy trajectory generation models DPstar [6], Adatrace [20], and RNN [22]. Due to the small difference in the actual performance of the two methods, they are combined into a differentially private Markov (DPMarkov) model for experiments. In addition, to further demonstrate the performance of the proposed method, this section takes the differentially private RNN model (DPRNN) as the performance baseline. To eliminate the influence of the grid method on the experiment results, three models are trained and generated based on the private trajectory grid method proposed in this paper.
In the experiment, the performance of the mechanisms under different privacy budgets is compared, where the privacy budget and adopt and , respectively. Undersized privacy budgets will add massive noise to the gradient, resulting in the model cannot converge, while oversize privacy budgets will increase the possibility of privacy disclosure. Therefore, the privacy budget is between 0.01 and 10 to compare various methods efficiently. The availability of the generated data is quantified by the geographics and semantics similarity with the original data, including the points’ distribution error, diameter error, region query error, and frequent patterns error. The experiment is built on the Google Colab platform, configured with Intel (R) Xeon (R) 2.20 GHz processor and 12991 MB memory, and the operating system is Linux version 5.4.144. The simulation experiment is implemented based on PyTorch 1.8.1 and Python 3.7.10.
5.2. Utility Metrics
We generate the synthetic data set according to the original trajectory sequence dataset , where . Due to the difference between the generated data and the original data, there will be errors in the results when the data user uses the synthetic data compared with the real data. The smaller the error between the results from the generated data and the results from the original data, the higher the availability of the generated data is considered. Since the metrics of data availability are difficult to define, therefore, the similarity between the generated data and the original data is used to approximate the availability of the generated data. We use four evaluation metrics to quantify the similarity of the generated data. In this experiment, we calculated the similarity between the original dataset and the generated trajectory dataset of different methods. The methods are the point distribution error metric, frequent pattern error metric, region query error metric, and diameter error metric.
5.2.1. The Point Distribution Error Metric
The point distribution error metric can directly measure the similarity of the point distribution of two trajectory data sets in the same geographic region. Its disadvantage is that the evaluation results cannot reflect the context between locations in the trajectory. Its advantage is that the metric can reflect the spatial density of locations in the geographic space. The point distribution error is defined as , where denotes the JensenShannon divergence and are the point distributions of and . JSD measures the similarity of two probability distributions. It is a variant of KL divergence. JSD is symmetric, whose value is from 0 to 1. We calculate the probability distribution of points by the ratio of the number of locations in each region to the total number of locations for each data set and then measure the difference between the distributions by JSD.
5.2.2. The Diameter Error Metric
The diameter of the trajectory is a good evaluation metric with practical value [18] and is critical to the user's range of movement. The diameter of a trajectory is defined as the maximum distance between any two points in the trajectory. We use the histogram of the equiwidth interval to count the distribution of diameters. In this paper, the diameter histogram is set with 50 equal width intervals, and the range of each interval is the extreme difference divided by 50. Then, we use JSD to measure the distance between the real diameter distribution and the synthetic diameter distribution as the diameter error metric.
5.2.3. The Region Query Error Metric
Region query refers to counting the number of trajectories in a query region. Region query error (RQE) measures the relative error between the number of trajectories of the generated dataset passing through a region and that of the real data. The definition of the region query error metric is shown in equation (11). For the query area , if the position of the trajectory exists within the area, the trajectory is considered to pass through the area. The function is used to count the number of the passed trajectories, and the same statistical methods are used for the real dataset . This section divides the geospatial region of the dataset into 625 (25 × 25) query regions and takes the mean of the region query error of all query regions as the region query error of the generated dataset.
5.2.4. Frequent Pattern Error Metric
We take the frequent pattern error metrics to measure the difference between datasets [6] because most services such as urban traffic and advertising require frequent patterns of users. The evaluation method is to select the trajectory segments of topk in the data set, count the percentage of each trajectory segment in the whole data set, and then compare the differences between the distribution of real data sets and synthetic data sets. We define the trajectory pattern as an ordered sequence of cells; for example, . We count the distributions of frequent patterns in the trajectory data sets and compare the differences between generated and real data sets. In the following experiment, the sequence lengths of the frequent pattern of the TDrive, Portotaxi, and Swedishtaxi datasets are 4, 19, and 6, respectively.
5.3. Comparison with Existing Generators
As shown in Figure 7, we first compare the results of loss function and accuracy of DTG in 50 epochs and 100 epochs when the privacy budget is 10. When DTG does not use the differential privacy mechanism, its accuracy reaches a stable maximum after 50 epochs. Each epoch costs in 100 epoch and in 50 epochs. With the increase of epochs, the number of noises added in each round will also increase. The values of loss function about 50 epochs and 100 epochs are similar in the end. Unilaterally, increasing the number of epochs cannot improve the performance of the model. Therefore, in subsequent experiments, the epochs of DTG and DPRNN are set to 50.
(a)
(b)
(c)
Figure 8 shows the accuracy of different models in predicting the next location. The input of the model is the trajectory sequence, and the output is the next location of the sequence. The length of the input sequence of the TDrive dataset is 4, the Portotaxi dataset is 19, and the Swedishtaxi is 6. Considering that the 1order Markov model is better than the highorder Markov model, we adopt the 1order Markov model in the experiment, and the Markov model predicts the next location based on the last location node of the sequence. As can be seen from Figure 8, with different datasets of TDrive, Portotaxi, and Swedishtaxi, the increase of privacy budgets has not significantly improved the accuracy of DPMarkov, while the accuracy of DTG and DPRNN has significantly improved. When the privacy budgets are 0.01 and 0.1, the accuracy of DTG and DPRNN is lower than DPMarkov. DTG gradually surpasses DPMarkov when the privacy budget is greater than (equal to) 0.5, and when it is greater than (equal to) 0.1, the accuracy of DPRNN is gradually higher than DPMarkov. In addition, the accuracy of DTG is higher than that of DPRNN under different privacy budgets. It can be concluded that the Markov model has a weak ability to extract the features of trajectory data, but its accuracy will not be significantly reduced when there is more differential privacy noise; DTG and DPRNN have strong modelling ability for trajectory data, but these two models are sensitive to noise. Excessive noise will significantly degrade the performance of the deep learning model, even lower than that of the Markov model. When the privacy overhead is large, DTG performance is much higher than the Markov model.
(a)
(b)
(c)
We compare the results of loss function and accuracy with the number of epochs under different privacy to verify the impact of different noise on the deep learning models. In Figure 9, DTG and DPRNN cannot converge when , and the convergence speed is slow when . And the models can converge normally when . In addition, under different privacy budgets, DTG converges faster than DPRNN. In Figure 10, the models have good accuracy in predicting the next location of the sequence when , and DTG is better than DPRNN with the same privacy budgets.
(a)
(b)
(c)
(a)
(b)
(c)
Figures 11–14 show the results of evaluating the similarity between the generated trajectory data and the original trajectory data using four metrics: point distribution error, diameter error, region query error, and frequent pattern error under different privacy budgets. The similarity of data generated by DPMarkov is inferior to that of DTG and DPRNN in four metrics, and there is little difference in the similarity of DPMarkov’s generated trajectories under different privacy budgets.
(a)
(b)
(c)
(a)
(b)
(c)
(a)
(b)
(c)
(a)
(b)
(c)
With the increase in privacy budgets, the results of the four metrics of DTG and DPRNN gradually decrease, which indicates that generated trajectory data becomes better. When , the similarity of DTG generated data is significantly improved compared with that when . Combined with the model performance in Figures 9 and 10, it shows that when is small, the error between the generated data and the original trajectory is large. In addition, except when in the TDrive dataset, the data quality generated by DTG is better than that of DPRNN.
In conclusion, by comparing with DPMarkov and DPRNN, experiments show that DTG performs better under the same privacy budgets in most cases. DTG has the advantage of extracting the temporal and spatial relationship of trajectory data.
6. Conclusion
This paper presents DTG. It is a trajectory generation model with high privacy and high availability by combining differential privacy and deep learning. In DTG, we first normalize the original trajectory dataset by sampling with the same interval. Second, we transform the spatial coordinates to grid identifiers by partitioning the geospatial space with the private hierarchical adaptive grid model. This model is densityaware and protects the privacy of the dataset’s spatial correlation by the Laplace mechanism. Third, DTG adopts the differentially private gradient descent algorithm to protect the privacy of the dataset’s temporal correlation by perturbing gradients. We experimentally compare DTG with the stateoftheart approaches in trajectory generation taking four metrics, and DTG has a better performance in the process of generating synthetic trajectories.
In the field of trajectory synthesis, we plan to expand in the following aspects in the future: (i) more suitable deep learning models for trajectory generation to better model human mobility patterns, especially in terms of stopping points, etc; (ii) more efficient privacy allocation mechanisms to improve generated data availability; (iii) enhanced grid method to solve the problem of data sparseness.
Data Availability
The experiment data used to support the findings of this study are included in the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (NSFC), Nos. 61941114, 61872836, 62001055, and 61802025.