Abstract

With the rapid development of mobile Internet and communication technology, location-based services (LBS) are widely used in our daily life. The server stores a large amount of user location data, and these location data constitute user trajectories. If trajectory information on the server is leaked, it will seriously endanger users’ privacy. Trajectory -anonymity technology is one of the most important methods to protect the privacy of user trajectory. However, current trajectory -anonymity methods have less discussion on the semantic of stop point when selecting dummy trajectory, which leads to the fact that attacker can still exclude the dummy trajectory from the -anonymity set and infer the real trajectory by combining background knowledge with the semantic information of stop points. To address this problem, this paper decomposes the real trajectory into location pairs set; the set consists of start-end points and stop points. According to the similarity of location pairs, the similar location pairs in history trajectory set are used to generate dummy trajectory: firstly, extracting the start-end points and stop points from real trajectory and assigning semantic to them. Then, based on the semantic, temporal, and geographical attributes, eligible location pairs are selected from history trajectory set to construct equivalence class. Finally, according to the location pairs in equivalence class, dummy trajectories are generated to form a -anonymity set. We evaluate our method thoroughly with real dataset. The results show that our method achieve an effective data availability and higher privacy protection than other methods.

1. Introduction

In recent years, with the rapid development of 5G technology and the Internet of Things (IoT), smart city is gradually becoming a reality. As an important cornerstone of smart city, location-based services (LBS) are used in more and more areas, such as check-in, road conditions, and social networking [1, 2]. When a user requests an LBS service, he submits his current location to the service provider. The service provider stores user’s location sequence on the server as a trajectory. Some service providers regularly release trajectory data to governments and research institutions for analysis and mining [35]. For example, the U.S. government updates and optimizes transportation facilities based on users’ GPS trajectory data [6]; the Chinese government constructs epidemic prevention maps based on COVID-19 patients’ trajectories, etc.

However, the servers that store trajectory data are not absolutely secure. If the service provider is attacked by attacker, the trajectory data may be leaked without any protection. By analysing the spatial-temporal information in user’s trajectory, attacker combined with background knowledge can deduce user’s hobbies, mobility patterns, health status, work and home address, and other personal information, which can cause economic losses and even threaten user’s personal safety. Accordingly, scholars at home and abroad pay a large amount of attention to this thing: how to better protect the trajectory privacy.

The existing trajectory privacy protection methods can be divided into three categories: trajectory -anonymity method [714], suppression [15, 16], and differential privacy [1721]. The suppression method assumes that attacker has some specific background knowledge, protecting trajectory privacy by suppressing sensitive information in the trajectory. However, suppression method requires predetermined sensitive information; if sensitive information is not set properly, it can seriously damage data availability. Differential privacy ensures unconditional privacy, i.e., individual information cannot be obtained by analysing specified statistical data. However, differential privacy can only protect a limited amount of information. Compared to the first two methods, trajectory -anonymity method transforms the 1 : 1 relationship between user and trajectory into relationship between multiple trajectories and user by generating dummy trajectories. It is the main method to achieve trajectory privacy protection because of its simple implementation and flexible application scenarios.

How to choose the dummy trajectory is the key issue in trajectory -anonymity. For availability purposes, dummy trajectory in -anonymity set should contain the most possible available information. For privacy purposes, dummy trajectory and real trajectory must be indistinguishable. Based on these two demands, many trajectory -anonymity methods are proposed [12, 14, 22, 23]. In the similarity calculation process, these methods calculate the similarity based on all location points in the trajectory, which leads to huge computational effort and the low data availability after privacy processing. In fact, not every location point in trajectory is necessary for privacy protection [24]. It is the stop points in the user’s trajectory that really reveal the user’s privacy. According to this idea, many methods calculate the similarity between trajectories based on the stop points in the trajectory, which can reduce the calculation volume while maintaining a level of privacy [13]. However, the above methods do not consider the semantic impact in trajectory privacy protection. In a trajectory, stop points combined with semantic attributes reveal the user’s identity and action patterns. By analysing the semantic information of stop points, attacker can easily identify certain dummy trajectories from -anonymity set or even obtain the user’s real trajectories directly.

As shown in Figure 1, suppose Tom, an employee of a company, leaves the company at 18:00 on Wednesday to watch a movie at the cinema and then returns to his home. Tom’s trajectory is represented by . According to trajectory -anonymity method, now, we generate the dummy trajectories and to protect Tom’s trajectory. The three trajectories have greater similarity in trajectory shape, geographic location, and overall direction. However, by analysing trajectory’s stop points, attacker can still find the difference between three trajectories. By extracting stop points from three trajectories, attacker can get the semantic trajectory of three trajectories as: ; . Compared with and , has only one stop point, and hospital’s working hours is from 8:00 to 18:00, so attacker speculates that is likely to be a dummy trajectory. In addition, although both and have three stop points, attacker knows that Tom is an employee of a company according to the background knowledge. So, attacker can infer that is more consistent with Tom’s action pattern and then determine that is the real trajectory.

To solve the problem that low trajectory utilization and stop point’s semantic lead to trajectory -anonymity failed, this paper proposes a trajectory privacy protect method based on location pair reorganization (DSTPP). Specifically, we decompose the real trajectory into a set of location pairs consisting of start-end point and stop points. For each location pair, we select eligible location pairs from the history trajectory set for constructing candidate location set. Finally, we use the location pairs in candidate location set to generate dummy trajectories that satisfy the similarity measure; dummy trajectories are generated to form a -anonymity set.

The main contributions of this paper are as follows: (i)We design a candidate location set generation method. For each location pair in the real trajectory, according to the defined location pair similarity, the eligible location pairs are selected from the history trajectory dataset and added to the candidate location. Location pairs in the candidate location have high spatial-temporal and semantic similarity and can be used to generate dummy trajectory that match user action patterns(ii)We design a dummy trajectory generation method that conforms to user action patterns. According to the similarity measures, dummy trajectory is similar to real trajectory in terms of geography, semantics, and direction(iii)We evaluated the privacy and availability of DSTPP with two similar methods [13, 25] on real dataset [2628]. The experimental results show that the trajectory -anonymity set constructed by DSTPP meets the privacy protection requirements and has high data availability

The rest of this paper is structured as follows: relevant work is reviewed in Section 2. Section 3 provides a description of relevant concepts and measure standards. Section 4 elaborates the trajectory privacy protect method based on location pair reorganization. In Section 5, we compare with existing solutions in terms of availability and privacy, and this paper is concluded in Section 6.

Typical privacy-preserving methods include suppression, generalization, and perturbation. Among them, -anonymity technique based on generalization are widely used in trajectory privacy protection. The -anonymity model was proposed by Sweeney [29] in 2002, which is the first complete model of privacy protection. This model prevents attacker from uniquely identifying a specific user in the dataset, making it impossible to obtain further accurate information about that user. Gruteser and Grunwald [30] first applied -anonymity techniques to LBS services. For the purpose of protecting user privacy, they replace the user’s exact location points with a location region that contains location points, so the probability of users being identified is reduced. However, this method cannot resist the privacy leakage problem caused by trajectory data leakage, nor can it resist the background knowledge attack. For this reason, trajectory -anonymity method was developed. This type of method constitutes a -anonymity set that includes dummy trajectories and real trajectory. trajectories in the anonymous set have indistinguishability, which reduces the identification probability of real trajectories to . According to the way of generating dummy trajectory, the existing trajectories -anonymity can be divided into two categories: local method and integral method.

For a trajectory, user really cares about certain specific geographic location, not all locations. Based on this idea, the local method is proposed. The local method means only protecting trajectory’s sensitive locations by -anonymity, not the whole trajectory. Pan et al. [31] considered the user’s movement direction and velocity when generating the generalized region, ensuring the user trajectory’s privacy while improving the service quality. Zhang et al. [9] proposed a double- mechanism to protect user’s sensitive locations. They send the user’s sensitive location and fake locations to anonymizers. Then, each anonymizer is -anonymized for a location. This method has a higher degree of privacy. However, this method has a high computational volume, and the trajectory availability is lower. To address this problem of high computational volume, Zhou and Wang [11] combined fog computing and -anonymity to reduce the computational consumption of generating -anonymity sets. Zhao et al. [10] consider the start-end points of user’s trajectory are sensitive locations. Based on user’s behaviour, they generate secure start-end candidate point set for constructing dummy trajectory. Ye et al. [32] propose protect location points within sensitive areas; they build the cloaking region for sensitive areas, which contains another similar POIs to the sensitive place, and randomly select one to replace the sensitive place. However, the local method requires presetting sensitive locations. If the sensitive locations are set improperly, it can seriously affect data availability. In addition, user privacy can also be leaked based on location points in nonsensitive areas.

In order to solve the defects caused by the local method, the integral method is proposed. The integral method is to select dummy trajectories that are similar to real trajectory to form a -anonymity set. Xu et al. [12] evaluated the trajectory similarity based on four features: angle, velocity, time, and space. Then, they selected historical trajectories that were similar to the real trajectories to form a -anonymity set. Wang et al. [22] propose constituting -anonymous set by exchanging the locations of neighbour nodes, protecting user privacy by interchanging the location between neighbour nodes on the -core subnet of the relationship network. But this method ignores the privacy needs of different location points, which leads to insufficient availability. According to the spatial and temporal characteristics of trajectory data, Li et al. [23] propose a data partitioning method to store and calculate trajectory, which reduces computational volume. Liu et al. [25] generated dummy locations for each location point in the real trajectory and randomly generated dummy trajectory based on these dummy locations. The dummy trajectory generated by this method have some unreachable positions, so the data availability and privacy are insufficient. Dai et al. [33] proposed simplifying real trajectory into a trajectory that is only consisting of stop points. Then, each stop points constructs a -anonymity set. The -anonymity set contains semantically similar location points. According to these -anonymity sets, generate dummy trajectory randomly. The dummy trajectory generated by this method only considers stop points, which may pass through unreachable locations. Meanwhile, it does not consider whether the direction of dummy trajectory is similar to real trajectory. Xu et al. [13] evaluated the trajectory similarity based on the number of stop points and average velocity. Then, select historical trajectories that are similar to the real trajectories to form a -anonymity set. The integral method selects history trajectory or fake trajectory as dummy trajectory. However, although using fake trajectory can ensure the similarity requirement, the fake trajectory may pass through or be at unreachable locations. Therefore, the availability and privacy are insufficient. If using history trajectory, the problem that the trajectory number is insufficient may be faced; meanwhile, although the whole trajectory does not satisfy the similarity requirement, some trajectory segments in these trajectories can still be utilized. But the integral method cannot utilize these trajectory segments.

In conclusion, existing dummy trajectory selection methods have the defects in low trajectory utilization and pay less attention to stop point’s semantic, which leads to the fact that attacker combining with background knowledge can still exclude the dummy trajectory from the -anonymity set. In order to solve above problem, this paper constructs dummy trajectory by the location pairs. Location pairs consist of stop points in history trajectory set. The history trajectory is taken from real user’s trajectory, so dummy trajectory does not pass through or reach unreachable locations. At the same time, we defined semantic, geographical, and directional similarities, which ensure the similarity between dummy trajectory and real trajectory. This method can effectively resist background knowledge attack and improve data availability.

3. Preliminaries

To facilitate reader’s understanding of the various system parameters used in this paper, Table 1 provides a description of the various system parameters used in this paper.

3.1. Related Concepts

Definition 1. GPS trajectory: the GPS trajectory can be represented as a polyline in three-dimensional space (two-dimensional coordinates and time dimension), denoted as , where indicates that the position of the trajectory at time , and is the number of sampling points of real trajectory .

The location points in real trajectory can be divided into two categories: moving point and stop point. Stop point is a location where the user stays (speed is 0 and lasts for a period of time) or visits (speed is not 0, but repeatedly wanders around a location). Moving point is a location that the user simply passes through. The specific description of stop point is given by Definition 2.

Definition 2. Stop point: a stop point is a geographic area where user stays within a distance threshold longer than a time threshold . In the real trajectory , the subtrajectory is denoted as , where, , , and . Then, can be combined into a stop point . Denote as: =, where: represent the latitude and longitude of stop point , respectively. and denote the longitude and latitude of location point .

denotes the time of entering the stop point , denotes the time to leave the stop point , and denotes the stay time at stop point . The values are calculated using the first position point , the last position point , and the time difference between and in the , respectively. According to Definition 1 and Definition 2, trajectory can be composed of stop points and moving points between top points.

Definition 3. Semantic category: the semantic category of location point can be represented by points of interest (POI). Each location point has an independent POI, and the semantic category can be obtained through Chinese POI standard [34].

According to Chinese POI standard, POI semantic category is divided into three levels: major category, middle category, and minor category. Each level consists of a 2-digit code. The codes for the major category, for the middle category, and for the minor category are arranged in order to form a fixed 6-bit semantic code (, , and ), as shown in Figure 2.

Definition 4. Semantic location point: the semantic location point can be represented as , where denotes the center coordinates of semantic location point; denotes the semantic category of semantic location point, consisting of the triplet ; and represents the temporal characteristics of semantic location point, which consists of the triplet . The three attributes indicate visit time, leave time, and stay time, respectively.

Definition 5. Semantic trajectory: a semantic trajectory is an ordered list of start-end points and a set of semantic points:

Definition 6. Semantic distance : refers to the semantic difference between two semantic location points. It is expressed by semantic encoding difference of two semantic location points:

According to the specification:

When the semantic distance , it means that two semantics differ only in minor category. The two points have a high degree of similarity.

When the semantic distance , it means that two semantics differ only in middle category. The two points have a low degree of similarity.

When the semantic distance , it means that two semantics differ only in major category. No similarity between the two points at all.

Definition 7. Geographical distance : refers to the geographical difference between two semantic location points, which can be calculated by the Euclidean distance:

Definition 8. Location pair similarity [14]: suppose and are the location pairs in real semantic trajectory and semantic trajectory equivalence class , respectively. Two location pairs satisfy the location pair similarity if they satisfy the following conditions: (1)(2)(3)where, , , and are semantic distance threshold, geographic distance threshold, and time threshold, respectively. Equations (1) and (2) ensure that location pairs are semantically and geographically similar. Equation (6) ensures that two location pairs are similar in velocity and time period by calculating the location time difference.

3.2. Similarity Measure

In the -anonymity set, we want the generated dummy trajectory to be semantically and geographically indistinguishable from real trajectories. In this subsection, geographic similarity and semantic similarity are proposed to evaluate the differences between dummy trajectory and real trajectory.

We first give two theorems and utilize them to prove that the difference between two probability distributions is the expectation of the difference between the features of the corresponding positions.

Theorem 9. Suppose is a random variable with probability distribution . is a set of random variables, the corresponding probability distribution is . For each , there is a probability distribution difference with , the probability distribution difference between and is denoted by . The probability of the probability distribution difference between and is . Therefore, the probability distribution difference between and is the expectation of each difference, i.e.,

Proof. and are functions of variables and . When the value of remains unchanged, the results of two functions depend only on . If is also deterministic, then both and are constants. That is, there is a one-to-one correspondence between and . Therefore, can be regarded as the probability of in all probability distributions difference. Accordingly, is the expectation of probability distribution difference and its probability .

Theorem 10. Suppose and are two sets of random variables with probability distributions and , where and . For and , the probability distributions difference of and is . The probability of probability distributions difference between and is . The probability distribution difference between and is the expectation of all , i.e.,

Proof. For , according to Theorem 9, the difference between and is . Then, . However, for , if , then and are 0. That means . Therefore, the probability distribution difference of and is the expectation of all .

For and real trajectory , semantic distance, geographic distance, and its corresponding probability distribution (the probability of distance difference) depend only on . Therefore, according to Theorem 10, we define geographical similarity and semantic similarity between and as the expectation of distance difference.

Definition 11. Visit probability: represents the probability of a user visiting location , which is calculated as follows: where represents the number of people visiting location and denotes the total number of people.

The larger the value of , the more people visit location , and the higher the probability that location will be visited.

Definition 12. Geographical similarity [14]: for and real trajectory , denotes the geographic distance between and , and the probability distribution of geographic distance is denoted by . The geographical similarity between and can be defined as the expectation of all :

Attacker’s goal is to infer user’s real trajectory from the -anonymity set. In order to achieve this goal, attacker usually assumes a location (called hypothetical location) to be the real location and evaluate the probability that this location is the true location by background knowledge. In the candidate location , we assume that is the hypothetical location. Without considering background knowledge, the probability that is the true location is denoted by . In this paper, there is an equal probability that any location in is assumed to be the true location, so, . For , the background knowledge that is the hypothetical location is attacker believes is the true location , i.e., the joint probability . So, the probability that is assumed to be is . The higher the probability, the higher the probability that is , which means that attacker believes that and are more similar. Therefore, we use this probability to calculate the probability distribution of geographic distance :

According to the above formula, the geographical similarity between and is calculated as follows: where is a constant used to normalize the geographical similarity value to lie in the range [0,1]. is the sum of the maximum location distance difference in each candidate location and user location :

Definition 13. Semantic similarity: for and real trajectory , denotes the semantic distance between and , and the probability distribution of the semantic distance is denoted by . The semantic similarity between and can be defined as the expectation of all :

Similar to Definition 12, for Equation (13), we still calculate by :

According to the above formula, the semantic similarity between and is calculated as follows: where is a constant used to normalize the geographical similarity value to lie in the range [0,1]. is the sum of the maximum semantic distance difference in each candidate location and user location :

Attacker can also analyse the trajectory’s movement direction in the published trajectory set. If a trajectory differs from other trajectories, attacker deduces that this trajectory is likely to be a dummy trajectory. Therefore, it is necessary to ensure the movement direction similarity between dummy trajectory and real trajectory . We use the least squares method to fit the slope of the overall trajectory movement direction and determine whether dummy trajectory direction is similar to real trajectory direction by the slope ratio. The slope is calculated as follows: where ; .

Definition 14. Directional similarity: for and real trajectory , according to Equation (17), calculate the slope of two trajectories and . The directional similarity between and is calculated as follows: The larger the value of , the more similar the overall direction of the two trajectories.

4. Scheme Description

4.1. Scheme Framework

This scheme is designed to protect user’s trajectory privacy. To prevent attacker from identifying user’s real trajectory from the -anonymity set. In this paper, we construct dummy trajectories based on stop point location pairs to form -anonymity set, thus protecting user privacy and security. The scheme is divided into three steps:

Trajectory preprocessing stage: based on the start time and end time of real trajectory, the trajectories with similar time periods are selected from the history trajectory dataset . The trajectory equivalence class consists of these historical trajectories. According to Definition 2 and Definition 3, we extract stop points in real trajectory and the equivalent class , then assign semantic to start-end points of and all stop points to generate the corresponding semantic trajectory and semantic trajectory equivalent class .

Selection of candidate location stage. For each location point in semantic trajectory , we select the location points that satisfy Definition 8 from STC and then add them to candidate location to form the candidate location set . (1)Dummy trajectory generation stage. We generate dummy semantic trajectory based on candidate location set . If meets similarity measure, moving points between stop points in are also added to the trajectory FS to generate corresponding dummy trajectory . Repeat this step until dummy trajectories are generated to form a -anonymity set. The overall process is shown in Figure 3. We will then describe each step separately in chapter 4.

4.2. Trajectory Preprocessing

This subsection preprocesses real trajectory. It can be divided into the following two steps: (1) construct a trajectory equivalence class for and add trajectories with similar time periods to ; (2) extract stop points in real trajectory and the equivalent class and assign semantic to them, and construct the corresponding semantic trajectory and semantic trajectory equivalent class .

4.2.1. Construct Trajectory Equivalence Class

In -anonymity set, if the time period of a trajectory is different from other trajectories, attacker may infer that this trajectory is a dummy trajectory. Therefore, we need to ensure that the trajectory in the equivalent class is similar to real trajectory in terms of time period. According to the set time offset , if a trajectory in history trajectory dataset satisfies the following conditions:

Then, is added to the equivalent class , where and denote the start time and end time of trajectory , respectively.

The specific steps are shown in Algorithm 1.

Input: Real trajectory ; time offset ; history trajectory dataset
Output: Trajectory equivalent class
1.
2.fordo
3. if meets the candidate of Formula (19) then
4.  
5. return

Algorithm 1 traverses the trajectories in the trajectory dataset by a for-loop and calculates whether they satisfy Equation (19). The time cost of Algorithm 1 is , i.e., the algorithm time complexity is .

4.2.2. Construct Trajectory Equivalence Class

Stop point contains richer information than moving point, and attacker can identify -anonymity set’s dummy trajectory by analysing stop point. The dummy trajectory that constitutes the -anonymity set have to be semantically similar to the real trajectories . So, in this subsection, we extract stop points in real trajectories and equivalence class and assign semantic to start-end points of and all stop points to construct the corresponding semantic trajectory and semantic trajectory equivalence class .

When the distance between a segment of consecutive location points and is less than the distance threshold and the interval time is greater than the time threshold , the location points from to are aggregated into a stop point. If the Euclidean distance between and is greater than the distance threshold , then and cannot be aggregated into a stop point. Next, the interval time between and is calculated to determine whether it is greater than the time threshold , if the interval time is greater than the time threshold , then to is aggregated as a stop point. If the interval time is less than the time threshold , the location points to are all moving points and repeat this operation from .

For example, in the subtrajectory , , , and . We can aggregate the location points to as a stop point. After acquiring stop points, we assign semantic of the closest POI to stop point based on the PAT. The step is repeated, and the semantic trajectory is finally generated.

The specific steps are shown in Algorithm 2.

Input: Real trajectory ; time threshold ; distance threshold ; point-of-interest set
   
Output: Semantic trajectory ; Semantic trajectory equivalence class
1.
2.
3.
4.
5.
6.whiledo
7.
8.whiledo
9. ifthen
10.      ifthen
11.         
12.         
13.         
14.    
15.   
16. fordo
17.   whiledo
18.      
19.      whiledo
20.         ifthen
21.         ifthen
22.            
23.            
24.            
25.         
26.   
27.  
28. return

In Algorithm 2, lines 2-5 assign semantic to start-end points of real trajectory and add them to semantic trajectory . Lines 6-15 extract stop points in real trajectory based on the threshold and distance threshold and assign semantic. Then, add these points to . Lines 15-27 traverse the trajectories in the equivalence class , convert the trajectory to semantic trajectory, and add them to semantic trajectory equivalence class .

Algorithm 2 transforms real trajectory into semantic trajectory by two layers of while-loop. In the worst case, the time consumption is , and the time complexity is . Second, the trajectories in are traversed by a for-loop, and then, the equivalence class is transformed into semantic trajectory equivalence class by two layers of while-loop. In the worst case, the time consumption is , and the time complexity is . Therefore, the total time consumption of Algorithm 2 is , and the total time complexity is .

4.3. Construct Candidate Location Set

The purpose of this subsection is to select the location pairs used to construct the dummy trajectory. We traverse semantic trajectory SEM and compute the similarity of location pairs in according to Definition 8. If there is a location pair that satisfies the condition, this location pair is added to the candidate location and . After traversing all position pairs, the composition of .

The specific steps are shown in Algorithm 3.

Input: Semantic trajectory ; semantic trajectory equivalence class ; location
   pair similarity threshold
Output: Candidate location set
1.
2.fordo
3.
4.fordo
5.      fordo
6.         if <> and <>
         meets the candidate of
         Definition8then
7.           
8.           exit
9.
10. return

For the location pair in the semantic trajectory , Algorithm 3 traverses each trajectory in . If there exists position pair similar to in this trajectory, this location pair will be added to and . Then, traverse the next trajectory until all trajectories have been traversed. Algorithm 3 finally returns the candidate location set .

Algorithm 3 constructs the candidate location set by a three-level for-loop. The first level for-loop traverses the location pairs in with a time consumption of . The second level for-loop traverses the trajectory in with a time consumption of . The third-level for-loop traverses the location pairs in the trajectory with a time consumption of . Therefore, the total time consumption of Algorithm 3 is , and the time complexity is .

4.4. Construct Location Pair Graph

In the previous subsection, we picked candidate location for each semantic location point in . The next goal is to generate dummy trajectories based on these candidate positions to form a -anonymity set. If location points are randomly selected from each candidate location and combined into a dummy trajectory, the generated dummy trajectory may have some unreachable locations. Specifically, as shown in Figure 4, the solid line and dotted line represent the location pair that exist in and the location pair that do not exist in , respectively. A dummy trajectory consisting of location pairs and will pass through unreachable locations. So, attacker can easily identify this trajectory as dummy trajectory.

This subsection constructs the location pair graph . Among them, consists of all semantic location points in , and each semantic location point represents a node in . is a location in the candidate position and is a directed edge connecting and . is the weight of edge , and the value is a binary consisting of geographical similarity and semantic similarity. Algorithm 4 describes the graph generation process as follows.

Input: Semantic trajectory ; location candidate set ; semantic trajectory
   equivalence class
Output: Location pair graph
1.
2. fordo
3.  fordo
4.   fordo
5.    ifthen
6.      
7.      
8. return

Algorithm 4 first traverses each candidate location and picks locations from and respectively to form the location pair . Next, determine whether the location pair exists in . If it exists, add and to the graph .

Algorithm 4 constructs the location pair graph by a three-level for-loop. The first level for-loop traverses the candidate location of , and the time consumption is . The second level for-loop traverses location point in candidate position , and the time consumption is . The third level for-loop traverses the location point in the candidate position , and the time consumption of . The total time consumption of Algorithm 4 is , and the total time complexity is .

4.5. Construct -Anonymity Set

The goal of this subsection is to generate dummy trajectories to form -anonymity sets. These dummy trajectories are semantically, geographically, and directionally similar with real trajectories . To achieve this goal, dummy semantic trajectory is first generated. If satisfies the requirements in semantic similarity and geographic and directional similarity, a dummy trajectory is generated based on . The dummy trajectory consists of each semantic position pair in and the moving points between corresponding location pairs.

Algorithm 5 describes the generation process of dummy trajectory. First, the directed edges that do not satisfy the threshold requirement are removed from the location pair graph . Then, iteratively generate dummy trajectory up to trajectories. Lines 3-12 show the detailed steps of dummy trajectory generation. First, a dummy semantic trajectory is randomly generated (line 5). If the generated trajectory happens to be the user semantic trajectory, a new semantic trajectory is regenerated (lines 5-6). If it is not the user semantic trajectory, determine whether satisfies the similarity requirement (line 8). If the requirement is satisfied, iteratively traverse the location pair in , find the location pair in , then add stop points and moving points contained in location pair to the dummy trajectory (lines 9 to 11). Finally, the eligible dummy trajectories are added to . The specific algorithm is as follows.

Input: Semantic trajectory ; location pair graph ; semantic similarity threshold ;
   Geographical similarity threshold ; directional similarity threshold ;
   Semantic trajectory equivalence class
Output: Dummy trajectory set
1.
2. Calculate the slope of
3. Remove all which do
 not satisfy Def(10) and Def(11) from G
4.whiledo
5.  Random generates a fake semantic trajectory
6.  ifthen
7.     
8.  else
9.     if
      then
10.       for
11.        fordo
12.         ifthen
13.           all point between and join
14.       
15. return

Algorithm 5 generates dummy trajectories by while-loop, and the time consumption is . In the dummy trajectory generation process, the dummy trajectory is constructed by two-level for-loop. In the worst case, the time consumption is . Therefore, the total time consumption of Algorithm 5 is (, and the total time complexity is .

According to the above analysis, the total time complexity of this scheme is .

5. Experiment Analysis

5.1. Data Set and Experimental Environment

In this paper, we use the GeoLife dataset [2628] to evaluate the performance of the trajectory privacy protect method based on location pair reorganization. The dataset collected 5 years of trajectories of 182 volunteers, and this dataset contains 17,621 trajectories. Most of the trajectories in GeoLife dataset are recorded in Beijing, China. Therefore, this paper extracts the trajectory data of the Beijing area for experimental analysis. We select 5 attributes in GeoLife to compose the history trajectory dataset . The 5 attributes are user ID, longitude, latitude, date, and time. We generate the trajectory -anonymity set on the historical trajectory dataset .

In this paper, the point of interest set (PAT) includes a total of 211,615 POIs within the 6th Ring Road of Beijing. The semantic categories adopt the three-level classification of Chinese POI standard, including 15 major categories, 51 middle categories, and 145 minor categories.

The hardware environment of the experiment is: Intel i7-8750H 2.20 GHz, 16.00 GB memory, the operating system is Microsoft Windows 10, and the algorithms are all implemented under Pycharm2020.

5.2. Experimental Parameter and Evaluation Indicator

To verify the performance of the DSTPP algorithm, we randomly select 10 user trajectories from the trajectory dataset for the experiment. Each user’s semantic trajectory contains at least 4 semantic location points. The experimental parameters are set as shown in Table 2.

In this paper, we verify the performance of DSTPP by comparing with Random algorithm [25] and MTPPA [13] algorithm in terms of both privacy and availability. Random algorithm randomly selects locations from the candidate location to generate dummy trajectory. MTPPA algorithm selects history trajectories to form -anonymity set. Under the same conditions, we run 3 algorithms on 10 user trajectories. To ensure the accuracy of the experimental results, each group of experiments was measured 100 times. Finally, the average of 100 results was taken as the final result.

5.3. Privacy Analysis

In this section, we use identification probability (IP) to evaluate -anonymity set’s privacy. In -anonymity set, there are trajectories that are similar to . The identification probability is . The larger the value of , the more difficult it is for attacker to identify true trajectory from -anonymity set and the better privacy. The identification probability is calculated by the following equation. where is used to calculate the number of trajectories that satisfy the similarity measure in the trajectory -anonymity set.

We evaluate the privacy by the average identification probability of 10 trajectories. Figure 5 shows the three methods’ average identification probability under different values of . As shown in Figure 5, the average identification probability of all methods tends to decrease when the value increases. This is because as the value increases, the more trajectories in -anonymity set and the more trajectories that are similar to the real trajectory. Under the same value, random algorithm has the highest average identification probability; this is because (1) the dummy trajectory passes through some unreachable locations and (2) the dummy trajectory do not consider geographic and semantic attribute. Attacker can easily filter dummy trajectories in -anonymity set by analysing the stop points; (3) the dummy trajectory may differ significantly from the real trajectory in the overall direction.

The average identification probability of MTPPA algorithm is lower than random algorithm. This is because MTPPA algorithm selects history trajectories to form a -anonymous set, so there are no unreachable locations. Meanwhile, dummy trajectory is selected based on the number of stop point, which can ensure the semantic similarity of trajectories in the -anonymity set to a certain degree.

Compared to the above two methods, DSTPP has a lower average identification probability. This is because DSTPP constructs dummy trajectory based on real position pairs, so no unreachable locations appear. In addition, dummy trajectory and real trajectories are similar in terms of overall direction, semantic property, and geographical property.

5.4. Availability Analysis

After the service provider publishes the -anonymity set, researchers can conduct statistics, research, and other studies by mining the trajectories in the anonymity set. Therefore, the -anonymity set’s availability is to be evaluated. According to the research needs, availability can be divided into two categories: trajectory availability and data availability.

5.4.1. Trajectory Availability Analysis

Trajectory availability means: for traffic optimization, logistics management, and other needs, researchers want to use trajectory data to evaluate traffic flow. This requires that the dummy trajectory must have a high degree of shape similarity and geographic similarity to the real trajectory.

In this subsection, we evaluate the trajectory availability of -anonymous sets by trajectory difference (TD). The trajectory difference is calculated by the following equation. where is the distance between two trajectories. distance is calculated by the following equation.

and are two trajectories for comparison, with representing a specific moment and representing and ’s location point at moment , respectively. denotes the Euclidean distance between two location points.

We use the average trajectory difference of the 10 trajectories to evaluate the -anonymity set’s trajectory availability. For the -anonymous set, larger values represent weaker trajectory availability. Figure 6 shows three methods’ average trajectory difference of under different values. As shown in Figure 6, three methods’ trajectory difference shows an increasing trend when the value increases. This is because, as the value becomes larger, the number of required dummy trajectories increases. So, the dummy trajectories’ overall quality decreases. Under the same value, DSTPP has the lowest trajectory difference. This is because DSTPP considers the overall direction and geographic location, and the generated dummy trajectory is more similar to the real trajectory. Random’s average trajectory difference is higher than DSTPP. This is because Random randomly generates dummy trajectory based on fake location points. The generated dummy trajectory has some geographical similarity, but the overall direction is not considered. The MTPPA algorithm has the largest average trajectory difference. This is because MTPPA does not consider trajectory shape similarity and overall direction at all when selecting dummy trajectory.

5.4.2. Data Availability Analysis

Data availability is pointed out for needs such as interest recommendation and location prediction [26, 35]; researchers want to analyse semantic properties in trajectories. This requires that there must be enough available data in the -anonymity set and the data must be accurate.

In this paper, the available data rate (AD) is used to evaluate the amount of available data in the anonymous set, which is calculated as follows. where denotes all the location points in the -anonymous set and denotes the reachable location points in the -anonymous set.

We use information loss (IL) to evaluate the accuracy of the anonymous set. Less information loss indicates more accurate data and better data availability. Information loss depends on the size of the anonymity zone; the method was similar with Xu et al. [13]. The calculation formula is as follows. where denotes the th semantic location point’s anonymous set size, and denotes the size of the trajectory -anonymous set, and denotes the number of semantic location points.

We evaluate the -anonymity set’s data availability by the average available data rate and the average loss information for the 10 trajectories. Figure 7 illustrates the effect of value on data availability.

Figure 7(a) evaluates three methods’ available data rate. Among them, Random has the lowest available data rate. This is because, this method contains or passes through some unreachable locations that are not available for data analysis. The available data rate of both MTPPA and DSTPP is 1. This is because the dummy trajectory of MTPPA is taken from history trajectories set, and the dummy trajectory by DSTPP is also taken from history trajectories. Therefore, no unreachable positions will be present or passed.

Figure 7(b) evaluates three methods’ data accuracy. When value increases, the information loss of all three methods shows an increasing trend. This is because the quality of dummy trajectory decreases as the -anonymity set increases. When values are constant, Random has the greatest loss information. This is because the dummy trajectory generated by Random does not consider semantic similarity. So, there is too much noise in the query results of each query. MTPPA’s information loss is in the middle; MTPPA considers the number of stop point between trajectories. Therefore, the query result of each query contains a high amount of semantic information, which can be used for data analysis. DSTPP has minimal information loss. This is because the dummy trajectory generated by DSTPP not only considers the number of stop point but also considers the semantics of stop point are similar to the real trajectory. The query results contain rich semantic information for researchers to analyse.

6. Conclusions

With the rapid development of information technology and 5G technology, human society has entered the era of big data. Analysing the user’s daily action trajectory is of great help in optimizing national resource scheduling and improving public facilities. User trajectories contain sensitive information: how to ensure the availability of trajectories on the premise of ensuring user privacy and security. Aiming at this problem, this paper proposes a trajectory privacy protect method based on location pair reorganization. The real trajectory is decomposed into location pairs consisting of start-send point and stop points. Then, dummy trajectories are generated by selecting eligible location pairs from the historical trajectory set to form trajectory -anonymous. Finally, this paper experiments privacy and availability on real dataset. By comparing with MTPPA algorithm and Random algorithm, our method reduces information loss and improves privacy protection. Overall, MTPPA algorithm is better than MTPPA algorithm and Random algorithm.

Data Availability

The trajectory data used to support the findings of this study can be downloaded from https://www.microsoft.com/en-us/download/details.aspx?id=52367 and the detailed instructions can be found in https://www.microsoft.com/en-us/research/publication/geolife-gps-trajectory-dataset-user-guide/.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work is supported by the Natural Science Foundation of Heibei Province (F2019201361)