Abstract
With the development of the wireless network, locationbased services (e.g., the place of interest recommendation) play a crucial role in daily life. However, the data acquired is noisy, massive, it is difficult to mine it by artificial intelligence algorithm. One of the fundamental problems of trajectory knowledge discovery is trajectory segmentation. Reasonable segmentation can reduce computing resources and improvement of storage effectiveness. In this work, we propose an unsupervised algorithm for trajectory segmentation based on multiple motion features (TSMF). The proposed algorithm consists of two steps: segmentation and mergence. The segmentation part uses the Pearson coefficient to measure the similarity of adjacent trajectory points and extract the segmentation points from a global perspective. The merging part optimizes the minimum description length (MDL) value by merging local subtrajectories, which can avoid excessive segmentation and improve the accuracy of trajectory segmentation. To demonstrate the effectiveness of the proposed algorithm, experiments are conducted on two real datasets. Evaluations of the algorithm’s performance in comparison with the stateoftheart indicate the proposed method achieves the highest harmonic average of purity and coverage.
1. Introduction
With the rapid development of location technology (such as GPS, Beidou System, AIS), it is becoming easier to get trajectory data of moving objects, including time, location, speed, acceleration, and heading. The analysis on trajectory data can provide a lot of valuable information for applications based on location data, such as traffic pattern detection [1], fishing detection [2, 3], animal migration behavior detection [4–6], human behavior patterns recognition [7], and hurricane trajectory prediction.
The preprocessing step of trajectory data mining includes noise cleaning, segmentation, stop points detection, compression, and map matching [8]. And trajectory segmentation is one of the most basic tasks, which is to partition the trajectory into disjoint parts. The motion features of each part are uniform, and the two adjacent parts represent different motion modes. Segmentation reduces computational complexity and allows us to mine richer knowledge, which exceeds the knowledge we learn from the entire trajectory. Furthermore, accurate segmentation methods can provide higherquality features for further analysis of the behavior of moving objects.
In recent years, the proposed trajectory segmentation algorithms can be classified the supervised [9–14], unsupervised [15–27], and semisupervised [28].
The trajectory segmentation algorithm as aforementioned can solve most of the problems in the preprocessing of trajectory data mining, but there are still the following challenges: (1)At present, most of the supervised trajectory segmentation algorithms, such as SPD [11], Warped Kmeans [10], and WSII [9], required labeled data or prior information such as time threshold, speed threshold, and the number of trajectory segments.(2)Semisupervised trajectory segmentation algorithm (e.g., RGRASPSemTS) uses a combination of both labeled and unlabeled data to segment. However, the majority of trajectory datasets do not contain the labeled data.(3)The unsupervised trajectory segmentation algorithm does not require labeled data. But the existing unsupervised segmentation algorithms use greedy algorithms with high time complexity, resulting in uselessness which causes it is not suitable for large trajectory data.
To overcome these challenges, we propose an unsupervised algorithm for trajectory segmentation based on multiple motion features (TSMF). The algorithm includes two steps: segmentation and mergence. First, to maximize the homogeneity of the subtrajectories, the segmentation part uses the Pearson coefficient to measure the similarity of trajectories. Furthermore, to avoid local oversegmentation, mergence part merges the subtrajectory by minimizing the cost function. Finally, we verify the proposed algorithm in two trajectory datasets of two different domains.
The main contributions of this article are as follows: (1)The study proposes a segmentation method based on the Pearson coefficient. First, the Pearson coefficient is employed to measure the similarity according to the speed, acceleration, differential position, angle, and other movement features of the two trajectory points. Then, the trajectory is segmented from a global perspective.(2)Considering the local oversegmentation of trajectories, we propose a merging method, which merges trajectories by minimizing cost function value.(3)Fusion of the segmentation and merging method proposes an unsupervised algorithm for trajectory segmentation based on multiple motion features (TSMF).(4)The time complexity of our proposed algorithm is , which is suitable for the segmentation of large trajectory datasets.
The rest of this article is organized as follows: Section 2 gives the related works. Section 3 introduces the proposed trajectory segmentation algorithm. In Section 4, we verify the feasibility of our algorithm on two actual datasets. Finally, Section 5 gives our conclusions and future work.
2. Literature Review
In the past few years, scholars have published lots of papers related to trajectory segmentation. In this section, we mainly summarize most of the trajectory segmentation methods.
The supervised trajectory segmentation algorithm requires label data and heuristic rules such as the time threshold, speed threshold, density threshold, and angle threshold to segment trajectory. Mohammad et al. proposed a segmentation algorithm named WSII [9], and it requires the labeled data. But the majority of trajectory datasets do not contain such information. Zheng et al. proposed a staying point detection (SPD) algorithm to segment trajectory [11]. SPD suppose that there is a stay point between two adjacent motion modes and uses the distance threshold and the time threshold to find the stay points. Then use the stay points to segment trajectory. Finally, SPD was verified on the geolife dataset. Mirge and Verma define the distance threshold and the angle threshold to segment trajectory [13]. Although these two algorithms can quickly find the stay points and segment the trajectory, the algorithm requires heuristic rules. In practical application, it is difficult to obtain these rules in advance, and the value of the threshold would greatly impact the accuracy of trajectory segmentation. Leiva and Vidal proposed a trajectory segmentation algorithm named Warped means [10] based on the means [29]. This algorithm adds time constraints in the means. It reaches 97% accuracy on real datasets. However, the number of trajectory segments is generally unknown.
The unsupervised segmentation algorithm mainly includes clusteringbased, costfunctionbased, and interpolationbased. The detailed description is as follows.
The clusteringbased segmentation algorithm mainly improves the existing clustering algorithm, which makes it more suitable for trajectory segmentation. A plethora of clusterbased trajectory segmentation algorithms have been proposed. CBSMOT [27] was proposed by Andrey, which is an extension of the DBSCAN algorithm [30]. The algorithm also uses speed characteristics to discover the stop points and move points of the trajectory. And to better process the spatialtemporal trajectory data, it replaces the distance threshold in DBSCAN with the time threshold. Chen et al. improved the DBSCAN algorithm and proposed a segmentation algorithm named TDBSCAN [18]. This algorithm utilizes the important spatialtemporal characteristics of the trajectory to segment the trajectory. The accuracy of the two algorithms is high on the experiment dataset. However, since CBSMOT and TDBSCAN are improved based on DBSCAN, they also have the same weaknesses as DBSCAN, which cannot reliably detect stop points from sparse trajectories.
The costfunctionbased approach mainly segments the trajectory by minimizing the cost function, including GRASPUTS [23]. It was proposed by Amilcar et al. in 2015. This algorithm first randomly selects the segmentation point, that is, landmark. Then, it utilizes the adaptive greedy algorithm to optimal the landmark and calculates the cost function. Finally, when the cost function reaches the lowest, segment the trajectory by landmark. GRASPUTS is tested on two real datasets of different domains and achieves high accuracy. However, because the algorithm uses an adaptive greedy algorithm, the time complexity is very high, which makes it is not suitable for large datasets.
The interpolationbased trajectory segmentation algorithm mainly uses different interpolation methods such as linear interpolation and kinematic interpolation to generate error signals for segmentation, including OWS [15] and SWS [19]. Mohammad et al. proposed the trajectory segmentation algorithm named Octal Window Segmentation (OWS) in 2019, and the SWS is an improvement of the OWS. The intuition of the two algorithms is that when a moving object changes from one behavior to another, this can be captured directly from its geographic location. Mohammad et al. compare the real position of the moving object with the estimated one to generate an error signal. By evaluating this error signal, predicting whether the behavior of moving object changed, and utilizing this information to segment trajectory. These two algorithms are better than the benchmark algorithm in segmentation accuracy. However, a part of the data is required to optimize the parameter and different trajectory datasets need to select different interpolation methods.
The semisupervised segmentation algorithm mainly includes RGRASPSemTS [28]. RGRASPSemTS was proposed by Amilcar et al. It uses the minimum description length (MDL) principle to measure homogeneity inside segments and segment trajectories by combining a limited user labeling phase with a low number of input parameters and no predefined segmenting criteria. However, when the algorithm faces largescale data, it is difficult to create a part of labeled trajectory datasets.
This study proposes an efficient and accurate trajectory segmentation and merging algorithm based on multiple motion features (TSMF) to overcome the limitations of the aftermentioned, mainly composed of a segmentation method and a trajectory merging method. The TSMF algorithm divides the trajectory both from the global and local perspectives to ensure the accuracy of segmentation.
3. Methodology
This section details the novel unsupervised algorithm for trajectory segmentation based on multiple motion features (TSMF). In Section 3.1, we present the relevant definitions. Figure 1 shows the overview of TSMF, which includes the two core processing: segmentation and mergence. The first step of TSMF is to segment the raw trajectory by Pearson coefficient, which is detailed in Section 3.2. The second step is to merge the subtrajectory of oversegmented, which is described in Section 3.3. Finally, the details of TSMF are introduced in Section 3.4.
3.1. Definitions
3.1.1. Raw Trajectory
A raw trajectory is composed of a series of multidimensional spatialtemporal data points. It is denoted as , where , represent the position coordinates at the time . means the movement characteristics of the trajectory point at the time such as speed, angle, and acceleration.
3.1.2. Subtrajectory
A subtrajectory is a set of consecutive trajectory points in the raw trajectory, for example, the subtrajectory can be denoted as .
3.1.3. Trajectory Segmentation
According to feature similarity of trajectory points, the trajectory segmentation algorithm can efficiently and accurately find a set of segment points from the raw trajectory, such as . We can segment the raw trajectory into several disjoint parts by these segmentation points. For example, , where is the number of subtrajectories.
3.2. Segmentation Method
The intuition behind the segmentation method is that when the motion features of two adjacent trajectory points (such as longitude, latitude, velocity, angle, acceleration, and heading) have significant variation, this trajectory point is where the motion state changes, that is, segmentation points. Therefore, the core of the segmentation method is to determine the segmentation point.
To accurately extract the segmentation points, it is necessary to define an index to measure the similarity of multiple motion features between two adjacent trajectory points. Since the Pearson coefficient is sensitive to variation, the Pearson coefficient is employed to calculate the similarity of adjoining trajectory points, extract the point where the motion feature changes, and save it to the segmentation point sequence.
The Pearson coefficient is a statistical indicator that reflects the degree of linear correlation between two variables. The Pearson coefficient can be calculated through Equation (1), where , , are the features of and , the features include longitude, latitude, speed, average speed, acceleration, and angle, is the covariance between and , represents the mean value, , means the standard deviations of and , and describes the expected value of . The value of is between [1, 1]. When equals 0, it indicates that and are irrelevant; when the value equals 1 (e.g., [1–6] and [1–6]), it suggests that and are completely positive correlation; when the value equals 1, it means that and are perfectly negative correlation (e.g., [1–6] and [1, 2, 3, 4, 5, 6]). Generally, trajectory data reflects the motion history of moving objects, and its sampling time is usually very short, so the characteristics between adjacent points in the same motion state are usually the same, that is, the value of the Pearson coefficient is close to 1. And the acceleration, speed, average speed, and angle of trajectory points with changed motion state will change obviously, resulting in Pearson coefficient is closed 1. For example, we calculate the value of the Pearson coefficient of two sets of adjacent trajectory points, whose result is shown in Table 1. We can discover that when the features of adjacent trajectory points are no obvious variation, the value of is close to 1, and it is close to 1 otherwise.
Figure 2 shows the change of the value of , and there are many mutation points of the Pearson coefficient. The value of between mutation points is close to 1 and remains unchanged. Meanwhile, we can discover there are multiple mutation points in a short time. However, the motion state of the moving object does not change in a short time. It means that some multiple mutation points are the outlier points. Therefore, the purpose of the segmentation method is to extract mutation points and remove the outlier points.
The pseudocode of the segmentation method is detailed in Algorithm 1. The proposed segmentation method firstly takes out the raw trajectory (such as ) from the database. Then, calculate the value of and save the results into the array . Finally, the super parameters and are defined to extract the point where the motion state changes, where and are the threshold of Pearson coefficient and time interval. The segmentation method looks for with the minimum value of from a global perspective. When the less than and the time interval between and adjacent segment points is less than , the is added the segmentation point sequence and remove the from array . And on the contrary, the outlier point is removed. The procedure performs this step until the minimum value of is greater than .

3.3. Merging Method
The trajectory segmentation algorithm based on the Pearson coefficient achieves high homogeneity in the subtrajectories. However, in practical application, the collected trajectories contain some outlier points, which cause the value of is closed to 1. Though the segmentation method utilizes time threshold to remove the outlier points, when the time interval is greater than , the outlier points may be mistakenly added to the segmentation point sequence. This condition may cause the raw trajectory to be oversegmented. For example, the raw trajectory containing 122 subtrajectories is finally partitioned into 187 segments, which is oversegmented. In the mergence part, the minimum description length (MDL) principle is used to construct the cost function and merge the subtrajectories by optimizing the cost function from a local perspective, which can ensure the final segmented subtrajectory achieves the best accuracy.
The MDL was proposed by Rissanen [31] and then used and detailed by Grünwald et al. [32]. According to Grünwald et al. [32], the MDL cost consists of and . Here, means the hypothesis, and the datasets. is the length of the description of the hypothesis in bits, and is the length of the description of the data when encoded with the hypothesis. The best hypothesis to explain is the one that minimizes the sum of and .
In the problem of trajectory segmentation, a hypothesis corresponds to a subtrajectory. Finding the optimal subtrajectory means finding the best hypothesis. Give a subtrajectory , and we formulate cost function by Equations (2), which can be used to measure homogeneity. In Equations (2), and , where means the perpendicular distance between and , represents the angle distance between and . The and are defined as Equations (3) and Equations (4), which are mentioned in [17]. Figure 3 shows the formulation of the cost function, and of a subtrajectory, which contains 5 trajectory points.
Based on the theory as aforementioned, the merging method is detailed in Algorithm 2. First, the procedure uses the segmentation point sequence to segment the raw trajectory, it can be denoted as . Then mergers and into , calculates the cost function of , and , and the results are represented as , , and . When , , and satisfy Equation (5), it means that the two subtrajectories are oversegmented and merge and from the local perspective. The procedure repeats this step until the last subtrajectory.

3.4. The TSMF Algorithm
The segmentation part and mergence part are the two phases (global segmentation and local optimization) of TSMF, which are described in Section 3.2 and Section 3.3. Algorithm 3 shows the pseudocode of TSMF. This algorithm receives the following inputs: the raw trajectory , a time threshold and the Pearson coefficient threshold . The output is the set of subtrajectories, which can be denoted as .

4. Experimental Evaluation
To evaluate the effectiveness of the proposed algorithm, we verify the proposed algorithm on two real datasets. This section first details the datasets (Section 4.1) and the evaluation metrics (Section 4.2). Then, the parameter settings and experimental results are introduced in Section 4.3 and Section 4.4, while a comparative analysis with other algorithms is presented in Section 4.5.
4.1. Trajectory Datasets
The first dataset is the vessels performing fishing activities on the coast of Brazil. It contains 5190 trajectory points and 122 segments. Our purpose is to partition the trajectories of fishing and not fishing. Generally, in Brazil, the captain must report the position (such as latitude and longitude) in realtime and record the status of fishing vessels (such as fishing and not fishing). The entire dataset was created using data from four vessels that perform the same types of fishing activities on Brazil’s northeast coast. Figure 4 (left) shows the trajectory of the four fishing vessels.
The second dataset is a subset of the geolife dataset containing 12,955 trajectory points and 181 segments. The geolife dataset has a mix of behaviors, such as car, bus train, and walk. Figure 4 (right) shows the part trajectory of geolife.
From these trajectories, we extracted the information of time, longitude, latitude, fishing, speed, and angle collected. We computed some trajectory features for all the points in this dataset, including mean speed and acceleration. The data description is shown in Table 2.
4.2. Evaluation Metrics
In this work, the harmonic mean () of average purity and average coverage is used to evaluate the proposed algorithm. Scholars firstly proposed the concepts of coverage and purity in [23] and used the harmonic mean () to evaluate the trajectory segmentation algorithm in [19].
The segment purity is the ratio of the sum of the most frequent label in the segment and the sum of all the trajectory points. For example, suppose a segmented trajectory has points, and the number of trajectory points with the most same label is , then, the segment purity is . The average of purity values for all segments is called as . Coverage is to evaluate the completeness of the segmentation algorithm. For example, suppose that the raw trajectory segment is divided into , is the larger one, and the coverage is defined as . The average for coverage of all segments is called as . Since the two metrics of purity () and coverage () are designed to be orthogonal, i.e., when one index increases, the other index decreases. Therefore, the harmonic mean of the purity and coverage is used to evaluate the performance of TSMF. Equation (6) gives the formulation of the harmonic mean [19]. When the harmonic mean is the highest, the purity and coverage of the segmented trajectory reach a good compromise, and the segmentation of subtrajectories is the best.
4.3. Parameter Settings
In the segmentation process, the threshold of Pearson coefficient, that is, is employed to find the segmentation point. In general, when , the two variables of features are highly positively correlated, , the two variables of features are moderately positive correlated; , the two variables of features can be low correlation; , the two variables of features may be irrelevant, and suggests that two variables of features are negatively correlated. Therefore, the TSMF can make to extract segmentation points. In addition, the segmentation process also utilizes the threshold of to remove outlier points. Since it is difficult to know the specific duration of each state of the moving object and the purpose of setting the is only to remove the part outlier points, the can be set to the minimum value of the duration of each movement state. The duration of fishing activities of fishing vessels is 6 hours on the coast of Brazil, which is mentioned in [33], and the shortest duration of the walk generally is 30 min. Therefore, the for the vessels performing fishing activities on the coast of Brazil and for the geolife.
4.4. Experiment Result and Analysis
The experiment result and analysis are detailed in this section. In this experiment, the segmentation method and TSMF are evaluated on the fishing dataset and geolife dataset. In addition, to observe the impacts of and demonstrate the feasibility of , it is tested under different . Figures 5–8 show the result of the experiment.
The results are shown in Figures 5–8, which display the value of the sum of subtrajectories, , , and under different on different datasets.
The results of the segmentation method are shown in Figures 5–6. The results display that with the increase of , the sum of subtrajectories increases, the increases, the decreases, and the increases.
TSMF is an extension of the segmentation method, that is, there is one more merging method. The mergence part merges the local subtrajectory by the segmentation method. The results of TSMF are as shown in Figures 7–8. Compare the results of the segmentation method, we can observe the num of subtrajectory is lower and the and is better. We also can discover that in the mergence part, the num of merged subtrajectories on the fishing dataset is more than geolife dataset. The reason is that when fishing vessels engage in fishing, the speed generally is 4 miles per hour, and the heading angle is constantly changing. This condition leads to the value of the Pearson coefficient being lower and the segmentation method may add many outlier points into segmentation points. Geolife collected trajectory data of 182 users, which includes various motion states. The difference of features of different motion states is large while the same motion state is small. Therefore, the segmentation method can accurately discover the segmentation points, that is, the outlier points in segmentation points is less.
Overall, the results of TSMF are better and the greater of , the segmentation method can extract more segmentation points and leads to the and becomes lower. But it does not mean that the lowest is the best selection. As shown in Figures 7–8, when the , the sum of subtrajectory is very low, that is, many segmentation points are lost of TSMF. The results also indicate that it is the feasibility of .
4.5. Comparing TSMF with Other Baseline Algorithms
In this section, the experiment is repeated in the same environment, and TSMF was compared with the other four trajectory segmentation algorithms (CBSMoT, GRASPUTS, SPD, and SWS) on the fishing dataset and geolife dataset. The results are reported in Figure 9. As shown in Figure 9, we can discover that the value of harmonic average is 90.1% and 94.28% on different datasets, and TSMF achieves the highest harmonic average of purity and coverage. The results also demonstrate the feasibility of TSMF.
5. Conclusions
It is envisioned that future wireless communications will be more datadriven. It is possible to obtain the highaccuracy and longterm trajectory of a moving object by mobile edge cloud, beamforming, and artificial intelligence techniques. But the longterm location data need huge computing resources to process and loses a lot of information. The segmentation algorithm designed for location data is the basic step to develop the locationbased application. This study proposes an unsupervised trajectory segmentation algorithm, named TSMF, which employs the Pearson coefficient to find the segmentation points and minimum cost function to merge the oversegmented subtrajectory. We compared our proposed segmentation algorithm against GRASPUTS, SPD, CBSMoT, and SWS; the results show that the proposed algorithm reaches the best harmonic mean of purity and coverage on the fishing dataset and geolife dataset. Furthermore, the TSMF algorithm requires no labeled data and its time complexity is , which means it is computation efficient and thus most suitable for the segmentation of large trajectory datasets.
However, there is one limitation of TSMF. It is that when the features are similar in the different movement states, the proposed segmentation algorithm may not find the qualified segmentation points for the raw trajectory.
As future work, we plan to extend this work in other directions. First, we would analyze the trajectory motion pattern and predict the subtrajectory state, semantic enhancement for raw trajectory. Second, we would like to apply the segmentation algorithm (TSMF) to more wireless positioning data, which facilitates more artificial intelligence technology are used to mine valuable information.
Data Availability
The data and codes that support the findings of this study are available with the identifier(s) at the private link https://figshare.com/s/6e6fb483b076b2a34cbe.
Conflicts of Interest
The authors declare no conflict of interest.
Authors’ Contributions
Wenjin Xu and Shaokang Dong conceived and designed the experiments; Shaokang Dong performed the experiments and analyzed the data; Wenjin Xu and Shaokang Dong wrote the paper. Wenjin Xu and Shaokang Dong contributed equally to this work.