Abstract

Exploring urban travel patterns can analyze the mobility regularity of residents to provide guidance for urban traffic planning and emergency decision. Clustering methods have been widely applied to explore the hidden information from large-scale trajectory data on travel patterns exploring. How to implement soft constraints in the clustering method and evaluate the effectiveness quantitatively is still a challenge. In this study, we propose an improved trajectory clustering method based on fuzzy density-based spatial clustering of applications with noise (TC-FDBSCAN) to conduct classification on trajectory data. Firstly, we define the trajectory distance which considers the influence of different attributes and determines the corresponding weight coefficients to measure the similarity among trajectories. Secondly, membership degrees and membership functions are designed in the fuzzy clustering method as the extension of the classical DBSCAN method. Finally, trajectory data collected in Shenzhen city, China, are divided into two types (workdays and weekends) and then implemented in the experiment to explore different travel patterns. Moreover, three indices including Silhouette Coefficient, Davies–Bouldin index, and Calinski–Harabasz index are used to evaluate the effectiveness among the proposed method and other traditional clustering methods. The results also demonstrate the advantage of the proposed method.

1. Introduction

Travel patterns can be explored by analyzing the travel characteristics of moving objects (vehicles and humans), which reflects peoples’ travel regularity, traffic congestion regularity, and social activity pattern. Travel patterns have been applied in many areas, for instance, providing the decision information for urban planning and emergency [14], analyzing and optimizing the path to provide personalized travel recommendations for residents [58], vehicles dispatching [9, 10], and station optimization and selection [11]. These applications can prove insights for urban construction and development.

Nowadays, more and more researchers and scholars explore urban travel patterns using large-scale trajectory data, which contain huge hidden information about travel feature and regularity. Several approaches have been used for this application. (1) Clustering methods: trajectory clustering has been gaining increasing interest in recent years, and it generally requires two components, similarity measurement and clustering algorithm [1215]. (2) Spatial statistics: Ni et al. [16] employed a spatial econometric model for travel flow analysis to explore factors that influence travel demand; Zhang et al. [17] proposed a Bayesian hierarchical approach for modeling the destination choice behavior considering the unavailable factors and spatio-temporal correlations; Kamruzzaman et al. [18] estimated the effects of urban form and spatial biases on residential mobility. (3) Deep learning: especially nonparametric deep learning methods have proven to give more accurate predictions in urban traffic forecast [19, 20]. (4) Classical traffic theory models: some classical models have been used in travel patterns analysis. For instance, using combined Markov chain and multinomial logit model [21] to improve the accuracy of travel destination prediction and using traffic assignment models [22] for O-D matrix estimation.

In the abovementioned methods, the clustering method is an unsupervised learning process that classifies datasets on the basis of similarity. By defining a reasonable similarity criterion, the clustering method can effectively excavate hidden information from massive trajectory data and thus reveal travel patterns. Some researchers adopt traditional clustering methods for travel patterns analysis, such as C-means [23], shared nearest neighbor clustering [24], DBSCAN [25, 26], hierarchical clustering method [27, 28], and optics algorithm [29, 30]. Besides, there are also some researchers who propose several modified clustering methods to obtain better clustering results: a modified bee colony optimization (MBCO) [31] is proposed which introduces the approach based on probability selection; a hybrid model-fused k-means and fuzzy c-means clustering with the modified cluster centroid (FKMFCM-MCC) [32]; Wang et al. [33] presented a modified find density peaks (MFDP) algorithm to transform the high-dimensional points into two-dimensional, and it expressed good potential for application.

The literatures mentioned above have provided practical applications of travel patterns analysis, and several classical or modified clustering methods have been proved that they can achieve good clustering results to explore residents’ travel patterns. However, there are also some limitations need to be improved in trajectory clustering for travel patterns analysis. Firstly, as for similarity measurements, lots of literatures only consider the spatial attribute of trajectories, and fewer literatures consider the multiattributes (such as temporal, directional, and other characteristics) of trajectories, ignoring other attributes will cause inaccurate results and unreasonable travel patterns. Besides, many researchers who consider multiattributes do not assign weights to different attributes. The configuration and discussion of weights is necessary because different attributes have different influences on trajectory clustering. Secondly, trajectory data have the attribute of faint borders and overlapping borders; thus, the efficient algorithm which has soft constrains is required to handle such dataset, while lots of researchers consider the ideology of hard divided clustering. Finally, many works pay attention on improving clustering algorithm to reduce computational complexity and obtain better results, and there are lacks in using multiple indices to the quantitative evaluation of results.

Fuzzy clustering methods have the ability to identify clusters with variable density distributions and partially overlapping borders [34]. In fuzzy clustering methods, the framework of the fuzzy set theory is defined with the aim to make up for the shortage of crisp clustering algorithms. Generally, the clustering methods realize soft constraints by introducing weight entropy, membership function, or fuzzy distance function. Currently, the fuzzy clustering methods have been applied in many areas, such as predictive models: Seresht et al. [35] proposed a fuzzy clustering algorithm and assign weights to the fuzzy inference systems to improve the accuracy of predictive models; virus research: Mahmoudi et al. [36] used fuzzy clustering technique to compare the spread of COVID-19 in many countries; image segmentation: Wu and Chen [37] proposed a novel fuzzy clustering method to improve the robustness of the exiting picture clustering. Moreover, some internal evaluation indices are adopted for trajectory data which are unlabeled in clustering results analysis: silhouette coefficient [38, 39]; Davies–Bouldin index [40]; Calinski–Harabasz index [41]; Q-Measure [42]; Dunn’s index [39, 43].

The main contributions of this paper include the following several aspects. Firstly, an improved fuzzy clustering method based on DBSCAN is proposed to cluster taxi trajectory data, in which fuzzy theory is introduced to defining membership functions. It is different from traditional DBSCAN algorithm and can effectively deal with taxi trajectory data which have the attributes of faint borders and overlapping borders. The greatest advantage of the proposed method is introducing the theory of soft constrain, which can divide trajectories into reasonable clusters. While traditional DBSCAN uses hard constrain, when a trajectory may belong to either cluster 1 or cluster 2, the algorithm divides it into first cluster, and this is unreasonable. Secondly, we define the trajectory distance considering the combination of spatial, temporal, and directional attributes to measure the similarity between trajectories. Besides, the weights of different attributes are optimized in trajectory distance function to obtain better results. Finally, several internal evaluation indices are used in model comparison, and the results show the effectiveness and advantage of the proposed method comparing with other classical clustering methods.

The remainder of this paper is organized as follows. Section 2 introduces the framework and theory of the proposed method. In Section 3, data description and case results’ analysis of Shenzhen city are introduced. In Section 4, the evaluation among compared approaches is conducted, and the results are then discussed. In Section 5, we summarize the conclusion of this paper.

2. Methodology

2.1. Framework

The proposed method for taxi trajectory clustering in this paper is shown in Figure 1 as a framework. The framework mainly contains three aspects: (1) after preprocessing the initial trajectory data, the trajectory distance can be calculated with the combination of spatial, temporal, and directional distance using the weight coefficients; (2) the TC-FDBSCAN is adopted to cluster trajectories, which need to determine the weight coefficients and other algorithm parameters; (3) three indices are used to evaluate the clustering results.

2.2. Multicharacteristics Similarity Measurement

A trajectory is denoted by in this paper. , where denotes the location (abscissa and ordinate , the longitude and latitude, are converted to plane coordinates) of track point , and denotes the recording time of track point . Next, we introduce measurements of multicharacteristics similarity between two trajectories and , where and .

2.2.1. Spatial Distance between Trajectories

Determining a rule to calculate the spatial distance of track points between trajectories is the key to measure the spatial similarity of trajectories. Hausdorff distance is a common distance measurement method for two points and can be introduced into trajectory dataset. Figure 2 shows the Hausdorff distance between two trajectories.

Traditional Hausdorff distance calculates the Max-Min distance, which can only measure the dispersion of the distance between trajectories. And, it is susceptible to local shapes of trajectories. In order to improve its robustness to local effects, a modified Hausdorff distance is defined as follows:where denotes the number of track points in , denotes the number of points in , is the Euclidean distance between points and , and is an absolute Hausdorff distance, which denotes the spatial distance between two trajectories.

2.2.2. Temporal Distance between Trajectories

The trajectory is represented as an interval, which is related to time of starting points and time of ending points. Considering the influence of trajectory duration, the temporal distance between two trajectories is defined as follows:where and .

Only if the starting time and ending time of two trajectories are the same ( ), the temporal distance is 0, which means two trajectories are completely similar in temporal characteristic. When two trajectories are separated in temporal characteristic, they are not completely similar and have a temporal distance between 0 and 1.

2.2.3. Directional Distance between Trajectories

Linear directional mean is commonly used to describe the trend or average direction of a set of lines. On the basis of linear directional mean, for trajectory , treating each trajectory segment (which consists of two consecutive points in ) as a line, its average direction is defined as follows:where is the real direction of trajectory segment , which represents the angle rotated counterclockwise due east, .

The angle of linear directional mean between two trajectories is defined as follows:

2.2.4. Trajectory Distance

Firstly, the spatial and directional distance need to be Min-Max normalized and converted to the dimensionless value, while the temporal distance does not need to be normalized because its value is between 0 and 1; then, the corresponding distance can be defined as follows:where and are the set of spatial and directional distance between all pairs of trajectories, respectively.

In combination with spatial, temporal, and directional distance, the trajectory distance is defined considering the influence of weights:where , , and are weight coefficients.

2.3. Trajectory Clustering Method TC-FDBSCAN

In this section, based on the classical DBSCAN method, an improved TC-FDBSCAN (trajectory clustering method based on fuzzy density-based spatial clustering of applications with noise) method is proposed in trajectory data clustering. In this method, membership functions are introduced to achieve fuzziness and parameters (minimum trajectories in neighborhood) and (radius of neighborhood) in classical DBSCAN are replaced by (minimum value of minimum trajectories in neighborhood), (maximum value of minimum trajectories in neighborhood), (minimum radius), and (maximum radius).

Several extended definitions of the modified method are described in detail as follows.

Definition 1. Neighborhood: the region within a definite radius of a trajectory.

Definition 2. Core trajectory: the number of trajectories in the neighborhood of a trajectory is greater than a definite value.

Definition 3. The local density of a trajectory:where and denotes the number of trajectories in neighborhood with membership degree .

Definition 4. membership function:This membership function considers the fuzzies of neighborhood, which causes a trajectory having a specified membership degree in the fuzzy neighborhood of another trajectory.

Definition 5. membership function:This membership function considers the fuzzies of core trajectory, which causes the number of trajectories in the fuzzy neighborhood having a specified membership degree.

Definition 6. Core membership degree: if , then trajectory is a fuzzy core trajectory, and it belongs to a cluster with core membership degree .

Definition 7. Border membership degree: if , then trajectory should be a border or noise trajectory. Then, can be a fuzzy border trajectory which belongs to a cluster with border membership degree:where .
The trajectory will be a noise trajectory if it is not a fuzzy core trajectory or fuzzy border trajectory.

Definition 8. Directly density-reachable: if trajectory is in the fuzzy neighborhood of trajectory and is a fuzzy core trajectory, then is directly density-reachable to .

Definition 9. Density-reachable: if each is directly density-reachable to under the condition of and are in the , then is density-reachable to .

Definition 10. Density-connected: if is density-reachable to both and , then and are density-connected.
The aim of the TC-FDBSCAN is to divide the regions with high density into clusters, which are the largest sets of density-connected trajectories. The major difference from classical DBSCAN is that a trajectory is determined to be a core trajectory or border trajectory considering membership degree, which means the possibility of this trajectory to belong to clusters.
The process is demonstrated as follows which is similar with classical DBSCAN:(1)Randomly select a trajectory to visit. If it is a fuzzy core trajectory, then add it into a new cluster with membership degree computed by equation (11) and add other trajectories in its fuzzy neighborhood into an alternate set.(2)Visit other trajectories in the alternate set; if the trajectories satisfy the fuzzy core trajectory condition, then add them into the original cluster with membership degree computed by equation (11) and add the trajectories in their fuzzy neighborhood; if they do not satisfy the fuzzy core trajectory condition, then add them into original cluster as fuzzy border trajectories with membership degree computed by equation (12).(3)Repeat step (2) until all trajectories in the alternate set have been visited.(4)Repeat step (1) to step (3) until all trajectories in the dataset have been visited.There is a simple approach to allow users to determine two percentages, and . And, then, use and to determine the values of and , where denotes the maximum distance between all pairs of trajectories. For and , a curve can be drawn, in which the x-coordinate is and y-coordinate is the number of trajectories in the set where the distance between each other is equal to . This curve is not monotonically decreasing; then, the corresponding y-coordinate values in the first two bends can be selected as and .

2.4. Cluster Evaluation Indices

In the relevant studies of cluster evaluation, as the experimental data is labeled and which cluster the sample belongs to is known in advance, so the method based on accuracy such as Purity, Rand Index, and Accuracy can be used. While the trajectory data is unlabeled, there is no external information that can be used to verify the authenticity of the clustering results. Then, several internal evaluation indices which consider the geometric structure of data are required to evaluate the effect of clustering results.

Silhouette coefficient (SC) [36]:where is the number of clusters, denotes the th cluster, denotes the number of trajectories in , and denotes the center of . The larger value of indicates better clustering results.

Davies–Bouldin index (DB) [44]:where denotes the th cluster, denotes the number of trajectories in , and denotes the center of . The smaller value of indicates better clustering results.

Calinski–Harabasz index (CH) [45]:where denotes the number of all trajectories in dataset and denotes the center of dataset. The larger value of indicates better clustering results.

3. Case Study

3.1. Data Description

The research area of the experiments is Shenzhen City, China. Shenzhen city consists of eight administrative districts (Futian, Luohu, Nanshan, Yantian, Baoan, Longgang, Longhua, Pingshan, and Guangming) and one functional district (Dapeng), which are located between 113°46′ E to 114°37′ E and 22°27′ N to 22°52′ N. In this study, taxi trajectory data is used in experiment, and it is composed of a series of sample points collected by vehicular GPS equipment.

Each sample point includes the information of the license plate, location (latitude and longitude), recording time, and instantaneous speed and state (0 represents vacant and 1 represents occupied). We use the data collected by 1000 taxies during one week in May, 2019, and we divide them into two groups to conduct experiment separately, which are weekdays (from May 13th to May 17th) and weekends (from May 18th to May 19th). In order to reduce the influence of abnormal points on clustering results, the original data is preprocessed firstly, which mainly includes eliminating the outliers (latitude and longitude are 0 or out of right range and the state is neither 0 or 1) and interpolating the missing points (recording time interval discontinuity). Finally, 2,499,880 original points during the workdays generate 82,144 passenger-occupied trajectories (49,456 trajectories are valid), and 107,186 original points during the weekends generate 34,384 passenger-occupied trajectories (22,774 trajectories are valid).

The travel time and travel distance of trajectories on workdays and weekends can be calculated, respectively, as shown in Figure 3. The percentage represents the ratio of the number of trajectories in the interval to all trajectories. It can be concluded that both on workdays and weekends, most passengers prefer to take taxi in a short (0 to 5 km) or medium (5 to 15 km) travel distance and takes no more than 20 minutes. Few passengers take more than 1 hour by taxi.

3.2. Parameter Configuration

In the proposed method, the parameters including , , , , , , and are needed to be determined to obtain reasonable clustering results. In parameter configuration and comparison, we adjust parameters and select the reasonable parameter combination according to the Silhouette Coefficient (SC) of the clustering results. Firstly, set up several combinations of weight coefficients to compare the clustering results, and the trajectory distance can be calculated meanwhile. Secondly, determine two percentages and according to the approach mentioned in Section 2.3; then, and can be determined according to the value of . In this part, is equal to 1 in these combinations, so we select and . Thirdly, the statistical curve is drawn to select and as mentioned in Section 2.3. For workday data, and , which are shown in Figure 4(b). While, for weekend data, and , which are shown in Figure 5(b). Finally, the optimal parameter combination is determined in the condition that the value of SC is largest in all combinations.

Next, the process of parameter configuration for workday data and weekend data are demonstrated as follows, respectively. Totally, we compare twenty combinations of weight coefficients, and the detailed results including the SC, the number of clusters, and the number of noise trajectories of each clustering result are shown in Tables 1 and 2. We can observe the points which have largest value of SC from Figures 4(a) and 5(a). These points reflect that the clustering results under the corresponding weight combination are the best result for the case study. And, the best results are highlighted in bold in Tables 1 and 2.

Finally, for workday data, we select , , , , , , and , while, for weekend data, we select , , , , , , and .

3.3. Clustering Results

Based on the selected parameters, we then adopt the proposed method to cluster taxi trajectory data in Shenzhen city. Figure 6 shows travel patterns of taxi trajectory data collected on workday and weekend. Tables 3 and 4 show the detailed temporal and directional information of each cluster. The value of main direction represents the angle of rotation counterclockwise in terms of east.

We can conclude some findings comparing the final results between workday and weekend data. Several common phenomena are observed as follows. (1) Most of the trajectories in final clusters are located in arterial roads of city and are concentrated in the downtown area. (2) Some main commuter roads are both detected in clusters (for example, Beijing-Hong Kong-Macau expressway and Binhe avenue are two main roads where most trajectories of clusters located in, which can be observed in Figure 6(a) Cluster 5 and Figure 6(b) Cluster1). (3) Some clusters do not show obvious characteristics on temporal attribute, while they are prominent on spatial and directional attributes (for example, the main travel time periods of Cluster 4 and Cluster 5 in Figure 6(a) do not have a concentrate pattern, while Cluster 4 concentrates on Guangming district, the travel trend is the southeast, and Cluster 5 concentrates on Futian district, the travel trend is towards the east). (4) Clusters which have opposite direction are found in the workday and weekend results, respectively (for example, in Figure 6(a) and Table 3, Cluster 3 and Cluster 4 have opposite direction, since the main direction of Cluster 3 is 138.26° and the main direction of Cluster 4 is 319.35°). It shows that the proposed method can effectively cluster trajectories which have different trends of directions.

There are also obvious differences between workday and weekend. Firstly, the clustering results of weekend data is better than workday data according to the value of SC (workday: 0.6382 and weekend: 0.6418). Secondly, on weekend, the range of clusters are more widespread, and residents tend to travel to the suburb (for example, in Figure 6(b), Cluster 2 and Cluster 3 reflect the larger travel range, the trajectories travel on roads including Pingshan avenue and Pingkui road are added to the final cluster). Thirdly, in contrast to workday, the trajectories of clusters reflect travel in night on weekend (for example, in Figure 6(b) and Table 4, the trajectories in Cluster 4 concentrate on night periods, which during 22 : 00 to 0 : 30). Finally, the number of clusters on weekend is more than weekend and the direction of clusters are more various on weekend which are shown in Figures 6(a) and 6(b).

4. Discussion

In order to verify the effectiveness of the proposed method, we compare it with other commonly used clustering methods in this section. The case data is 10,000 trajectories randomly selected from the original data (from May 13th to May 19th). Three internal evaluation indices such as Silhouette Coefficient, Davies–Bouldin index, and Calinski–Harabasz index as mentioned in Section 2.4 are adopted to evaluate the clustering results.

4.1. Comparative Approaches
4.1.1. Hard C-Means (HCM)

HCM or K-means clustering algorithm is an iterative solution of the cluster analysis algorithm. In this method, the data is divided into C groups, which are randomly selected from C objects as the initial clustering centers. Then, for each object, calculate the distance between every seed clustering center to each object from its nearest cluster center. The clustering centers and the objects assigned to them represent a cluster. Since each sample has been assigned, the clustering center of the cluster is recalculated according to the existing objects in the cluster. This process is repeated until a termination condition is met. The termination condition can be that no (or minimum number) objects are reassigned to different clusters, no (or minimum number) cluster centers change again, and the error sum of squares is locally minimum.

4.1.2. Fuzzy C-Means (FCM)

Fuzzy C-means clustering is a kind of clustering algorithm which uses membership degree to determine the degree of each object belonging to a certain cluster. It is an improvement of the earlier HCM clustering method. FCM divides the original data into C fuzzy groups and finds the clustering center of each group in order to minimize the value function of the nonsimilarity index. The main difference between FCM and HCM is FCM applying fuzzy division so that each given trajectory is determined by the membership degree between 0 and 1. In accordance with the introduction of fuzzy division, the membership matrix has normalization provisions so that the sum of membership degrees of a dataset is always equal to 1.

4.1.3. Agglomerative Nesting Algorithm (AGNES)

Agglomerative nesting algorithm adopts a bottom-up strategy. Each object is initially treated as a cluster, and these clusters are then merged step by step according to some criteria. The distance between two clusters can be determined by the similarity of the closest data in the two different clusters. The merging process of clustering is repeated until all objects meet the number of clusters.

4.1.4. DBSCAN

DBSCAN is a density-based clustering algorithm, which generally assumes that clusters can be determined by the density of the sample distribution. Samples of the same cluster are closely related to each other, and there must be other samples of the same cluster not far away from any sample of the cluster. A cluster is obtained by grouping closely related samples together. The final result of all clusters is obtained by dividing all groups of closely related samples into different clusters. The definitions of DBSCAN algorithm are similar to the descriptions in Section 2.3 without membership constraint.

4.1.5. Shared Nearest Neighbor Clustering (SNNC)

The shared nearest neighbor clustering algorithm was proposed by Jarvis and Patrick, where a link is created between a pair of points and , if and only if and have each other in their closest k-nearest neighbor [46]. This algorithm is an extension of the DBSCAN. The basic idea of SNNC is based on determining the core points around which clusters with various sizes and shapes are built, without worrying about determining their number [47]. Counting the number of points shared between two points and in their k-nearest neighbor list based on the distance metric allows us to determine the similarity between them. The greater the number of shared points, the higher the similarity between and .

4.2. Results’ Discussion

In order to apply these approaches to the case study in comparison analysis, the distance in HCM, FCM, AGNES, DBSCAN, and SNNC is calculated by the trajectory distance as mentioned in our method. The weight coefficients are determined, so the trajectory distance is described by . The detailed results are shown in Table 5.

Table 5 and Figure 7 show the effectiveness evaluation of different clustering approaches, and we can observe their ability in analyzing unlabeled data samples. The values of SC show that the proposed TC-FDBSCAN model produces better clustering results than other methods, while AGNES produces worse clustering results than others. In Figure 7, the DB index and CH index describe the similar effect: (1) TC-FDBSCAN is obviously superior to other methods; (2) AGNES is obviously inferior to others; (3) HCM and FCM show the similar results, while FCM is better than HCM; (4) DBSCAN and SNNC show that their clustering performance are close, while SNNC is better than DBSCAN. Overall, the proposed method can find better clustering division and provide high clustering accuracy for large-scale trajectory data in travel pattern analysis.

5. Conclusion

Clustering taxi trajectory based on similarity measurement is a widely applied way to explore urban travel patterns. This study proposes an improved TC-FDBSCAN to uncover urban travel patterns. The taxi trajectory data collected in Shenzhen city is used to evaluate clustering results in the case study. The dataset is divided into two parts, workdays and weekends, which are be used in clustering analysis and model comparison. Some main findings are concluded in following aspects. (1) Both on workdays and weekends, the trajectories in clusters are mainly distributed on the arterial roads. However, clustering results show that, on weekends, the range of residents’ travel is wider than that analyzed on workdays. (2) Introducing the fuzzy theory into traditional DBSCAN algorithm can improve clustering performance according to three evaluation indicators. (3) Different attributes of trajectories have different influences on clustering results according to the values of weight coefficients.

There are still some limitations which need to be improved in future study. On the one hand, other fuzzy clustering methods need to be studied to reduce computational complexity of the algorithm. On the other hand, other fuzzy theory such as weight entropy should be combined in the trajectory clustering method. Moreover, the proposed method is also necessary to be applied in different cities to prove its universality.

Data Availability

The trajectory data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was funded in part by the Innovation-Driven Project of Central South University (no. 2020CX041), Natural Science Foundation of Hunan Province (no. 2020JJ4752), National Key R&D Program of China (no. 2020YFB1600400), and Foundation of Central South University (no. 502045002).