Abstract

GPS trajectory data in intersections are series data with different lengths. Dynamic time wrapping (DTW) is good to measure the similarity between series with different lengths, however, traditional DTW could not deal with the inclusive relationship well between series. We propose a unified generalized DTW algorithm (GDTW) by extending the boundary constraint and continuity constraint of DTW and using the weighted local distance to normalize the cumulative distance. Based on the density peak clustering algorithm DPCA using asymmetric GDTW to measure the similarity of two trajectories, we propose an improved DPCA algorithm (ADPC) to adopt this asymmetric similarity measurement. In experiments using the proposed method, the number of clusters is reduced.

1. Introduction

With the development of intelligent and connected vehicle technology, tremendous vehicle trajectory data is increasingly available in many applications, such as urban traffic planning and management, traffic analysis, etc. Mining and analyzing vehicle trajectory data are of great significance for intelligent transportation, point of interest recommendation, and location-based services [1]. Trajectory clustering can explore different spatial-temporal features of vehicle trajectory data and has been a research hotspot in recent years [2, 3].

Many studies have been attempted for vehicle trajectory clustering. How to define and measure trajectory distance appropriately is quite a challenging work. There are three main distance metrics used in the existing literature, i.e., Euclidean distance, shape-based distance, and warping-based distance. Traditional clustering methods using Euclidean distance often lead to inaccurate results because of different lengths of trajectories. To overcome this limitation, the warping distance is used in several methods. Besides, shape-based distances, such as Hausdorff and Frechet distances could also apply to trajectories. On the other hand, deep learning-based methods have shown an impressive ability to process multidimensional sequence data [46].

Our work is similar to that presented in paper [7], however, we focus on improving the DTW algorithm and clustering algorithm. We propose a unified generalized DTW algorithm called GDTW. Moreover, we propose a clustering algorithm called ADPC by improving DPCA to adopt the asymmetric similarity measurement.

In our application scenario, namely a road intersection, some short trajectory is included by a long trajectory. Thus, the short trajectory is a subsequence of a long trajectory. We assume that the short trajectory and long trajectory have the same pattern, which means belonging to the same cluster, and the long trajectory can represent a short trajectory. In similarity measurement, this inclusive relationship should be considered. Therefore, we generalize DTW to match this pattern and propose a clustering algorithm to adjust the generalized DTW.

Dynamic time warping is a classic dynamic programming algorithm to measure the similarity between series with different lengths. It is first proposed in the last century. At that time, the research hotspot is reducing computation complexity and constrained search area [8]. At the beginning of the 21st century, research on the lower bound function and other techniques to accelerate DTW in the application is popular [911]. From beginning to end, improving the calculation efficiency in the application is always the key spot of DTW-related research. However, on the other hand, there is research on a variety of DTW to generalize its formation [12, 13], and our work about DTW is in this area.

The clustering is also a widely studied area. There are many classic algorithms, from k-means FCM to GMM, from graph-based hierarchical algorithms to spectrum clustering, and various density-based algorithms. The density-based clustering methods can automatically acquire the number of clusters and form clusters of arbitrary geometric shapes. There are several classic density-based algorithms, such as mean-shift [14], DBSCAN [15], and DPCA [16]. Based on DPCA, we introduce an asymmetric distance clustering algorithm to find the superinclusive trajectory of each cluster.

In this paper, we proposed a method for an unsupervised clustering of vehicle trajectories at intersection areas. The contributions of this work are as follows:(1)We proposed a generalized DTW algorithm to calculate the similarity between two trajectories that can handle trajectories with inclusion relationships. First of all, we extend the definition of DTW by the concept of the matched scheme. Then, we generalized various conditions of DTW among boundary and continuity conditions. We summarized the scale of the problem for each generalized condition for DTW.(2)We proved the validity of generalized DTW and proposed a unified algorithm to calculate all kinds of generalized DTW, and we proved that this unified algorithm’s complexity is the same as traditional DTW’s dynamical programming algorithm. Then, we use asymmetric GDTW as the similarity measurement to match the inclusive relationship between trajectories.(3)We discussed different normalization methods for cumulative DTW and proposed a weighted normalization method. To adopt the proposed generalized asymmetric DTW as distance measurement, we also proposed a clustering algorithm ADPC, which is based on DPCA. This proposed clustering algorithm can extract the local max trajectories as the cluster representative.

2. Methodology

In this section, we first give the traditional dynamic time warping’s definition in a matched scheme formation and describe the traditional DTW’s conditions in this matched scheme formation. Second, we define the generalized DTW conditions. Third, we evaluate the scale of the problem for generalized DTW. Fourth, we give the definition and prove the recursive formation of various generalized DTW, and we propose a unified dynamical programming algorithm to calculate generalized DTW. Fifth, we give a unified normalization for generalized DTW. Sixth, based on DPCA, we propose an asymmetric distance density peak clustering algorithm to adopt the generalized DTW.

2.1. Matched Scheme for Traditional DTW

In this part, we recall the traditional DTW and proposed a concept named matched scheme. We use matched scheme to define the traditional DTW and its three conditions.

The dynamic time warping (DTW) is used to measure the similarity of two series with different lengths. Supposing there are two series and , are the elements of series a and series b, respectively. The DTW distance between series a and series b is defined as (1).where is the direct local distance between the ith element of series a and the jth element of series b. The local distance could be measured by any distance measurement, e.g., the Euclidean distance . In this formation, it is obvious that .

In (1), is the matched index pair sequence, for each pair means that the index values of the element are about series a and b, respectively, and there is no same index pair, namely , where . The matched index pair sequence is denoting the matched pair of elements between the two series. Therefore, the matched index pair sequence P could be assumed as a matched scheme. The DTW is to find a matched scheme P to minimize (1) among all valid matched schemes.

For the traditional DTW, the matched scheme must satisfy the three conditions mentioned below.

2.1.1. Monotonic Condition

Let , for each matched index pair of a matched scheme , must satisfy and . The monotonic condition means both index values in the matched index pair are monotonically increasing and there is no inverse matching.

2.1.2. Continuity Condition

Let , for each matched index pair of a matched scheme , must satisfy and . Continuity condition indicates the limitation on the growing amount of both index values for each matched index pair.

2.1.3. Boundary Condition

Let , which must satisfy and . Condition means both of the first elements of series a and series b must be matched, and we call it the front boundary condition. Condition means both of the last elements of series a and series b must be matched, and we call it the back boundary condition. The boundary condition of traditional DTW meets both front and back boundary conditions.

Therefore, (1) can be explained as the traditional DTW that can find the lowest cumulative local distance among all matched schemes that meet monotonic, continuity, and boundary conditions. We call matched schemes that meet all these three conditions as DTW-matched schemes. The number of DTW-matched schemes saw an exponential increase with the length of the series. It is not realistic to iterate all matched schemes to find the solution. In practice, the DTW is calculated by dynamic programming.

Definition 1. The DTW distance can be calculated recursively by equation (2)where is the same as (1), and it denotes the direct local distance between the ith element of series a and the jth element of series b. denotes the DTW distance between a’s subseries and b’s subseries , and it has . When , it means the index pair does not exist. Therefore, let the nonexistent local distance .
For example, there are two-dimensional trajectory series a and b.
As shown in Figure 1, trajectory series a has 6 sampling points, and trajectory series b has 14 sampling points. The matched pairs in the DTW context are linked by a dotted line. In this paper, we call the figure that is similar to Figure 1 a matched pairs diagram.
In Figure 2, the y axis is the index of the element of series a, and the x axis is the index of the element of series b. In this paper, we call the figure that is similar to Figure 2 a cumulative distance diagram. Each grid (x, y) is represented the partial cumulative DTW distance by different gray depth levels. The path from the left bottom to the upper right is linked by each matched indexes pair is the search path. The start point and end point of the search path represent the boundary condition of DTW. The step of the search path represents the continuity condition of DTW, which is not greater than one on both of y and x axes. The monotonic condition of DTW is intuitively shown by the direction of the search path. In fact, DTW is to find a search path from the left bottom to the upper right in the cumulative distance diagram.
It must be aware that although DTW could measure the similarity between the two series, DTW is not exactly a distance. DTW is symmetric, namely . However, DTW is not an identity, because in most cases, it is hard to say that series a is series b when . DTW does not meet triangle inequality, namely, there may exist three series, namely a, b, and c, having . Therefore, DTW is not suitable to calculate a transitive similarity between many series by accumulating them.

2.2. Generalized DTW Conditions

The traditional DTW needs to meet three conditions, which are monotonic, continuity, and boundary. In this paper, we keep the monotonic condition and generalize DTW by removing the boundary condition and extending the continuity condition.

First of all, we introduce some new generalized conditions. Let be a matched scheme.

We divide the boundary condition into four parts.(1)The left-front-boundary condition, which must meet . It means the first element of series a (left series) must be in the first match pair, which is the start point of the search path of the cumulative distance diagram.(2)The right-front-boundary condition, which must meet . It means the first element of series b (right series) must be in the first match pair, which is the start point of the search path of the cumulative distance diagram.(3)The left-back-boundary condition, which must meet . It means the last element of series a (left series) must be in the last match pair, which is the end point of the search path of the cumulative distance diagram.(4)The right-back-boundary condition, which must meet . It means the last element of series b (right series) must be in the first match pair, which is the end point of the search path of the cumulative distance diagram.

When meeting both left and right front boundary conditions, we call it to meet the left-boundary condition. When meeting both left and right back boundary conditions, we call it to meet the right-boundary condition. When meeting both left and right boundary conditions, it must meet the original boundary condition. When meeting only one of the left or right boundary conditions, we call it meeting the half-boundary condition.

The relationship between these boundary conditions is shown in Table 1. The boundary condition of traditional DTW needs to meet all four generalized boundary conditions. The four new generalized boundary conditions are a partition of the original boundary condition, respectively. In application, the partition of boundary conditions can make partial matching between series.

We extend the continuity condition and divide it into two parts. Let and be any matched index pair of a matched scheme , except the first match pair. For any number ,(1)The left-n-step continuity condition, which must meet and . Compared with the continuity condition of DTW, the left n-step continuity condition extends the growth amount of index value for each matched index pair of series a (left series) to n from the original 1. It would lead to some elements of series a to not be in the matched scheme, which means those elements would be neglected. In the application, we can use this feature to ignore some outliers.(2)The right-n-step continuity condition, which must meet and . The formation of the right n-step continuity condition is similar to the left n-step continuity condition. It only just neglects some elements of series b.

We also give a full-match condition and can divide it into two parts. Let and be the cardinal of sets in matched scheme P for the two series, respectively.(1)The left-full-match condition, which must meet , means each element of series a (left series) must be in the matched scheme.(2)The right-full-match condition, which must meet , means each element of series b (left series) must be in the matched scheme.

When meeting both left and right full-match conditions, we call it meeting the dual-full-match condition. When meeting one of the left or right full-match conditions, we call it meeting the half-full-match condition.

The relationship between various generalized conditions is shown in Table 2, which let n-step greater than one. In application, the half-full-match condition should be met to let the DTW make sense in most cases. When only meeting the one left- or right-half condition, the generalized DTW is not symmetric.

2.3. The Scale of Problem for GDTW

In this paper, we generalized dynamic time warping to extend its application range by eliminating some conditions. However, along with the elimination of conditions, the number of valid matched schemes will be increased, leading to an increase in the scale of the problem. The number of traditional DTW-matched schemes is approximate when M=N. In an extreme case, when all conditions are removed, the structure can be assumed as a binary graph between series a’s M elements and series b’s N elements, and there are matched schemes.

Let be valid matched schemes of traditional DTW for series a with M elements and series b with N elements. By eliminating some conditions, the generalized DTW has a different number of the matched schemes, which are expressed as . We would explain the meaning and give the calculation equation for these matched schemes, respectively.

is symmetric

is the opened begin boundary condition for series A, which means the matched scheme could ignore some elements in front of series a. is the opened end boundary condition for the series, which means the matched scheme could ignore some elements in the tail of series a. and are similar to and , except those for series b. is asymmetric. From fd3(3) and (4), we have the following:

and are symmetric upon removing front or back boundary conditions, respectively. The symmetric feature can be deduced by (5).

means the left-boundary condition is removed but meets the right-full condition, namely every element of series b must be in the matched scheme. In application, when series b is a subsequence of series a, this form of generalized DTW could be zero. The is the same as but meets the left-full condition instead of the right-full condition.

From (5) and (7), we have , and it proved that they are asymmetric. is symmetric.

and are GDTW with the opened begin boundary condition based on removing the back-boundary condition.

removed all boundary conditions, and we have the following:

The number of matched schemes can be seen as the search path count in the cumulative distance diagram. We compared the numbers of different generalized DTW with various sizes of series a and b.

As shown in Figure 3, the growth proportion is stable for various size ratios of series. The scale of the problem is linear growth between generalized DTW.

2.4. Generalized DTW

In this part, we introduce the recursive formation for generalized DTW. Then, we give the recursive equation for each type of generalized DTW. Also, we propose a unified algorithm to calculate all these generalized DTW.

Let denote the subsequence of series a, which is from the uth element to the ith element. Let denote the subsequence of series b, which is from the vth element to the jth element. Let denote the traditional DTW between the subsequence series au…i and bv…j. The traditional DTW between series a and b can be denoted as . The recursive (2) can be rewritten.

Definition 2. The generalized formation of recursive calculation for DTW distance.In (11), must meet the monotonic condition, which is and . When the monotonic condition is not met, it means the search path in the cumulative distance diagram does not exist, thus letting the nonexistent DTW .
The Open-End DTW for series a and b are denoted, respectively, by and , and it is obvious that,In (12), the Open-End DTW for series a is just to find the minimum value in the cumulative distance diagram’s right edge. On the other hand, the Open-End DTW for series b is just to find the minimum value in the cumulative distance diagram’s top edge. The computation complexity is the same as traditional DTW. are asymmetric but have .
The Open-Begin DTW for series a and b is denoted, respectively, by and , and it is obvious that,Equation (13) only gives the relationship between and subsequence DTW . Furthermore, from (11) and (13), can be expressed as follows:The recursive end margin for is , which is the same as traditional DTW using the cumulative distance. On the other hand, because of , the computation complexity is less than traditional DTW using local distance.
has the same recursive formation as , and therefore, Open-Begin DTW’s recursive formation is as follows:Equation (15) has the same recursive formation as traditional DTW, and the only difference is the recursive end margin and . Also, the computation complexity is the same as traditional DTW, and in fact, it is a little less than the traditional DTW because the start margin is not a cumulative distance. are asymmetric but have .
Also, there is another formation. Let denote series a’s reverse sequence . Let denote series b’s reverse sequence . We have and .
The Open-Begin-End DTW for series a and b are denoted, respectively, by and , and it is obvious that,In (16), can be expressed as follows:Because of (11)’s definition, when , the monotonic condition is not met and , having . Thus,Furthermore, when , having . Therefore, has the same recursive formation as . Therefore, the Open-Begin-End DTW can be expressed by and , respectively.Equation (20) has the same form as (12) of and , except for replacing with replacing . Therefore, the computation complexity is the same as traditional DTW. are asymmetric but have .
The symmetric Open-End DTW for series a and b is denoted by and is defined as follows:Equation (21) is showing the symmetric property with . Compared to (12), the symmetric Open-End DTW is just to find the minimum value in the cumulative distance diagram’s right and top edge. The computation complexity is the same as traditional DTW.
The symmetric Open-Begin DTW for series a and b is denoted by , and for , it defined asFrom (15) and (22), we have the following:Therefore, we have the recursive formation as follows:Equation (24) is symmetric and has . Compared to (15), the symmetric Open-Begin DTW has the same formation. The difference is the recursive end margin and , which means the start margin is not the cumulative distance. Therefore, the computation complexity is the same as traditional DTW, and the calculated quantity is a little less than asymmetric Open-Begin DTW’s, which is a little less than traditional DTW.
The symmetric Open-Begin-End-half-Full DTW for series a and b is denoted by , and it is defined as follows:Because of , there is no recursive formation same as traditional DTW. Hence, it must search both asymmetric Open-Begin-End DTW for series a and series b. Therefore, the computation complexity is two times the traditional DTW.
The asymmetric Open-Begin-for-series-A-End-for-series-A-B DTW is denoted by . On the other hand, the asymmetric Open-Begin-for-series-B-End-for-series-A-B DTW is denoted by . They are defined as follows:From equation (17)–(19), we have the following:Intuitively, and search the minimized value in the cumulative distance diagram’s right and top margin, and the element was calculated by and , respectively. Their symmetric formation of Open-Begin-End DTW is denoted by and defined as follows:From (27), we have the following:From (22), we have the following:, , and do not meet the half-full-match condition and are hard to make sense of in the application. We mention them just for completeness. They have the same formation as , except with different cumulative distances. Therefore, the computation complexity is the same as traditional DTW.
The left-n-step continuity DTW for series a and b is denoted by , which means the growth step for series a is expanded to n from 1. Its recursive formation is defined as follows:On the other hand, the right-n-step continuity DTW for series a and b is denoted by , which means the growth step for series b is expanded to n from 1. The recursive formation is the same as , and it is defined as follows:In (31) and (32), for both and , when or , it means the matched scheme does not exist, thus having and . It is obvious that they are asymmetric, which is expressed by . On the other hand, same as other asymmetric DTW, it has . The recursive formation is the same as traditional DTW, except for the needs to search more elements for each step, and the computation complexity is as traditional DTW.
We summarize all these generalized DTW by their features, such as symmetric, half-full, local distance for start margin, and whether search-all-end margin
In Table 3, the star marker () means that the feature cannot be ensured because of the symmetric feature. The n-step continuity DTW and can be combined with other boundary-free DTW. The symmetric front-back-boundary free DTW needs to calculate both asymmetric and .
Therefore, we proposed a generalized asymmetric algorithm to calculate all these generalized DTW. This proposed algorithm’s structure is the same as traditional DTW’s dynamic programming algorithm.
Algorithm 1 is the asymmetric generalized DTW Search (AGDTW_Search). It is used to search the generalized DTW for two series A and B. The parameters oba, obb, oea, and oeb are bool switchers to disable the boundary condition. The parameters istep and jstep control the moving step for generalizing the continuity condition. Default 1 is equal to traditional DTW’s continuity condition. In the traditional DTW’s dynamic programming implementation, (1) it initializes the cumulative distance matrix and traceback matrix and (2) forward passes to set each element of the cumulative distance matrix and traceback matrix. After these two processes, the cumulative distance matrix is generated. Then, the traditional DTW just chooses the last element of both matrices as the end position of the matched scheme.
The differences between AGDTW and traditional DTW are in the three following aspects: (1) the initialization of cumulative distance matrix, depending on parameter oba and obb whether set. AGDTW would initialize the cumulative distance matrix’s first column or first row by local distance instead of the cumulative distance. (2) In the forward pass process, the back-search range depends on parameter isetp and jsetp, which are to control the continuity condition. (3) After the forward pass process is over, the whole cumulative distance matrix is generated, depending on parameters oea and oeb whether set. The end position of the matched scheme is searched in the cumulative distance matrix’s last column and last row.
In algorithm 1 AGDTW_Search, the first row is the initialization of the cumulative distance matrix and traceback matrix. The 2nd to 6th rows are the forward pass processes, which are used to set all elements of the cumulative distance matrix D and traceback matrix K. The 7th to 12th rows are used to search the end position of the matched scheme.
Algorithm 2 is Init_cumulative_and_trace_matrix. It is used to initialize the cumulative distance matrix D and traceback matrix K. The 1st and 2nd rows have generated two empty matrices for D and K. From the 3rd to 7th rows, initialize the two matrices as removing the front-boundary condition, which directly uses the local distance to set the start edges of the cumulative distance matrix D, and all elements of the traceback matrix are set to (0,0), which means the traceback end. From the 8th and 13th rows, the first column and first row of D from the local distance to cumulative distance are adjusted, depending on the parameters oba and obb whether set. In the meantime, the traceback matrix K would be adjusted accordingly.
When the execution of algorithm 1 AGDTW_Search is completed, we get the cumulative value alone. In practice, we need to trace back to get the index pair of the matched scheme.
Algorithm 3 is Trace_back_index_pair_list. It is the same as traditional DTW’s path traceback process. We trace back K starting from the position index pair (iend, jend), which returned from algorithm 1 AGDTW_Search. The 5th row means the traceback end condition is index pair (0,0). The 8th row means the idx_pair_list needs to be reversed because the traceback process is from the end position to the start position. Finally, we generated the index pair list of the matched scheme. Furthermore, we can get the local distances of all index pairs.Trace_back_index_pair_listInput: iend, jend: end index pair of matched scheme   K: traceback index matrixReturn: idx_pair_list: index pair list of matched scheme(1)  idx_pair_list = [](2)  i, j = iend, jend(3)  while True:(4)   idx_pair_list.append([i, j])(5)   if i = = 0 and j = = 0(6)    break(7)   i, j = K[i, j](8)  idx_pair_list.reverse()(9)  return idx_pair_list

AGDTW_Search
Input: A, B: two series
   istep, jstep: n-step for series A and B, default 1
   oba, obb, oea, oeb: boundary-free switcher
Return: dtw: cumulative DTW value
   iend, jend: end index pair of matched scheme
   D, K: cumulative distance & traceback index matrix
(1) D, K = Init_cumulative_and_trace_mat (A, B, oba, obb)
(2) for i in 1 … A.length-1
(3)  for j in 1… B.length-1
(4)   imin, jmin = find min in D[i-istepi, j-jstepj]
(5)   D[i, j] = D[imin, jmin] + local_distance(A[i],B[j])
(6)   K[i, j] = (imin, jmin)
(7)iend, jend = A.length-1, B.length-1
(8) if oea
(9)   iend, jend = index of minimal in D’s last column
(10) if oeb
(11)   iend, jend = A.length-1, index of minimal in D’s last row
(12)dtw = D[iend, jend]
(13)return dtw, iend, jend, D, K
Init_cumulative_and_trace_matrix
Input: series A and B, oba, obb
Return: D: cumulative distance matrix
   K: traceback index matrix
(1) allocate cumulative distance matrix D[A.length, B.length]
(2) allocate traceback matrix K with same size to D
(3) D[0, 0] = local_distance(A[0], B[0])
(4) for i in 1 … A.length-1
(5)   K[i,0], D[i, 0] =  (0,0), local_distance(A[i], B[0])
(6) for j in 1 … B.length-1
(7)   K[0,j], D[0, j] = (0,0), local_distance(A[0], B[j])
(8) if not oba: # use cumulative distance for series A
(9)   for i in 1 … A.length-1
(10)    K[i, 0], D[i, 0] = (i − 1, 0), D[i − 1, 0] +D[i, 0]
(11) if not obb: # use cumulative distance for series B
(12)   for j in 1 … B.length-1
(13)    K[0, j], D[0, j]  =  (0, j − 1), D[0, j − 1] +D[0, j]
(14) return D, K
2.5. Normalization of Cumulative Distance

The generalized DTW search process is to find a matched scheme P that minimizes the cumulative DTW value under some specific conditions. The cumulative DTW value is the sum of all local distances of this matched scheme P. Let the matched scheme . We define and denote the local distance of P’s kth index pair (), which is the local distance between the ikth element of series A and jkth element of series B. Therefore, we define the local distance sequence of the matched scheme P as .

The cumulative DTW value can be expressed as follows:

In application, it is hard to directly use the cumulative DTW because of the various sampling frequencies between different trajectories. More sampling amount would lead to more matched index pairs and a greater cumulative distance, even if each local distance is small.

Therefore, the cumulative distance is needed to be normalized to solve this problem. The basic normalization is average, which is expressed as follows:

The normalization parameter C of the average normalized distance can use a different value, such as series A’s length M, series B’s length N, the max length of matched scheme M + N − 1, the actual length of the matched scheme K, or the max(M, N). Each type of average normalization parameter would lead to a minor effect on the final normalized result. In general, it is reasonable to set C to the length of matched scheme K.

Overall, the average normalization prefers lower results. Assume a matched scheme with K pairs, and only one local distance is nonzero and has a very large value. When K is big enough, the average normalization value is close to zero, which means the two series are very close. However, in fact, there is an extremely different element in these two series. The average normalization ignored this difference. Therefore, the average normalization prefers to make a lesser result, which is not good for reflecting an extremely different element between the two series.

On the contrary, there is max normalization, which is expressed as follows:

Max normalization is choosing the max distance among all matched pairs. It is sensitive to the extreme local distance difference position between the two series. Accompany with the DTW search process, it first finds the minimal cumulative distance among all possible matched schemes, then finds the maximal local distance among all index pairs of that minimal cumulative distance matched scheme.

On the contrary, to average normalization, max normalization is sensitive to an extreme position. Therefore, we propose weighted normalization to balance the average and max normalization. The weighted normalization is expressed as follows:where and . For weighted normalization, the greater the local distance, the larger the weight. Therefore, weighted normalization can reflect the larger difference among all matched pairs and is not too sensitive. Weighted normalization with norm1 is expressed as follows:

For complementary, we propose norm-P weighted normalization, which is expressed as follows:

From the definition, we have . In this paper, we use norm1 weighted normalization.

2.6. Asymmetric Distance DPCA Clustering

In this paper, the similarity measurement for clustering is asymmetric generalized DTW. We apply Open-Begin-End-boundary, extend an n-step continuity for series A, and use norm-1 normalization for these cumulative distances. In the matched scheme, series B must be fully matched. This measurement can match the pattern in which series A contains series B. Therefore, the longest trajectory that contains more short trajectories would have greater density.

We adjust DPCA to find the longest trajectory using asymmetric generalized DTW. We name it asymmetric density peak clustering (ADPC). Similar to the original DPCA, ADPC’s most work is to calculate the local density and minimal distance to other samples that have greater density. The main difference is how to calculate the minimal distance.

Let D be the distance matrix among all trajectories. Dij is norm-1 normalized asymmetric generalized DTW from the ith trajectory to jth trajectory. As we use asymmetric DTW, the distance matrix D is asymmetric, with Dij not equal to Dji. When Dij is small, it means that the ith trajectory contains the jth trajectory.

The calculation of local density for each trajectory is the same as DPCA, with a hyperparameter cutoff distance and Gaussian kernel bandwidth. The calculation of the minimal distance to the greater density sample is a little different from DPCA. To calculate the ith trajectory’s minimal distance, we search it from all Dji, namely the jth column of matrix D. The minimal distance means the minimal distance from its supertrajectories to itself. The super trajectory mean density is greater than itself.

Algorithm 4 is ADPC_make_decision_graph. It is to calculate the local density list rho and minimal distance list delta. After rho and delta are calculated, same as DPCA, samples with greater rho and delta would be assumed as cluster centers.

Trace_back_index_pair_list
Input: iend, jend: end index pair of matched scheme
   K: traceback index matrix
Return: idx_pair_list: index pair list of matched scheme
(1)  idx_pair_list = []
(2)  i, j = iend, jend
(3)  while True:
(4)   idx_pair_list.append([i, j])
(5)   if i = = 0 and j = = 0
(6)    break
(7)   i, j = K[i, j]
(8)  idx_pair_list.reverse()
(9)  return idx_pair_list

As we use asymmetric distance and DTW does not meet triangle inequality, the process of allocating the sample to its cluster is just only one pass and needs to consider the difference between the two asymmetric distances.

Algorithm 5 is ADPC_allocate_sample. It needs two thresholds to decide whether a sample belongs to a cluster. The distance from cluster center to all cluster members must be less than the threshold thre_c2m. The distance from all cluster members to the cluster center must be less than threshold thre_m2c. In the 13th row, when a sample does not meet the threshold conditions, it would be set as the outlier. In the 15th row, when allocating a sample to a cluster, we consider both asymmetric distances by adding them. Then, we assign the sample to a cluster in which the distance sum is the least.

ADPC_make_decision_graph
Input: D: normalized asymmetric generalized DTW
   r: cutoff-distance
Return: rho: local density list
   delta: minimal distance list.
(1)  rho, delta = list[len(D)], list[len(D)]
(2)  for i = 0…len(D)-1
(3)   neighbors = D[i][D[I]<=r]4
(4)   rho[i] = gaussan_kernel(neighbors)
(5)  sortedidx = argsort(rho)
(6)  for i = 0…len(D)-1
(7)   r i = sortedidx[i]
(8)   if i = = len(D)-1
(9)    delta[ri] = D[:, ri].max()
(10)   else:
(11)    idx = sortedidx[i + 1:end]
(12)    delta[ri] = D[:, ri][idx].min()
(13)  return rho, delta

3. Experiment Results

In this section, we apply our proposed generalized DTW algorithm to a real GPS trajectory dataset to illustrate the ability to recognize the inclusive relationship between trajectories. We compare different DTW normalization methods, and the asymmetric generalized DTW clustering ADPC result is given.

The experiment dataset is an intersection area GPS trajectory dataset with 18,813 sampling points and 1654 trajectories, as shown in Figure 4.

The trajectory data consists of samplings located by longitude and latitude. When calculating two samples’ direct local distance, we use the arc distance of the Earth. Let p1 = (lon1, lat1), p2 = (lon2, lat2), and lon1, lat1, lon2, and lat2 be measured by radian, and the local distance is defined as follows:where R is the average radius of Earth between lat1 and lat2, which is defined as follows:where RE is Earth radius on the equator, which is approximately 6,378,137 meters, and RP is the Earth radius on the polar, which is approximately 6,356,725 meters. We use this arc distance to reduce local distance variation by different latitudes and because of the use of the meter unit so that it is good for setting the threshold intuitively later.

3.1. GDTW Comparison

For various generalized DTW, we compare the results in the same trajectory pair. For the boundary condition extending, we compared OBA, OBB, OEA, OEB, OBEA, OBEB, OE, OB, and OBEF. Also, we compared different gap steps for continuity condition extending.

The traditional DTW’s matching result is shown in Figure 5. In traditional DTW, the start and end points for both trajectory A and B are matched.

Next, we demonstrate the asymmetric generalized DTW of the boundary condition extending. They are Open Begin for series A (OBA), Open Begin for series B (OBB), Open End for series A (OEA), Open End for series B (OEB), Open Begin and End for series A (OBEA), Open Begin, and End for series B (OBEB).

In Figure 6, DTW results with different generalized boundary conditions are demonstrated. The start or end points for trajectories A and B are not matched necessary. Therefore, it could be recognized as the inclusive relationship between trajectories.

For completeness, we demonstrate the symmetric form of boundary condition generalization. They are Open End (OE), Open Begin (OB), Open Begin End, and full-match on a series (OBEF).

In Figure 7, DTW results in the symmetric form with different generalized boundary conditions are demonstrated. The start or end points for trajectories A and B are not matched necessarily.

Next, we compared DTW results for extending continuity conditions with different step gaps for series B. The compared step gaps are 3-step, 5-step, and 7-step.

In Figure 8, we compared different step results. The bigger the step gap, the lower the cumulative distance. When the step is big enough to a critical value, because of the single full-match condition, the cumulative distance is not reduced more. Finally, we use OBEA5 as an asymmetric similarity measurement, which is an open begin and end boundary and 5-step continuity for trajectory A, and it leads to assuming series A as the supertrajectory.

3.2. Normalization Comparison

We compare the affection of various normalization methods by different sampling amounts for the same trajectories using traditional DTW.

In Figure 9, trajectory A is resampling in different amounts. The more the sampling amount, the bigger the cumulative distance. The comparison for different normalization results is shown in Table 4.

ADPC_allocate_sample
Input: D: normalized asymmetric generalized DTW
   centeridxs: candidate cluster center index list
   thre_c2m: threshold from center to member
   thre_m2c: threshold from member to center
Return: centeridx_to_memberidxs. outlieridxs:
   outlier index list.
(1)  cm_mat = D[centeridxs,:]
(2)  mc_mat = D[:, centeridxs]
(3)  outlieridxs = []
(4)  centeridx_to_memberidxs = dict((i, []) for i in centeridxs)
(5)  for i = 0…len(D)-1:
(6)   if i in centeridx_to_memberidxs:
(7)    centeridx_to_memberidxs[i].append(i)
(8)    continue
(9)   br_c2m = cm_mat[:, i] < thre_c2m
(10)   br_m2c = mc_mat[i] < thre_m2c
(11)   ii = true_idxs(br_c2m & br_m2c)
(12)   if len(ii) = = 0:
(13)    outlieridxs.append(i)
(14)    continue
(15)   cj = argmin(cm_mat[j, i] + mc_mat[i, j] for j in ii)
(16)   ci = centeridxs[ ii[cj] ]
(17)   centeridx_to_memberidxs[ci].append(i)
(18)  return centeridx_to_memberidxs, outlieridxs

M is the sampling amount of trajectory A. K is the length of the matched scheme, which represents the number of matched pairs. The sum is the cumulative distance for all matched pairs. Therefore, it increases greatly along with the sampling amount. Nmax is the max local distance among matched pairs. It reflects the most difference between two trajectories but lost most information and is seriously affected by noise. Naverage is calculated by the sum dividing K, which is an easy and good normalization, however, it flattens the local difference. Nmedian flattens more and is a little complex to calculate, just for comparison. Nnorm-1 is good as Naverage and can distinguish the effect of greater local distances. Therefore, we use Norm-1 normalization in clustering.

3.3. ADPC Clustering

Before ADPC clustering, the similarity matrix is calculated by GDTW (open begin and end boundary and 5-step continuity for trajectory A, Nnorm-1 normalization) for 1654 trajectories. However, the calculation efficiency is not satisfactory. In our implementation by python, it takes 2 hours to calculate the similarity matrix.

In ADCP clustering, the hyperparameter cutoff distance is set to 15 meters, and Gaussian kernel bandwidth is also set to 15.

In Figure 10(a), there are 17 clustering centers in the upper right area, which are divided by two dotted lines. The horizontal dotted line is the threshold for min-distance , which is 30. The vertical dotted line is the threshold for local density , which is 10. The corresponding trajectories of cluster centers are demonstrated in Figure 10(b).

APDC is designed to find the maximum inclusive trajectory as the cluster center, however, in reality, there are some long singular trajectories that would be clustered as centers. Therefore, the sample assignment procedure of ADPC is worked to filter those singular cluster centers.

The clustering result after sample assignment is shown in Figure 11. The subtitle like “p1 247 26” means the cluster center 1, 247 is the trajectory sample’s serial number, and 26 is the number of trajectories belonging to this cluster. The long singular trajectories, such as p2 399 and p4 578 are regarded as singular trajectories and are not assigned any other trajectories.

3.4. Discussion

We use the arc distance of the Earth to measure the local distance of two GPS sampling points. It can reflect the truth distance better than the Euclidean distance of the latitude and longitude, although the improvement is minor.

We compared various GDTW proposed in this section by the same two trajectories. It proved the validity of GDTW and illustrated the asymmetric boundary-free DTW with 5-step, which is good for recognizing the inclusive relationship between the two trajectories.

By comparing various normalization methods, we proved that Norm-1 is good to reserve more information on difference, while, in the meantime, we reflect the similarity as K average normalization.

In ADPC clustering, we use asymmetric DTW to acquire the max inclusive trajectory as the clustering center and let it as the representation of the cluster. The result shows that our sample allocation algorithm is good to distinguish noise.

4. Conclusions

The conclusions can be summarized as follows:(1)A unified DTW algorithm (GDTW) is proposed to calculate the similarity between two trajectories and can recognize inclusive relationships.(2)To overcome the influence of unbalanced sampling quantity on the trajectories, a weighted normalization method is proposed.(3)We use the generalized asymmetric DTW as a similarity measurement and proposed a clustering algorithm (ADPC). This clustering algorithm can extract the local max trajectories as cluster representatives to reduce the number of clusters caused by short trajectories.

The limitation of this method is calculation efficiency. In our implementation of GDTW, it takes 2 hours to calculate the similarity matrix for 1654 trajectories. An improvement on this limitation will be conducted in future work.

Data Availability

The data used to support the findings of the study can be obtained from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by Beijing Social Science Foundation 21XCC013.