Abstract
Vessel big data play a significant role in understanding vessel behaviors and thus facilitating the prosperity of waterway transportation. However, relevant research regarding vessel trajectory recognition in a broad range of narrow channels still lacks, especially using VITS data. The major objective of this paper is to conduct vessel trajectory analysis based on the novel VITS data and examine its availability in inland waterway vessel transportation. An alternate aim is to develop a more comprehensive framework to extract the vessel trajectory of multiple narrow waterways. This paper utilized vessel trajectory information of multiple narrow channels belonging to Yangtze River captured by VITS. Four compression algorithms were conducted. Additionally, the performances of three clustering approaches were evaluated. Speed distribution analysis was also implemented. The results indicated that slide window (SW) algorithm outperforms its other counterparts. Relative to DBSCAN, K-means and hierarchical clustering analysis (HCA) tend to be more capable of balanced classification. This paper is the first to utilize VITS data in vessel trajectory feature extraction and can potentially provide useful insight for vessel trajectory extraction in multiple narrow channels.
1. Introduction
Waterway transportation provides the most energy-efficient way of transporting large quantities of goods over long distances [1] and constitutes the foundation of worldwide trade, accounting for about 90% of the volume of global trade volume [2]. It is estimated that the volume of the worldwide waterway trade will grow by 3.2% from 2019 to 2022 [3]. For hinterland regions with developed water channels and networks, inland shipping plays a significant role in the comprehensive transportation of waterway containers, and it can effectively promote regional social development. Generally speaking, the existence of perfect infrastructure construction and navigable canals makes it possible for the harbors or the dry-ports to reach deeply into the hinterland, which consequently enables an easier generation of economies of scale [4]. In Jiangsu Province of China, authorities are striving to develop the inland shipping with superior water transport conditions. The coverage rate of inland waterway network is about 85% within the province, and the inland waterway mileage makes up approximately 20% of the total mileage in China now. In 2018, the volume and rotation volume of waterway freight transport account for 35.5% and 63.2% of the province’s total, respectively [5]. On the other hand, the rapid growth in the economy and trading leads to increasing demand for more ships and higher traveling speeds, which highlights concerns in waterway transportation safety and security. Besides, it is necessary to understand current vessel transportation patterns for the amelioration of operating efficiency and provide empirical support for various applications.
To achieve the aforementioned goals, Jiangsu Maritime Safety Administration has developed a ship identification system called the Vessel Identification and Trajectory Sensor (VITS) which utilizes both BeiDou and GPS to obtain satellite positioning data and applies 3G to send the data to the base station service receiver. VITS mainly stores and sends vessel traveling records throughout the whole traveling process in Yangtze Rivers. However, unlike the automatic identification system (AIS) that has been widely employed to conduct vessel transportation analysis, VITS data have not been applied in the current quantitative research, which actually deserves further exploration considering its availability in vessel transportation analysis. Additionally, to the authors’ best knowledge, although an abundant number of studies conducted in the vessel trajectory analysis for extracting the vessel trajectory features and providing the insightful knowledge to enhance the safety and efficiency of vessel transportation, they tended to focus on the marine transportation or a relatively smaller range of inland narrow waterway transportation, neglecting to expand the research scope to a larger range of inland narrow waterway transportation from an overall perspective, which is actually required for facilitating the management of inland waterway transportation.
Hence, to address the aforementioned gap, the major objectives of this paper are (1) to conduct vessel trajectory analysis based on the novel VITS data and examine its availability in inland waterway vessel transportation in the Yangtze River and (2) to develop a more comprehensive framework to extract the vessel trajectory of multiple narrow waterways belonging to Yangtze River from an overall perspective.
To the authors’ best knowledge, the current study is original since it is the first to apply the VITS data at a quantitative analysis level. The research framework is exhibited in Figure 1. To be precise, the innovation can be two-fold: (1) the current study is the first to utilize VITS data to conduct vessel trajectory feature extraction; (2) this paper made an attempt to explore the trajectory distribution in a larger range of inland narrow channels, while conventional studies mainly focused on the trajectory analysis at the ocean level or in a relatively small range of narrow channels. By determining the optimal threshold, the paper utilized four compression algorithms to conduct the preliminary trajectory simplification based on three indicators including compression time, redundant information rejection rate, and compression error. Clustering analysis was further undertaken to extract the vessel trajectory pattern considering location and speed information. Finally, the speed distribution of different clusters was investigated to reflect the effect of different clustering methods, which can also be applied for providing the reference for inland area risk determination. The remainder of this paper is structured as follows. In Section 2, related work is reviewed. In Section 3, methodology is introduced. In Section 4, the results of the experiment are provided. Section 4 also discusses the results of the trajectory extraction, and the paper is concluded in Section 5.

2. Related Work
Nowadays, data science and technologies, represented by big data mining and artificial intelligence such as machine learning and deep learning, have attracted an extensive concern for changing the traditional way of science exploration profoundly [6]. Under such circumstances, the widespread use of AIS has made ship trajectory data increasingly available, providing more opportunities to model the vessel behavior, predict the vessel trajectory, and analyze the risk of vessel collisions than ever before [7]. To be concrete, AIS is an automatic tracking and self-reporting system for identifying and locating vessels by electronically exchanging data among other nearby ships, AIS base stations, and satellites [8]. Including both static attribute information of vessels and dynamic motion features, the AIS data normally reflect consecutive maneuvering behaviors of vessels that are automatically updated continuously. Note that although AIS data have been widely utilized in waterway transportation, it is not free of limitations considering the lower accessibility of the specific AIS data for the most researchers due to it being supplied by a third party [9].
Currently, vessel behavior pattern recognition is one of the most important objectives existing in the research concerning AIS data. The behavior patterns of vessels demonstrate not only the kinematic characteristics but also the spatiotemporal features that vary among different individuals due to the different sailing situations over the area [10]. Consequently, it is essential to understand the vessel behaviors in an area for efficient traffic management, predictive port design, and maritime supervision.
Many researchers have employed AIS data for vessel traffic behavior recognition in recent years. Mascaro and Korb [11] and Silveira et al. [12] made a classification based on ships’ type without considering the size of them, contributing to the lack of a detailed description of behavior distinctions in the scope of the same ship type. Shu et al. [13] and Xiao et al. [14] classified vessels according to gross tonnage (GT), based on frequencies of ships with a different GT in the dataset, without a fairly clear recognition of the actual behavior patterns. Besides, Shelmerdine [15] examined the spatial and temporal variations in vessel trajectories through drawing ship trajectories on a map.
With regard to vessel trajectory analysis, due to the high sampling frequency of AIS messages, usually at an interval of a few seconds to a few minutes, a large number of ship trajectory points are generated during the voyage, which greatly increases the workload of data storage and analysis. Therefore, it is necessary to adopt effective data simplification methods that can compress massive AIS trajectory data of a trajectory without sacrificing the quality of the original trajectory to demand maritime motion pattern research and application, such as path planning and anomaly detection [8].
The global compression algorithm, represented by the Douglas–Peucker (DP) algorithm [16], has been employed to compress trajectory data [17] and performs effectively in ship trajectory compression [18]; owing to the advantage of translation and rotation invariance, many studies have evaluated and proved that the DP algorithm performed accurately in line simplification [19, 20]. However, the determination of the DP algorithm parameters might vary among different cases. Besides, some scholars believed that the DP algorithm has high time complexity and it is time-consuming to run [21].
Although the global compression algorithm is suitable for static or historical trajectory data compression, it might be incapable of dealing with real-time online trajectory compression. Thus, the compression method based on local features is more appropriate [5]. The simplest local compression algorithm only needs to consider the characteristic relationship between the current point and two adjacent nodes, such as the vertical distance method and included angle method. However, due to too few objects compared, this compression algorithm cannot distinguish the space-time bending with small local change and a large cumulative change [22]. Therefore, the window-based compression algorithm can better coordinate the local features and the global features. Typical window-based compression algorithms include slide window (SW) algorithm and open window (OPW) algorithm [23]. Due to different strategies of selecting the ending point, OPW is divided into normal open window algorithm (NOPW) and before opening window algorithm (BOPW).
Machine learning is also a prevailing and promising technique employed in the field of vessel trajectory, which can be classified into four major categories, i.e., classification, regression, dimensionality reduction, and clustering. Among them, some classical classification and regression algorithms such as random forests (RF) and multi-layer perception (MLP) tend to be employed to predict the vessel trajectory, and destination and arrival time (e.g., [3, 24]), while dimensionality reduction methods including principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) are targeted at dealing with the visualization of high-dimensional issues (e.g., [25]). Recently, deep learning method has attracted increasing popularity in vessel trajectory analysis; e.g., Liang et al. [26] introduced the convolutional auto-encoder (CAE) method to learn the low-dimensional representations of informative trajectory images and found that it outperforms traditional trajectory similarity computation methods in terms of efficiency and effectiveness.
Clustering analysis, another powerful feature extraction instrument and statistical method in the field of data mining, can be applied to identify the similar ship behavior patterns [27]. As a typical unsupervised machine learning method, the clustering algorithm can automatically classify data through similarity measurements without inputting ground labels. Therefore, a number of researchers have used clustering analysis to handle ship trajectory data [28], which can be generally classified into three categories [29]:(1)Distance-based clustering methods: Edit distance (ED), Hausdorff distance (HD), longest common subsequence (LCSS), and dynamic time warping (DTW) are most widely used. Recently, K-means algorithm has also been adopted in vessel trajectory clustering considering its high calculating efficiency, generality, and transferability, which is also commonly identified as the advantages of the distance-based clustering methods. For example, Zhen et al. [30] put forward a k-medoid algorithm with a proposed improved HD to investigate the spatial and directional characteristics when clustering a ship trajectory. Li et al. [31] proposed a dimensionality reduction-based clustering method, which used dynamic time warping (DTW) to measure the distance between different trajectories, and the k-medoid algorithm was utilized to recognize vessel traffic behavior patterns. However, it should be noted that the distance-based clustering algorithms have the potential to lose the local information of the trajectory and might be incapable of dealing with imbalanced trajectory distribution.(2)Density-based clustering algorithms: It can be found that the density-based spatial clustering of applications with noise (DBSCAN) algorithm and its variants are of great representativeness owing to its superiority in clustering trajectory with arbitrary shapes. For example, an improved DBSCAN algorithm was applied by Pallotta et al. [32] to extract vessel movement patterns based on turning points of AIS trajectories and identify abnormal trajectories that were far from the fixed patterns. Mazzarella et al. [33] presented a way to discover fishing areas based on the historical AIS data and DBSCAN algorithm. The TRACLUS algorithm [34] was used by Xiao et al. [14] for vessel trajectory clustering. In their work, the minimum description length was applied for dividing the trajectory and the DBSCAN algorithm was conducted to group the sub-trajectories. Zhao et al. [10] utilized maximum likelihood estimation to determine the fixed minimum core objective quantity (MinPts) and the radius value (eps) adaptively in the DBSCAN algorithm. Sub-trajectory clustering based on the multipath spectral clustering algorithm was proposed by [25] for behavior pattern recognition, which tremendously improved the machine learning efficiency. Varlamis et al. [35] presented a methodology for extracting information about the navigation network for an area based on the DBSCAN method. Liu et al. [36, 37] applied DBSCAN to improve the trajectory quality. Although density-based clustering methods have gained great popularity, it should not be denied that there are still some limitations; e.g., the density-based clustering algorithms might perform poorly especially when the vessel trajectories are of inhomogeneous distribution.(3)Statistical clustering algorithms: Kernel density estimation (KDE) and Gaussian mixture model (GMM) methods are frequently employed. The former can be characterized as possessing flexible kernel function and window width to analyze the vessel manoeuver from different dimensions, while the time and space efficiency still make it difficult to apply it to large-scale vessel trajectory clustering. With respect to GMM, it is suitable for analyzing the characteristics of vessel traffic flow in waterways with large traffic flow.
The clustering result can not only reveal the different ship behavior patterns and spatiotemporal statistical characteristics between some clusters, but also recognize the integral ship behavior patterns over the whole research area. As a result, the port authority can benefit from the identified clusters, which can provide support for port operation and shipping safety supervision. Thus, it is necessary to carry out the related research.
On the other hand, most researches regarding vessel trajectory feature extraction focus on the marine transportation, lacking attention to the inland narrow waterway transportation. Note that some scholars have conducted the relevant studies; e.g., Wu and Zaloom [38] proposed an AIS data-based method of studying travel behavior of vessels when passing through a hotspot in a narrow waterway and found that more than 10% of total vessel conflicts occurred within these three hotspots. Wu et al. [7] also estimated travel time of ships in narrow channel based on AIS data. However, it should be noted that these studies concentrate on a local analysis of the specific narrow channel, and an overview knowledge to understand trajectory feature in multiple narrow channels is not that common.
3. Methodology
3.1. Data Description and Preprocessing
The “Vessel Identification and Trajectory Sensor” (VITS) is a ship identification system which was independently developed by Jiangsu Provincial Maritime Safety Administration inspired by the AIS [39]. It is used to store and send ship traveling records throughout the whole traveling process. The scope of the installation objects includes the normally operated dangerous chemicals ships, powered passenger ships with more than 12 passengers, and ordinary cargo ships with a gross tonnage of more than 100 tons belonging to Jiangsu Province. 1,617,480 points of 617 ships during October 2017 were utilized in the current paper with the longitudinal and latitudinal coordinates ranging from (28.32, 35.25) and (112.74, 121.73), respectively. Each record represents a vessel’s status: vessel ID, the location (longitude and latitude), speed over ground (SOG), heading, and timestamp (as accurateas a second).
The VITS system uses BeiDou and GPS to obtain satellite positioning data and uses 4G to send the data to the base station service receiver for data analysis. Owing to the use of the operator network for data transmission, VITS has basically realized the whole object perception, the whole water coverage, and the whole process tracking. As mentioned by Zhang [40], VITS has four merits compared to original AIS data:(1)Automatic identification of ship information: One of the highlights of VITS is that people cannot shut down the equipment at will, so as to avoid the situation that the equipment cannot identify the ship’s identity information due to malicious shutdown. When the power supply of VITS equipment fails, it can still rely on the internal lithium battery for power supply, and the longest independent continuous power supply is 100 days. It realizes the real automatic identification of ship information, makes up for the deficiency of manual identification and the limitation of AIS identification, enables maritime operators to fully grasp the regional ship traffic situation at any time, and improves the work efficiency of traffic control.(2)Safe and reliable information source: At present, there are some problems with AIS data in the domain of water safety supervision; for example, the ships installed with AIS equipment often input randomly or never update the ship-related information, switch on and off at will, and do not remove the fault; this causes the inaccurate data source and the untimely data update. VITS equipment uses not only BeiDou and GPS dual-mode geographical position sensing module, but also has equipment abnormal alarm system, which greatly improves the target tracking performance.(3)Expanded network transmission coverage: The transmission of traditional sensing devices mainly depends on the base station and is often affected by the base station. Due to the limitation of coverage, the data cannot be transmitted. VITS does not depend on the base station. The data are sent to the base station service receiver through 3G network, which has wider coverage.(4)Active alarm function: The VITS device will give an active alarm when the VITS box and antenna box are at fault, the external power supply is disconnected, the internal battery is exhausted, and it enters the sleep mode.
The raw data were exported and processed with Oracle 11g to detect and eliminate some inevitable errors during the collection, transmission, and receiving processes caused by software or hardware defects, as mentioned in previous literature [41, 42]. Consequently, the quality of VITS data was improved through discarding out-of-range latitude and longitude coordinates, abnormal speed, and duplicate data, which is vital for vessel traffic behavior recognition.
The following part will provide a detailed introduction to the methodology of the current study. To summarize the methodology, this paper firstly conducted trajectory compression to enhance the value of the trajectory individuals and computing efficiency. Secondly, on the basis of the vessel trajectory compression, clustering methods include DBSCAN, K-means, and hierarchical clustering analysis (HCA) which were conducted to extract vessel trajectory. To search for the best hyper-parameters of these clustering methods and find the more optimal one, two indicators including Calinski-Harabasz index and inertia-evaluation index were utilized for evaluating the performances. Besides, the speed distribution of different clusters according to the results of the clustering algorithms was analyzed to reflect the speed variance among clusters and identify the anomaly clustering center with a higher value of speed.
3.2. Trajectory Compression and Characteristic Point Extraction
The high sampling rate of VITS data can improve the accuracy of the trajectory, but this will contribute to a massive amount of data creating the overhead problems with storing, transmitting, and processing. Now that a ship trajectory is approximated by a list of chronologically ordered coordinate points, so it can be regarded as a sort of special line. A previous study showed that the generated losses in accuracy were negligible during the process of compressing a large amount of raw coordinate position updates, as they contributed almost no other knowledge about maritime motion pattern [43]. Thus, the linear simplification method without sacrificing the quality and characteristics of the original trajectory is required for ship traffic behavior recognition.
To preserve the main geometrical structure of the original trajectory and reduce the redundant trajectory points, it is necessary to extract the characteristic points from the original trajectory, and in this research, DP, NOPW, BOPW, and SW algorithms for trajectory compression were both conducted to compare the algorithm performance.
3.2.1. Douglas–Peucker (DP) Algorithm
In the field of ship trajectory analysis, the Douglas–Peucker (DP) algorithm [16] is typically used for compressing trajectory data, and its essence lies in using line segments to approximate the original trajectory. The method makes use of spatial information of trajectory data to detect ship directional changes at the trajectory level, thus reducing the number of points in a curve that is represented by a series of points [17].
The principle of DP algorithm has been illustrated in Figure 2. The first point and the last point are regarded as the initial anchor point and the initial floating point, respectively, and the straight line connecting them is taken as the baseline. The vertical Euclidean distance from the middle point to the baseline is calculated by (1) in turn. The predefined threshold is selected as a benchmark to compare.where is the distance from point to the line defined by two points and .

As Figure 2 shows, the solid lines represent the original trajectory, and the dashed lines represent a DP simplified trajectory. If the vertical distance from and to the line defined by and exceeds the threshold, and will be retained. However, if vertical distances from to the line defined by and , from and to the line defined by and , and from to the line defined by and are smaller than the preset threshold, , , , and will be discarded. Finally, the original trajectory in the figure is finally simplified to , , , .
3.2.2. Open Window Algorithm
As shown in Figure 3, the opening window (OPW) algorithm selects the first point in the data series as the starting point and the third point as the floating point. The compression principle is as follows: (1) calculate the vertical distance from all data points between the starting point and the floating point to the connecting line between the starting point and the floating point; (2) if all distances are less than the given threshold, the floating point moves backward one along the sequence direction and returns to step (1); (3) if there is a point whose offset distance is greater than the threshold value, the point is saved and set as the starting point, and the subsequent third point returns to step (1) as a floating point. After the cycle ends, the track composed of the stored points is the compressed track. In the current paper, two OPW algorithms were utilized including NOPW and BOPW. Their difference is that NOPW selects the sampling point causing the maximum distance in the current window as the end point of the current window, while BOPW takes the previous sampling point of the just absorbed point as the end point of the current window.

3.2.3. Sliding Window Algorithm
The core idea of sliding window (SW) trajectory compression algorithm is to initialize a sliding window with a size of 3 points including the starting point, the end point, and the curve point [44].
As shown in Figure 4, the is the starting point and the initial sliding window is , , . Using the line segment as the approximate trajectory of the part , , , the vertical Euclidean distance between the point and the line segment is calculated. If the distance is less than the threshold, the window will be continued to expand and a new trajectory point will be added. Consequently, the new sliding window is updated to , , , and the line segment is used as the new approximate trajectory to calculate the vertical Euclidean distance of other points in the window. If the calculated distance exceeds the preset distance threshold, the point will be regarded as a key feature point of the approximate trajectory and marked as the starting point of the new sliding window. Then, the previous steps will be repeated until it comes to the end of the track. The original trajectory in the figure is finally simplified to , , , , .

3.3. Clustering Algorithm
3.3.1. DBSCAN Clustering Algorithm
The DBSCAN is a density-based data clustering algorithm [45]. Without specifying the number of clusters in advance, the high-density points in target dataset can be grouped together and the low-density points can also be identified as outliers.
This algorithm requires two parameters for clustering data [46]:(i)Eps: the maximum distance between two points can be considered to categorize in the same neighborhood.(ii)MinPts: the number of neighbor points, a data point, must have to be considered as a core point.
An illustration of DBSCAN algorithm can be seen in Figure 5. At the beginning of DBSCAN clustering algorithm, each point in the dataset is marked as unvisited. It starts with an arbitrary point 1 and retrieves its MinPts = 2. Meanwhile, the point is labeled as visited. If its neighborhood is greater than MinPts, a cluster is formed. In Figure 5, to are the core points (in green color) because each of them has at least 2 points in their neighborhood. Then, for each unvisited point in this cluster, the same retrieve process will be repeated until all the density-connected points are found, and (in blue color) is a border point. (in red color) is identified as an outlier point because it is not reachable from any other points. In another round, a new unvisited point will be retrieved to discover another cluster. Based on the parameters, the points that are density-connected with one another are grouped into a cluster, while all points that are not density-connected with any other points are outliers.

3.3.2. K-Means Clustering
K-means is one of the most widely used unsupervised clustering methods owing to its simplicity, ease of implementation, and efficiency [47]. The aim of the K-means algorithm is to divide m objects in n dimensions into k (where k ≤ m) partitions (or clusters) such that the within-cluster sum of squares is minimized [48].
Given a set objects, the K-means clustering algorithm attempts to optimize the following objective function, that is, minimize the distance of each point from the center of the cluster to which the point belongs [49]:where is the criterion function, is the th observation, is the th cluster center, is the object set of the th cluster, and represents the number of clusters.
Although the K-means algorithm is sensitive to the presence of noise points and is easy to fall into local optimum, it does find a local minimum for a given initial choice of centroids. Moreover, the interpretability of the algorithm result is strong and the convergence speed is fast. The K-means will be conducted several times to check for variation in clustering due to different initial centroids. Furthermore, the K-means algorithm belongs to the family of clustering algorithms that require a priori specification of a desired number of clusters [50].
3.3.3. Hierarchical Clustering
Hierarchical clustering analysis (HCA) is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. HCA divides the data at different levels and generates a series of nested trees to provide the needed clusters. HCA has generally grouped two categories including agglomerative and divisive clustering. The current study utilized agglomerative clustering, where each point starts in its own cluster, and two clusters are merged as one with the level moves up [51].
3.4. Clustering Performance Evaluation Criteria
3.4.1. Calinski-Harabasz Index for DBSCAN
The Calinski-Harabasz Index (CHI), also known as the variance ratio criterion, can be used to evaluate the clustering performance of DBSCAN algorithm. The index is the ratio of the sum of between-cluster dispersion and of inter-cluster dispersion for all clusters (where dispersion is defined as the sum of distances squared), where a higher Calinski-Harabasz score relates to a model with better-defined clusters [52].
For a set of data of size which has been clustered into clusters, the mathematical formulation of the Calinski-Harabasz score is as follows:where is trace of the between-group dispersion matrix and is the trace of the within-cluster dispersion matrix defined bywhere is the set of points in cluster , is the center of cluster , is the center of , and is the number of points in cluster .
The score will be high with a greater and a smaller when clusters are dense and well separated. Besides, the score can be computed quickly.
3.4.2. Inertia-Evaluation Index for K-Means
Inertia is an internal evaluation method of clustering performance, which represents the sum of Euclidean distances from all sample points to the centroid in a cluster. The K-means algorithm aims to choose centroids that minimize the inertia or within-cluster sum-of-squares criterion illustrated as (2).
However, it should be noted that inertia is based on the assumption that clusters are convex and isotropic, which is not always the case. It may respond poorly to elongated clusters or manifolds with irregular shapes. Besides, inertia is not a normalized metric, and Euclidean distances tend to become inflated in very high-dimensional spaces. Although it may bring several drawbacks, inertia can still be recognized as a measure of how internally coherent clusters are. A smaller inertia means that the samples are more similar in each cluster and the clustering results are more satisfactory.
4. Result and Discussion
4.1. Trajectory Compression
To undertake data-driven vessel trajectory analysis, a large body of methods to deal with tremendous trajectory data and information has been proposed which can be roughly classified into two major categories: trajectory pattern extraction and trajectory prediction [3]. For trajectory pattern extraction, the most frequently utilized methods can be summarized as stay-point extraction (e.g., [53]), trajectory segmentation (e.g., [54]), and trajectory clustering (e.g., [10]). However, due to the massive amount of the original vessel trajectory points recorded by sensors, the data storing, transmitting, and processing can be extremely challenging, with time and machine memory consumed and occupied, which puts forward the claim of trajectory compression technology.
Although DP is one of the most prevailing trajectory compression algorithms that have been applied by many scholars (e.g., [8, 55]), it should be noted that its operational complexity is extremely high due to its high time complexity and the inevitable tendency of being memory overflowing [21]. Compared to the demand for real-time trajectory compression, NOPW, BOPW, and SW algorithms were proposed to ameliorate the efficiency of data compression. In this article, to investigate the performance of different compression algorithms and find the best one for later analysis, all of them were conducted. Besides, compression time, redundant information rejection rate [21], and compression error were chosen as three main indicators for performance evaluation and the results have been demonstrated in Figures 6–8. As illustrated, SW algorithm presented better performance in compression time relative to its other three counterparts. From the overall perspective, compression time decreases as threshold increases for all the employed algorithms. The average compression time of DP, NOPW, BOPW, and SW algorithms was 472.556 s, 371.556 s, 334.400 s, and 252.801 s, respectively. It reflected that compared to the DP algorithm and OPW algorithm, the SW algorithm might be more efficient and subsequently provide a more time-saving compression strategy for real-time trajectory compression.



With regard to redundant information rejection rate, which can also be considered as compression rate, SW algorithm obtained more acceptable results indicating higher compression efficiency. Marginally different from the relationship between compression time and threshold, redundant information rejection rate is positively related to threshold and when the parameter of compression threshold was set as 20, the compression rate of DP, NOPW, BOPW, and SW algorithms was 95.433%, 96.689%, 95.733%, and 97.189%, respectively, demonstrating a high compression efficiency. Likewise, it is worth noting that redundant information rejection rate appears a stable tendency when the threshold reaches a certain value, consistent with the relationship between compression and threshold with a higher value. This result also indicates that the SW algorithm is more efficient compared to DP, NOPW, and BOPW.
From Figure 8, it could be observed that the compression error in these algorithms has a roughly linear relationship with the compression threshold, which suggests that as the compression threshold increases, the compression error will also increase. In the beginning, the SW algorithm exhibited relatively large compression errors, while it had a lower error-increasing rate. The BOPW algorithm consistently maintained a high compression error. DP and NOPW algorithms obtained lower compression errors when the threshold is small, but their error-increasing rates were higher than SW algorithms, indicating that when the threshold constantly increases, SW algorithm may have the lowest compression error.
Comparing these four algorithms in terms of compression time, redundant information rejection rate, and compression error, it can be found that SW outperforms the other compression algorithm in VITS dataset utilized in the current study, while DP displayed an inferior performance. Figure 9 presents the trajectory point distance distribution of original trajectories, DP-compressed trajectories, and SW-compressed trajectories. As presented in Figure 9(a), the original trajectories contain a series of points with the distance majorly distributed in the range of 20–40 m. After compressed by DP and SW algorithms, respectively, the distances between points were enlarged and demonstrated a similar trend. Figures 9(b) and 9(c) present the distribution tendency of compressed trajectory with the threshold being set as 20 m, and it can be observed that most of distances fall in the 20–5000 m.

(a)

(b)

(c)
4.2. Trajectory Clustering
Clustering algorithms have been widely employed in the field of vessel trajectory feature extraction containing DBSCAN, K-means, HCA, OPTICS, spectral clustering, affinity propagation (AP), etc.; nevertheless, few studies compared their performance in clustering effect. Some work has proposed the joint method to integrate the decomposition algorithm (t-SNE) and clustering algorithm (spectral clustering) to deal with the high-dimensional trajectory data (e.g., [25]); however, after numerical attempts, it was found that spectral clustering, OPTICS, and AP were incapable of dealing with the original trajectory information of the dataset used in this article, which consists of longitude, latitude, speed, heading, and timestamp. DBSCAN, K-means, and HCA are more suitable and powerful, and that is why they were selected in this paper to conduct trajectory feature extraction.
Regarding DBSCAN clustering algorithm, since its calculation requires two critical uncertain parameters, Eps and MinPts, there is a great necessity to conduct parameter adjusting test for the selection of the most appropriate parameters, and in this paper, Calinski-Harabasz Index (CHI) was applied to test the algorithm performance and obtain the most appropriate parameters of eps and MinPts. CHI takes the advantage of easy and fast calculation, which is time-saving and of significance for practical application, and a higher Calinski-Harabasz score relates to a model with better-defined clusters. Besides, to comprehensively measure the clustering effect, clustering time and clustering number were also considered two important indicators. After a pretesting process, four groups of MinPts (i.e., 50, 100, 150, 200) were set for comparison with eps taking the value between 0.0002 and 0.011. For the intuitive visualization of the relationship between parameters and indicators, they were separately chosen as the dependent and independent variables to portray the relationship curves as illustrated in Figure 10.

(a)

(b)

(c)
Relative to groups Min150 and Min200, the thresholds of CHI scores are higher in groups Min50 and Min100, indicating that lower MinPts tended to obtain higher CHI scores and might have a better clustering effect. Besides the group Min50, the CHI scores appear a fluctuant tendency before they get steady in the other three groups. With regard to clustering time, as shown in Figure 10(b), all groups demonstrate the propensity of ascending as eps increases, which might be caused by the increase of calculation complexity, since eps represents the scan radius and its expansion will also bring about the enlargement of calculation burden. In addition, it is marginally unexpected that there is no obvious distinction in processing time among different MinPts groups, indicating the insensitivity of MinPts to processing time. Another alternative explanation is that at such MinPts scales (i.e., take 50 as an interval), the differences in processing time are not apparent whereas when at other MinPts scales with larger intervals, there will be different findings and more tests could be performed for further investigation.
K-means is also an unsupervised machine learning method to implement sample clustering, and to achieve a satisfying clustering result, the cluster number could be determined with different evaluation indexes such as CHI, silhouette coefficient (SC), and inertia. Likewise, after a pretest to investigate the relationship between cluster numbers and specific evaluation indexes mentioned above, it was found that for the vessel trajectory data of this paper, CHI was hard to converge at an upper boundary and subsequently the best cluster number cannot be defined. Similarly, SC was not suitable for the dataset of this paper, while inertia presented acceptable results especially when combined with clustering time indicator as presented in Figure 11. It is noteworthy that lower inertia value can represent better cluster effect, especially for those data with low dimension. As shown, K-means clustering time is positively related to the number of clusters, while inertia index presents the opposite propensity and declines sharply when the number of clusters lies in the range of 0–10. To realize the balance of clustering time and inertia index, 14 (the value of cluster number at the intersection point of two curves) was chosen as the input parameter of K-means for later analysis. Likewise, HCA analysis was also conducted with 14 clusters.

Comprehensively considering the results of parameter selection of DBSCAN, K-means, and HCA, the ultimate eps and MinPts of DBSCAN were set as two groups: (1) eps = 0.003, MinPts = 150; and (2) eps = 0.002, MinPts = 100. Through conducting DBSCAN algorithm with these two groups of parameters, 8 and 15 clustering were obtained as demonstrated in Figures 12(a) and 12(b). The trajectory sample distribution of each cluster among different algorithms and parameter groups can be seen from Table 1 and it can be obviously observed that the trajectory distribution is not balanced among clusters of DBSCAN algorithm and most samples (nearly 90% of both DBSCAN groups). Likewise, Figure 12 reveals that compared with clustering result of DBSCAN algorithm, K-means and HCA are more capable of dividing clusters among highly assembled data with little distance divergence out of the clusters. To further compare the clustering performance of DBSCAN, K-means, and HCA in the dataset of this paper from a quantitative perspective, CHI calculation was undertaken, with the ultimate scores of DBSCAN (eps = 0.003, MinPts = 150), DBSCAN (eps = 0.002, MinPts = 100), K-means, and HCA being 4870.043, 2439.210, 59887.906, and 54006.362, respectively.

(a)

(b)

(c)

(d)
4.3. Speed Distribution Analysis Result
To reflect the trajectory point feature and investigate the differences among clusters from a more specific perspective, vessel speed distribution analyses of DBSCAN clusters and K-means clusters were carried out and the result is presented in Figure 13. As illustrated, vessel speed distribution of clusters obtained by K-means and HCA did not vary obviously among clusters, whereas clusters generated by DBSCAN algorithm presented larger distinctions in speed distribution, which might be caused by the smaller sample size of the most clusters. With regard to DBSCAN cluster speed distribution, samples belonging to Cluster 1 were found to have larger average speed (6.874 knots) and speed standard deviation (2.542) than other clusters, with its cluster center located in the coordinate (119.718545, 34.302416), indicating that when the vessels are driving into this coordinate field, more attention should be paid to avoid conflict since the surrounding vessels may maintain a higher speed, which increases the risk of the accident. Although the result showed that Cluster 10 of DBSCAN obtains the lowest average vessel speed (3.59 knots), this result might not be completely credible since Cluster 10 only has 80 samples, and lacks of validity and reliability.

(a)

(b)

(c)
As for speed distribution analysis of K-means clusters, both the average speed (5.828 knots) and speed standard deviation (2.446) were the highest in Cluster 6, with the cluster center lying in (113.29926, 29.633016). For HCA clusters, the highest average speed is Cluster 4, with the speed of 5.525 and speed standard deviation of 1.723.
4.4. Methods of Trajectory Extraction
Although previous studies have involved the utilization of trajectory compression more or less, few of them compared the performance of different compression algorithms. Additionally, most studies preferred to employ DP algorithm (e.g., [3, 55]) for trajectory compression, while the other three algorithms (i.e., NOPW, BOPW, and SW) utilized in the current paper seem to present better performance in trajectory compression time and efficiency in this paper.
As Figure 14 exhibits, although some trajectory points were deleted through compression, the compressed trajectory did not present substantial deviation from the original one. Besides, in this study, owing to the trajectories being collected from the narrow waterway channels, the advantages of applying compression algorithms to both decrease time and memory consumption and preserve original trajectory features can be observed more apparently.

Through the performance comparison of this article, it is considered that although SW algorithm outperformed other compression algorithms in terms of demand for real-time online compression given its lowest compression time regardless of varied thresholds (Figure 15). The compression time gap between SW and DP was the largest, while that between SW and BOPW is the lowest, suggesting that DP might exhibit satisfying performance in addressing global compression task [41], but compared to other algorithms, it could not handle the local compression in an efficient way. However, it should be noted that when the preset distance threshold increases, the compression time gap might narrow. Besides, it can be assumed that the increase of the data amount will also aggrandize the compression time gap, and more research could be implemented to expand the relevant research. As for the practical application prospect, it is believed that as the trajectory compression technology constantly matures, online and real-time vessel management and decision will also be increasingly available and thus facilitate the development of vessel transportation.

Clustering analyses, another important branch in vessel trajectory extraction, were also carried out in this paper. Except for spectral clustering, OPTICS, and AP, which failed to run the dataset in this research, this paper testified the effect of DBSCAN, K-means, and HCA in narrow waterway channel trajectory extraction, and according to CHI score, the performance of K-means is more desirable than the other two clustering methods, while DBSCAN seems to be more frequently used in the vessel trajectory detection (e.g., [3, 8, 55]), which might be due to the large geographical spaces considered in the article. The points in different clusters come from different geological locations. And it can be observed that for DBSCAN clustering result, most trajectory points focused on the middle and lower reaches of the Yangtze River, indicating the relative dense traffic owing to prosperous economy in the Yangtze River Delta. Similar to this paper, some scholars also believed that DBSCAN was not the optimal clustering method [25]; nevertheless, their work obtained desirable result in spectral clustering, which was different from the current study. These findings proposed some novel perspectives different from the conventional one, indicating that not all trajectory data are suitable for DBSCAN. An alternative explanation is that our trajectory data are based on narrow waterway channels while most studies used maritime vessel trajectory.
4.5. Trajectory Feature Analysis
Through the further trajectory point feature analysis reflected by cluster speed distribution variance, it can be found that under the roughly similar number of clusters, vessel speeds among clusters determined by DBSCAN presented larger variance compared to its K-means and HCA counterparts. Past work regarding trajectory extraction has conducted speed analysis from the alternative perspective, such as speed distribution in a narrower waterway channel (e.g., [10, 56]) and trajectory segment speed feature analysis (e.g., [55]), while this paper enlarged the analysis range to lots of narrow channels belonging to Yangtze River and the finding demonstrated that vessel speed did vary slightly among different channels. However, it should be noted that the sample of this paper is based on 617 ships, there might be a certain contingency, and more samples could be fed to acquire more convincing results in future research.
5. Conclusion
This paper compared the performance of four trajectory compression algorithms and three clustering algorithms out of vessel trajectory information obtained by VITS. The main contributions and findings can be concluded as follows.(1)In terms of trajectory point feature extraction, DP, NOPW, BOPW, and SW algorithms were employed to compress vessel historical trajectory data, which can subsequently reduce the amount of data processing and significantly improve the computational efficiency. The experimental results based on a real dataset collected from Yangtze River and its connected inland waterways demonstrated that the SW algorithm presented better performance in compression time compared with the other three algorithms with appropriate thresholds. As for redundant information rejection rate, two algorithms all obtained high compression efficiency, while the compression rate of the SW algorithm is 97.189% performed better than the other algorithms.(2)To recognize vessel behavior patterns, DBSCAN, K-means, and HCA were utilized for clustering vessel trajectory points, where CHI and inertia were used to select the most appropriate parameters to improve the performance of the corresponding algorithms. The running results indicated that lower MinPts tended to obtain higher CHI scores and might have better clustering effect for the DBSCAN algorithm, while it is unexpected that there was no significant difference in processing time between different MinPts groups. Compared with the clustering results of the DBSCAN algorithm, it was also found that the K-means and HCA are more capable of dividing clusters among highly assembled data.(3)Besides, the parameters to balance the clustering time and inertia index were selected to compare the distribution of the trajectories. It is obvious that the trajectory distribution of DBSCAN was more divergent among different clusters, indicated by larger distribution variance of vessel velocity compared to its K-means and HCA counterparts on the condition of similar cluster numbers. Through the statistical analysis of clustering results, the water position with higher speed or variance can be discovered and marked as risky areas, which will help maritime administration understand the ship behavior patterns and the ship behavior change patterns in a specific area and develop corresponding vessel speed management and operation measures. This will strongly guarantee the safety of vessels and reduce property losses due to unreasonable traveling speeds.
However, it should be noted that this paper is not free of limitations: (1) there may exist some contingency due to the sample collected in one day, so more samples are required in future research; additionally, this paper mainly utilized dynamic data of vessels; in the future, more static data such as vessel type information could be further obtained to investigate whether it is useful to enhance the clustering performance; (2) clustering individual point data abandons the shape characteristics of the trajectory which cannot properly model the global trajectory features and motion trends; (3) considering the narrow geographical shape characteristics of the Yangtze River channel, the clustering results may be influenced to a certain extent, and it is worth investigating techniques for extracting features and obtaining better performance in narrow waterways in future development; (4) the current paper only explored the feasibility of the VITS data, while the comparison of VITS and AIS could be conducted when more AIS data were obtained for better understanding of these two data sources; (5) clustering methods for the trajectory could be further improved for trajectory extraction in the future research.
Data Availability
The data used to support the findings of this study have not been made available.
Conflicts of Interest
The authors declare that they have no conflicts of interest or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work has been partially supported by the National Natural Science Foundation of China (Grant no. 52072069) and Postgraduate Research and Practice Innovation Program of Jiangsu Province (KYCX21_0130). Their assistance is gratefully acknowledged.