Abstract

Road link speed is one of the important indicators for traffic states. In order to incorporate the spatiotemporal dynamics and correlation characteristics of road links into speed prediction, this paper proposes a method based on LDA and GCN. First, we construct a trajectory dataset from map-matched GPS location data of taxis. Then, we use the LDA algorithm to extract the semantic function vectors of urban zones and quantify the spatial dynamic characteristics of road links based on taxi trajectories. Finally, we add semantic function vectors to the dataset and train a graph convolutional network to learn the spatial and temporal dependencies of road links. The learned model is used to predict the future speed of road links. The proposed method is compared with six baseline models on the same dataset generated by GPS equipped on taxis in Shenzhen, China, and the results show that our method has better prediction performance when semantic zoning information is added. Both composite and single-valued semantic zoning information can improve the performance of graph convolutional networks by 6.46% and 8.35%, respectively, while the baseline machine learning models work only for single-valued semantic zoning information on the experimental dataset.

1. Introduction

With the increasing number of vehicles, traffic congestion in cities is getting more and more serious. Obtaining real-time and future road states is essential for optimizing driver routes, reducing road congestion and developing sustainable urban transport policies [1, 2]. Road states are usually measured by traffic indexes such as volume, speed, and occupancy [2, 3]. With the support of communication and computing technologies, these indexes can be calculated from monitoring data obtained from sensors placed in the road network. Particularly, taxis with location-positioning capabilities are considered to be flexible probes that can obtain real-time, continuous information on vehicle movements, trip origins and destinations, routes, and passenger status. Studies have also shown that the analysis and prediction of urban road states using location data and artificial intelligence methods can be effective in relieving road traffic stress [46].

The layout of urban functional zones is the root cause for the generation of traffic demand, the uneven distribution of traffic flow, and the dynamic characteristics of road network. Traditional methods of urban functional zoning use land-uses, satellite images, and questionnaire surveys to statically delineate urban functional areas by clustering or the establishment of indicator systems. However, the static functional delineation cannot reflect the travel patterns exhibited by human activities [7, 8] and their impact on the formation of regional functions. In recent years, trajectory data such as taxi and bus location data have been gradually applied to the classification of land use and the identification of functional zones in road networks to help city managers better understand the relationship between urban functional zones and travelers’ activities. On the other hand, the traffic flow between road segments is spatially correlated [9]. Not only do the traffic flows between upstream and downstream road segments influence each other, but there is also a traffic transfer relationship between multiple road intersections. Studies have shown that the traffic state at one intersection is directly related to the other 100 intersections [10]. However, the use of trajectory data to identify urban functional zones and predict road link speeds is independent of each other. The information on the functional structure of road network is for planning purposes only and is not integrated with speed predictions for road links. In addition, correlations between road segments and intersections were tested only at small spatial scales. Therefore, it would be valuable to include these two factors in road speed prediction.

As socioeconomic activities develop and change, an urban area usually contains multiple functions simultaneously. This in turn affects the temporal and spatial characteristics of each road. But in the traditional method of road network subdivision, each road belongs to only one functional zone [11, 12]. The results of such singular delineation cannot reflect the dynamic nature of the network zones and have poor relevance to human activities. Recently, some studies [1316] have proposed multifunctional quantification methods based on textual data mining, such as the Latent Dirichlet Allocation (LDA) model [17]. The results of the multifunctional quantitative calculations are expressed as vectors, which are then clustered to form the final functional zoning of road network [18, 19]. Furthermore, urban transportation networks are typically complex networks characterized by small-world, community structures. Current machine learning algorithms for complex networks have become a research hotspot. Several graph representation methods [20, 21] and graph neural networks (GNNs) [22] have been introduced in complex network modeling. In the past years, data-driven machine learning methods were commonly used to predict the state of the road network, such as support vector machines [23] and neural networks [24]. Recently, there are published literatures based on GNNs in the field of road state prediction, such as Multirange Attentive Bicomponent GCN (MRA-BGCN) [25], Multiweighted Traffic Graph Convolutional Network (MW-TGC) [26], DDP-GCN [27], and T-GCN [28]. In these methods, only primitive information in taxi trajectories and structural information in road network were used, such as speed, travel time, traveled distance, speed limit, and flow direction. It was expected that through the analysis of a large amount of data, efficient models could be learned autonomously. However, less research has been devoted to integrating the spatial semantic zoning information with predictive models and examining the validity of zoning information on the intended models.

For the reasons mentioned above, this paper aims to integrate the semantic zoning information with graph convolutional network for road link speed prediction. Firstly, LDA algorithm was used to obtain the stable semantic zoning information of taxi travel network over a certain period of time, and then the extracted functional vectors were added to the training process of spatial-temporal graph convolutional network and baseline machine learning models. Finally, we compared and analyzed the performance of functional vectors in each model. The main contributions of this paper are as follows: (1) due to the difficulty in reflecting the temporal and spatial dynamic dependencies of each road link by assigning it to a single zone, we obtained the composite functional vectors of each road link using semantic zoning based on the text of the taxi trajectories. (2) We proposed the use of spatiotemporal graph convolutional network to fuse information on semantic zoning, historical speed, and network structure. (3) With the proposed method, the large-scale spatial correlation of road links was integrated into the predictive model using LDA-generated semantic vectors; the local spatial correlation and the temporal dependencies of road links were learned by spatial–temporal graph convolutional network.

This paper is organized as follows. We briefly describe the works related to road speed prediction in Section 2. In Section 3, the detailed steps of the proposed method are explained, and theoretical basis of each algorithm used is introduced. The results of comparison experiments with six baseline algorithms are presented in Section 4 and discussed in Section 5. The conclusion and future work are reported in Section 6.

2. Literature Review

The functional layout and structure of the city is the root cause of the generation of traffic demand and the imbalanced distribution of traffic flows in the taxi trip network. This unevenness in traffic flow is often reflected in the spatial and temporal differences in location. For example, traffic demand in commercial areas is high and road congestion is frequent. For electric vehicles, longer charging times tend to cause congestion near charging stations [29]. In contrast, the road network in cultural district tends to experience traffic peaks during commuting and school hours. Identifying zoning and cluster characteristics of taxi travel networks has been the focus of research in urban planning, transportation network planning, and spatiotemporal trajectory mining for taxi operations and management. There are three types of methods to characterize the zoning or clustering of taxi travel networks. The first method is to detect hotspots and identify clustering patterns using a clustering algorithm based on taxi location points. The second approach is to divide the urban space into regular grids of a certain size [8, 30] or traffic zones [31] and then perform density analysis and clustering pattern discovery in the grids or zones. The last one is the semantic analysis method, which extracts trajectories from taxi location points according to spatial and temporal order and then combines them with textual information such as point of interests (POIs) data [32] and street names to identify the semantic functional areas of the travel network [1316]. Compared to the previous two approaches, the semantic-based approach makes the functional zoning of the travel network more interpretable. In semantic analysis methods, LDA [17] is a widely used method that first appeared in the field of natural language processing (NLP) for semantic topic recognition. The current researches have been extended to the field of trajectory data mining with good results [14, 16]. When analyzing taxi trajectory data using LDA algorithm, the “word-document-topic” relationship in text mining is referenced to extract the road network semantic zones based on “road-trajectory-topic zone.”

Traffic state estimation refers to the analysis of typical quantities, such as traffic speed, travel time, flow volume, and density. Traditional traffic prediction algorithms include support vector machine (SVM) [23], support vector regression (SVR) [2], ARIMA, and neural networks [24]. The early speed collecting approaches primarily adopt loop sensors, radar, cameras, and other sensors, which are mostly used in road traffic monitoring and autonomous vehicle state detection [33]. Compared to the above sensors, the GPS equipped in taxis has a wider coverage and is more useful for recording the driving speed of each road segment. For example, Shan [3] proposed a multivariate linear regression model based on taxi location data to calculate the travel speed of each road segment by fusing information from the previous interval time and adjacent road segments. Oshyani [34] used an estimator based on indirect inference to predict traffic speed. Shan [35] tested three widely used GPS-based traffic speed estimation methods. Deng [36] introduced a path inference process for congested link speeds from low sampling frequency taxi GPS data. Satrinia [2] predicted the traffic speed using support vector regression. Yao [23] proposed a support vector machine model with spatial–temporal parameters for short-term traffic speed prediction, including multitime-step traffic prediction of several road links, and compared the proposed model with ANN, k-NN, historical data-based model, and moving average data-based model. The abovementioned methods in big data generally lack longevity and scalability due to insufficient robustness of the underlying theory. Despite the good performance of the SVM and ANN, they can only provide deterministic point prediction and failed to provide the corresponding uncertainty quantification. In recent years, there have been some state-of-the-art prediction methods that can measure uncertainty in transportation field [3739], but these methods have not yet been used for road state prediction.

In recent years, research trend in the field of traffic prediction has been towards deep learning and combinatorial models. For example, Ma [40] proposed a convolutional neural network (CNN) to learn traffic from images and predict large-scale, network-wide traffic speed. Liu [41] introduced an attention CNN to predict traffic speed. Kim [42] employed the capsule network on loop sensor data for traffic speed prediction. As an emerging framework, GNNs have been widely promoted and extended in traffic prediction. Zhao [28] presented a temporal graph convolutional network (T-GCN) that combines GCN with gated recursive unit (GRU) for traffic prediction. Guo [43] proposed attention-based spatial-temporal GCN for traffic flow forecasting. Lu [44] designed a graph Long Short-Term Memory (LSTM) framework to capture spatial-temporal representations in road speed prediction. GCNs can be seen as a special case of GNNs [22], whose spectral domain approach aims to introduce the convolutional theory of signal analysis into irregular graphs to extract spatial features similar to those of CNNs on images. GNNs can also implement graphical feature extraction through message passing. However, it is prone to smoothing problems [22]. Combining the spatial zoning characteristics of taxi travel networks at different scales with road speed prediction is beneficial to the optimization and integration of model design. Following this idea, Huang [45] used spectral clustering to classify the traffic conditions into several clusters and implemented predictions for the clusters with less variability of traffic conditions within each cluster. In order to be able to add semantic information to the predictive model and to learn the spatiotemporal dependence between road segment links, we integrated two state-of-the-art algorithms. We used LDA algorithm for semantic zones detection for road links at large scale and adopted a GCN algorithm for speed prediction that considers spatial–temporal dependencies at local scale.

3. Data and Methods

3.1. Data

The data used in this paper are taxi trajectories and road network. Taxi data was collected in Shenzhen, China, from May 1 to May 15, 2015. The raw taxi data was sampled at intervals of about 30 seconds. The road network data was downloaded from OpenStreetMap [46] and manually checked and edited. A sample of taxi data and the road network are shown in Figure 1. Since the taxi data contains some useless information and some errors, we first removed the useless fields from the daily data and saved the following fields: taxi ID, latitude and longitude, timestamp, speed, and operator status. After removing the outliers, the final dataset contains a total of 16,828 taxi trajectories.

3.2. Methods

From the existing literature [14], the temporal and spatial dynamic characteristics of road links can be quantified by trajectory semantic mining methods and represented in the form of functional vectors. Additionally, the characteristics of road links are correlated to their historical states and are influenced by the surrounding road links. In order to incorporate the spatiotemporal dynamic and correlation characteristics of road links into speed prediction, this paper proposes a speed prediction approach for road links based on the integration of LDA and GCN and validates the feasibility of this approach and the effectiveness of the functional vectors of road links through comparative experiments. The detailed flowchart is illustrated in Figure 2.

The proposed approach has five key steps (shown in Figure 2), which are trajectories extraction, map matching, semantic zones detection, semantic zones merging, road link speed prediction, and comparison experiments. The first three steps are used to prepare the experimental data and to discover the semantic zones in taxi travel network. The extracted results are feature vectors of the semantic zones. In order to verify the effectiveness of zoning information in road link speed prediction, we merged the composite semantic zones using modularity [47] in the fourth step and generated two types of features: single-valued zoning features and composite zoning features. In the fifth step, six baseline models were trained using the two types of features and the historical average road link speed, respectively. Finally, both types of features were added to the prediction process and their effectiveness was compared with six baseline models.

Step 1. Extracting trajectories from taxi location data.
The trajectories in taxi raw data are composed of a collection of discrete points. We extracted the trajectories based on carrier states at the time of taxi positioning. The location where the carrier state changes from 0 to 1 was defined as the origin, and the location where the carrier state changes from 1 to 0 was defined as the destination (see Figure 3). Continuous location points between origin and destination were used as a trajectory. To avoid the impact of searching for passengers on road link speed calculations, we ignored the locations of taxis without passengers.

Step 2. Map matching.
In this paper, the ST-Matching algorithm [48] was used to match all trajectories to the road network. This algorithm takes into account the spatial geometry and topology of the road network as well as the time/velocity constraints of the trajectories. It can handle low-sample-rate localization data within 3 to 5 minutes with excellent matching accuracy and is suitable for the low-frequency data in this paper. The GPS geographic coordinates are converted to planar coordinates, and OpenStreetMap and PostGIS [49] were used to extract the source and target nodes of the road network during the matching process to search for the shortest path. Figure 4(a) shows the prematching trajectory points and Figure 4(b) shows the postmatching trajectory points. It can be seen that all the trajectory points have been correctly aligned to the road network.

Step 3. Semantic zoning based on LDA algorithm.
In order to obtain composite features for semantic zones, LDA algorithm was adopted in this paper. LDA is a semantic topic model proposed by Blei [17] and enables modeling of intertextual semantic topics based on text corpus. In the results obtained by LDA, a topic contains the probability distribution of each word. For each document, it can have multiple topics. When describing the distribution of document topics, a document can be represented as a composite vector of topics, or a topic with the highest probability is used as the topic of a document. For taxi trajectory topic discovery, the former can be used for subsequent machine learning tasks, and the latter can be used to visualize zoning results. The LDA algorithm is defined as follows.
LDA (as shown in Figure 5) assumes that the a priori distribution of the document topic and word is Dirichlet distribution; then for any document and any topic , LDA has the following definitions: is the probability distribution of each implied topic in the dth document. α is hyperparameter of distribution and a -dimensional vector. is the probability distribution of the th topic feature word. is hyperparameter of distribution and is a -dimensional vector, and represents the number of all words in the lexicon. is the probability distribution for the th topic in document . is the probability distribution for the th word in the th document.
The LDA model is generated as follows:(1)For each topic , calculation , and Divergence, choose the best number of topic.(2)For each topic , draw .(3)For each document ,draw .(4)Then for each word in document , draw the topic of the th word: and the th word:.When training the LDA model, a common evaluation metric is confusion (as shown in equation (2)) [17]. Smaller perplexity means that the model is a better predictor for new text. Also in this paper, the Jensen–Shannon Divergence (JS Divergence) [50], a method for calculating topics similarity, is used together with perplexity to determine the optimal .
Perplexity is defined below:where is the sum of all words in test dataset. is the test dataset with documents, represents the number of words in document , and is the words in document .
Jensen–Shannon Divergence is defined as follows:where is the number of topics, is the JS Divergence of topics, is the variance of topics, is the th topic, and is the mean of probability distribution of topic-word.
The final optimal number of topics is determined by , which is calculated as follows:where is test dataset.
On the basis of the above definition, we created a topic model for taxi trajectories based on LDA algorithm. When building the model, we treated a trajectory as a “document” and each road link number in a trajectory as a “word.” All trajectories constitute a word corpus. Then, we used equation (4) to select the optimal number of topics and used the LDA model to extract the topics in trajectories. The obtained topics consist of several road links. Also, each link belongs to multiple traffic topics, which is similar to a document that can contain multiple topics.

Step 4. Merging semantic zones.
After specifying the number of topics, the probability distribution of each road link in each semantic zone is generated by LDA algorithm. The topic for each road link is a vector of probability distribution. This means that each road link will belong to multiple semantic zones, i.e., composite zones. Visualization methods in existing studies will select the topic with the highest probability as the final semantic zone to which a road link belongs. In other words, composite zones will be converted to single-valued zones. However, the resulting zones will be very fragmented. A semantic zone may consist of many small fragments that are scattered in taxi travel network. Therefore, in order to reduce the dispersion at a specific number of zones and to perform comparative experiments between feature vectors of composite and single-valued semantic zones, we used modularity [47] to merge the small fragments and maximize the modularity of road network subdivision. An idealized community division has the highest similarity between nodes within a community and the lowest similarity of nodes between communities. Modularity is commonly used to measure the merits of community subdivision results of complex networks. The higher the quality of the community division, the greater the modularity .
The modularity is calculated as follows:By using equation (5), we can convert the composite zones obtained by LDA into single-valued zones with larger modularity. In subsequent comparative experiments, we can simultaneously evaluate the effectiveness of composite and single-valued zones in the prediction of road link speed.

Step 5. Building predicting model for road links.
The state of each road link is influenced by the upstream and downstream links. Therefore, incorporating the complex structure and historical state of the road network into model will be beneficial to improve the accuracy of speed prediction for road links and also enables to predict multiple road links at once. GNNs are this kind of methods for learning on a non-Euclidean structure. GNNs introduce the convolution theory from the Euclidean data to the non-Euclidean data to solve the spatial dependencies. In order to address both spatial and temporal dependencies, Zhao [28] proposed the temporal graph convolutional network (T-GCN). The temporal dependencies are obtained by adding the GRU structure to GCN model. In order to compare and analyze the effectiveness of single-valued zoning and composite zoning while maintaining the network topology, we chose to incorporate the results of LDA into T-GCN in the proposed approach. T-GCN is defined as follows:where and are update gate and reset gate at time . They are used to control the forgetfulness of state information from previous period. is memory contents stored at time . is output state at time and indicates the output of time . represents the adjacency matrix of road network. is feature matrix of each road link at time .
The training process of T-GCN model is as follows (as shown in Figure 6). Firstly, we calculate the ground truth average speed of each road link for a certain period based on map-matched trajectories. Then, we build training and test dataset based on historical data of each road link and extract its semantic zoning information of taxi travel network. Next, we combine with adjacency matrix of road network for T-GCN training. Finally, the speed prediction for each road link is output and compared to the true value to optimize the model.

4. Experiments and Results

4.1. The Results of Semantic Zoning for Taxi Travel Network
4.1.1. Optimal Parameter Selection

LDA model has the following parameters that should be set up firstly: (1) Dirichlet distribution parameter for trajectory-traffic topic; (2) Dirichlet distribution parameter for traffic topic-road link; and (3) the number of traffic topics.

value affects topics distribution for each trajectory, and value affects road link distribution for each traffic topic. The greater the two values, the more concentrated the distribution. In order to obtain optimal parameters, the values of and were compared. We found that traffic topic model was better differentiated when and .

Another important parameter is the number of topics . If is too large, the topic division is very detailed, and the likelihood of similarity between topics will increase, while a lower value of may be not able to distinguish topics well. Therefore, tests are needed to determine the number of themes . In the previous section, we have determined the optimal value of and . With and , we conducted tests by taking the number of topics from 2 to 100 at an interval of 1. Perplexity (equation (2)), Jensen–Shannon Divergence (equation (3)), and the joint index of perplexity and Jensen–Shannon Divergence (equation (4)) were calculated. The results are shown in Figures 7(a)7(c).

In Figure 7(a), the perplexity value of LDA decreases as the topic number increases. After , the perplexity value begins to decrease slowly. In Figure 7(b), as the -values increase, the value of Jensen–Shannon Divergence starts to increase slowly between 15 and 20 and gradually flattens out. In Figure 7(c), the trend is consistent (Figure 7(a)), which starts to slowly decline after . In order to effectively characterize the traffic flow clustering pattern of travel network topics in study area,  = 18 was chosen as the topic numbers for LDA algorithm in the experiment.

4.1.2. The Semantic Zones of Taxi Travel Network

After semantic zoning and semantic zones merging based on modularity, stable semantic zones of travel network was generated within 15 days using the constructed trajectories dataset, where the LDA was modeled using the Gensim [51] library. The result semantic zones were visualized using ArcGIS [52] software. We classified zoning information into two categories, namely, composite semantic zones generated by LDA algorithm and single-valued semantic zones merged from LDA results based on modularity. In the former, each road link contains probability belonging to 18 semantic zones. In the latter, each road link belongs only to the semantic zone with maximum probability.

Figure 8 shows the map of composite semantic zones. We used different colors to represent zones. Line widths were set by the probability of a road link belonging to one of the 18 zones. As can be seen from the figure, the distribution of semantic zones is clear across map, but the number of semantic zones varies from district to district (as shown in Table 1). Nanshan, Futian, and Luohu districts have the highest number of topics, and only Longhua district matches the semantic zone nicely. Additionally, we found that traffic topic zones are correlated with land use and are prone to form semantic zones near train stations, airports, residential areas, and commercial areas. On arterial roads, it is also easy to form semantic zones, such as Beihuan Road and Binhai Road. Since the probability that a road belongs to each of the 18 zones is difficult to visualize, some zones have nested road links that belong to other zones (as shown in the upper right corner of Figure 8).

Figure 9 shows the map of single-valued semantic zones. We classified each road link to one semantic zone with maximum probability and used the same color style as Figure 8. As the map depicts, all semantic zones are rendered clearly and overlap disappears. The nested road links between different semantic zones are reduced.

4.2. The Comparison of Prediction Models
4.2.1. Data Preparation and Experimental Setup

(1) Data Preparation. There are 44,609 links in the Shenzhen road network, which has a complex network structure. We collected data starting and ending from May 1 to May 15, 2015. Due to the short period of data collection, some of the road links lack taxi track data. As shown in Figure 9, the small amount of data leads to many road links that are not explicitly in a topic. Therefore, in order to compare the effectiveness of single-valued semantic features and composite semantic features in the speed prediction of road links and to examine the learning ability of T-GCN on the spatiotemporal dependencies on road links, an experimental area was selected as an example in this paper. The experimental area contains multiple semantic zones, including composite zoning and single-valued zoning information, which are suitable for the research objectives of this paper. Also, the amount of data in the experimental area is sufficient to make the prediction model fit better. After data processing, the experiment area was selected from composite zoning map and single-valued zoning map extracted by LDA (as shown in the upper right of Figure 10). There were 766 road links in the research area. The data of road link speeds at 7:00–23:00 was selected. We adopted 15 minutes as the time interval, the previous four periods were selected as historical speed features, and the speed of the next one period was used as the prediction value. Also, in the comparison experiment, semantic zoning information will be added to the features. In the splitting of datasets, 80% of all data were used as training sets and 20% as test sets.

(2) Algorithm Selection and Parameter Setting. In this study, Vector Machine Regression (SVR), Random Forest Regression (RFR), Gradient Boost Regression (GBDT), XGBoost, Decision Tree Regression (DTR), and T-GCN were selected for comparative analysis on the impact of semantic zones in road link speed prediction. The first five baseline algorithms are from the Scikit-learn package which is a machine learning library for Python. We used GridSearchCV in Scikit-learn to automatically find the optimal parameters for these five algorithms. The parameters of these five machine learning algorithms are shown in Table 2. The hyperparameters of T-GCN model mainly include batch size, learning rate, training epochs, and number of hidden units. Based on the experience of Zhao [28] and after repeated tests, learn rate was set to 0.01, and catch size was set to 32, and the number of hidden units was set to [8, 16, 32, 64, 100]. We found that T-GCN model has the highest prediction accuracy when the number of hidden layers is 64. In this experiment, Adam optimizer was used for loss calculation and the training loss remained stable when the number of iteration epochs was 100.

(3) Evaluation Metrics. In order to accurately evaluate the performance of prediction models in road link speeds, the following metrics were used in this study. These metrics include Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Coefficient of Determination (), and Accuracy (Acc). RMSE and MAE were used to calculate the error of the model; the smaller the value, the better the model. was used to test the model’s predictive ability on the test dataset; the larger the value, the better the model.

(4) Results. The results were compared and analyzed in three scenarios. The first was only use of raw data to train models and evaluation results; the second was modeled under single-valued zones; and the third was done under composite zones. The accuracy of each model is shown in Table 3.

From Table 3, we can see that the RMSE error of T-GCN decreases gradually with the addition of semantic information (from single-valued zoning to composite zoning), while the RMSE error of the other five machine learning models increases when composite zoning is used instead. For example, the RMSE error of the T-GCN model is about 52.03% smaller than that of SVR model when composite zoning is used. The trend for R2 is the same as for RMSE error. The ability of the predicted results of T-GCN to represent actual data increases gradually with the addition of semantic zoning information, while the other five models perform poorly under composite zoning. For example, the R2 of T-GCN is improved about 8.94% compared to that of SVR with the addition of composite partition information. Based on the raw data, the machine learning algorithms were able to achieve average accuracy 83.68%, while the accuracy of T-GCN was only 73.63%. This may be due to the small amount of available data. T-GCN still needs more data to improve its ability to learn spatial dependencies. After adding semantic zoning information, the accuracy of machine learning algorithms was improved by an average of 8.76%. But it showed different performance in composite and single-valued zones, and the accuracy of each machine learning model becomes poor under composite zones. T-GCN showed improved performance with both composite and single-valued zoning information, with a 6.46% improvement in the single-valued zones over raw data only and a 1.88% improvement in the composite zones compared with nonoverlapping zoning information.

5. Discussion

(1)LDA gives the distribution probabilities of semantic zones for each road link. It is a common practice in visualization to select the topic with highest probability as the final semantic zones for road links. In this experiment, we extracted semantic zoning information according to this idea and used modularity to merge the scattered, nested road links belonging to other zones. We can see that the resulting map becomes tidier and clearer (shown in Figure 9). However, from the predicting results, the accuracy with composite zoning information in T-GCN is improved as compared to the single-valued zones. In addition, LDA is a community discovery algorithm with semantic information, which is different from the traditional division methods targeting on topological relationships for obtaining single-valued communities such as GN [53], FN [54], and FUA [55]. Traffic flow of road link is influenced by dynamic changes at previous times and surrounding road links. The road links will fall in different semantic zones at different times. The probability that one road link belongs to more than one semantic zone in LDA depicts the spatial–temporal ambiguity. Therefore, the composite zones for road links are closer to the actual spatial and temporal characteristics of the transportation network. Ding [56] found that communities identified using topic analysis are more interrelated than communities detected by topological methods, so it would be useful to apply composite zones which are better at characterizing the taxi travel network structure to predict the average speed of road links.(2)Data augmentation in traffic prediction can be classified into two types: one is to take external information, for example, adding information such as weather and holidays to prediction models. The other is to build features from road network topology and historical time series data or augment data using neural networks [57]. This paper adopted the second approach to augment data for traffic prediction. We used LDA to extract spatial–temporal semantic information from trajectory data and concatenate it with historical average road link speeds to solve the problem of single data source in prediction models. Experiments on predicting average road link speeds were performed on the proposed approach and six baseline models (RFR, DTR, SVR, GDBT, XGBoost, and T-GCN). The experimental results of each model showed that the performance improvement by data augmentation varied obviously. In the experiment, the improvement in prediction accuracy after adding semantic zoning information using T-GCN model is obviously better than other traditional machine learning methods. However, single-valued and composite semantic zoning information can have different effects when added to machine learning algorithms. Adding composite zoning information makes the machine learning algorithms worse. The reason is that each road link may not belong to all semantic zones, and the probability of some road links is zero, so a large number of zero values affect the model fitting. In addition, although the accuracy of machine learning algorithms is higher than T-GCN, but traditional machine learning algorithms such as SVM can only predict the results of one road each time, which is inefficient in practical applications, while T-GCN could predict the average speed of road links all at once. Moreover, as the research on GCNs is deepened, the combination of semantic information with spatial and temporal relationships learned automatically from GCN may improve the prediction performance. It should be noted that the objective of this paper is to verify the validity of semantic zoning features in predicting the average speed of road links and that data such as weather is difficult to obtain and therefore is ignored.

6. Conclusion and Future Work

This paper proposed a method for predicting average road link speed that integrates the semantic zones of taxi travel network extracted by LDA and the spatial–temporal dependencies learned by T-GCN. Firstly, the taxi location data was preprocessed, and datasets for subsequent tasks were built up after anomaly data filtering, trajectory segmentation, and map matching. Next, we converted the trajectories to a sequence of road numbers and extracted the semantic zones using LDA algorithm. To test the validity of the semantic zoning features, we merged the composite semantic zones obtained from LDA to form single-valued zones using modularity in social network community detection. Finally, we compared the proposed approach with six baseline models. The main findings of this study are summarized below:(1)Semantic zones of the taxi travel network do exist within a certain period of time. These zones can describe the spatiotemporal dynamic characteristics of the road network.(2)LDA can be used to quantify the dynamic characteristics of the road network and integrate with the historical state of the network, which helps to improve the accuracy of speed prediction for road links.(3)Compared with traditional machine learning model, the semantic zoning information has better performance in T-GCN model, which can learn the spatiotemporal dependencies of the travel network simultaneously and can integrate the semantic zones.

In future work, we would like to research on end-to-end algorithms referring to the techniques such as network representation learning and GCNs to reduce the complexity of road link average speed prediction.

Data Availability

As the data forms part of an ongoing study, the raw data needed to reproduce these findings cannot be shared at this time.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors gratefully acknowledge the support of this research from the National Natural Science Foundation of China (41701167) and the Basic Research Project of Shenzhen City (JCYJ20190812171419161).