A Higher-Order Motif-Based Spatiotemporal Graph Imputation Approach for Transportation Networks
Due to the incomplete coverage and failure of traffic data collectors during the collection, traffic data usually suffers from information missing. Achieving accurate imputation is critical to the operation of transportation networks. Existing approaches usually focus on the characteristic analysis of temporal variation and adjacent spatial representation, and the consideration of higher-order spatial correlations and continuous data missing attracts more attentions from the academia and industry. In this paper, by leveraging motif-based graph aggregation, we propose a spatiotemporal imputation approach to address the issue of traffic data missing. First, through motif discovery, the higher-order graph aggregation model was presented in traffic networks. It utilized graph convolution network (GCN) to polymerize the correlated segment attributes of the missing data segments. Then, the multitime dimension imputation model based on bidirectional long short-term memory (Bi-LSTM) incorporated the recent, daily-periodic, and weekly-periodic dependencies of the historical data. Finally, the spatial aggregated values and the temporal fusion values were integrated to obtain the results. We conducted comprehensive experiments based on the real-world dataset and discussed the case of random and continuous data missing by different time intervals, and the results showed that the proposed approach was feasible and accurate.
With the rapid growth of urbanization, intelligent transportation systems (ITS) are widely adopted for the urban management and traffic control [1, 2]. ITS rely on the availability of traffic data to evaluate traffic status and system performance. Assuming that traffic information would be collected in an all-round way, urban commuters can adapt the traffic conditions of urban roads, grasp the law of traffic flow changes, and subsequently promote the development of urban transportation.
In recent years, emerging information technologies, such as fifth-generation networks  and edge computing , have brought a bit convenience to traffic data collection, and the collected data is usually mobile, multisource, and real time. Unfortunately, due to the frequent occurrence of various types of failures (e.g., power malfunction, device maintenance, and network issues), collected data always are incomplete . Moreover, due to the high cost of construction and maintenance, the equipment is difficult to cover the entire traffic network . So, the loss of traffic data in the process of data collection is inevitable.
The problem of information missing significantly weakens the data quality, limits the study of transportation networks (e.g., traffic management, urban planning, and route choice), and in worse, may result in false decisions [6–8]. Thus, handling missing values is a premise for traffic data mining and analysis . Making accurate imputation becomes an important research topic in ITS .
The key of data imputation is to discover the hidden spatiotemporal information with regard to the neighbouring data . For instance, as shown in Figure 1, it makes use of the spatiotemporal neighbouring values of the missing data, thereby improving the accuracy of data imputation. Bae et al. proposed two cokriging methods that exploited the existence of spatiotemporal dependency in traffic data, to impute high-resolution traffic speed under different random data missing scenarios . Li et al. developed a combined deep neural model, which extracted spatio-temporal features to estimate missing values . To characterize the hidden patterns in spatiotemporal traffic data, Chen et al. incorporated a low-rank tensor completion (LRTC) framework with the truncated nuclear norm (TNN) and obtained a better solution for data imputation . Considering the case of continuous data missing, Zhang et al. utilized the temporal neighbouring values of a given period and employed the long short-term memory network (LSTM) to recover missing data .
Although many methods achieved promising imputation accuracy of missing data, there are still some limitations. The first question needs to be addressed is how to effectively capture the spatial dependencies. Existing methods usually consider the direct adjacent road segments (i.e., upstream and downstream) but ignore the global information. It is considerable to introduce the imputed models with sensing the global and local variations of spatial information . The second one is the continuous data missing, in which the missing values at some consecutive timestamps in a road segment. In this circumstance, it is unable to generate data during a period of time and provide stable inputs for a model.
To address these issues, in this paper, we propose a spatiotemporal imputation approach for traffic data via motif-based graph aggregation (named MGIA), which incorporates the motif-based spatial aggregation with the multitime dimension fusion by bidirectional LSTM (Bi-LSTM). To the best of our knowledge, our work is the first attempt to apply motif-based spatial method to address the issues of traffic data imputation. The contributions of this paper are summarized as follows: (1)We propose a higher-order graph aggregation model based on motifs. It polymerizes the correlated segment attributes of the missing data segments to capture the higher-order spatial correlations in a road network, which utilizes the method of motif-based search, and the aggregation based on graph convolution network (GCN)(2)We develop a Bi-LSTM approach based on the multitime dimension fusion to improve the accuracy in the case of continuous data missing. It incorporates the recent, daily-periodic, and weekly-periodic dependencies to ensure that there are enough historical data to complete the temporal imputation(3)We perform extensive experiments on the real-world dataset to evaluate the performance of our approach. The experimental results confirm the advantages of proposed approach with various missing patterns over the state-of-the-art imputation approaches
The remainder of the paper is organised as follows: Section 2 introduces the related work in traffic data imputation. Section 3 describes the proposed approach. Section 4 shows the experimental results and analysis. Section 5 concludes this paper.
2. Related Work
In this section, we introduce the related work regarding the approaches for conventional and deep learning-based imputation.
2.1. Conventional Imputation
In the past decades, traffic data imputation has caused widespread concern. The imputation methods mainly include prediction, interpolation, and statistical methods.
The predictive methods utilize historical data to predict the missing values. The typical methods include Bayesian networks  and support vector regression . Ahn et al. then developed a traffic flow prediction method based on Bayesian classifier and support vector regression . It further improved the imputation accuracy of missing values. The hybrid approach based on fuzzy -means (FCM) is another example of such predictive methods, which integrates the optimized FCM parameters and genetic algorithms to build prediction models [19, 20]. These methods focus more on the historical traffic data for missing data filling and fail to consider the imputation on missing continuous data. Moreover, they ignore the spatial relationships which also provide crucial information for imputation.
Interpolation methods use the average value of the neighbouring data or historical data to impute the missing values. Using traffic data from the same sensor during the same period in neighbouring days, Yin et al. took the average value of these known data to impute the missing values . Chang et al. utilized the -nearest neighbours (KNN) and local least squares to consider the relationship between similar traffic flow patterns and enhanced the interpolation effects . Kriging interpolation [12, 23] focused on determining the weighted historical values. It considered the spatiotemporal information to capture the characteristics of traffic data. Although these methods can achieve promising imputation results in a short time, they can only focus on the average calculation without considering complicated changes caused by other factors such as data global attributes and random events.
The statistical methods are aimed at developing a data distribution that best fits the imputed missing data. To reflect the uncertainty between the imputation parameters, Audigier et al. designed a multiple imputation method based on Bayesian principal component analysis (BPCA) to cope with incomplete continuous data . To exploit the spatiotemporal correlation of traffic network data, Wang et al. developed a low-rank matrix factorization-based approach to reconstruct the missing traffic data . Moreover, some researchers have expanded the two-dimensional matrix into high-dimensional tensors, such as the tensor decomposition models [26, 27]. However, the accuracy of these approaches mainly relies on the priori assumption of the data distribution, but the unknown data of the actual distribution may cause errors.
2.2. Deep Learning-Based Imputation
Recently, the booming of deep learning [28, 29] has inspired new ideas for data imputation. The autoencoder models  were developed to hierarchically train the full set of traffic data and extract the spatiotemporal features of the hidden layers to demonstrate the effectiveness of data imputation . Pathak et al. converted spatiotemporal trajectory data into images and used the powerful feature extraction by convolutional neural network (CNN) and then combined with the autoencoder model to impute the missing spatiotemporal trajectory data . Generative adversarial network (GAN) provides a class of generative models for adversarial training, and it applies actual data/parallel data to generate the true data distribution, so that the imputation quality would be improved [33, 34]. By incorporating the reversibility of the generative imputer into GAN, Kazemi and Meidani proposed an iterative GAN architecture to evaluate the imputation of traffic missing data . These state-of-the-art models work properly while dealing with the data correlation across different road segments. It adopts the powerful feature extraction capabilities of the deep neural networks to impute spatiotemporal data. However, most methods ignore the spatial dependencies in traffic network, and their imputation effect depends on a massive number of training data.
As some graph-based methods consider strong relations of data structures [36, 37], it is feasible of capturing global information to improve imputation performance. Chen and He proposed a heterogeneous graph embedding framework, which constructed a travel heterogeneous information network to find the best matched vehicles for the missing records . By incorporating the spectral graph convolution operation, Cui et al. developed the graph Markov network to handle missing values for short-term traffic forecasting . Graph representation learning is one crucial category of deep learning that has been widely used for traffic data imputation . They viewed the observations and features as data nodes in a bipartite graph or constructed sample self-representation strategy and further required the neighbouring missing samples. Motifs are small connected components in a graph and are beneficial to understand the higher-order relations and global spatial graph principles [41, 42]. They introduced some imputation approaches based on motif discovery [43, 44], which are rarely adopted in transportation. The graph-based methods solve the imputation problems by capturing global spatial information from historical and neighbouring data and ignore the time series results to reflect the spatial dependencies.
Inspired by the above viewpoints, we propose a spatiotemporal imputation approach, which is based on motif-based graph aggregation. By adopting the motifs, the proposed approach benefits from capturing the higher-order spatial correlations in traffic networks. In addition, the proposed approach focuses on the imputation issue in the case of continuous data missing. At last, we list the limitations of these existing methods in Table 1.
According to the analysis of related work, these existing methods are with the following limitations: (1) they ignore the spatial dependencies in the scene of large-scale areas and (2) rarely consider the imputation issue in the case of continuous data missing. The MGIA is aimed at improving the imputation accuracy from the perspectives of continuous data missing and the higher-order spatial correlations in traffic networks. The framework of the MGIA is shown in Figure 2. It works in the following steps: (1) we adopt motifs to define the graph-based structure presented in traffic networks and search for all associated road segments that meet the motif gain condition of the missing data segment. On this basis, GCN is utilized to gradually aggregate the nonmissing features of each associated segment, and the spatial aggregated value of the missing data segment is determined. (2) The multitime dimension imputation based on Bi-LSTM focuses on dealing with the problem of continuous data missing, which incorporates the recent, daily-periodic, and weekly-periodic dependencies of the historical data. (3) The spatial aggregated values and the multitime dimension fusion values are integrated to obtain the imputation results.
3.1. The Spatial Imputation of Motif-Based Graph Aggregation
In order to capture the data correlation and global spatial characteristics between the road segments in traffic networks, the motif-based graph aggregation is employed to impute the missing data of road segments.
3.1.1. The Associated Node Search Based on Motif Discovery
Motifs are nonisomorphic connected graph structures that occur frequently in the network and the number of nodes is greater than or equal to 3, triangle and quadrangle motifs are shown in Figure 3. As motifs consider higher-order correlations of data structures, it is feasible for capturing global information to improve the performance of node feature aggregation . Triangles are traffic network motifs that play important roles in the higher-order connectivity . Thus, we select the triangle motif M32 as the research object in combination with traffic theory, in which nodes denote road segments, and edges denote the connection between two adjacent segments.
We design the method of motif-based search, which uses the motif gain to adjust the fitness function. If the motif gain no longer increases or no neighbouring node exists, then the search phase is stopped.
First, a road segment set is defined in traffic networks. Assuming that the road segment where the missing data in is a node , a target node set is determined. Then, the motif-based local optimization algorithm is adopted to search all adjacent nodes of node and form adjacent node set and calculates the motif gain owing to adding node in node set to node set . If the motif gain is greater than zero (i.e., there is a spatial correlation between the node and its adjacent nodes), the node will be added to the target node set and is updated. Third, by adding the remaining adjacent nodes of , the motif gain is calculated in turn, and the nodes that meet the conditions are joined to the node set , is updated at last. By analogy, the eligible nodes continue to join the target node set . When the motif gain calculation of all nodes in the node set is completed, the associated node search for node ends. The associated node search for node is shown in Figure 4.
The local motif rate addresses the issue of avoiding repeated counting motifs . When an adjacent node is joined, the local motif rate is calculated as follows:
where is the current node set, is the joined adjacent node, is the local motif rate, is the number of local motif, is the number of motifs between the current node set and the external node set, is the number of motifs between the current node set and the new nodes, and is the number of motifs between the external node set and the new nodes.
The motif gain of the node is calculated as follows: where is the gain of the local motif rate when the node is joined.
The gain indicates that the joined adjacent node is spatially related to the missing data node . As shown in Figure 5, we provide two examples of calculating the motif gain . and are equal to 4 and 3, respectively, and is . In Figure 5(a), and are equal to 1 and 2, respectively, and is from Equation (3), the default control parameter is 1,and ; thus, the node cannot be joined to the current node set . In Figure 5(b), and are equal to 1, and is , ; this indicates that there is a spatial correlation between the missing data node and the node.
3.1.2. The Spatial Aggregated Method Based on GCN
All nodes associated with the missing data node by Equation (3) are determined, and the associated node data is aggregated to obtain the estimate values of the missing data node by GCN. A graph , with nodes to describe a road network, where nodes denote road segments, edges denote the directed connection from node to node and denotes the weighted adjacency matrix. The graph is represented by its corresponding Laplacian matrix. The properties of the graph structure can be obtained by analyzing Laplacian matrix and its eigenvalues. where is the normalized form of Laplacian matrix and , and are the degree matrix, adjacent matrix, and unit matrix, respectively. and are the eigenvector function and matrix of , respectively, and is the eigenvalue of input node.
According to Equations (4) and (5), the eigenvalue decomposition of is represented as follows: where is the diagonal matrix composed by eigenvalues of and .
In order to reduce the time complexity when the scale of the graph is large, the Chebyshev polynomials are adopted to approximate the solution: where and are Chebyshev polynomials and coefficients, respectively; ; represents the maximum eigenvalue of ; ; and .
It can be seen from Equations (7) and (8), the approximate solution with Chebyshev polynomials is equivalent to using a convolution kernel to extract the eigenvalues of neighboring nodes with the node as the center of each node in the graph. In order to simplify the calculation, limit to 1, scale the eigenvalue of to make , and Equation (7) is expressed as follows:
According to Equation (4), let at the same time, and Equation (9) is transformed as follows:
In order to avoid numerical and gradient instability problems, let , Equation (10) is transformed as follows:
All associated nodes that meet the motif gain conditions are extracted according to Equation (11), and final aggregated value of the missing data node is expressed as follows: where is the final aggregated value of the missing data node , i.e., the spatial aggregated value in Figure 2. , and are the number of associated nodes, the parameters to be trained, and activation function, respectively. The initial value is the eigenvalue of the firstassociated node that meets the motif gain condition.
3.1.3. The Computational Complexity
We discuss the time complexity of the proposed approach. For the process of motif-based node search, a node where the missing data needs to search other nodes, and the nodes in the target node set also search the each node that do not exist in , so the computational complexity is . For the process of the GCN-based aggregation, optimized by the Chebyshev polynomials, the time complexity is reduced to . In order to simplify the calculation, we limit to 1, so the time complexity of the GCN-based aggregation approximates to . It is a sequential to these two processes, so the total value of the two processes is , i.e., the overall computational complexities .
The process of motif-based graph aggregation is depicted in Algorithm 1.
3.2. The Multiple Dimension Imputation Based on Bi-LSTM
Existing temporal imputation methods obtain good estimates of missing data in the case of random data missing. When there are a large number of continuous missing values at some consecutive timestamps, the imputation performance will degrade . A more challenging task is to recover the continuous missing data.
Processing long-term time-series data is an essential task since there is a large number of continuous missing data. Deep learning, which trains classifiers directly from input data by complex feature representations, may generate high performing results in the dynamic and challenging context . LSTM and gated recurrent unit (GRU) , which are the key methods of deep learning, have been employed for time-series applications with temporal dependencies. LSTM has three gates, in which, input gate and forget gate are used to control the update of memory cell, and output gate passes the output information. LSTM has recently been employed for missing data imputation, such as in Ref. [10, 13, 15]. GRU is the variant of LSTM that only comprises of update gate and reset gate and utilizes the reset gate to control the information at the previous point. GRU’s training of previous point could restrict improvement owing to the case of continuous data missing. Besides, in the case of a large amount of dataset, the performance of LSTM is better than that of GRU, and it has been verified by . Therefore, in order to efficiently capture the temporal dependencies of traffic time series data, the LSTM is employed in our proposed approach.
It is known that the missing data points are closely related to the adjacent points from two opposite temporal directions. Bi-LSTM is capable of training in both forward and backward directions . Meanwhile, an increase of the window based on different time granularity can provide benefit in prediction performance, by allowing the capitalization of temporal dependencies in the time series data . Thus, we adopt the time-series imputation based on Bi-LSTM, which incorporates the recent, daily-periodic, and weekly-periodic dependencies of the historical data. That is to say, among the adjacent time-series data of current missing values in the three dimensions, there exist normal data at least one dimension, so as to ensure that there are enough historical data to complete the temporal imputation.
Assuming that the road segment has missing data in time period , the adjacent data of recent, daily, and weekly periods are defined as , , and , respectively. For instance, when represents the data from 7 a.m. to 7:10 a.m. on 19 September 2018, some adjacent time series data in three dimensions are shown in Figure 6.
3.2.1. Bi-LSTM Encoder
The Bi-LSTM extends the generic LSTM. In case of recent, daily, and weekly periods, it utilizes the previous and future points by processing the missing data from both forward and backward directions with two separate LSTMs. The one-way LSTM of recent, daily, and weekly periods include forget, input, and output gates.
In each time dimension, the hidden vector in time period is updated as follows: where and refer to forget and output gates, respectively, and are input gates, is the adjacent historical data in time period , and is the cell vector. , , and represent the sigmoid function, activation function, and element-wise multiplication, respectively.
The previous and future hidden vectors generated by Equations (13)–(18) are updated in the time period : where and representthe previous and future hidden vectors, respectively, and is the impact weight.
3.2.2. LSTM Decoder with Attention
In this part, we select the LSTM decoder with the attention mechanism. The encoded hidden vectors of adjacent periods are fused into an attention vector: where is the attention context vector in time period t, is the updated hidden vector in encoding stage, is the decoded hidden vector, is the forward LSTM trained by decoder components, and and are weight coefficient and activation function, respectively.
The dense layer with a linear activation function is added on top of the decoder layer to generate predictions in Figure 2 and outputs the decoded values with the backward LSTM: where is the output value decoded by LSTM, is a concatenationof the decoder hidden vector and the attention context vector , is linear activation function, and and are linear parameters mapped to decoder hidden states.
3.2.3. Multicomponent Time Dimension Fusion
The input vectors are trained by the above encoder-decoder processes, and the estimated value merged by the adjacent data of the missing data in the three time dimensions of recent, daily and weekly periods, i.e., the fusion-value are obtained.
3.3. The Spatiotemporal Fusion of Missing Data
The spatial aggregated value and the multitime dimension fusion-value in time period are combined as follows: where is the final result of spatiotemporal imputation in time period , and the initial value of weight .
In order to prevent the poor effect caused by excessive weight fluctuation, an inertia factor is introduced. The sizeof reflects the size of the weight inertia. The is inversely proportional to the volatility of weight .
Supposing that the error of the spatial imputation in time period is , and the error of the multitime dimension imputation in time period is , is updated as follows:
That is to say, when the imputation error is relatively large, in the next time period, the imputation influence will be reduced and vice versa. The value of can be obtained through simulation experiments to obtain a better value.
4. Results and Discussion
In this section, we introduce the dataset, baselines, and evaluation metrics, and then verify the superiority of the MGIA in the case of random and continuous data missing.
The scenario of random data missing is that missing data appears randomly in the traffic dataset owing to the temporary failure (e.g., network or power outage issues). Continuous data missing refers to that some data is missing continuously due to the long-term failure of traffic data collectors, in which the values are missed at some consecutive timestamps or multiple intersections in a road network.
4.1. Data Preparation
We validate our approach with the traffic index dataset, which was provided by the Chengdu branch of Didi Chuxing, China. In Figure 7, we select 30.66° N ∼30.73° N, 104.02° E ∼104.10° E as the geographic area and 74 road segments make up the road network in this work. The time span of these dataset is from September to October 2018. All the programs are developed based on Python 3.7 and TensorFlow 1.13.1.
The dataset was filtered under the normal traffic flow hours (6 am–8:50 pm), and the time intervals are set to 10 min and 30 min, respectively. The traffic index is a standardized quantitative indicator to measure traffic congestion. It is expressed as follows: where is the traffic index of road segment during timeslot , is the average speed, and and are the maximum speed and minimum speed corresponding to road segment in the historical data, respectively.
In the tests, we manually remove a certain amount of processed dataset and then compute these data with the proposed approach and state-of-the-art methods. We assume that there are patterns of random and continuous data missing in the dataset and get the accuracy of these methods by comparing the imputed results with the ground truth data. The range of data missing rate in these two patterns is set between 10% and 60% (in units of 10%).
4.2. Experimental Settings
The comparison between the MGIA and state-of-the-art methods for traffic data imputation is conducted as follows: (i)LRTC-TNN  is a low-rank tensor completion framework with truncated nuclear norm, which extracts spatiotemporal features from traffic data(ii)SSIM  is a sequence-to-sequence imputation model, which is designed to impute missing data by utilizing the LSTM from both the past and future time indexes(iii)MGIA: our proposed approach, which incorporates the motif-based graph aggregation method with the multitime dimension fusion method based on Bi-LSTM, to impute missing data(iv)GIA is a comparison for MGIA, which incorporates the graph aggregation method with the multitime dimension fusion method based on Bi-LSTM but does not include motif-based application
4.2.2. Evaluation Metrics
To evaluate the performance of the MGIA, we employ three widely used performance metrics: root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). where is the total number of the missing data, is the imputed data, and is the corresponding actual data of .
4.3. Comparison and Result Analysis
As illustrated in Figures 8 and 9, the average errors with different missing rates in the case of random and continuous data missing is presented, respectively. That shows the MGIA shows superior performance gains over the baselines w.r.t. all the three metrics.
All metrics of MGIA are lower than those based on the other three approaches. LRTC-TNN and SSIM are the recent approaches, and the difference between all three metrics of MGIA and those based on other two algorithms are larger, which means that the experiments based on MGIA achieved good performance. Meanwhile, MGIA significantly outperforms the compared approach GIA. It means that the application of motif in spatial imputation is feasible and effective.
As shown in Figures 8 and 9, the overall trends of the four approaches are almost similar, i.e., all metrics of these approaches increase with the increase in the missing rate. In Figure 8, when the missing rate is less than 40%, the error growth of these approaches is steady except for SSIM. Once the missing rate exceeds 40%, the error increase is larger than the respective miss rate (the missing rate is less than 40%). In Figure 9, the error growth of these approaches has remained stable, which crosses various missing rates from 10% to 60%. The reason could be that, as for random data missing, there will be the situation of continuous data missing when the missing rate exceeds 40%; thus, this is different from the current imputed pattern, and the error growth increases significantly.
The LRTC-TNN achieved a better performance than the SSIM. This is because LRTC-TNN makes use of low-rank tensor decomposition and will not be affected by consecutive missing data. On the contrary, SSIM is susceptible to consecutive missing data. If the data is missing at the forward period or the next period, the error will be very different. GIA improves SSIM from one-dimensional to three-dimensional horizons and employs graph aggregation to perform spatiotemporal fusion of data imputation, so it has improved performance compared to SSIM. MGIA adds motif-based imputation on the basis of GIA and balances the effectiveness of higher-order spatial correlations and periodicity in merging the imputed values.
Due to the widening of the time interval, i.e., from 10 min to 30 min, the imputation performance becomes worse in Figures 8 and 9, and MGIA still performs better than the other three approaches in all the three metrics.
To represent the advantage of continuous data missing in MGIA, we compare the metrics with the pattern of random data missing in Tables 2 and 3. Regardless of whether it is a 10 min interval or a 30 min interval, the pattern of continuous data missing has better performance than the pattern of random data missing. This is because MGIA adopts the time series method based on Bi-LSTM, which widens the horizon from one-dimension to three-dimension, and pays more attention to the time continuity.
Overall, MGIA outperforms the other baselines due to the spatiotemporal characteristics and higher-order spatial correlations. Moreover, the imputation performance of continuous data missing is better than random data missing.
In this paper, a novel spatiotemporal imputation approach for traffic data (MGIA), which utilizes motif-based graph aggregation, is proposed. To sum up, this approach addresses the issues of (1) traffic spatial imputation for large-scale areas and (2) poor imputation performance especially when in the case of continuous data missing. Based on MGIA, we capture the higher-order spatial correlations in traffic networks and solve the problem of spatiotemporal data imputation. Experiments are performed on a real-world traffic dataset in Chengdu, China. The experimental results showed that the MGIA outperformed all other methods in the case of random and continuous data missing and achieved strong stability crossing various missing rates range from 10% to 60%.
In the future, we will further evaluate the MGIA with regard to other factors (such as weather and event). Besides, we plan to incorporate adversarial learning into the proposed approach to improve the imputation accuracy, in the case of continuous data missing with longer time interval. On this basis, we improve the accuracy at the various missing rate range from 60% to 80%. Additionally, we plan to apply the proposed approach to other tasks in ITS, such as traffic prediction and causal discovery of the congestion propagation patterns.
All datasets in this study can be downloaded from https://outreach.didichuxing.com.
Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
This research was supported by the National Natural Science Foundation of China (Nos. 62073295 and 62072409), Zhejiang Provincial Natural Science Foundation (LR21F020003), and Fundamental Research Funds for the Provincial Universities of Zhejiang (RF-B2020001).
H. P. Li, M. Li, X. Lin, F. He, and Y. H. Wang, “A spatiotemporal approach for traffic data imputation with complicated missing patterns,” Transportation Research Part C:Emerging Technologies, vol. 119, article 102730, 2020.View at: Publisher Site | Google Scholar
D. F. Zhu, G. J. Shen, D. Y. Liu, J. J. Chen, and Y. J. Zhang, “FCG-ASpredictor: an approach for the prediction of average speed of road segments with floating car GPS data,” Sensors, vol. 19, no. 22, p. 4967, 2019.View at: Publisher Site | Google Scholar
X. J. Kong, K. L. Wang, M. L. Hou et al., “A federated learning-based license plate recognition scheme for 5G-enabled internet of vehicles,” IEEE Transactions on Industrial Informatics, vol. 17, no. 12, pp. 8523–8530, 2021.View at: Publisher Site | Google Scholar
X. J. Kong, S. Q. Tong, H. R. Gao et al., “Mobile edge cooperation optimization for wearable internet of things: a network representation-based framework,” IEEE Transactions on Industrial Informatics, vol. 17, no. 7, pp. 5050–5058, 2021.View at: Publisher Site | Google Scholar
W. L. Ding and Z. F. Zhao, “DS-harmonizer: a harmonization service on spatiotemporal data stream in edge computing environment,” Wireless Communications & Mobile Computing, vol. 2018, article 9354273, 2018.View at: Publisher Site | Google Scholar
C. F. Gong and Y. Y. Zhang, “Urban traffic data imputation with detrending and tensor decomposition,” IEEE Access, vol. 8, pp. 11124–11137, 2020.View at: Publisher Site | Google Scholar
D. G. Zhang, T. Zhang, Y. Dong, X. H. Liu, Y. Y. Cui, and D. X. Zhao, “Novel optimized link state routing protocol based on quantum genetic strategy for mobile learning,” Journal of Network and Computer Applications, vol. 122, pp. 37–49, 2018.View at: Publisher Site | Google Scholar
A. Chaudhry, W. Li, A. Basri, and F. Patenaude, “A method for improving imputation and prediction accuracy of highly seasonal univariate data with large periods of missingness,” Wireless Communications & Mobile Computing, vol. 2019, article 4039758, 2019.View at: Publisher Site | Google Scholar
X. Y. Jia, X. Y. Dong, M. Chen, and X. H. Yu, “Missing data imputation for traffic congestion data based on joint matrix factorization,” Knowledge-Based Systems, vol. 225, article 107114, 2021.View at: Publisher Site | Google Scholar
L. C. Li, J. Zhang, Y. G. Wang, and B. Ran, “Missing value imputation for traffic-related time series data based on a multi-view learning method,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 8, pp. 2933–2943, 2019.View at: Google Scholar
Y. F. Zhuang, R. M. Ke, and Y. H. Wang, “Innovative method for traffic data imputation based on convolutional neural network,” IET Intelligent Transportation Systems, vol. 13, no. 4, pp. 605–613, 2019.View at: Google Scholar
B. Bae, H. Kim, H. Lim, Y. Liu, L. D. Han, and P. B. Freeze, “Missing data imputation for traffic flow speed using spatio-temporal cokriging,” Transportation Research Part C:Emerging Technologies, vol. 88, pp. 124–139, 2018.View at: Publisher Site | Google Scholar
D. Li, L. H. Li, X. L. Li, Z. Ke, and Q. Hu, “Smoothed LSTM-AE: a spatio-temporal deep model for multiple time-series missing imputation,” Neurocomputing, vol. 411, pp. 351–363, 2020.View at: Publisher Site | Google Scholar
X. Y. Chen, J. M. Yang, and L. J. Sun, “A nonconvex low-rank tensor completion model for spatiotemporal traffic data imputation,” Transportation Research Part C:Emerging Technologies, vol. 117, article 102673, 2020.View at: Publisher Site | Google Scholar
Y. F. Zhang, P. J. Thorburn, W. Xiang, and P. Fitch, “SSIM-a deep learning approach for recovering missing time series sensor data,” IEEE Internet of Things Journal, vol. 6, no. 4, pp. 6618–6628, 2019.View at: Publisher Site | Google Scholar
B. Ghosh, B. Basu, and M. O’Mahony, “Bayesian time-series model for short-term traffic flow forecasting,” Journal of Transportation Engineering, vol. 133, no. 3, pp. 180–189, 2007.View at: Publisher Site | Google Scholar
M. Castro-Neto, Y. S. Jeong, M. K. Jeong, and L. D. Han, “Online-SVR for short-term traffic flow prediction under typical and atypical traffic conditions,” Expert Systems with Applications, vol. 36, no. 3, pp. 6164–6173, 2009.View at: Publisher Site | Google Scholar
J. Ahn, E. Ko, and E. Y. Kim, “Highway traffic flow prediction using support vector regression and Bayesian classifier,” in Proceedings of IEEE International Conference on Big Data and Smart Computing, pp. 239–244, New York, USA, 2016.View at: Google Scholar
J. J. Tang, G. H. Zhang, Y. H. Wang, H. Wang, and F. Liu, “A hybrid approach to integrate fuzzy C-means based imputation method with genetic algorithm for missing traffic volume data estimation,” Transportation Research Part C:Emerging Technologies, vol. 51, pp. 29–40, 2015.View at: Publisher Site | Google Scholar
Q. Shang, Z. S. Yang, S. Gao, and D. R. Tan, “An imputation method for missing traffic data based on FCM optimized by PSO-SVR,” Journal of Advanced Transportation, vol. 2018, Article ID 2935248, 2018.View at: Publisher Site | Google Scholar
W. Yin, P. Murray-Tuite, and H. Rakha, “Imputing erroneous data of single-station loop detectors for nonincident conditions: comparison between temporal and spatial methods,” Journal of Intelligent Transportation Systems, vol. 16, no. 3, pp. 159–176, 2012.View at: Publisher Site | Google Scholar
G. Chang, Y. Zhang, and D. Yao, “Missing data imputation for traffic flow based on improved local least squares,” Tsinghua Science and Technology, vol. 17, no. 3, pp. 304–309, 2012.View at: Publisher Site | Google Scholar
H. T. Yang, J. J. Yang, L. D. Han et al., “A Kriging based spatiotemporal approach for traffic volume data imputation,” PLoS One, vol. 13, no. 4, article e0195957, 2018.View at: Publisher Site | Google Scholar
V. Audigier, F. Husson, and J. Josse, “Multiple imputation for continuous variables using a Bayesian principal component analysis,” Journal of Statistical Computation and Simulation, vol. 86, no. 11, pp. 2140–2156, 2016.View at: Publisher Site | Google Scholar
Y. Wang, Y. Zhang, X. L. Piao, H. Liu, and K. Zhang, “Traffic data reconstruction via adaptive spatial-temporal correlations,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 4, pp. 1531–1543, 2019.View at: Publisher Site | Google Scholar
G. Pastor, “A low-rank tensor model for imputation of missing vehicular traffic volume,” IEEE Transactions on Vehicular Technology, vol. 67, no. 9, pp. 8934–8938, 2018.View at: Publisher Site | Google Scholar
Y. X. Han and Z. C. He, “Simultaneous incomplete traffic data imputation and similarity pattern discovery with Bayesian nonparametric tensor decomposition,” Journal of Advanced Transportation, vol. 2020, Article ID 8810753, 2020.View at: Publisher Site | Google Scholar
X. Kong, K. Wang, S. Wang et al., “Real-time mask identification for COVID-19: an edge computing-based deep learning framework,” IEEE Internet of Things Journal, vol. 8, no. 21, pp. 15929–15938, 2021.View at: Google Scholar
A. Montieri, G. Bovenzi, G. Aceto, D. Ciuonzo, V. Persico, and A. Pescapè, “Packet-level prediction of mobile-app traffic using multitask deep learning,” Computer Networks, vol. 200, article 108529, 2021.View at: Publisher Site | Google Scholar
M. Liang, R. W. Liu, S. C. Li, Z. Xiao, X. Liu, and F. Lu, “An unsupervised learning method with convolutional auto-encoder for vessel trajectory similarity computation,” Ocean Engineering, vol. 225, article 108803, 2021.View at: Publisher Site | Google Scholar
Y. Duan, Y. Lv, Y. L. Liu, and F. Y. Wang, “An efficient realization of deep learning for traffic data imputation,” Transportation Research Part C:Emerging Technologies, vol. 72, pp. 168–181, 2016.View at: Publisher Site | Google Scholar
D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: feature learning by inpainting,” in Proceedings ofthe IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2536–2544, Las Vegas USA, 2016.View at: Google Scholar
J. Yoon, J. Jordon, and M. Van Der Schaar, “GAIN: missing data imputation using generative adversarial nets,” in Proceedings of 35th International Conference on Machine Learning, Stockholm, Sweden, 2018.View at: Google Scholar
Y. Y. Chen, Y. S. Lv, and F. Y. Wang, “Traffic flow imputation using parallel data and generative adversarial networks,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 4, pp. 1624–1630, 2020.View at: Publisher Site | Google Scholar
A. Kazemi and H. Meidani, IGANI: iterative generative adversarial networks for imputation applied to prediction of traffic data, 2020, https://arxiv.org/abs/2008.04847v1.
G. J. Shen, Z. Z. Zhao, and X. J. Kong, “GCN2CDD: a commercial district discovery framework via embedding space clustering on graph convolution networks,” IEEE Transactions on Industrial Informatics, vol. 18, no. 1, pp. 356–364, 2022.View at: Publisher Site | Google Scholar
X. Han, G. J. Shen, X. Yang, and X. J. Kong, “Congestion recognition for hybrid urban road systems via digraph convolutional network,” Transportation Research Part C:Emerging Technologies, vol. 121, article 102877, 2020.View at: Publisher Site | Google Scholar
Y. X. Chen and Z. C. He, “Vehicle identity recovery for automatic number plate recognition data via heterogeneous network embedding,” Sustainability, vol. 12, no. 8, p. 3074, 2020.View at: Publisher Site | Google Scholar
Z. Y. Cui, L. F. Lin, Z. Y. Pu, and Y. H. Wang, “Graph Markov network for traffic forecasting with missing data,” Transportation Research Part C:Emerging Technologies, vol. 117, article 102671, 2020.View at: Publisher Site | Google Scholar
J. X. You, X. B. Ma, D. Y. Ding, M. Kochenderfer, and J. Leskovec, Handling missing data with graph representation learning, 2020, https://arxiv.org/abs/2010.16418v1.
A. R. Benson, D. F. Gleich, and J. Leskovec, “Higher-order organization of complex networks,” Science, vol. 353, no. 6295, pp. 163–166, 2016.View at: Publisher Site | Google Scholar
J. Piao, G. Zhang, F. Xu, Z. Chen, and Y. Li, “Predicting customer value with social relationships via motif-based graph attention networks,” in Proceedings of the Web Conference, Ljubljana, Slovenia, 2021.View at: Google Scholar
A. M. Haasnoot, M. W. Schilham, S. Kamphuis et al., “Identification of an amino acid motif in HLA-DR1 that distinguishes uveitis in patients with juvenile idiopathic arthritis,” Arthritis& Rheumatology, vol. 70, no. 7, pp. 1155–1165, 2018.View at: Google Scholar
Y. Zhu, A. Mueen, and E. Keogh, “Matrix profile IX: admissible time series motif discovery with missing data,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 6, pp. 2616–2626, 2021.View at: Google Scholar
T. H. Do, D. M. Nguyen, and G. Bekoulis, Graph convolutional neural networks with node transition probability-based message passing and drop node regularization, 2020, https://arxiv.org/abs/2008.12578.
N. Zhang, X. F. Guan, J. Cao, X. L. Wang, and H. Y. Wu, “Wavelet-HST: a wavelet-based higher-order spatio-temporal framework for urban traffic speed prediction,” IEEE Access, vol. 7, pp. 118446–118458, 2019.View at: Google Scholar
W. Ma, L. Cai, T. He, L. Chen, Z. Cao, and R. Li, “Local expansion and optimization for higher-order graph clustering,” IEEE Internet of Things Journal, vol. 6, no. 5, pp. 8702–8713, 2019.View at: Google Scholar
G. Aceto, D. Ciuonzo, A. Montieri, and A. Pescape, “Mobile encrypted traffic classification using deep learning: experimental evaluation, lessons learned, and challenges,” IEEE Transactions on Network and Service Management, vol. 16, no. 2, pp. 445–458, 2019.View at: Google Scholar
K. Zhang, X. H. Zhao, X. Li, X. Y. You, and Y. H. Zhu, “Network traffic prediction via deep graph-sequence spatiotemporal modeling based on mobile virtual reality technology,” Wireless Communications & Mobile Computing, vol. 2021, article 2353875, 2021.View at: Publisher Site | Google Scholar
F. Shahid, A. Zameer, and M. Muneeb, “Predictions for COVID-19 with deep learning models of LSTM, GRU and bi-LSTM,” Chaos Solitons& Fractals, vol. 140, article 110212, 2020.View at: Publisher Site | Google Scholar
R. W. Liu, J. Nie, S. Garg, Z. Xiong, Y. Zhang, and M. S. Hossain, “Data-driven trajectory quality improvement for promoting intelligent vessel traffic services in 6G-enabled maritime IoT systems,” IEEE Internet of Things Journal, vol. 8, no. 7, pp. 5374–5385, 2021.View at: Google Scholar