Due to the incomplete coverage and failure of traffic data collectors during the collection, traffic data usually suffers from information missing. Achieving accurate imputation is critical to the operation of transportation networks. Existing approaches usually focus on the characteristic analysis of temporal variation and adjacent spatial representation, and the consideration of higher-order spatial correlations and continuous data missing attracts more attentions from the academia and industry. In this paper, by leveraging motif-based graph aggregation, we propose a spatiotemporal imputation approach to address the issue of traffic data missing. First, through motif discovery, the higher-order graph aggregation model was presented in traffic networks. It utilized graph convolution network (GCN) to polymerize the correlated segment attributes of the missing data segments. Then, the multitime dimension imputation model based on bidirectional long short-term memory (Bi-LSTM) incorporated the recent, daily-periodic, and weekly-periodic dependencies of the historical data. Finally, the spatial aggregated values and the temporal fusion values were integrated to obtain the results. We conducted comprehensive experiments based on the real-world dataset and discussed the case of random and continuous data missing by different time intervals, and the results showed that the proposed approach was feasible and accurate.

1. Introduction

With the rapid growth of urbanization, intelligent transportation systems (ITS) are widely adopted for the urban management and traffic control [1, 2]. ITS rely on the availability of traffic data to evaluate traffic status and system performance. Assuming that traffic information would be collected in an all-round way, urban commuters can adapt the traffic conditions of urban roads, grasp the law of traffic flow changes, and subsequently promote the development of urban transportation.

In recent years, emerging information technologies, such as fifth-generation networks [3] and edge computing [4], have brought a bit convenience to traffic data collection, and the collected data is usually mobile, multisource, and real time. Unfortunately, due to the frequent occurrence of various types of failures (e.g., power malfunction, device maintenance, and network issues), collected data always are incomplete [5]. Moreover, due to the high cost of construction and maintenance, the equipment is difficult to cover the entire traffic network [6]. So, the loss of traffic data in the process of data collection is inevitable.

The problem of information missing significantly weakens the data quality, limits the study of transportation networks (e.g., traffic management, urban planning, and route choice), and in worse, may result in false decisions [68]. Thus, handling missing values is a premise for traffic data mining and analysis [9]. Making accurate imputation becomes an important research topic in ITS [10].

The key of data imputation is to discover the hidden spatiotemporal information with regard to the neighbouring data [11]. For instance, as shown in Figure 1, it makes use of the spatiotemporal neighbouring values of the missing data, thereby improving the accuracy of data imputation. Bae et al. proposed two cokriging methods that exploited the existence of spatiotemporal dependency in traffic data, to impute high-resolution traffic speed under different random data missing scenarios [12]. Li et al. developed a combined deep neural model, which extracted spatio-temporal features to estimate missing values [13]. To characterize the hidden patterns in spatiotemporal traffic data, Chen et al. incorporated a low-rank tensor completion (LRTC) framework with the truncated nuclear norm (TNN) and obtained a better solution for data imputation [14]. Considering the case of continuous data missing, Zhang et al. utilized the temporal neighbouring values of a given period and employed the long short-term memory network (LSTM) to recover missing data [15].

Although many methods achieved promising imputation accuracy of missing data, there are still some limitations. The first question needs to be addressed is how to effectively capture the spatial dependencies. Existing methods usually consider the direct adjacent road segments (i.e., upstream and downstream) but ignore the global information. It is considerable to introduce the imputed models with sensing the global and local variations of spatial information [10]. The second one is the continuous data missing, in which the missing values at some consecutive timestamps in a road segment. In this circumstance, it is unable to generate data during a period of time and provide stable inputs for a model.

To address these issues, in this paper, we propose a spatiotemporal imputation approach for traffic data via motif-based graph aggregation (named MGIA), which incorporates the motif-based spatial aggregation with the multitime dimension fusion by bidirectional LSTM (Bi-LSTM). To the best of our knowledge, our work is the first attempt to apply motif-based spatial method to address the issues of traffic data imputation. The contributions of this paper are summarized as follows: (1)We propose a higher-order graph aggregation model based on motifs. It polymerizes the correlated segment attributes of the missing data segments to capture the higher-order spatial correlations in a road network, which utilizes the method of motif-based search, and the aggregation based on graph convolution network (GCN)(2)We develop a Bi-LSTM approach based on the multitime dimension fusion to improve the accuracy in the case of continuous data missing. It incorporates the recent, daily-periodic, and weekly-periodic dependencies to ensure that there are enough historical data to complete the temporal imputation(3)We perform extensive experiments on the real-world dataset to evaluate the performance of our approach. The experimental results confirm the advantages of proposed approach with various missing patterns over the state-of-the-art imputation approaches

The remainder of the paper is organised as follows: Section 2 introduces the related work in traffic data imputation. Section 3 describes the proposed approach. Section 4 shows the experimental results and analysis. Section 5 concludes this paper.

In this section, we introduce the related work regarding the approaches for conventional and deep learning-based imputation.

2.1. Conventional Imputation

In the past decades, traffic data imputation has caused widespread concern. The imputation methods mainly include prediction, interpolation, and statistical methods.

The predictive methods utilize historical data to predict the missing values. The typical methods include Bayesian networks [16] and support vector regression [17]. Ahn et al. then developed a traffic flow prediction method based on Bayesian classifier and support vector regression [18]. It further improved the imputation accuracy of missing values. The hybrid approach based on fuzzy -means (FCM) is another example of such predictive methods, which integrates the optimized FCM parameters and genetic algorithms to build prediction models [19, 20]. These methods focus more on the historical traffic data for missing data filling and fail to consider the imputation on missing continuous data. Moreover, they ignore the spatial relationships which also provide crucial information for imputation.

Interpolation methods use the average value of the neighbouring data or historical data to impute the missing values. Using traffic data from the same sensor during the same period in neighbouring days, Yin et al. took the average value of these known data to impute the missing values [21]. Chang et al. utilized the -nearest neighbours (KNN) and local least squares to consider the relationship between similar traffic flow patterns and enhanced the interpolation effects [22]. Kriging interpolation [12, 23] focused on determining the weighted historical values. It considered the spatiotemporal information to capture the characteristics of traffic data. Although these methods can achieve promising imputation results in a short time, they can only focus on the average calculation without considering complicated changes caused by other factors such as data global attributes and random events.

The statistical methods are aimed at developing a data distribution that best fits the imputed missing data. To reflect the uncertainty between the imputation parameters, Audigier et al. designed a multiple imputation method based on Bayesian principal component analysis (BPCA) to cope with incomplete continuous data [24]. To exploit the spatiotemporal correlation of traffic network data, Wang et al. developed a low-rank matrix factorization-based approach to reconstruct the missing traffic data [25]. Moreover, some researchers have expanded the two-dimensional matrix into high-dimensional tensors, such as the tensor decomposition models [26, 27]. However, the accuracy of these approaches mainly relies on the priori assumption of the data distribution, but the unknown data of the actual distribution may cause errors.

2.2. Deep Learning-Based Imputation

Recently, the booming of deep learning [28, 29] has inspired new ideas for data imputation. The autoencoder models [30] were developed to hierarchically train the full set of traffic data and extract the spatiotemporal features of the hidden layers to demonstrate the effectiveness of data imputation [31]. Pathak et al. converted spatiotemporal trajectory data into images and used the powerful feature extraction by convolutional neural network (CNN) and then combined with the autoencoder model to impute the missing spatiotemporal trajectory data [32]. Generative adversarial network (GAN) provides a class of generative models for adversarial training, and it applies actual data/parallel data to generate the true data distribution, so that the imputation quality would be improved [33, 34]. By incorporating the reversibility of the generative imputer into GAN, Kazemi and Meidani proposed an iterative GAN architecture to evaluate the imputation of traffic missing data [35]. These state-of-the-art models work properly while dealing with the data correlation across different road segments. It adopts the powerful feature extraction capabilities of the deep neural networks to impute spatiotemporal data. However, most methods ignore the spatial dependencies in traffic network, and their imputation effect depends on a massive number of training data.

As some graph-based methods consider strong relations of data structures [36, 37], it is feasible of capturing global information to improve imputation performance. Chen and He proposed a heterogeneous graph embedding framework, which constructed a travel heterogeneous information network to find the best matched vehicles for the missing records [38]. By incorporating the spectral graph convolution operation, Cui et al. developed the graph Markov network to handle missing values for short-term traffic forecasting [39]. Graph representation learning is one crucial category of deep learning that has been widely used for traffic data imputation [40]. They viewed the observations and features as data nodes in a bipartite graph or constructed sample self-representation strategy and further required the neighbouring missing samples. Motifs are small connected components in a graph and are beneficial to understand the higher-order relations and global spatial graph principles [41, 42]. They introduced some imputation approaches based on motif discovery [43, 44], which are rarely adopted in transportation. The graph-based methods solve the imputation problems by capturing global spatial information from historical and neighbouring data and ignore the time series results to reflect the spatial dependencies.

Inspired by the above viewpoints, we propose a spatiotemporal imputation approach, which is based on motif-based graph aggregation. By adopting the motifs, the proposed approach benefits from capturing the higher-order spatial correlations in traffic networks. In addition, the proposed approach focuses on the imputation issue in the case of continuous data missing. At last, we list the limitations of these existing methods in Table 1.

3. Methodology

According to the analysis of related work, these existing methods are with the following limitations: (1) they ignore the spatial dependencies in the scene of large-scale areas and (2) rarely consider the imputation issue in the case of continuous data missing. The MGIA is aimed at improving the imputation accuracy from the perspectives of continuous data missing and the higher-order spatial correlations in traffic networks. The framework of the MGIA is shown in Figure 2. It works in the following steps: (1) we adopt motifs to define the graph-based structure presented in traffic networks and search for all associated road segments that meet the motif gain condition of the missing data segment. On this basis, GCN is utilized to gradually aggregate the nonmissing features of each associated segment, and the spatial aggregated value of the missing data segment is determined. (2) The multitime dimension imputation based on Bi-LSTM focuses on dealing with the problem of continuous data missing, which incorporates the recent, daily-periodic, and weekly-periodic dependencies of the historical data. (3) The spatial aggregated values and the multitime dimension fusion values are integrated to obtain the imputation results.

3.1. The Spatial Imputation of Motif-Based Graph Aggregation

In order to capture the data correlation and global spatial characteristics between the road segments in traffic networks, the motif-based graph aggregation is employed to impute the missing data of road segments.

3.1.1. The Associated Node Search Based on Motif Discovery

Motifs are nonisomorphic connected graph structures that occur frequently in the network and the number of nodes is greater than or equal to 3, triangle and quadrangle motifs are shown in Figure 3. As motifs consider higher-order correlations of data structures, it is feasible for capturing global information to improve the performance of node feature aggregation [45]. Triangles are traffic network motifs that play important roles in the higher-order connectivity [46]. Thus, we select the triangle motif M32 as the research object in combination with traffic theory, in which nodes denote road segments, and edges denote the connection between two adjacent segments.

We design the method of motif-based search, which uses the motif gain to adjust the fitness function. If the motif gain no longer increases or no neighbouring node exists, then the search phase is stopped.

First, a road segment set is defined in traffic networks. Assuming that the road segment where the missing data in is a node , a target node set is determined. Then, the motif-based local optimization algorithm is adopted to search all adjacent nodes of node and form adjacent node set and calculates the motif gain owing to adding node in node set to node set . If the motif gain is greater than zero (i.e., there is a spatial correlation between the node and its adjacent nodes), the node will be added to the target node set and is updated. Third, by adding the remaining adjacent nodes of , the motif gain is calculated in turn, and the nodes that meet the conditions are joined to the node set , is updated at last. By analogy, the eligible nodes continue to join the target node set . When the motif gain calculation of all nodes in the node set is completed, the associated node search for node ends. The associated node search for node is shown in Figure 4.

The local motif rate addresses the issue of avoiding repeated counting motifs [47]. When an adjacent node is joined, the local motif rate is calculated as follows:

where is the current node set, is the joined adjacent node, is the local motif rate, is the number of local motif, is the number of motifs between the current node set and the external node set, is the number of motifs between the current node set and the new nodes, and is the number of motifs between the external node set and the new nodes.

The motif gain of the node is calculated as follows: where is the gain of the local motif rate when the node is joined.

The gain indicates that the joined adjacent node is spatially related to the missing data node . As shown in Figure 5, we provide two examples of calculating the motif gain . and are equal to 4 and 3, respectively, and is . In Figure 5(a), and are equal to 1 and 2, respectively, and is from Equation (3), the default control parameter is 1,and ; thus, the node cannot be joined to the current node set . In Figure 5(b), and are equal to 1, and is , ; this indicates that there is a spatial correlation between the missing data node and the node.

3.1.2. The Spatial Aggregated Method Based on GCN

All nodes associated with the missing data node by Equation (3) are determined, and the associated node data is aggregated to obtain the estimate values of the missing data node by GCN. A graph , with nodes to describe a road network, where nodes denote road segments, edges denote the directed connection from node to node and denotes the weighted adjacency matrix. The graph is represented by its corresponding Laplacian matrix. The properties of the graph structure can be obtained by analyzing Laplacian matrix and its eigenvalues. where is the normalized form of Laplacian matrix and , and are the degree matrix, adjacent matrix, and unit matrix, respectively. and are the eigenvector function and matrix of , respectively, and is the eigenvalue of input node.

According to Equations (4) and (5), the eigenvalue decomposition of is represented as follows: where is the diagonal matrix composed by eigenvalues of and .

In order to reduce the time complexity when the scale of the graph is large, the Chebyshev polynomials are adopted to approximate the solution: where and are Chebyshev polynomials and coefficients, respectively; ; represents the maximum eigenvalue of ; ; and .

It can be seen from Equations (7) and (8), the approximate solution with Chebyshev polynomials is equivalent to using a convolution kernel to extract the eigenvalues of neighboring nodes with the node as the center of each node in the graph. In order to simplify the calculation, limit to 1, scale the eigenvalue of to make , and Equation (7) is expressed as follows:

According to Equation (4), let at the same time, and Equation (9) is transformed as follows:

In order to avoid numerical and gradient instability problems, let , Equation (10) is transformed as follows:

All associated nodes that meet the motif gain conditions are extracted according to Equation (11), and final aggregated value of the missing data node is expressed as follows: where is the final aggregated value of the missing data node , i.e., the spatial aggregated value in Figure 2. , and are the number of associated nodes, the parameters to be trained, and activation function, respectively. The initial value is the eigenvalue of the firstassociated node that meets the motif gain condition.

3.1.3. The Computational Complexity

We discuss the time complexity of the proposed approach. For the process of motif-based node search, a node where the missing data needs to search other nodes, and the nodes in the target node set also search the each node that do not exist in , so the computational complexity is . For the process of the GCN-based aggregation, optimized by the Chebyshev polynomials, the time complexity is reduced to . In order to simplify the calculation, we limit to 1, so the time complexity of the GCN-based aggregation approximates to . It is a sequential to these two processes, so the total value of the two processes is , i.e., the overall computational complexities .

The process of motif-based graph aggregation is depicted in Algorithm 1.

Input: the node set, target node , the eigenvalue of the first associated node.
Output: imputed value .
Part I: Motif-based search.
1: Initialize: current node set , motif gain
3: while do
4: update
5: set according to Equation (2)
6: obtain according to Equation (1)
7: obtain according to Equation (3)
8: i ← i +1
9:end while
10:until no neighbor node exists
11: update target node set
Part II: Aggregation
12: Initialize:
13: for do
14: update according to Equations (4)–(12)
15: end for
16: return
3.2. The Multiple Dimension Imputation Based on Bi-LSTM

Existing temporal imputation methods obtain good estimates of missing data in the case of random data missing. When there are a large number of continuous missing values at some consecutive timestamps, the imputation performance will degrade [15]. A more challenging task is to recover the continuous missing data.

Processing long-term time-series data is an essential task since there is a large number of continuous missing data. Deep learning, which trains classifiers directly from input data by complex feature representations, may generate high performing results in the dynamic and challenging context [48]. LSTM and gated recurrent unit (GRU) [49], which are the key methods of deep learning, have been employed for time-series applications with temporal dependencies. LSTM has three gates, in which, input gate and forget gate are used to control the update of memory cell, and output gate passes the output information. LSTM has recently been employed for missing data imputation, such as in Ref. [10, 13, 15]. GRU is the variant of LSTM that only comprises of update gate and reset gate and utilizes the reset gate to control the information at the previous point. GRU’s training of previous point could restrict improvement owing to the case of continuous data missing. Besides, in the case of a large amount of dataset, the performance of LSTM is better than that of GRU, and it has been verified by [50]. Therefore, in order to efficiently capture the temporal dependencies of traffic time series data, the LSTM is employed in our proposed approach.

It is known that the missing data points are closely related to the adjacent points from two opposite temporal directions. Bi-LSTM is capable of training in both forward and backward directions [51]. Meanwhile, an increase of the window based on different time granularity can provide benefit in prediction performance, by allowing the capitalization of temporal dependencies in the time series data [29]. Thus, we adopt the time-series imputation based on Bi-LSTM, which incorporates the recent, daily-periodic, and weekly-periodic dependencies of the historical data. That is to say, among the adjacent time-series data of current missing values in the three dimensions, there exist normal data at least one dimension, so as to ensure that there are enough historical data to complete the temporal imputation.

Assuming that the road segment has missing data in time period , the adjacent data of recent, daily, and weekly periods are defined as , , and , respectively. For instance, when represents the data from 7 a.m. to 7:10 a.m. on 19 September 2018, some adjacent time series data in three dimensions are shown in Figure 6.

3.2.1. Bi-LSTM Encoder

The Bi-LSTM extends the generic LSTM. In case of recent, daily, and weekly periods, it utilizes the previous and future points by processing the missing data from both forward and backward directions with two separate LSTMs. The one-way LSTM of recent, daily, and weekly periods include forget, input, and output gates.

In each time dimension, the hidden vector in time period is updated as follows: where and refer to forget and output gates, respectively, and are input gates, is the adjacent historical data in time period , and is the cell vector. , , and represent the sigmoid function, activation function, and element-wise multiplication, respectively.

The previous and future hidden vectors generated by Equations (13)–(18) are updated in the time period : where and representthe previous and future hidden vectors, respectively, and is the impact weight.

3.2.2. LSTM Decoder with Attention

In this part, we select the LSTM decoder with the attention mechanism. The encoded hidden vectors of adjacent periods are fused into an attention vector: where is the attention context vector in time period t, is the updated hidden vector in encoding stage, is the decoded hidden vector, is the forward LSTM trained by decoder components, and and are weight coefficient and activation function, respectively.

The dense layer with a linear activation function is added on top of the decoder layer to generate predictions in Figure 2 and outputs the decoded values with the backward LSTM: where is the output value decoded by LSTM, is a concatenationof the decoder hidden vector and the attention context vector , is linear activation function, and and are linear parameters mapped to decoder hidden states.

3.2.3. Multicomponent Time Dimension Fusion

The input vectors are trained by the above encoder-decoder processes, and the estimated value merged by the adjacent data of the missing data in the three time dimensions of recent, daily and weekly periods, i.e., the fusion-value are obtained.

3.3. The Spatiotemporal Fusion of Missing Data

The spatial aggregated value and the multitime dimension fusion-value in time period are combined as follows: where is the final result of spatiotemporal imputation in time period , and the initial value of weight .

In order to prevent the poor effect caused by excessive weight fluctuation, an inertia factor is introduced. The sizeof reflects the size of the weight inertia. The is inversely proportional to the volatility of weight .

Supposing that the error of the spatial imputation in time period is , and the error of the multitime dimension imputation in time period is , is updated as follows:

That is to say, when the imputation error is relatively large, in the next time period, the imputation influence will be reduced and vice versa. The value of can be obtained through simulation experiments to obtain a better value.

4. Results and Discussion

In this section, we introduce the dataset, baselines, and evaluation metrics, and then verify the superiority of the MGIA in the case of random and continuous data missing.

The scenario of random data missing is that missing data appears randomly in the traffic dataset owing to the temporary failure (e.g., network or power outage issues). Continuous data missing refers to that some data is missing continuously due to the long-term failure of traffic data collectors, in which the values are missed at some consecutive timestamps or multiple intersections in a road network.

4.1. Data Preparation

We validate our approach with the traffic index dataset, which was provided by the Chengdu branch of Didi Chuxing, China. In Figure 7, we select 30.66° N ∼30.73° N, 104.02° E ∼104.10° E as the geographic area and 74 road segments make up the road network in this work. The time span of these dataset is from September to October 2018. All the programs are developed based on Python 3.7 and TensorFlow 1.13.1.

The dataset was filtered under the normal traffic flow hours (6 am–8:50 pm), and the time intervals are set to 10 min and 30 min, respectively. The traffic index is a standardized quantitative indicator to measure traffic congestion. It is expressed as follows: where is the traffic index of road segment during timeslot , is the average speed, and and are the maximum speed and minimum speed corresponding to road segment in the historical data, respectively.

In the tests, we manually remove a certain amount of processed dataset and then compute these data with the proposed approach and state-of-the-art methods. We assume that there are patterns of random and continuous data missing in the dataset and get the accuracy of these methods by comparing the imputed results with the ground truth data. The range of data missing rate in these two patterns is set between 10% and 60% (in units of 10%).

4.2. Experimental Settings
4.2.1. Baselines

The comparison between the MGIA and state-of-the-art methods for traffic data imputation is conducted as follows: (i)LRTC-TNN [14] is a low-rank tensor completion framework with truncated nuclear norm, which extracts spatiotemporal features from traffic data(ii)SSIM [15] is a sequence-to-sequence imputation model, which is designed to impute missing data by utilizing the LSTM from both the past and future time indexes(iii)MGIA: our proposed approach, which incorporates the motif-based graph aggregation method with the multitime dimension fusion method based on Bi-LSTM, to impute missing data(iv)GIA is a comparison for MGIA, which incorporates the graph aggregation method with the multitime dimension fusion method based on Bi-LSTM but does not include motif-based application

4.2.2. Evaluation Metrics

To evaluate the performance of the MGIA, we employ three widely used performance metrics: root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). where is the total number of the missing data, is the imputed data, and is the corresponding actual data of .

4.3. Comparison and Result Analysis

As illustrated in Figures 8 and 9, the average errors with different missing rates in the case of random and continuous data missing is presented, respectively. That shows the MGIA shows superior performance gains over the baselines w.r.t. all the three metrics.

All metrics of MGIA are lower than those based on the other three approaches. LRTC-TNN and SSIM are the recent approaches, and the difference between all three metrics of MGIA and those based on other two algorithms are larger, which means that the experiments based on MGIA achieved good performance. Meanwhile, MGIA significantly outperforms the compared approach GIA. It means that the application of motif in spatial imputation is feasible and effective.

As shown in Figures 8 and 9, the overall trends of the four approaches are almost similar, i.e., all metrics of these approaches increase with the increase in the missing rate. In Figure 8, when the missing rate is less than 40%, the error growth of these approaches is steady except for SSIM. Once the missing rate exceeds 40%, the error increase is larger than the respective miss rate (the missing rate is less than 40%). In Figure 9, the error growth of these approaches has remained stable, which crosses various missing rates from 10% to 60%. The reason could be that, as for random data missing, there will be the situation of continuous data missing when the missing rate exceeds 40%; thus, this is different from the current imputed pattern, and the error growth increases significantly.

The LRTC-TNN achieved a better performance than the SSIM. This is because LRTC-TNN makes use of low-rank tensor decomposition and will not be affected by consecutive missing data. On the contrary, SSIM is susceptible to consecutive missing data. If the data is missing at the forward period or the next period, the error will be very different. GIA improves SSIM from one-dimensional to three-dimensional horizons and employs graph aggregation to perform spatiotemporal fusion of data imputation, so it has improved performance compared to SSIM. MGIA adds motif-based imputation on the basis of GIA and balances the effectiveness of higher-order spatial correlations and periodicity in merging the imputed values.

Due to the widening of the time interval, i.e., from 10 min to 30 min, the imputation performance becomes worse in Figures 8 and 9, and MGIA still performs better than the other three approaches in all the three metrics.

To represent the advantage of continuous data missing in MGIA, we compare the metrics with the pattern of random data missing in Tables 2 and 3. Regardless of whether it is a 10 min interval or a 30 min interval, the pattern of continuous data missing has better performance than the pattern of random data missing. This is because MGIA adopts the time series method based on Bi-LSTM, which widens the horizon from one-dimension to three-dimension, and pays more attention to the time continuity.

Overall, MGIA outperforms the other baselines due to the spatiotemporal characteristics and higher-order spatial correlations. Moreover, the imputation performance of continuous data missing is better than random data missing.

5. Conclusion

In this paper, a novel spatiotemporal imputation approach for traffic data (MGIA), which utilizes motif-based graph aggregation, is proposed. To sum up, this approach addresses the issues of (1) traffic spatial imputation for large-scale areas and (2) poor imputation performance especially when in the case of continuous data missing. Based on MGIA, we capture the higher-order spatial correlations in traffic networks and solve the problem of spatiotemporal data imputation. Experiments are performed on a real-world traffic dataset in Chengdu, China. The experimental results showed that the MGIA outperformed all other methods in the case of random and continuous data missing and achieved strong stability crossing various missing rates range from 10% to 60%.

In the future, we will further evaluate the MGIA with regard to other factors (such as weather and event). Besides, we plan to incorporate adversarial learning into the proposed approach to improve the imputation accuracy, in the case of continuous data missing with longer time interval. On this basis, we improve the accuracy at the various missing rate range from 60% to 80%. Additionally, we plan to apply the proposed approach to other tasks in ITS, such as traffic prediction and causal discovery of the congestion propagation patterns.

Data Availability

All datasets in this study can be downloaded from https://outreach.didichuxing.com.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.


This research was supported by the National Natural Science Foundation of China (Nos. 62073295 and 62072409), Zhejiang Provincial Natural Science Foundation (LR21F020003), and Fundamental Research Funds for the Provincial Universities of Zhejiang (RF-B2020001).