Abstract

Traffic data plays a very important role in Intelligent Transportation Systems (ITS). ITS requires complete traffic data in transportation control, management, guidance, and evaluation. However, the traffic data collected from many different types of sensors often includes missing data due to sensor damage or data transmission error, which affects the effectiveness and reliability of ITS. In order to ensure the quality and integrity of traffic flow data, it is very important to propose a satisfying data imputation method. However, most of the existing imputation methods cannot fully consider the impact of sensor data with data missing and the spatiotemporal correlation characteristics of traffic flow on imputation results. In this paper, a traffic data imputation method is proposed based on improved low-rank matrix decomposition (ILRMD), which fully considers the influence of missing data and effectively utilizes the spatiotemporal correlation characteristics among traffic data. The proposed method uses not only the traffic data around the sensor including missing data, but also the sensor data with data missing. The information of missing data is reflected into the coefficient matrix, and the spatiotemporal correlation characteristics are applied in order to obtain more accurate imputation results. The real traffic data collected from the Caltrans Performance Measurement System (PeMS) are used to evaluate the imputation performance of the proposed method. Experiment results show that the average imputation accuracy with proposed method can be improved 87.07% compared with the SVR, ARIMA, KNN, DBN-SVR, WNN, and traditional MC methods, and it is an effective method for data imputation.

1. Introduction

With the rapid development of the social economy, many kinds of the massive road infrastructure are implemented [14] but traffic congestion still exists in the highway. Therefore, it is necessary to collect the information of the highway for the convenience of people's travelling demand. With the development of information technology, the collection of highway information becomes possible, and the collection equipment used for highways includes Bluetooth sensor, remote traffic microwave sensor, video sensors, and loop detectors. However, traffic flow data are lost in different degrees due to sensor damage, malfunction, or transmission errors, etc. Missing data makes it difficult to extract valid information from traffic data. Meanwhile, the missing data is also an obstacle in the traffic and the travel time prediction field [58], and the integrity of traffic flow data is the premise of data analysis in ITS. Therefore, it is very important to put forward an effective traffic data imputation method. At present, various methods have emerged in the field of traffic flow data imputation. These imputation methods can be roughly divided into three categories: prediction methods, interpolation methods, and statistical learning methods [9].

Traffic flow prediction models [1012] are critical for road traffic management in complex road networks. Prediction methods usually build predictive models with historical data and treat missing data as values to be predicted. There are many ways to build traffic flow prediction models, from a simple null value imputation to complex spatiotemporal imputation models [13]. The representative prediction methods include Autoregressive Integrated Moving Average Model (ARIMA) [1416], Bayesian networks (BNs) [1719], and support vector regression (SVR) [20, 21]. Elshenawy et al. [22] proposed an intelligent data imputation method with ARIMA model and presented a mechanism based on Hyndman-Khandakar algorithm to determine ARIMA parameters. Sun et al. [23] partitioned a day into different time section and used SVR to forecast traffic flow data. Chen et al. [24] proposed an Autoregressive Integrated Moving Average with Generalized Autoregressive Conditional Heteroscedasticity (ARIMA-GARCH) model for traffic flow prediction. However, these prediction methods failed to utilize the sensor information with missing data, which would affect data imputation accuracy.

Interpolation methods are divided into temporal-neighboring and pattern-neighboring [25]. Temporal-neighboring methods fill up the missing data by the known data from the same sensors at the same daily time but on some neighboring days [20, 26]. Pattern-neighboring methods use the similarity characteristics of the daily traffic flow data [27] and estimate missing data using historical data collected from the same sensors on different days [17, 20]. The typical pattern-neighboring methods include K-nearest neighbors (KNN) model [28, 29] and Local Least Squares (LLS) [30, 31] model, and the key difficulty of these methods is to determine the neighbors by an appropriate distance metric [32, 33]. Nguyen et al. [34] used the mean value of the historical data to estimate missing data. Smith et al. [35] used historical data or the data from surrounding periods and locations to impute the missing data. The interpolation model assumes that the daily traffic flow data are similar, but the actual traffic flow data fluctuates and changes with time. Therefore, it is impossible to obtain satisfactory imputation performances.

The method based on statistical learning has been developed in recent years. This method primarily assumed the probability distribution model of traffic data and used iterative methods to estimate the parameters of the probability distribution. Then the observed data was used to impute the missing data. The statistical learning methods include Probabilistic Principal Component Analysis (PPCA) [6, 9], Bayesian Principal Component Analysis (BPCA) [26], neural network method [36], and Markov Chain Monte Carlo (MCMC) [37]. The MCMC is a typical imputation method based on statistical learning. The basic idea of the MCMC method regards the missing data as the target parameter and estimate the parameter by the sample values of the parameter. Y Higashijima et al. [38] proposed a regression tree imputation method and used a preprocessing method to improve imputation accuracy. Wei et al. [39] proposed a data-driven imputation method and used k-means clustering to group the most correlated road segments; the trained model is able to estimate the missing data at multiple locations under a unified framework. Although the methods based on statistical learning have strong hypothesis about traffic data, their performance is superior to traditional imputation methods [40] because the assumed probability distribution captures the essentials of traffic flow.

The methods based on prediction and interpolation simply impute the data with the temporal or spatial correlation characteristic and only consider the information of historical data. The historical imputation methods fill the missing data with the known data point collected on the same sensors at the same daily time but from different days. These methods require higher stability of historical data, but traffic flow data is usually unstable and fluctuate to some extent in practical applications. The traditional imputation method sets all the missing data to zero and uses the data matrix with zero-padding to participate the operation for the data imputation, which cannot consider the impact of missing sensor data into the imputation result. Generally, the sensors including missing data have the highest correlation with final imputation results. However, the missing data is set to zero directly in the traditional imputation method, which ignores the effect of the missing data on the imputation results and reduces the accuracy of the imputation results. In order to address the above problems, a traffic data imputation method is proposed based on improved low rank matrix decomposition (ILRMD). Compared with the traditional imputation method, the ILRMD method fully considers the impact of missing data in the imputation results. In the process of data imputation, the ILRMD method does not directly discard the information of missing data, and the effect of missing data is reflected in the coefficient matrix. The reconstructed data matrix multiplied by the coefficient matrix, containing the missing data information, is the imputation result. The ILRMD method uses not only the traffic data around the sensor including missing data, but also sensor data with data missing. The information contained in the missing data is fully considered, and the spatiotemporal correlation characteristics of the traffic flow are adequately utilized. The tested results with traffic data collected from the Caltrans Performance Measurement System (PeMS) show that the proposed algorithm has superior imputation accuracy.

The rest of this paper is organized as follows. Section 2 reviews the related work in traffic data imputation and gives a brief introduction. The traditional imputation approach is introduced in Section 3. Section 4 describes the ILRMD method proposed in this paper. Section 5 discusses the result analysis and method comparison. Section 6 makes the conclusion of this paper and gives some recommendations.

With the rapid development of machine learning, pattern recognition, computer vision, and data mining, the processing of big data is becoming more and more important. The scale and growth rate of big data are continuously increasing, but large-scale high-dimensional data is often correlative and redundant. Therefore, it is necessary to perform reasonable compression processing on large-scale data. In order to reduce data redundancy, Candes [41] proposed the concept of low rank sparse matrix decomposition in 2009, which is also called Low-Rank Matrix Recovery (LRMR), Low-Rank Matrix Decomposition (LRMD), or Robust Principal Component Analysis (RPCA).

2.1. Low-Rank Matrix Decomposition

For a given data matrix distributed in a linear subspace with approximately low dimension, it can be decomposed into a low-rank matrix and a sparse matrix [42].

where represents the norm of the matrix and represents the compromise factor of matrices and .

Since the optimization problem of (1) is a NP-hard problem, it can be relaxed to the convex optimization problem [4143], which is noted as follows:

where represents the nuclear norm of matrix ; is the norm of the matrix .

The low-rank characteristic of recovered matrix determines the matrix imputation performance. Therefore, choosing the suitable LRMD solution method is crucial. The main algorithms for solving LRMD problem include Iterative Threshold method [44, 45], the Dual Approach [46], Accelerated Proximal Gradient Algorithm [47], and Augmented Lagrange Multiplier method [48]. In this paper, Augmented Lagrange Multiplier method is used.

2.2. Matrix Imputation Based on Low-Rank Matrix Decomposition

Generally, we cannot recover all the data with partial sample data. But Candes [42] proved that the missing data can be recovered more accurately when data matrix is low or near low rank. From the Section 2.1, the low rank matrix is acquired based on LRMD, which can be used to impute the missing data.

The model of matrix imputation can be noted as follows:

where is the set of known element subscripts, and , is a linear projection operator, which can be defined as follows:

The optimization problem of (3) is also a NP-hard problem, so it needs to be relaxed into a convex optimization problem:

2.3. Matrix Imputation Based on Low-Rank Matrix Representation

The low-rank matrix imputation method mentioned above directly minimizes the rank of imputed data. In order to improve imputation efficiency, a self-expression is applied to LRMD, which is called the low-rank matrix representation [49, 50]. The data matrix is represented as a linear combination with a dictionary matrix , that is, . The matrix is the coefficient matrix, and it is expected to be low rank. can be obtained by solving the optimization problem in the following:

Equation (6) can be convexly relaxed to obtain the following:

If the data matrix is selected as the dictionary matrix, (7) can be noted as follows:

In practical applications, the data matrix may be disturbed by noise. In order to enhance the robustness, (8) can be revised as follows:

A data matrix is represented by a data dictionary , and the coefficient matrix is sparser when has higher similarity with . But the stochastic noise is usually appended in data matrix , which will influence the correlation within the data matrix. When the stochastic noise is removed, the correlation of data matrix can be enhanced. is selected as a dictionary, and its essence is to reveal the correlation within the data matrix. When the coefficient matrix is sparse, data columns in data matrix are represented by each other’s columns with few coefficients as possible. For the traffic flow data, it has high spatiotemporal correlation characteristics, but it is affected by the weather, holidays, and other factors, which makes the traffic flow data have stochastic volatility. Therefore, if the influence of this stochastic volatility on the traffic data is removed, the correlation between the traffic data will be enhanced. After removing the influence of stochastic noise, the correlation between the data itself is further explored, and the similarity between the data is expressed with as little information as possible. Then the internal correlation of traffic flow data is used to impute the data.

2.4. The Solution of the Coefficient Matrix

In order to obtain the solution of (9), a variable is introduced and let to separate the variable . The coefficient matrix can be calculated with the Augmented Lagrange Multiplier method, and the optimization model becomes the following:

Construct an Augmented Lagrange function as (11), where is a Lagrange Multiplier, is Fibonacci norm, which represents the sum of the absolute squares of elements, and is a weight to tune the error term .

The Exact Augmented Lagrange Multiplier (EALM) method is used to solve the matrices and according to the following:

The updating of the coefficient matrix is as follows. Firstly, a projection matrix is used to express the unmissing position of the matrix , and . For convenience, set and (13) can be expressed as follows:

In order to get a derivative about in (14), the cross product should be changed to inner product. The matrices of (14) are spread in column as follows:

where , , and are, respectively, the column of matrices , , and .

Change vector to a diagonal matrix, i.e., and . Therefore, (15) can be expressed as follows:

For simplifying (16), is denoted as , and is denoted as . Then (16) can be simplified as follows:

For (17), can be updated by the following:

Then repeat the above process until the objective function convergence. The coefficient matrix can be obtained when the termination condition is met, and it is expressed as follows:

3. Traditional Imputation Method with LRMD

The traditional method imputed the missing data by zero-padding operation. For an original matrix , suppose that is missing, where represent column in . The missing column of the matrix is imputed by 0, which can be represented as a matrix :

where is the specific elements in the matrix .

Multiplying by the column of coefficient matrix , can be recovered by the following:

The zero-padding operation is used for the traditional matrix imputation method to filling the missing column. Then the reconstructed matrix is multiplied by the corresponding column of the coefficient matrix ; the imputed data of the missing column is obtained. This method only uses the data around the missing column to impute the missing data; that is to say, the missing column does not contribute to the imputation result. Generally, the sensors including missing data have the highest correlation with final imputation results. However, the missing data is set to zero directly in the traditional imputation method, which ignores the effect of the missing data on the imputation results and reduces the accuracy of the imputation results.

4. Traffic Data Imputation with ILRMD

The missing data generally can be divided into three different types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing at Determinate (MAD). This paper mainly deals with the problem of determinate missing. In road networks, traffic data was collected by various types of sensors, which usually demonstrated high temporal-spatial correlation characteristics; that is, traffic data have low-rank characteristic.

In a road network, suppose that there are sensors and each sensor has data samples, which can be denoted as a data matrix . This paper assumed that the data in the sensor is missing in . The traditional imputation method based on LRMD failed to consider the impact of missing data columns on imputation results. In order to address this shortcoming and combine the temporal-spatial correlation characteristics of traffic flow, this paper proposes a data imputation method based on ILRMD.

4.1. The Proposed ILRMD Model

In (9), it is assumed that , are the elements of the () observed sensor at the () time, respectively, existing in the observed matrix and the noise matrix . is the element of the coefficient matrix , and the coefficient matrix . According to the multiplication rule, the following is obtained:

Then, (22) can be transformed into the following:

The coefficient matrix of the observed sensor can be expressed as follows:

The final coefficient matrix of all observed sensors is described as follows:

Assuming that represents the matrix that removes the column. According to the matrix multiplication rule, the matrix is multiplied by the column of the coefficient matrix . The value is obtained and can be noted as follows:

The ILRMD method proposed in this paper assumes that a certain column of data in the matrix is lost and then multiplies the matrix by the coefficient matrix to recover the missing data. The influence of all observed sensors is considered including the sensor with missing data. In (24), if the value is zero, the data of the surrounding sensors is used for imputation. If the value is not zero, both the data of the surrounding sensors and the sensor including missing data are used.

The differences between the ILRMD method and the traditional imputation method are discussed as follows. The traditional imputation method performs the zero-padding operation on the missing column and then is directly multiplied by the corresponding column of the coefficient matrix . The traditional imputation method utilizes the data collected from the surrounding sensors to recover the matrix and ignores the effect of the sensors including missing data. The ILRMD method assumes that the column of the data is completely missing and the matrix represents the matrix after removing the column data. Then after the conversion, the weight that is most relevant to each sensor itself is expressed in another form, in order to reduce the effect of the most relevant weight to the imputation result. From (22)-(24), a coefficient matrix is obtained. The coefficient matrix considers not only the surrounding sensors, but also the influence of the sensor including missing data. Ultimately the matrix is multiplied by the coefficient matrix for obtaining imputation result.

The main steps of the proposed imputation method are as follows.

Step 1. The traffic flow data is preprocessed by smoothing and filtering, and the complete traffic flow data of one day is randomly selected to construct the training matrix .

Step 2. The preprocessed matrix is decomposed into the low-rank matrix and the sparse matrix according to (1).

Step 3. According to (9), matrix is decomposed into and , and, from (10) to (20), the coefficient matrix is solved.

Step 4. Construct test matrix and set matrix as the dictionary matrix. represents the matrix that removes the column.

Step 5. The coefficient matrix is obtained according to (25) and the missing data which need to be imputed is obtained by (26).

4.2. Performance Evaluation Criteria

The evaluation criteria to measure the error of the imputed data included root mean square error (RMSE), mean absolute error (MAE), mean squared percentage error (MSPE), and mean absolute percentage error (MAPE). The RMSE and MAPE are selected in this paper. The formulas are as follows:

where is the total number of the missing data, is the actual value of the missing data point, and is the corresponding estimated value.

5. Experiment Results

5.1. Data Description

The data used to evaluate the performance of the proposed model was collected in mainline detectors provided by the PeMS database, which includes more than 39,000 individual sensors that span the highway system in all major metropolitan areas of California. In this paper, 46 mainline sensors numbered from 1108512 to 1221232 are selected to perform data imputation test from April 1st, 2018, to April 30th, 2018. The traffic flow data is aggregated at 5-minute intervals and generate 288 data points for the daily flow. The data of 1 day, 7 days, and 14 days are, respectively, selected to construct the training matrix; however, the experimental results show that the improvement of the imputation accuracy is not obvious when the training samples become larger and larger. Therefore, the traffic flow data on April 23th, 2018, is used as training data, and the data on April 30th, 2018, is used as test data. The data in sensor numbered 1108512 is assumed to be missing, which needs to be imputed. According to the analysis of the spatial-temporal correlation characteristics of traffic flow, the traffic flow data on the same day in different consecutive weeks have high regularity and relevancy. Therefore, this paper selects traffic flow data from the same day on consecutive weeks (two Mondays) to perform the experiment. The traffic flow data of 46 observed sensors on April 23th, 2018, are selected as training matrix, and the data in sensor numbered 1108512 on April 30th, 2018, is assumed to be missing, which needs to be imputed.

Due to the influence of people’s willing for a trip, weather, and other factors, the traffic flow data presents certain stochastic fluctuation and abrupt. In order to reduce the impact of stochastic fluctuation of traffic flow data on imputation results, a five-point smoothing filtering method was used to preprocess the data. The original and filtered data, in the sensor numbered 1108512 on April 8, 2018, are shown in Figure 1.

From Figure 1, it can be seen that the filtered data intuitively reflects the regularity of the traffic data, and the abrupt points are effectively filtered out in the original traffic flow data.

In this paper, the training data and the test data are all preprocessed with a smoothing filtering method at first, which can remove the abnormal points in the sensor data. Then we randomly assume that a sensor data is missing and then impute the missing sensor data with the proposed model.

5.2. Results and Performances Analysis
5.2.1. Influence of Parameter

The compromise factor is an important parameter of low rank matrix decomposition, and the different values have an important impact on the performance of data imputation. In order to verify the effectiveness of ILRMD method, the influence of parameter is analyzed. The RMSE and MAPE of imputation results changes with the compromise factor are, respectively, shown in Figures 2(a) and 2(b).

From Figure 2, we can see that, for the traditional MC method, both RMSE and MAPE gradually decrease with the increase of the compromise factor . After RMSE and MAPE reach the minimum value (), which increase again. For the ILRMD method, RMSE and MAPE all decrease with the change of . When , they reach the minimum and then increase slowly. In any case, the traditional MC method is far less effective than ILRMD method. Therefore, in order to compare the imputation results of the two methods in the best state, is set as 0.08 for traditional MC method and 0.15 for ILRMD method in this paper.

5.2.2. The Selection of the Training Data

Due to traffic flow has high spatial-temporal correlation characteristics, it is necessary to analyze the effect of different training data to imputation results. However, the selection of training data has little influence on the performance of the proposed ILRMD method. In order to show that the performance of the proposed method is not sensitive to the time, the traffic flow data of four days (April 21th, 2018, April 22th, 2018, April 23th, 2018, and April 24th, 2018) are randomly selected as training data to impute the data of April 30th, 2018. The experimental results are shown in Figures 3(a), 3(b), 3(c), and 3(d).

It can be seen from Figure 3 that the proposed ILRMD method always has good performance and is not sensitive to the selection of training data. And the imputation performance of different training data is shown in Table 1.

It can be seen from Table 1 that the proposed method always has good performance although the different training data is used. The results indicate that the selection of time has little influence on the proposed ILRMD method. Therefore, we only select the traffic flow data of one day (April 23th, 2018) to verify the proposed model in the paper.

5.2.3. Comparison of Imputation Results

For the purpose of verifying the performances of ILRMD method, the proposed method is compared with the traditional method. The imputation results of the ILRMD method under the best condition () and the traditional method under the best condition () are shown in Figures 4(a) and 4(b).

From Figure 4, it can be seen that the imputation results of traffic flow data through the ILRMD are more accurate than the traditional MC method. Although the imputation result is obtained in the optimal compromise factor with the traditional MC method, there is a big deviation between the imputation result and the real data, and the ILRMD method still recovers the missing traffic data more accurately. When compromise factor is set as the optimal value for the ILRMD method, the imputation result is almost identical with the real value, but there are more deviations in traditional methods. It is observed that the imputation results of the proposed ILRMD method have similar traffic patterns with the real traffic flow, especially in morning and evening peak hours.

5.2.4. The Comparison of ILRMD and Other Imputation Methods

In order to evaluate the advantages of our proposed approach, the ARIMA, SVR, DBN-SVR, WNN, KNN, and Traditional MC imputation methods are selected under the premise of testing with the same experimental data. In the ARIMA model, the orders of autoregressive , moving average , and difference are, respectively, set as 5, 5, and 1. In the SVR model, the nuclear function is configured as “”, the number of iterations is 10,000, and the penalty factor is taken as 0.01. In the WNN model, the number of iterations is 1000, the number of the hidden layer nodes is 3. In the DBN-SVR model, the number of network layers in the DBN model is set as 3 and the number of iterations is 200. The ILRMD model proposed in this paper is compared with these imputation methods; the imputation results of different models and real traffic flow are shown within one day in Figure 5.

It can be seen from Figure 5, the imputation traffic flow has similar traffic patterns with the real traffic flow. The DBN-SVR model has the worst imputation performance; the ARIMA, SVR, KNN, and WNN are better than the DBN-SVR, while they show weakness compared with the ILRMD method. The imputation value of the proposed ILRMD model is almost coincided with the measured data. It is observed that the proposed ILRMD model has better imputation performance.

The error analysis test is conducted using two error evaluation criteria, which is expressed in Table 2. In order to more precisely verify the performance of the proposed model, another sensor numbered 1119921 is randomly selected to perform the test. In Table 2, the sensors numbered 1108512 and 1119921 are, respectively, assumed to be imputed to verify the performance of the proposed model. It can be seen from Table 2, when the sensors numbered 1108512 and 1119921 are assumed to be imputed, the proposed ILRMD models all have the best performance compared with other approaches. These experiments can verify that the ILRMD model proposed in this paper is an effective method for data imputation.

From Table 2 of the first condition (1108512 sensor), it can be seen that the imputation accuracy of the ILRMD model, respectively, improves 93.01%, 74.61%, 95.96%, 80.57%, 96.30%, and 81.97% compared with the traditional MC, SVR, ARIMA, KNN, DBN-SVR, and WNN methods. The average imputation accuracy is 87.07% higher than other imputation methods. Results demonstrate that the proposed ILRMD model has the best performance compared with other approaches, and it is an effective method for data imputation.

6. Conclusions and Recommendations

In the paper, a data imputation method is proposed to impute the missing traffic flow data. Different from the most known traffic flow data imputation methods, the ILRMD model makes an effective use of the information of missing sensors and takes full advantage of the high spatiotemporal correlation characteristics of traffic flow data. The experiment result shows that the proposed imputation method is superior to other methods. However, this paper focuses on dealing with the missing traffic data at a single sensor; we only considered one observed sensor with missing data. In practical terms, the missing traffic data is always distributed on multisensors.

In our future research, the missing data analysis on multisensors is being studied. The concept of missing rate can be introduced, and the more effective data imputation method can be proposed for different degrees of missing data in order to improve the imputation accuracy.

Data Availability

The data used in this paper are collected from the Caltrans Performance Measurement System (PeMS) in 46 sensors numbered from 1108512 to 1221232 on 04/01/2018~04/27/2018. If any researcher requests for these data, he can log into the website: http://pems.dot.ca.gov/.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was partly supported by the National Key R&D Program of China (2018YFC0808706) and the National Natural Science Foundation of China (Grant no. 5157081053). The authors are also grateful to the PeMS for providing the data.