Abstract

There are increasing concerns about missing traffic data in recent years. In this paper, a robust missing traffic flow data imputation approach based on matrix completion is proposed. In the proposed method, the similarity of traffic flow from day to day is exploited to impute missing data by the low-rank hypothesis of constructed traffic flow matrix. And the physical limitation of road capacity and nonnegativity is also considered through the optimization process, which avoids the possibility of producing negative and overcapacity values. Moreover, the proposed algorithm can impute missing data and recover outlier in a unify framework. The experiment results show that the proposed method is more accurate, stable, and reasonable.

1. Introduction

Traffic information collected by various kinds of sensors are a vital component of intelligent transportation system (ITS) which aims to influence travel behavior, reduce traffic congestion, improve mobility, and enhance air quality [1]. For example, the real-time traffic information can be provided to drivers before and during their travels for supporting their decision of route choice [2] and it is also an important guideline for modern traffic control system to adjust the signal timing [3]. Moreover, after proper preprocessing, the real-time traffic data can be used as the real-time traffic state estimation of transportation networks [4]. On the other hand, several data mining techniques have been applied to mine time related association rules from historical traffic databases and its results have been used for traffic prediction such as the works of Qiao et al. [5] and Zargari et al. [6].

However, the missing traffic data problems remain inevitable due to detector faults or transmission distortion in many places. About 10% of daily traffic flow is usually missing in Beijing, China [7]. Turner et al. [8] reported that almost a quarter of data from San Antonio, Texas, is missing; and more than 5% of data are lost within the PeMS traffic flow database [9]. The missing data adversely affect the applications of intelligent transportation system; for example, the traffic control system requires sufficient traffic flow data (i.e., traffic volumes, occupancy rates, and flow speeds) to generate appropriate traffic management strategies [10, 11]. In traffic forecast area, if there exists missing data, the predicting performance will reduce sharply [12, 13]. Clearly, missing data problem is a large obstacle for any of the functions for which ITS data is to be used.

In the past decades, numerous imputation methods have been proposed to handle missing traffic data problem. These imputation methods can be roughly divided into two parts: interpolation based and inductive learning based methods.

Interpolation based methods always fill the missing data with a weighted average value calculated from part of known data. Yin et al. [14] use historical averages from the same detector at the same time period but in neighboring days to replace missing data. Zhong et al. [15] interpolate error traffic data by traffic data from similar daily flow variation patterns considering the error type and traffic condition. Such approaches only use part of the traffic information and always fail to accurately estimate missing value at high missing ratio.

On the other hand, inductive learning methods try to build imputation modeling from the a priori characteristics of traffic data. Most of these methods are based on some assumptions of traffic flow data. Autoregressive Integrated Moving Average (ARIMA) is based on the assumption that the historical value and future value of traffic flow provide an indication of the missing value [16]. The probabilistic principle component analysis (PPCA) based methods assume that the basic characteristic of traffic flow variations can be captured by the probability distribution of PPCA [7, 17]. The recent tensor-based methods assume that traffic data are highly correlated in multimode (day, week, link) and construct traffic flow data into multiway array (tensor) to capture these correlations. By utilizing the essential characteristics of traffic flow information by the assumptions, these kinds of imputation methods often outperform traditional interpolation methods [7]. It can be concluded that most of the inductive missing traffic data imputation methods make the following assumptions.

Assumption 1. Traffic flow has a high similarity from day to day and week to week but also link to the neighboring link which can be utilized to impute missing traffic flow.

Assumption 2. The traffic flow data have not been spoiled by the outliers, which frequently occur in real-world traffic information system.

While the inductive based imputation methods achieve somewhat success by the assumptions for traffic flow data, there are still some shortcomings in these methods. Firstly, according to the traffic flow theory, the volume of traffic flow is a certain value from zero to the road-capacity (the maximum traffic flow obtainable on a given roadway using all available lanes). But most data-driven imputation methods ignore this limitation of traffic volume. Secondly, the traditional methods such as the PCA-based methods cannot work well with big outlier or errors without the preprocessing of corrupted traffic data [7]. In fact, we have proposed several tensor completion methods [1720], which make full use of multimode correlations [21] of traffic flow data, to impute missing traffic data. However, in our former works, the nonnegativity (lower bound) and capacity (upper bound) of traffic flow are still ignored. As a result, it is possible that these methods will produce some unreliable results.

To tackle these shortcomings, this paper proposes a traffic flow data imputation methods based on matrix completion. In the proposed method, the traffic data are constructed into a matrix. The similarity of day mode is captured by the assumption that the constructed matrix is low-rank. By adding a limitation in the objective function, the proposed method can restrict the reconstructed traffic flow data between zero and road capacity. Moreover, for the traffic flow data corrupted by outliers, the proposed method can simultaneously impute the missing data and recover the outliers by the sparse assumption. It should be noted that the method we proposed here can be considered as an idealized version of robust PCA (RPCA) but different from the natural approaches to robust PCA [22]. The proposed approach needs not to preprocess the data and isolate the outlier before the imputation. This advantage allows the proposed matrix completion (MC) methods outperform traditional imputation methods especially for the traffic flow corrupted by outliers.

To give a detailed explanation of the proposed method, the rest of this paper is organized as follows. The methodology and algorithm of the proposed method are proposed in Section 2. Section 3 presents the imputing testing results including comparison with other methods. The conclusion and future works are conducted in Section 4.

2. Methodology

In this section, a brief description of the matrix completion is presented. Then, we described the proposed missing traffic flow data imputation method in detail.

2.1. Review of Matrix Completion Methods

Let be an matrix of rank (< or ); the low matrix has some available sampled entries where Ω is a subset of sampled cardinality. Then [17] proves that most matrices of rank can be perfectly reconstructed by solving the optimization problem:

In (1), the functional is the nuclear norm of the matrix , which is the sum of its singular values.

In [23], the singular value thresholding (SVT) algorithm is used to solve an approximate optimization problem of (1): where is the orthogonal projector onto the span of matrices vanishing outside of so that the th component of is equal to if and zero otherwise.

The Lagrange multiplier of (2) is where , with optimization variable . Fix and a sequence of scalar step sizes. Then starting with , the algorithm inductively defines where

is the constriction factor of , where More details for SVT can be found in [24].

Literature [25] develops an augmented Lagrange multiplier method to solve MC. In their methods, the MC problem is formulated as as will compensate for the unknown entries of and the unknown entries of are simply set as zeros. Then the partial augmented Lagrangian function is

Then, and are updated according to the subproblems of (8),where is updated by is updated by

More detailed information of the augmented Lagrange multiplier (ALM) method can be found in literature [25].

2.2. The Proposed Algorithm

The goal of the proposed method is to impute the missed traffic data considering both the physical limitation of traffic flow data and the possible corruption by outliers. Firstly, the traffic flow data in a local place are formed into the matrix mode as follows: In this matrix represents the discretized volume on day at time interval within the given day. For the physical limitation, changes in a particular range from zero to road capacity. The total number of days is and each day is divided into time intervals. Supposing the set of observed traffic volume data is , and the traffic volume is corrupted by sparse outliers. Hence, the missing traffic data imputation problem is translated into a corrupted matrix completion problem [2628]; the optimization problems can be described as where represents the number of the nonzero entries and is the orthogonal projector onto the span of matrices vanishing outside of .

The minimums of () and the rank of matrix are NP-hard problem [29]. To convert the objective function (12) into a convex optimization problem, the rank of is approximated by the nuclear norm (the sum of the singular values) of the matrix, and the of is approximated by the of the matrix (the sum of the absolute value of its entries) [29] as follows:

Because the solution of is easier than , the function is converted into the following form: Considering the faster computation speed and higher accuracy, the augmented Lagrangian method (ALM) [25] is employed to optimize the problem.

By introducing a Lagrange multiplier to remove the equality constraint, one has the Lagrangian function of (14):

Lin et al. [25] proved that updating and once when solving this subproblem is sufficient for and to converge to the optimal solution of (15). is updated by where is the constriction factor, where is updated by is the projector onto the span of matrices ranging from to . This leads to a matrix completion based traffic data imputation method (MCI) described in Algorithm 1.

Input: Observation samples , , of matrix
(1)
(2) while not converged do
(3) ,
(4) ,
(5)
(6)
(7) updating
(8)
(9) end while
(10) output

3. Experiments

The evaluation of an imputation method’s performance is a multiobjective problem. In this section, four key performance indicators of the proposed method are discussed, which are the accuracy, stability, robustness, and computation complexity.

3.1. The Test Data

To evaluate the proposed method, traffic flow datasets from PeMS [9] open database are used. The dataset is collected from Detector 400141. The detector is located at north bound freeway I880. The freeway has four lanes under surveillance. The sampling period is between July 11, 2013 and July 30, 2013. The data are almost all observed with a 99.9% observed ratio.

The assumption is an important premise for the missing traffic data imputation methods based on inductive learning. For the proposed MCI, the traffic volume matrix is assumed to be low-rank. The correctness of this assumption is validated by the low-rank approximation of the original data. The low-rank approximation data is computed by singular value decomposition according to Eckart-Young theorem [30]. If we consider singular value decomposition (SVD) of the constructed traffic volume matrix , we get where columns of and are left-singular vectors and right-singular vectors of , respectively. The diagonal entries of are equal to the singular values of .

The full-rank matrix can be approximated as a low-rank matrix by the SVD of , namely, where is the same matrix as except that it contains only the largest singular values (the other singular values are replaced by zero). The low-rank approximation of the selected traffic data is given in Figure 1.

As we can see from Figure 1, the approximated low-rank data basically keeps the characteristic of origin data. From the results, it can be concluded that the low-rank hypothesis of traffic volume matrix is reasonable.

3.2. Quantitative Measures

The set of measures including MAE, MAPE, and SDE allows one to directly evaluate the performance of multiple imputation techniques

3.2.1. Accuracy

In this paper, the mean absolute percentage error (MAPE) is used to evaluate the performance of missing traffic data imputation. However, the MAPE will be lower if the traffic volumes are higher [31]. In observance of this phenomenon, this paper also applies the mean absolute error (MAE) as a complementary measure for MAPE.

The mean absolute error (MAE) is defined to be The mean absolute percentage error (MAPE) is defined to be where is the total number of missing data, is the observed value, and is the reconstructed value.

3.2.2. Stability

The standard deviation of errors (SDE) of the test methods is evaluated. The smaller SDE means that the errors are tightly clustered around the mean value [29]:

3.2.3. Robustness

The robustness is evaluated by the accuracy on the dataset added outlier under different missing ratio.

3.3. The Results without Outlier

In this part, we evaluate the performance of MCI algorithm and compare it with other state-of-the-art algorithms including PCA-based PPCA [7], SVT [22], and IALM [23] on random missing case without outlier. For MCI, the tolerance on the of divided by of in the gradient is set to 0.01, and the maximum number of iterations is set to which is set to 150. For SVT and IALM, the tolerance on the is set to 0.01, and the maximum number of iterations is also set to . For PPCA, similar to [7], the tolerance is set to 0.01, and the latent space is set to 15.

In order to better verify the change in imputation performance, the total missing ratio (the number of missing data points divided by the total number of data points) is set from 5% to 70%. The MAE and MAPE curves are shown in Figure 2.

In Figure 2, all the methods achieved equal results under missing ratio lower than 30%. However, the performances of other methods except for MCI degrade sharply when the missing ratio is higher than 50% except MCI. The reason may be that the MCI can utilize the physical limitation of traffic flow in the imputation process while the other methods ignored the physical limitation of traffic flow data.

Traffic volume must be nonnegative and less than the value of road capacity. The PPCA imputation strategy is a kind of statistical method which imputes data through the a priori statistical characteristics of data. As shown in Figure 3, the PPCA method may give a negative value of volume during the low flow rate interval. The phenomenon also can be found in the other two matrix completion strategies without the constraints of nonnegativity and road capacity in their objective function. Our proposed MCI can tackle this shortcoming by adding the limits to the algorithm. The negative and overcapacity value is not observed in the experiments of MCI. The possibilities of four methods that produce unreliable results in our experiments are given in Table 1 (the frequency of experiment results with unreliable results: negative or overcapacity).

In the above experiments, the accuracy of MCI has been tested. Then, we will test the stability of imputation methods by SDE under different missing ratio. As shown in Figure 4, the SDEs of MCI and IALM using augmented Lagrangian function are lower than PPCA and SVT. It suggests that the MCI not only can compute missing data more accurate but also more stable by employing the augmented Lagrangian function.

3.4. Missing Data Imputation with Outlier

The above experiments assume that the data have not been spoiled by outliers. However, the traffic flow series are often corrupted by the outliers which are caused by numerous reasons [32]. Unfortunately, these outliers are usually not easy to be isolated by the traditional missing traffic data imputation approaches. Thus, the recovery of outlier and imputation of missing data are often completed in different frameworks separately [7, 29].

For the problem, the MCI algorithm makes it possible to impute missing data and recover outlier in a unify framework by adding the sparse matrix .

There are various kinds of outlier in traffic data. Here, we only consider two common scenarios of outliers:(a)volume out of range (VOR): percentage of the detector records with volumes larger than 1000 v/5 min;(b)volume repeating zero (VRZ): percentage of the detector records with repeating zero volumes for 30 min.

It is hard to enumerate all the situations with different mixing ratios of the two outliers’ scenarios. In the experiments, the methods are tested on a typical situation by assuming that the mixed VOR and VRZ data have a ratio of 1 : 1. Ratios of outlier data are set from 5% to 15% and the outlier data are produced randomly. The missing data ratio is set to 30%. All the results are averaged by 10 instances. The MAE and MAPE for missing data and outlier recovery are both given in Table 2. Figure 5 presents the part of traffic volume data and reconstructed volume data. The results show that MCI could impute the missing data and recover the traffic volume outlier data with a reliable performance.

3.5. Computation Complexity of MCI Approach

As the same as IALM [25], it is not necessary to compute the full SVD in MCI. By using Lansvd [21], a fast SVD method that only computes singular values larger than a particular threshold and their corresponding singular vectors, the complexity of the singular value decomposition is not a problem for MCI. And the computation speed of MCI is faster than traditional matrix completion based methods such as SVT by utilizing the augmented Lagrangian function [25].

It is not easy to choose the parameter which is the weight parameter between the rank of matrix and the number of sparse outlier. For the traffic data without outliers, setting larger than 100 can obtain a good performance. But for data corrupted by outliers, a proper lower value of will achieve better results. In this paper, we suggest for real application for data without corruption of outlier and for corrupted data.

4. Conclusion and Future Works

In this paper, a matrix completion method which fully utilizes the physical limitation of traffic volume and the day mode similarity has been proposed dealing with missing traffic flow problem. The experiment shows that the proposed method is more reasonable, accurate, and stable than the state-of-art methods for traffic flow data. Moreover, the proposed MCI can impute missing data and recover the outlier in a unify framework with a reliable performance.

Future research should look into missing traffic data imputation method that incorporates spatial and temporal correlations among adjacent detectors to improve imputation accuracy. In addition, future studies may evaluate the performance of MCI on other parameters such as speed and occupancy. It still needs more researches on the appropriate choice of parameter for the MCI.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The research was supported by NSFC (Grant nos. 61271376, 51308115, and 91120010), National Basic Research Program of China (973 Program no. 2012CB725405), and Beijing Natural Science Foundation (4122067).