Journal of Sensors

Volume 2019, Article ID 7092713, 11 pages

https://doi.org/10.1155/2019/7092713

## Traffic Data Imputation Algorithm Based on Improved Low-Rank Matrix Decomposition

^{1}School of Information Engineering, Chang’an University, Xi’an 710064, China^{2}School of Highway, Chang’an University, Xi’an 710064, China

Correspondence should be addressed to Xue Meng; nc.ude.dhc@7304217102

Received 16 January 2019; Revised 8 May 2019; Accepted 11 June 2019; Published 1 July 2019

Academic Editor: Eduard Llobet

Copyright © 2019 Xianglong Luo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Traffic data plays a very important role in Intelligent Transportation Systems (ITS). ITS requires complete traffic data in transportation control, management, guidance, and evaluation. However, the traffic data collected from many different types of sensors often includes missing data due to sensor damage or data transmission error, which affects the effectiveness and reliability of ITS. In order to ensure the quality and integrity of traffic flow data, it is very important to propose a satisfying data imputation method. However, most of the existing imputation methods cannot fully consider the impact of sensor data with data missing and the spatiotemporal correlation characteristics of traffic flow on imputation results. In this paper, a traffic data imputation method is proposed based on improved low-rank matrix decomposition (ILRMD), which fully considers the influence of missing data and effectively utilizes the spatiotemporal correlation characteristics among traffic data. The proposed method uses not only the traffic data around the sensor including missing data, but also the sensor data with data missing. The information of missing data is reflected into the coefficient matrix, and the spatiotemporal correlation characteristics are applied in order to obtain more accurate imputation results. The real traffic data collected from the Caltrans Performance Measurement System (PeMS) are used to evaluate the imputation performance of the proposed method. Experiment results show that the average imputation accuracy with proposed method can be improved 87.07% compared with the SVR, ARIMA, KNN, DBN-SVR, WNN, and traditional MC methods, and it is an effective method for data imputation.

#### 1. Introduction

With the rapid development of the social economy, many kinds of the massive road infrastructure are implemented [1–4] but traffic congestion still exists in the highway. Therefore, it is necessary to collect the information of the highway for the convenience of people's travelling demand. With the development of information technology, the collection of highway information becomes possible, and the collection equipment used for highways includes Bluetooth sensor, remote traffic microwave sensor, video sensors, and loop detectors. However, traffic flow data are lost in different degrees due to sensor damage, malfunction, or transmission errors, etc. Missing data makes it difficult to extract valid information from traffic data. Meanwhile, the missing data is also an obstacle in the traffic and the travel time prediction field [5–8], and the integrity of traffic flow data is the premise of data analysis in ITS. Therefore, it is very important to put forward an effective traffic data imputation method. At present, various methods have emerged in the field of traffic flow data imputation. These imputation methods can be roughly divided into three categories: prediction methods, interpolation methods, and statistical learning methods [9].

Traffic flow prediction models [10–12] are critical for road traffic management in complex road networks. Prediction methods usually build predictive models with historical data and treat missing data as values to be predicted. There are many ways to build traffic flow prediction models, from a simple null value imputation to complex spatiotemporal imputation models [13]. The representative prediction methods include Autoregressive Integrated Moving Average Model (ARIMA) [14–16], Bayesian networks (BNs) [17–19], and support vector regression (SVR) [20, 21]. Elshenawy et al. [22] proposed an intelligent data imputation method with ARIMA model and presented a mechanism based on Hyndman-Khandakar algorithm to determine ARIMA parameters. Sun et al. [23] partitioned a day into different time section and used SVR to forecast traffic flow data. Chen et al. [24] proposed an Autoregressive Integrated Moving Average with Generalized Autoregressive Conditional Heteroscedasticity (ARIMA-GARCH) model for traffic flow prediction. However, these prediction methods failed to utilize the sensor information with missing data, which would affect data imputation accuracy.

Interpolation methods are divided into temporal-neighboring and pattern-neighboring [25]. Temporal-neighboring methods fill up the missing data by the known data from the same sensors at the same daily time but on some neighboring days [20, 26]. Pattern-neighboring methods use the similarity characteristics of the daily traffic flow data [27] and estimate missing data using historical data collected from the same sensors on different days [17, 20]. The typical pattern-neighboring methods include K-nearest neighbors (KNN) model [28, 29] and Local Least Squares (LLS) [30, 31] model, and the key difficulty of these methods is to determine the neighbors by an appropriate distance metric [32, 33]. Nguyen et al. [34] used the mean value of the historical data to estimate missing data. Smith et al. [35] used historical data or the data from surrounding periods and locations to impute the missing data. The interpolation model assumes that the daily traffic flow data are similar, but the actual traffic flow data fluctuates and changes with time. Therefore, it is impossible to obtain satisfactory imputation performances.

The method based on statistical learning has been developed in recent years. This method primarily assumed the probability distribution model of traffic data and used iterative methods to estimate the parameters of the probability distribution. Then the observed data was used to impute the missing data. The statistical learning methods include Probabilistic Principal Component Analysis (PPCA) [6, 9], Bayesian Principal Component Analysis (BPCA) [26], neural network method [36], and Markov Chain Monte Carlo (MCMC) [37]. The MCMC is a typical imputation method based on statistical learning. The basic idea of the MCMC method regards the missing data as the target parameter and estimate the parameter by the sample values of the parameter. Y Higashijima et al. [38] proposed a regression tree imputation method and used a preprocessing method to improve imputation accuracy. Wei et al. [39] proposed a data-driven imputation method and used k-means clustering to group the most correlated road segments; the trained model is able to estimate the missing data at multiple locations under a unified framework. Although the methods based on statistical learning have strong hypothesis about traffic data, their performance is superior to traditional imputation methods [40] because the assumed probability distribution captures the essentials of traffic flow.

The methods based on prediction and interpolation simply impute the data with the temporal or spatial correlation characteristic and only consider the information of historical data. The historical imputation methods fill the missing data with the known data point collected on the same sensors at the same daily time but from different days. These methods require higher stability of historical data, but traffic flow data is usually unstable and fluctuate to some extent in practical applications. The traditional imputation method sets all the missing data to zero and uses the data matrix with zero-padding to participate the operation for the data imputation, which cannot consider the impact of missing sensor data into the imputation result. Generally, the sensors including missing data have the highest correlation with final imputation results. However, the missing data is set to zero directly in the traditional imputation method, which ignores the effect of the missing data on the imputation results and reduces the accuracy of the imputation results. In order to address the above problems, a traffic data imputation method is proposed based on improved low rank matrix decomposition (ILRMD). Compared with the traditional imputation method, the ILRMD method fully considers the impact of missing data in the imputation results. In the process of data imputation, the ILRMD method does not directly discard the information of missing data, and the effect of missing data is reflected in the coefficient matrix. The reconstructed data matrix multiplied by the coefficient matrix, containing the missing data information, is the imputation result. The ILRMD method uses not only the traffic data around the sensor including missing data, but also sensor data with data missing. The information contained in the missing data is fully considered, and the spatiotemporal correlation characteristics of the traffic flow are adequately utilized. The tested results with traffic data collected from the Caltrans Performance Measurement System (PeMS) show that the proposed algorithm has superior imputation accuracy.

The rest of this paper is organized as follows. Section 2 reviews the related work in traffic data imputation and gives a brief introduction. The traditional imputation approach is introduced in Section 3. Section 4 describes the ILRMD method proposed in this paper. Section 5 discusses the result analysis and method comparison. Section 6 makes the conclusion of this paper and gives some recommendations.

#### 2. Related Work

With the rapid development of machine learning, pattern recognition, computer vision, and data mining, the processing of big data is becoming more and more important. The scale and growth rate of big data are continuously increasing, but large-scale high-dimensional data is often correlative and redundant. Therefore, it is necessary to perform reasonable compression processing on large-scale data. In order to reduce data redundancy, Candes [41] proposed the concept of low rank sparse matrix decomposition in 2009, which is also called Low-Rank Matrix Recovery (LRMR), Low-Rank Matrix Decomposition (LRMD), or Robust Principal Component Analysis (RPCA).

##### 2.1. Low-Rank Matrix Decomposition

For a given data matrix distributed in a linear subspace with approximately low dimension, it can be decomposed into a low-rank matrix and a sparse matrix [42].

where represents the norm of the matrix and represents the compromise factor of matrices and .

Since the optimization problem of (1) is a NP-hard problem, it can be relaxed to the convex optimization problem [41–43], which is noted as follows:

where represents the nuclear norm of matrix ; is the norm of the matrix .

The low-rank characteristic of recovered matrix determines the matrix imputation performance. Therefore, choosing the suitable LRMD solution method is crucial. The main algorithms for solving LRMD problem include Iterative Threshold method [44, 45], the Dual Approach [46], Accelerated Proximal Gradient Algorithm [47], and Augmented Lagrange Multiplier method [48]. In this paper, Augmented Lagrange Multiplier method is used.

##### 2.2. Matrix Imputation Based on Low-Rank Matrix Decomposition

Generally, we cannot recover all the data with partial sample data. But Candes [42] proved that the missing data can be recovered more accurately when data matrix is low or near low rank. From the Section 2.1, the low rank matrix is acquired based on LRMD, which can be used to impute the missing data.

The model of matrix imputation can be noted as follows:

where is the set of known element subscripts, and , is a linear projection operator, which can be defined as follows:

The optimization problem of (3) is also a NP-hard problem, so it needs to be relaxed into a convex optimization problem:

##### 2.3. Matrix Imputation Based on Low-Rank Matrix Representation

The low-rank matrix imputation method mentioned above directly minimizes the rank of imputed data. In order to improve imputation efficiency, a self-expression is applied to LRMD, which is called the low-rank matrix representation [49, 50]. The data matrix is represented as a linear combination with a dictionary matrix , that is, . The matrix is the coefficient matrix, and it is expected to be low rank. can be obtained by solving the optimization problem in the following:

Equation (6) can be convexly relaxed to obtain the following:

If the data matrix is selected as the dictionary matrix, (7) can be noted as follows:

In practical applications, the data matrix may be disturbed by noise. In order to enhance the robustness, (8) can be revised as follows:

A data matrix is represented by a data dictionary , and the coefficient matrix is sparser when has higher similarity with . But the stochastic noise is usually appended in data matrix , which will influence the correlation within the data matrix. When the stochastic noise is removed, the correlation of data matrix can be enhanced. is selected as a dictionary, and its essence is to reveal the correlation within the data matrix. When the coefficient matrix is sparse, data columns in data matrix are represented by each other’s columns with few coefficients as possible. For the traffic flow data, it has high spatiotemporal correlation characteristics, but it is affected by the weather, holidays, and other factors, which makes the traffic flow data have stochastic volatility. Therefore, if the influence of this stochastic volatility on the traffic data is removed, the correlation between the traffic data will be enhanced. After removing the influence of stochastic noise, the correlation between the data itself is further explored, and the similarity between the data is expressed with as little information as possible. Then the internal correlation of traffic flow data is used to impute the data.

##### 2.4. The Solution of the Coefficient Matrix

In order to obtain the solution of (9), a variable is introduced and let to separate the variable . The coefficient matrix can be calculated with the Augmented Lagrange Multiplier method, and the optimization model becomes the following:

Construct an Augmented Lagrange function as (11), where is a Lagrange Multiplier, is Fibonacci norm, which represents the sum of the absolute squares of elements, and is a weight to tune the error term .

The Exact Augmented Lagrange Multiplier (EALM) method is used to solve the matrices and according to the following:

The updating of the coefficient matrix is as follows. Firstly, a projection matrix is used to express the unmissing position of the matrix , and . For convenience, set and (13) can be expressed as follows:

In order to get a derivative about in (14), the cross product should be changed to inner product. The matrices of (14) are spread in column as follows:

where , , and are, respectively, the column of matrices , , and .

Change vector to a diagonal matrix, i.e., and . Therefore, (15) can be expressed as follows:

For simplifying (16), is denoted as , and is denoted as . Then (16) can be simplified as follows:

For (17), can be updated by the following:

Then repeat the above process until the objective function convergence. The coefficient matrix can be obtained when the termination condition is met, and it is expressed as follows:

#### 3. Traditional Imputation Method with LRMD

The traditional method imputed the missing data by zero-padding operation. For an original matrix , suppose that is missing, where represent column in . The missing column of the matrix is imputed by 0, which can be represented as a matrix :

where is the specific elements in the matrix .

Multiplying by the column of coefficient matrix , can be recovered by the following:

The zero-padding operation is used for the traditional matrix imputation method to filling the missing column. Then the reconstructed matrix is multiplied by the corresponding column of the coefficient matrix ; the imputed data of the missing column is obtained. This method only uses the data around the missing column to impute the missing data; that is to say, the missing column does not contribute to the imputation result. Generally, the sensors including missing data have the highest correlation with final imputation results. However, the missing data is set to zero directly in the traditional imputation method, which ignores the effect of the missing data on the imputation results and reduces the accuracy of the imputation results.

#### 4. Traffic Data Imputation with ILRMD

The missing data generally can be divided into three different types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing at Determinate (MAD). This paper mainly deals with the problem of determinate missing. In road networks, traffic data was collected by various types of sensors, which usually demonstrated high temporal-spatial correlation characteristics; that is, traffic data have low-rank characteristic.

In a road network, suppose that there are sensors and each sensor has data samples, which can be denoted as a data matrix . This paper assumed that the data in the sensor is missing in . The traditional imputation method based on LRMD failed to consider the impact of missing data columns on imputation results. In order to address this shortcoming and combine the temporal-spatial correlation characteristics of traffic flow, this paper proposes a data imputation method based on ILRMD.

##### 4.1. The Proposed ILRMD Model

In (9), it is assumed that , are the elements of the () observed sensor at the () time, respectively, existing in the observed matrix and the noise matrix . is the element of the coefficient matrix , and the coefficient matrix . According to the multiplication rule, the following is obtained:

Then, (22) can be transformed into the following:

The coefficient matrix of the observed sensor can be expressed as follows:

The final coefficient matrix of all observed sensors is described as follows:

Assuming that represents the matrix that removes the column. According to the matrix multiplication rule, the matrix is multiplied by the column of the coefficient matrix . The value is obtained and can be noted as follows:

The ILRMD method proposed in this paper assumes that a certain column of data in the matrix is lost and then multiplies the matrix by the coefficient matrix to recover the missing data. The influence of all observed sensors is considered including the sensor with missing data. In (24), if the value is zero, the data of the surrounding sensors is used for imputation. If the value is not zero, both the data of the surrounding sensors and the sensor including missing data are used.

The differences between the ILRMD method and the traditional imputation method are discussed as follows. The traditional imputation method performs the zero-padding operation on the missing column and then is directly multiplied by the corresponding column of the coefficient matrix . The traditional imputation method utilizes the data collected from the surrounding sensors to recover the matrix and ignores the effect of the sensors including missing data. The ILRMD method assumes that the column of the data is completely missing and the matrix represents the matrix after removing the column data. Then after the conversion, the weight that is most relevant to each sensor itself is expressed in another form, in order to reduce the effect of the most relevant weight to the imputation result. From (22)-(24), a coefficient matrix is obtained. The coefficient matrix considers not only the surrounding sensors, but also the influence of the sensor including missing data. Ultimately the matrix is multiplied by the coefficient matrix for obtaining imputation result.

The main steps of the proposed imputation method are as follows.

*Step 1. *The traffic flow data is preprocessed by smoothing and filtering, and the complete traffic flow data of one day is randomly selected to construct the training matrix .

*Step 2. *The preprocessed matrix is decomposed into the low-rank matrix and the sparse matrix according to (1).

*Step 3. *According to (9), matrix is decomposed into and , and, from (10) to (20), the coefficient matrix is solved.

*Step 4. *Construct test matrix and set matrix as the dictionary matrix. represents the matrix that removes the column.

*Step 5. *The coefficient matrix is obtained according to (25) and the missing data which need to be imputed is obtained by (26).

##### 4.2. Performance Evaluation Criteria

The evaluation criteria to measure the error of the imputed data included root mean square error (RMSE), mean absolute error (MAE), mean squared percentage error (MSPE), and mean absolute percentage error (MAPE). The RMSE and MAPE are selected in this paper. The formulas are as follows:

where is the total number of the missing data, is the actual value of the missing data point, and is the corresponding estimated value.

#### 5. Experiment Results

##### 5.1. Data Description

The data used to evaluate the performance of the proposed model was collected in mainline detectors provided by the PeMS database, which includes more than 39,000 individual sensors that span the highway system in all major metropolitan areas of California. In this paper, 46 mainline sensors numbered from 1108512 to 1221232 are selected to perform data imputation test from April 1st, 2018, to April 30th, 2018. The traffic flow data is aggregated at 5-minute intervals and generate 288 data points for the daily flow. The data of 1 day, 7 days, and 14 days are, respectively, selected to construct the training matrix; however, the experimental results show that the improvement of the imputation accuracy is not obvious when the training samples become larger and larger. Therefore, the traffic flow data on April 23th, 2018, is used as training data, and the data on April 30th, 2018, is used as test data. The data in sensor numbered 1108512 is assumed to be missing, which needs to be imputed. According to the analysis of the spatial-temporal correlation characteristics of traffic flow, the traffic flow data on the same day in different consecutive weeks have high regularity and relevancy. Therefore, this paper selects traffic flow data from the same day on consecutive weeks (two Mondays) to perform the experiment. The traffic flow data of 46 observed sensors on April 23th, 2018, are selected as training matrix, and the data in sensor numbered 1108512 on April 30th, 2018, is assumed to be missing, which needs to be imputed.

Due to the influence of people’s willing for a trip, weather, and other factors, the traffic flow data presents certain stochastic fluctuation and abrupt. In order to reduce the impact of stochastic fluctuation of traffic flow data on imputation results, a five-point smoothing filtering method was used to preprocess the data. The original and filtered data, in the sensor numbered 1108512 on April 8, 2018, are shown in Figure 1.