Abstract

Traffic volume data are the important part of research and application of intelligent transportation systems (ITS). However, data loss often happens due to various factors in the real world, which may cause large deviations in prediction or bad accuracy of optimizations. Imputation is a valid way to handle missing values. A multilayer perceptron-multivariate imputation of chain equation (MLP-MICE) regression imputation method optimized by the limit-memory-BFGS algorithm is proposed, considering the temporal and spatial characteristics of traffic volume. Also, 32 groups of simulated imputation experiments based on the detected traffic volume of road sections in the Guangdong freeway system are conducted, which take the scenarios of continuous missing and jumped missing into account. The results of the experiments show that the MLP-MICE can optimize the imputation performance in the missing value of traffic volume with the MAPE of imputation results from 6.38% to 30%. Meanwhile, the proposed model has higher imputation accuracy for the traffic volume data with a lower degree of mutation. Lastly, the performance of the proposed model of imputation in short-term traffic volume prediction is discussed using the support vector machine. The results of it show that the MAPE of prediction under the proposed model is much lower than all-zero imputation. Therefore, the proposed model in this study is positive on improving the accuracy of traffic volume prediction and intelligent traffic control and management.

1. Introduction

Traffic volume data are an important part of Intelligent Transportation Systems (ITSs); however, missing data are widespread and inevitable problem due to the failure of detectors and information processing errors, which certainly have negative effects on the application of ITS due to the temporal and spatial characteristics of traffic flow [1, 2]. Moreover, the data failure is likely to be culled in the actual study because the size of overall data is large [3], which will reduce the accuracy of prediction and optimization of the road management system. Therefore, effective imputation methods for missing traffic volume data are necessary.

Imputation of missing data means to replace the missing values with estimated values [4]. The imputation methods can be divided into single imputation and multiple imputation according to the times of imputation [5]. Single imputation includes statistical and machine learning-based methods [6]. Mean imputation and regression imputation are mainly statistical-based methods. Mean imputation is easy to operate, but its performance is limited because of the underestimation of imputation results and the neglect of the relations with other variables [7], while regression imputation is realized by obtaining alternative values by regression, such as spatial autoregression models [8] and logistic regression [9]. Machine learning-based imputation methods have been proposed with the development of data information technology in recent years [10], such as support vector regression [11], residual learning networks [12], and semisupervised regression [13]. Hot deck imputation [14, 15] and cluster imputation [16] are also widely used in the imputation of missing data. Previous studies show that the combination of single imputation methods can improve the performance of imputation, for example, general regression auto associative neural network (GRAANN) [17] is a hybrid of mean imputation and machine learning, and fuzzy c-means support vector regression genetic algorithm (Fuzzy C-means SVRGA) [18] is combined by clustering imputation and machine learning.

Generally, multiple imputation (MI) requires at least two times of imputation [19]. The basic idea of MI is to create multiple copies of data containing missing values, perform the imputation of each copy independently, and then select the imputation results according to certain evaluation criteria [4, 20]. MI is efficient to improve the precision of imputation [2124] and Multivariate Imputation by Chained Equations (MICE) [25] is a commonly used method for MI. It is verified that good performance can be obtained from the hybrid of MICE and regression imputation [26].

The imputation methods are widely used in the fields of economy [8], computer [9, 13], biology [11], environment [12, 27], medicine [21, 28], and so on. Most of them are used to resolve the nonresponse of the questionnaire survey, which has weak temporal and spatial correlation. Traffic volume is a kind of data with strong temporal continuity and spatial correlation, which should be considered in the imputation process of accuracy-improving [29]. Previous studies show that traffic volume imputation methods are also mainly based on statistical learning and machine learning [30]. Statistical learning methods take full advantage of the statistical feature of traffic volume [31], and it mainly includes improved principal components analysis (PCA) methods such as PPCA [32], KPPCA [31], fuzzy theory [33, 34], and tensor completion [3538]. Machine learning methods estimate the missing value by machine learning [30] or deep learning algorithms [39], for example, SVR [40, 41], DSAE [42], KNN [43], CNN [44, 45], LSTM [46, 47], GAN [48], and DEB [49]. Comparing with the statistical learning method, the machine learning method makes use of more characteristics of traffic volume, especially the temporal features [30]. And the spatiotemporal (ST) features are considered such as ST-BiRT [38] and ST-PTD [50]. The details of the imputation methods used in dealing with traffic volume data are shown in Table 1.

Most imputation methods for traffic volume are single imputation methods, the results of which are generally underestimated [18], and they are more suitable for complex and large traffic systems with a large scale of traffic volume data, where a combination with other kinds of methods is not recommended due to the increasing calculation. However, it is different for smaller systems with a small scale of data; the imputation accuracy of it can be improved by using a combination of multiple methods, while MI is an effective one. In addition, such a combination can effectively reduce the influence of abnormal fluctuations of traffic data caused by external random factors on the final output result.

Neural network model is promising to obtain accurate imputation results [39]. Multilayer perceptron (MLP), a deep learning model, finds application in representing the nonlinear features in traffic prediction, and missing value imputation [41] has been chosen in this study. Therefore, to study a simple traffic system and to fully use the performance of MLP with MI to improve the precision of imputation, a hybrid model of deep learning and multiple imputation called MLP-MICE is proposed to impute the missing data of detected traffic volume. Comparing with the unprocessed data, inputting the data fixed by the proposed MLP-MICE into the prediction model improves the accuracy effectively, which benefit the intelligent traffic control and management. The rest of this study is organized as follows. Section 2 introduces the methodology, including the framework of MLP-MICE and each part of it. Section 3 verifies the proposed model using freeway detected data and analyses the performance. Section 4 concludes this study with some remarks.

2. Methodology

2.1. Model Framework

The structure of the model is shown in Figure 1. Considering the temporal continuity and spatial correlation of freeway traffic volume, MLP was used to impute the missing traffic volume data, and the L-BFGS algorithm was used to optimize the parameters of MLP. The model is divided into three parts, which are data preparation, MLP imputation, and MICE process.

2.2. Data Preparation

The input of the model is defined as an M × q matrix denoted as D:where dt+i,j is the traffic volume of time interval i at location q, M represents the number of time intervals for the imputation model, q is the number of detecting locations, and t is the started time of detectable data for imputation. Suppose N elements are missing from dt+m,n, where m represents the starting time interval of the missing data, n represents the location number, and M = m+N, which means the data of all the locations during and before the missing data are used to input into the model. M needs to be determined by the data distribution, and equation (2) can be used when the model has been tested to have excellent imputation performance on the missing rate , in which Mmin is the minimum data sample size to get relatively accurate results:

Finally, the following equation normalizes the elements of nonmissing values of the matrix to speed up the convergence of the model as :

2.3. Multilayer Perceptron Imputation

The multilayer perceptron (MLP) neural network structure is interconnected by many nodes and contains four layers.

The nodes belonging to different layers process the input data through the activation function. Then, the results are transmitted from layer to layer [51]. The MLP regression imputation model we use is illustrated in Figure 2. Define the data , the weight matrix W, and the bias term matrix B, where is the data that input to layer l and is the element of W, which denotes the weight of the ith node of the layer l − 1 to the jth node of the layer l, and is the element of B, which represents the numerical deviation of the input to the jth node of the layer l. z(l) represents the output dataset of each node in layer l:

Choose tanh as the activation function and the following expression can be obtained:

When l is the output layer, it outputs the normalizing estimation of the missing values. The activation method in the output layer is softmax regression.

2.4. Parameters Optimization Based on L-BFGS

L-BFGS algorithm (limit-memory-BFGS) is used to optimize the parameters of MLP. L-BFGS is a kind of approximate quasi-Newton method. It is commonly used to solve unconstrained nonlinear programming problems with the advantages of fast convergence and low memory overhead (Algorithm 1).

Definition:
θ0: the first run parameter of MLP
μ: the tolerable maximum positive number
iter: iteration of L-BFGS, iter = k(k = 0,1,2, …)
m: latest m groups of iteration results are used for calculation
ɛ: optimized step length
Procedure L-BFGS
 Calculate B0 and according to θ0 and the value of the loss function
  
   get (θk + ɛr) = min(f(θk + ɛir)) from:
    Let:
    for i = k − 1, k − 2, …, k − m:
     
     
     end for
    r = B·
    for i = k − m, k − m + 1, …, k − 1:
     
     
     end for
  
  
  k = k + 1
  
 end while
end procedure

Define is the ith element of set X, RWB(x(i)) is the result of imputation for the value of the ith time interval with weight W and the numerical deviation B for a certain location, y(i) is the observed value of the ith time interval for this location, and are elements of W. Then, the loss function of the set X, denoted as floss(X), can be described as follows:

L2 is added in equation (6) to avoid the overfitting phenomenon and local optimization, and δ is the coefficient of L2. The value of the loss function is related to the parameters W and B of the MLP. Let θ denote the set of W and B, and then, define as follows:

θ needs to be generated by iteration to meet the near minimum of [52]. Let k denote the iteration times. When the formula is expanded by Taylor expansion at θk after k iterations, equation (8) can be obtained:where is the gradient vector of h(θ), represents the Hessian matrix for h(θ), and R parameters are in , and can be obtained by the following equation :

To simplify the calculation, an approximation is made to equation (9). Take the derivative at k + 1 of h (), and express it in the form of a gradient operator as in the following equation:

Record Bk+1 for the first approximation:

Therefore,

Then, can be formulated as follows, where I is a unit matrix:while

Therefore,

In order to simplify the calculation and reduce the required memory, the latest m groups of θk are used in the calculation, and Bk + 1 is approximated for the second time. Then, the following equation can be expressed as

Equation (16) has two purposes: one is to find the feasible direction of iterations, and the other is to determine the specific calculation method of iterative optimization. The expression of the direction r is

A two-loop recursion algorithm can be used to obtain the specific direction of parameter optimization and the optimized step length ɛ according to equation (17).

The L-BFGS algorithm is shown as follows:

2.5. Multivariate Imputation by Chain Equation

The parameters of MLP are updated when the algorithm ends. Multivariate imputation by chained equations (MICE) is a multiple imputation method that can realize flexible imputation of missing values as shown in the following four steps [53]. The process of MICE is shown as follows.Step 1: construct a data frame with a capacity.Step 2: fill the data frame with the imputation results from MLP and the evaluation of each result. Different types of missing should have different filling functions.Step 3: repeat step 2. If the data frame is filled, go to step 4.Step 4: select the final imputation result of the missing values from the data frame according to the evaluation.

Through the above process, the MLP-MICE regression imputation method optimized by the L-BFGS algorithm is proposed. The complete process of the imputation model proposed in this study is shown in Figure 3.

3. Results and Discussion

3.1. Empirical Analysis

This study takes the detected traffic volume data of VDS (Video Detection Systems) in 5 locations around an interchange of the freeways in Guangdong Province as an example, which is shown in Figure 4, to conduct an empirical analysis of the proposed model. Traffic volume is extracted by image recognition from each VDS. The data used in the study are collected from 0:00 on May 1, 2020, to 24:00 on May 7, 2020, including two workdays and five holidays, which cover many various scenarios. Name each VDS of a different location, from VDS-1 to VDS-5. The time interval of data collection is 15 minutes, and 672 pieces of data are collected in total. The statistical property of the 15-minute traffic volume of each VDS is shown in Table 2.

The probability distribution of the collected data is drawn by kernel density estimation (KDE), as shown in Figure 5. The higher the peak of probability curve, the more concentrated the traffic volume. The further to left the area enclosed by the coordinate axis and curve, the lower the overall traffic volume.

In Table 3, lag represents the different order of autocorrelation. A high correlation is considered when the correlation coefficient is greater than 0.5. Table 3 illustrates that the adjacent VDS has a significant correlation, which means the detected traffic volume has not only temporal relation but also spatial relation among VDSs near each other.

In the process of data collection, data loss is inevitable due to the breakdown of electronic equipment or other environmental factors. The data are sorted and labeled in the form of a VDS-days-time series. For example, 010236 represents the 36th data collected on the second day of VDS-1. The specific missing data from real-world VDS are shown in Table 4.

According to the data of case study, we can see that data missing happens. Although the data missing rate is not very high in this case, it will still have some impact on data applications. On the other hand, for the imputation model in this study, the small data missing rate ensures that there are sufficient observed values for the model test and performance analysis.

Imputation experiments of different scenarios are carried out and comparisons are made with other methods. The missing rate denotes the degree of missing in the dataset, which affects the output of the imputation model. There are two patterns of missing value, which are continuous missing and jumped missing. Continuous missing happens when the VDS cannot work for a long time for some reason, while jumped missing happens when the VDS temporarily breaks down. To simulate the experiment of different missing scenarios, set the missing rate from 10% to 80% for both continuous missing and jumped missing, and 10% is used as the span. The MLP used in this case has two hidden layers which consist of 4 nodes and 2 nodes separately, and let M be 60 according to the analysis of the detected data.

Select data of the 1st120th time intervals of all the locations as the dataset of the simulated imputation experiment. Define the 1st60th elements as subset I and the 61th120th as subset II. During the experiment, a piece of continuous data with a length of ten was randomly removed to simulate the continuous missing scenario. Ten discontinuous data were randomly removed to simulate a jumped missing scenario. Meanwhile, to verify the superiority of the imputation model, MLP, random forest, and decision tree were selected as the control groups of the experiment.

The experiment uses mean absolute percentage error (MAPE) to evaluate the imputation performance:

In equation (18), yi represents the ith observed value and represents the ith imputation result. The smaller the MAPE, the better the imputation performance.

3.2. Results’ Analysis

The results of experiments of MLP-MICE are shown in Table 5.

Table 5 and Figure 6(a) show that the model has the best imputation performance for continuous missing-subset I when the missing rate is 30%. The imputation accuracy of the model changes abruptly when the missing rate is 60%. And the model has the best imputation performance for continuous missing-subset II when the missing rate is 10%. The imputation accuracy of the model changes abruptly when the missing rate is 40%.

Table 5 and Figure 6(b) show that the model has the best imputation performance for jumped missing-subset I when the missing rate is 10%. The imputation accuracy of the model changes abruptly when the missing rate is 60%.

When the missing rate is between 10% and 30%, the MAPE of the most imputation result is between 6.38% and 30%, which illustrates that the proposed model has good imputation performance [3150].

In this case, the adjacent difference is defined as the absolute value of the difference between the adjacent traffic volumes in the same VDS during a certain period. Total difference in a data subset is called subset differences, which measures the degree of the mutation of data subsets. The less fluctuation of the traffic volume in a certain period happens, the smaller the value of subset differences is.

Table 6 and Figure 7 show the adjacent difference probability distribution among VDS-1, VDS-2, and VDS-4 is relatively uniform, while the variances are larger among VDS-1 to VDS-5. Comparing with jumped missing, the imputation performance of continuous missing of these three detectors is better, and VDS-3 and VDS-5 with relatively concentrated adjacent differences have better imputation results in jumped missing. The curve concentrates to the left, which means stable data because the adjacent difference probability distribution describes the degree of the mutation of the traffic volume. Therefore, when the traffic volume changes significantly, the continuous missing imputation performance is better. When the traffic volume changes stably, the imputation performance for jumped missing is better.

The main difference between subset I and subset II is the value of subset differences, as shown in Figure 8. The results of Table 5 and Figure 8(a) show that the imputation model has better performance for subset I, which means that MLP-MICE has better imputation performance for data subsets with smaller “subset differences.”

From Figure 8(b) and Table 5, we can see that, for the two subsets, the higher the subset differences, the lower the imputation performance of MLP-MICE. And for continuous missing, the model has higher imputation performance for data subsets with small subset differences. While for jumped missing, the imputation performance is not strictly related to the subset differences. However, in general, the imputation performance for data subsets is better when the subset difference is smaller. We can draw the conclusion that MLP-MICE has high imputation performance for the data subsets where the mutation degree is low.

To verify the performance of the proposed model, regression imputation methods such as MLP, KNN, decision tree, SVR, and random forests are chosen for comparison experiments. Input subsets I and II into separate models and set the missing rate to 30% and 10%, respectively. The results are shown in Table 7.

Table 7 shows that the MAPE of the proposed model is lower than the other methods in each test. To verify the performance of existing missing imputation models, MLP, decision tree, and random forest are chosen by the average MAPE of imputation on both data subsets to do more tests on subsets I and II on continuous missing and jumped missing. Set the missing rate to 10–80% with the gradient at 10% and compare them with the imputation results of MLP-MICE, respectively. The results in Figure 9 show that MLP-MICE has good imputation performance and also proves the superiority of the proposed imputation model.

Finally, the proposed model is tested in the short-term prediction of traffic volume. The MLP-MICE with a missing rate of 20% is constructed to perform regression imputation on jumped missing and continuous missing of the collected dataset in Section 3. Meanwhile, all-zero imputation was taken as the comparison. Short-term prediction of traffic volume is carried out with the support vector machine model.

The dataset that complete the missing value by different imputation methods was taken as the input data, and the short-term traffic volume prediction on the last six hours (24 data volumes) was taken as an example. Comparing with the datasets whose missing values are all replaced by 0, the MAPE calculated from the prediction of the dataset with MLP-MICE repairing is significantly better. The results are shown in Table 8 and Figure 10.

Figure 10 shows that the prediction accuracy of VDS-1, VDS-3, VDS-4, and VDS-5 increases compared to all-zero imputation, while VDS-2 decreases because continuous zeros are recognized as outliers by SVR, which makes the all-zero imputation dataset predicted by SVR better. Despite the above defects, the average of the MLP-MICE imputation MAPE is 26.63%, while the MAPE of all-zero imputation is 45.21%. Therefore, the MLP-MICE can effectively improve the accuracy of short-term prediction of traffic volume.

4. Conclusions

This study proposes the MLP-MICE imputation model optimized by the L-BFGS algorithm, in which temporal and spatial characteristics of freeway traffic volume have been considered. According to the experiments and application analysis of the real-world data, the following conclusions can be drawn. (i) The proposed MLP-MICE in this study has better imputation performance and a strong superiority compared with other models. (ii) The imputation performance of the proposed model is better for continuous missing than for jumped missing. In the imputation process of the missing value of traffic volume data, the more smoothly the data change, the better the imputation performance of MLP-MICE in jumped missing is. When the traffic volume changes significantly, the imputation performance of MLP-MICE for continuous missing is improved. (iii) Whether continuous missing or jumped missing, there is always a gap of imputation performance among different data subsets that are from the same dataset but have a diverse degree of mutation. The smaller the degree is, the better the imputation performance of the missing value is. The gap between the imputation performances widens with the concentration of datasets and narrows with the divergence of datasets. (iv) For freeway traffic volume data, the proposed model is applied to conduct a short-time traffic prediction can get a more accurate result than only filling the missing data with zero. However, spatial and temporal characteristics of traffic flow are mainly considered for the imputation model in this study, but features such as weather, road conditions, and travel demand may also have an influence on the imputation performance, which can be considered in further study.

Data Availability

The data used to support the findings of this study were supplied under license and so cannot be made freely available. The data can be obtained from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The research and publication of this work was funded by the National Natural Science Foundation of China (project no. 52072129). The authors would like to express appreciation to Tianjiao Wang, Lingbin Kong, and Yuxin Guo for their help on this paper.