Safety Technologies and Fault Tolerant Methods for EngineeringView this Special Issue
Traffic Flow Anomaly Detection Based on Robust Ridge Regression with Particle Swarm Optimization Algorithm
Traffic flow anomaly detection is helpful to improve the efficiency and reliability of detecting fault behavior and the overall effectiveness of the traffic operation. The data detected by the traffic flow sensor contains a lot of noise due to equipment failure, environmental interference, and other factors. In the case of large traffic flow data noises, a traffic flow anomaly detection method based on robust ridge regression with particle swarm optimization (PSO) algorithm is proposed. Feature sets containing historical characteristics with a strong linear correlation and statistical characteristics using the optimal sliding window are constructed. Then by providing the feature sets inputs to the PSO-Huber-Ridge model and the model outputs the traffic flow. The Huber loss function is recommended to reduce noise interference in the traffic flow. The regular term of the ridge regression is employed to reduce the degree of overfitting of the model training. A fitness function is constructed, which can balance the relative size between the k-fold cross-validation root mean square error and the k-fold cross-validation average absolute error with the control parameter to improve the optimization efficiency of the optimization algorithm and the generalization ability of the proposed model. The hyperparameters of the robust ridge regression forecast model are optimized by the PSO algorithm to obtain the optimal hyperparameters. The traffic flow data set is used to train and validate the proposed model. Compared with other optimization methods, the proposed model has the lowest RMSE, MAE, and MAPE. Finally, the traffic flow that forecasted by the proposed model is used to perform anomaly detection. The abnormality of the error between the forecasted value and the actual value is detected by the abnormal traffic flow threshold based on the sliding window. The experimental results verify the validity of the proposed anomaly detection model.
Traffic flow anomaly detection plays an essential role in the traffic field. Traffic jams have become a common thing in big cities and have received considerable critical attention. The traffic flow anomaly detection model can detect the abnormal traffic flow and can be achieved by constructing a traffic flow forecast model, which is helpful to avoid traffic congestion in time. The accurate forecast of traffic flow can not only provide a basis for real-time traffic control but also provide support for the alleviation of traffic jams and the effective use of traffic networks, and the forecast result of traffic flow can directly affect the accuracy of traffic anomaly detection. Useful information can be extracted from massive traffic flow data through the traffic flow forecast model so as to quickly forecast the short-term traffic flow in the future and detect the traffic flow abnormalities in time, thus improving the traffic operation efficiency.
In recent years, many experts and scholars have studied traffic flow forecasting. The ARIMA model is a classic time series model that is often used in traffic flow forecasts. Kumar and Vanajakshi proposed a SARIMA-based traffic flow forecast scheme, which effectively solved the problem of massive data required for model training . Shahriari et al. combined bootstrap with the ARIMA model, which overcame the shortcomings of nonparametric methods lacking theoretical support and improved the forecast accuracy of the model . Luo et al. combined the improved SARIMA model with the genetic algorithm and used the real traffic flow to test the model. The model forecast results were good . The ARIMA model forecasts the traffic flow based on historical values. If the model training data contain noise, the model’s performance will be greatly reduced.
The neural network model can fit complex data relationships, which can learn the nonlinear relationships implicit in traffic flow. Qu et al. proposed a batch learning method to solve the time-consuming problem of the traffic flow neural network prediction model, which effectively reduced the training time of the neural network . Zhang et al. used the spatiotemporal feature extraction algorithm to extract the temporal and spatial features in traffic flow. The features were input into the recurrent neural network for modeling and forecast, which effectively improved the forecast performance of the model . Zhang et al. proposed a multitask learning deep learning model to forecast the traffic network flow. The nonlinear Granger causality analysis was used to select features for the model. The Bayesian optimization algorithm was used to optimize the model parameters. The forecast performance was better than that of the single deep learning model . Do et al. used temporal and spatial attention mechanisms to help neural network models fully explore the temporal and spatial characteristics of the traffic flow, which not only effectively improved the prediction performance of the model but also enhanced the interpretability of the model . The use of neural network models can cause overfitting easily with a calculation cost much higher than that of the traditional traffic flow forecast model. As neural networks can fit nonlinear relationships of data, it is easy to use the wrong noise as the implicit nonlinear relationship in the data, which will reduce the generalization ability of the model.
The support vector regression machine can fit data based on the strategy of structural risk minimization, which is a common model in the field of traffic flow forecasts. Wang et al. proposed an adaptive traffic flow forecast framework, which used the Bayesian optimization algorithm to optimize the parameters of the support vector machine model. The forecast performance was better than that of the SARIMA model . Luo et al. used the discrete Fourier transform to extract the trend information in traffic flow and used the support vector machines for error compensation, which improved the forecast accuracy of the model . The support vector regression machine solves the optimization problem based on quadratic programming. When the sample size is large, the model training time will be greatly increased. The support vector regression machine is very sensitive to the noise in the data. When the support vector regression machine selects the noise as the support vector, the forecast performance of the model will be poor.
Traffic flow anomaly detection plays an important role in the field of urban traffic control. Many studies have done related work in the field of traffic flow anomaly detection. Djenouri et al. proposed a framework for detecting temporal and spatial traffic anomalies. The KNN algorithm was applied to the space-time traffic database, and the traffic flows at ten different locations were experimented. Experimental results showed that the performance of the proposed framework is better than the baseline model . Yujun et al. proposed a hybrid model that contained the Poisson mixture model and coupled hidden Markov model. The proposed model considered the spatial correlation of traffic flow and the degree of traffic congestion. Semisynthetic and real traffic anomaly data were used to verify the validity of the model . Zhang et al. employed the dictionary-based compression theory to identify the spatial and temporal characteristics of traffic flow and used anomaly index to quantify the degree of traffic anomalies . The proposed method can clearly detect the location of traffic flow spatial anomalies. Noise in traffic data may lead to false detection results of traffic anomaly detection models, which may affect the normal operation of traffic networks.
Influenced by factors such as mechanical damage, line aging, signal loss, and environmental interference, the data detected by the traffic flow sensor contain a lot of noise. Huber loss function is a mixture of and loss functions, which is insensitive to noise , the regular term of the ridge regression can effectively avoid overfitting caused by model training . To improve the generalization performance of the model, the sum of and on the training set based on k-fold cross-validation is constructed as the fitness function and the PSO algorithm is used to optimize the model hyperparameters. The PSO algorithm originated from the research on the foraging process of birds . It has a simple structure. Each particle in the particle swarm has three main parameters: position, velocity, and fitness. In recent years, many pieces of literature have achieved good results using the particle swarm optimization algorithm [16–20].
To solve the problem of noise in traffic flow data, a Huber-Ridge traffic flow anomaly detection model with the particle swarm optimization (PSO) algorithm is proposed. The Huber-Ridge model is used to reduce the negative impact of noise in the data. Huber-Ridge model performance depends on model hyperparameters. Therefore, it is very important to determine the optimal model hyperparameters. A PSO algorithm based on the proposed fitness function is used to search for the optimal hyperparameters of the model so that the model has the best performance.
The remaining part of the paper proceeds as follows: Section 2 introduces the theoretical information of the Huber-Ridge algorithm; Section 3 proposes the data preprocessing steps and the steps using PSO algorithm to optimize the Huber-Ridge model parameters; Section 4 illustrates the model evaluation indexes; Section 5 presents the experimental content which contains the comparison of the forecast models and the results of traffic flow anomaly detections; Section 6 is conclusions.
2. Huber-Ridge Algorithm
2.1. Huber Function
The combination of the Huber function with the loss function and the loss function can effectively avoid the interference of noise in the data during the data fitting . Its robustness is better than that of and loss functions. The definition of the Huber loss function is
The definitions of loss function and loss function are shown in equations (2) and (3):where is the error between the actual value and the estimated value, and is the threshold. When the threshold is 1, the comparison of the Huber loss function, the loss function, and the loss function is shown in Figure 1. Compared with the loss function, when is smaller than the threshold , the Huber loss function penalizes the model for making large errors. Compared with the loss function, when is greater than the threshold , the Huber loss function penalizes the model for making small error Therefore, the Huber loss function is quadratic for smaller errors and is linear for larger errors.
2.2. Ridge Regression Model
The ridge regression model is first proposed by Hoerl and Kennard. The ridge regression objective function adds the regular term based on the least square objective function . Its definition is as follows:where is the regular term and is the ridge parameter, which is the weight of the regular term.
For the linear regression model , the least square estimation of the regression coefficient is defined as follows:where is the independent variable matrix and is the dependent variable vector.
The mean square error of the least square estimation is defined as follows:
If there is a linear correlation between independent variables, the matrix is a singular matrix. Some characteristic roots of the singular matrix are close to zero, resulting in a large . This indicates that there is a large error between the least-squares estimated value and the actual value. The addition of the disturbance term on the matrix will weaken the singularity, thereby reducing . The least square estimation with the disturbance term added is the ridge estimation. The ridge estimate is defined as follows:where is the ridge parameter and is the identity matrix. indicates the ridge estimation of the regression parameter when the ridge parameter is . When , the ridge estimation is the least square estimation. In the case of linear correlation of independent variables, the ridge estimation provides improved efficiency in parameter estimation problems, that is, biased but has lower variance than the least square estimator.
2.3. Huber-Ridge Regression
Owen uses the Huber loss function to replace the least-squares loss function and converted the ridge regression to the Huber-Ridge regression . The definition of the Huber-Ridge model is as follows:where is the weight vector of the regression when the objective function is the smallest, represents the estimate for each regression coefficient, is the regular term, and is the weight of the regular term, which is used to balance the relationship between the Huber loss function and the regular term. The Huber loss function can help the model avoid the influence of the data noise. The regular term helps the model have a proper sparsity and avoid overfitting of the model. The Huber-ridge regression combines the robustness of the Huber regression to noise with the regularization of the Ridge regression, which not only ensures the robustness of the model but also makes the regression model more stable.
can be considered as , which is the norm square of the weight vector . The objective function is defined as follows:where is the error. The objective function is used to take the partial derivative of the weight vector and let it to be zero. It can be obtained that the expression of the weight vector is at the minimum value of the objective function in the direction of the weight vector . The solution process of equation (9) is as follows:where , is the estimated value, and is the actual value. The first term of equation (10) can be simplified as
Let , equation (11) can be simplified as
The second term of equation (10) can be simplified as
In summary, the solution process of equation (10) is as follows:where is the identity matrix. The optimal threshold and the ridge parameter can be found in a fixed interval through the optimization algorithm. The weight vector can be obtained by substituting the threshold value , the ridge parameter , and the sample data into equation (16).
3. PSO-Huber-Ridge Model
3.1. PSO Algorithm
The core idea of the PSO algorithm comes from the foraging process of birds. For the PSO algorithm, the candidate solution of the optimization problem is a particle in the hyperparameter space. Each particle has its corresponding fitness value, speed, and position. The speed of the particle determines the direction and the displacement of the particle to look for the candidate solution. The PSO algorithm can find the optimal solution by iterating a group of initialized random particles.
For the PSO algorithm, there are m particles in the D-dimensional space. The speed of each particle can be expressed as , and the position of each particle can be expressed as , where . In the loop iteration, each particle represents a candidate solution. The corresponding fitness value can be obtained through the fitness function. The individual optimal particle and the global optimal particle can be selected based on the fitness value. The personal optimal particle (pbest) is expressed as , and the global optimal particle ( best) is expressed as . Before the next iteration, each particle will update its speed and position through equations (17)–(19):where is the inertia factor , is the local learning factor, and is the global learning factor (). and are random numbers uniformly distributed between . and represent the number of iterations. represents the maximum speed of the particle.
For equation (17), where is called the memory item, which refers to the influences of the speed on the particle when it is updated; is called the self-cognition term, which means that when the particle is updated, it is biased toward the individual optimal particle; is called the group-cognition term, which means that when the particles are updated, they are biased toward the group optimal particle. It represents the result of collaboration among multiple particles.
3.2. Fitness Function
The PSO algorithm can find the optimal hyperparameters for the model based on the fitness function. The smaller the particle fitness value, the lower the forecast error of the hyperparameters. To improve the generalization ability of the model, the k-fold cross-validation  is added to the fitness function. The fitness function is defined as the sum of RMSE and MAE of k-fold cross-validation on the model training set. The expression equation for the fitness function is as follows:
is a root mean square error based on k-fold cross-validation and its expression is as follows:
is based on the average absolute error of k-fold cross-validation, and its expression equation is as follows:where is the number of training samples, is the number of cross-validated subsets. and are the model estimated value and the true value, respectively. The smaller the fitness function value, the better the corresponding particle.
The weight of is (), which is also the control parameter used to balance the size of and . When , has less weight than ; when , has more weight than ; when , has the same weight as . has a small penalty for small errors. The degree of penalty for errors remains unchanged. However, it does not punish large errors as much as . The fitness function controls the degree of which the fitness function penalizes errors by adjusting the size of the control parameter . As the control parameter increases, the degree of penalty for small errors by the fitness function increases. Combining and , the problem of insufficient penalty for small errors for and insufficient penalty for large errors for can be improved, which not only increases the penalty for model prediction errors but also improves the generalization ability of the model.
3.3. Data Preprocessing
Good data quality can improve the performance of the model. The missing values and the dimensional differences in the data will reduce the forecast performance of the model. Therefore, it is significant to preprocess the data. The data preprocessing can be divided into the following steps:(1)Data cleaning. The previous value of the missing value should be used to fill in the missing value.(2)Construction of model feature sets and output samples. For the traffic flow data set, the historical characteristics based on the linear correlation and the statistical characteristics based on the sliding window should be constructed. The model output sample is the traffic flow at the next time point in the sliding window.(3)Data standardization. There are dimensional differences between different features. To prevent dimensional errors from reducing the model performance, the data distribution is transformed into a standard distribution with a mean of 0 and a variance of 1 through the standardized equation. The standardized equation is as follows: For the feature matrix, is the standardized data of the k-th row and the i-th column, is the mean value of the i-th column, is the standard deviation of the i-th column, and is the number of samples.
3.4. PSO-Huber-Ridge Model Optimization Process
The optimization steps of the PSO-Huber-Ridge model are as follows: Step 1. Start the optimization. Step 2. Determine the model inputs and outputs. The feature set is used as the model input and the model output the traffic flow. Step 3. PSO-Huber-Ridge model parameter settings. The number of particles , the inertial factor , the local learning factor , and the global learning factor are input into the PSO algorithm. Initialize the speed and the position of each particle. Set the maximum number of iterations of the PSO algorithm and the value range of the model hyperparameters. Step 4. . Step 5. Particles update. Use equations (17)∼(19) to update the speed and position of each particle. Step 6. Fitness evaluation. Use equation (21) to calculate the fitness value of the particle based on the threshold value and the ridge parameter of each particle. Step 7. Optimal particle selection. Select the individual optimal particle and the global optimal particle according to the fitness value of the particles. Step 8. Terminate training judgment. If the number of iterations does not meet the termination condition , return to Step 4. Otherwise, continue to the next step. Step 9. Output optimization results. Output the threshold and the ridge parameter in the global optimal particle. Step 10. End the optimization.
4. Evaluation Indexes
The average absolute error (MAE), root mean square error (RMSE), and average absolute percentage error (MAPE) were used to evaluate the forecast performance of the model. The definition equations of MAE, RMSE, and MAPE are as follows:where is the number of samples in the test set, is the model forecast value, is the true value. MAE and RMSE can reflect the forecast error of the model. The value range of MAPE is . The closer its value is to 0, the better the model performance.
5. Experimental Results and Analysis
5.1. Data Description
The traffic flow data set used in the experiment came from a highway intersection in Changsha City and was collected by a single detector with a data interval of 5 minutes. There were a small number of missing values in the traffic flow data set and the previous value of the missing value was used to fill in the missing points. The data sets containing 5 days of traffic flow were divided into the training set and the test set. The traffic flow from Saturday to Tuesday was used as the training set for the training model. The traffic flow on Wednesday was used as the test set to verify the performance of the trained model.
5.2. Feature Extraction and Selection
Historical characteristics based on the linear correlation from the traffic flow data were selected. The statistical characteristics based on the optimal sliding window were extracted.
The historical characteristics were selected. The Pearson correlation coefficient was used to judge the strength of the linear correlation between the data. The range of the correlation coefficient was . The closer to , the stronger the positive correlation between the data; the closer to , the stronger the negative correlation between the data; the closer to 0, the weaker the linear correlation between the data. The historical value of greater than 0.9 was selected as historical characteristics. See Table 1 for the correlation coefficients of traffic flow with delays of 1–9.
According to Table 1, the historical characteristics with delays of 1–6 were selected as historical characteristics. To fully consider the periodicity of the traffic flow, the historical characteristics at the same time point last week were selected. The set of historical characteristics included the historical values with delays of 1–6 and the historical values at the same time point last week.
The statistical characteristics of the optimal sliding window were extracted. The maximum, minimum, median, mean, standard deviation, skewness, and kurtosis of the data set within the length of the sliding window were taken as the statistical characteristics. The sliding window length had a value range of . The Huber-Ridge model with default hyperparameters was used for the exhaustive operation on the traffic flow training set. The optimal window length was selected with the MAPE evaluation index as the standard. It can be seen from Figure 2 that when the MAPE value was the smallest, the sliding window length was 34 as the optimal sliding window length.
5.3. Experimental Results
The state transition algorithm (STA) , grey wolf optimizer (GWO) , genetic algorithm (GA) , and PSO algorithm were used to optimize the hyperparameters of the Huber-Ridge model. The range of model parameters is shown in Table 2:
The parameter values of the optimization algorithm are shown in Table 3:
The model training was performed using the standardized traffic flow training set. The fitness function based on 10-fold cross-validation was used. The control parameter of the fitness function was taken as 1. The performances of the STA-Huber-Ridge model, the GWO-Huber-Ridge model, the GA-Huber-Ridge model, and the PSO-Huber-Ridge model were compared and analyzed using RMSE, MAE, and MAPE evaluation functions. The optimization results of the four model parameters are shown in Table 4. The iterative comparison of their fitness values is shown in Figure 3.
It can be seen from Table 1 and Figure 3 that the fitness value of the STA algorithm dropped rapidly in the early stage of the iteration and then fell into the search for the local optimum; after that, it dropped slowly in the later stage. The state transition algorithm used four transform operators to search. The search range was large and the early convergence was fast. However, transform operators with fixed values limited the global search capability of the state transition algorithm . The fitness value of the GWO algorithm decreased slowly in the iterative process. The global optimization efficiency was not high. The GWO algorithm may easily fall into the local optimum and be unsuccessful in finding the global best . The control parameters of the GWO algorithm decreased linearly with the iterative process, which cannot satisfy the complex search process . The fitness value of GA stagnated in the early stage of the iteration and fell into the search for the local optimum. This is because the genetic algorithm has a premature phenomenon , making it difficult to jump out of the local optimum. Compared with the GWO, GA, and STA optimization algorithms, the PSO algorithm has a better iterative update strategy. It updates the particle position based on the individual experience of particles and the global experience of the particle swarm so that it will not all into the search for the local optimum easily.
It can be seen from Table 5 that the PSO-Huber-Ridge model had the lowest MAE, RMSE, and MAPE; that is, the forecast performance of the PSO-Huber-Ridge model was the best. It can be seen from Figure 4 that the PSO-Huber-ridge model can well forecast the trend of the traffic flow at most time points.
Based on the error between the predicted value of the PSO-Huber-Ridge model and the actual value, the anomaly detection was performed on the traffic flow using the threshold by calculating the mean value (mean) and variance of error data in a sliding window with a length of 10. If the forecast error at the next time point of the sliding window was greater than the anomaly detection threshold, the traffic flow at this time point was defined as an abnormal flow. The abnormal warnings would be reported to relevant traffic departments to avoid possible traffic jams. The label for abnormal traffic flow was defined as 1 and the label for normal traffic flow was defined as 0. The traffic flow anomaly detection based on the PSO-Huber-Ridge model is shown in Figure 5. It can be seen from Figure 5 that the proposed model can well detect the abnormal traffic flow in each period time.
To solve the problem of the large data noises in traffic flow, the traffic flow anomaly detection based on PSO-Huber-Ridge model is proposed. The strong robustness of the Huber function enables it to effectively reduce the influence of noise in traffic flow data on model training. The addition of the regular term of the ridge regression in the objective function can reduce the risk of model overfitting. The sum of and based on 10-fold cross-validation is constructed as the fitness function to improve the generalization ability of the model. The optimal model parameters can be obtained through the particle swarm optimization algorithm so as to improve the model performance. Compared with the STA-Huber-Ridge, GA-Huber-Ridge, and GWO-Huber-Ridge models, the experimental results show that the PSO-Huber-Ridge model has the best model forecast performance. The traffic flow anomaly detection is performed using the traffic flow forecasted by the PSO-Huber-Ridge model. The error between the forecasted and actual traffic flow at a certain time point is large, which indicates that the regular pattern of traffic flow at that time point is different from that of history and may cause traffic congestion. The anomaly detection is performed on the traffic flow using the threshold . The experimental results verify the validity of the proposed traffic flow anomaly detection model.
The information contained in the traffic flow is complex. The PSO-Huber-Ridge model is limited to explore the linear information in the traffic flow. The nonlinear information needs further analysis and exploration. When extracting statistical features in feature engineering, the optimal sliding window is determined by the method of exhaustion. Its disadvantage is that it takes a long time and is not easy to apply. Using an adaptive method to extract features will greatly reduce the time of feature engineering. The Huber loss function reduces the negative impact of the data noise on the model training by reducing the penalty for large errors. Combining the Huber function with outlier detection method in data preprocessing can further improve the robustness of the model. Using adaptive feature extraction to mine linear and nonlinear information on the basis of improving model robustness is the next step.
The data used to support the findings of this study are currently under embargo, while the research findings are commercialized. Requests for data, 6/12 months after publication of this article, will be considered by the corresponding author.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
All authors contributed equally to this work.
This research was funded by the National Natural Science Foundation of China (Grant no. 61403046), the Natural Science Foundation of Hunan Province, China (Grant no. 2019JJ40304), Changsha University of Science and Technology “The Double First Class University Plan” International Cooperation and Development Project in Scientific Research in 2018 (Grant no. 2018IC14), the Research Foundation of Education Bureau of Hunan Province (Grant no. 19K007), Hunan Provincial Department of Transportation 2018 Science and Technology Progress and Innovation Plan Project (Grant no. 201843), the Key Laboratory of Renewable Energy Electric-Technology of Hunan Province, the Key Laboratory of Efficient and Clean Energy Utilization of Hunan Province, Innovative Team of Key Technologies of Energy Conservation, Emission Reduction and Intelligent Control for Power-Generating Equipment and System, CSUST, Hubei Superior and Distinctive Discipline Group of Mechatronics and Automobiles (Grant no. XKQ2020009), National Training Program of Innovation and Entrepreneurship for Undergraduates (Grant no. 202010536016), Major Fund Project of Technical Innovation in Hubei (Grant no. 2017AAA133), and Hubei Natural Science Foundation Youth Project (Grant no. 2020CFB320).
Z. Zhang, Q. He, H. Tong, J. Gou, and X. Li, “Spatial-temporal traffic flow pattern identification and anomaly detection with dictionary-based compression theory in a large-scale urban network,” Transportation Research Part C: Emerging Technologies, vol. 71, pp. 284–302, 2016.View at: Publisher Site | Google Scholar
L. Lin, J. C. Handley, Y. Gu, L. Zhu, X. Wen, and A. W. Sadek, “Quantifying uncertainty in short-term traffic prediction and its application to optimal staffing plan development,” Transportation Research Part C: Emerging Technologies, vol. 92, pp. 323–348, 2018.View at: Publisher Site | Google Scholar
P. J. Huber, “Robust estimation of a location parameter,” in Breakthroughs in Statistics: Methodology and Distribution, S. Kotz and N. L. Johnson, Eds., pp. 492–518, Springer New York, New York, NY, USA, 1992.View at: Google Scholar
A. B. Owen, “A robust hybrid of lasso and ridge regression,” Contemporary Mathematics, vol. 443, pp. 59–72, 2006.View at: Google Scholar