Abstract

Vehicle exhaust is one of the main sources of carbon emissions. The short-term traffic flow prediction plays an important role in alleviating traffic congestion, optimizing the travel structure, and reducing traffic carbon emissions. The current advanced models of short-term traffic flow prediction are evaluated in this work, especially their inadequacies. To improve the prediction accuracy and ensure fine traffic management, an effective self-attention-based hybrid model is proposed to predict the short-term traffic flow. The proposed model includes an encoder-decoder neural network module and a self-attention mechanism module. The self-attention mechanism module is applied as a feature extraction unit in this hybrid model to enhance the ability of key information capture and to settle the problem on key information disappearing due to the increasing sequence length in traditional models. The dataset of the Guangdong freeway toll station is used for the experimental testing. Compared with several baseline models, the proposed model is more suitable for real-time prediction and can provide highly accurate results. Also, a better interpretability is presented in this proposed model. The experimental results showed that MAE, RMSE, and MAPE of the proposed model are 3.01, 4.38, and 12.99%, respectively. Our new hybrid model gives a higher accuracy than the support vector regression (SVR) model, LSTM neural network-attention (LSTM-attention) model, and temporal convolutional network (TCN) model. It shows that the proposed model in this work is favorable to the short-term traffic flow prediction.

1. Introduction

With the acceleration of urbanization, the number of motor vehicles increases rapidly, and the problem of traffic congestion is more and more serious. The most significant ones are the increment in travel time consumption caused by traffic congestion and environmental pollution from the vehicle exhaust. Reducing the traffic pollution is completely consistent with the national strategy of carbon neutralization and carbon peaking.

In order to optimize traffic distribution and improve traffic efficiency, it is necessary to achieve high accuracy traffic prediction. The current research studies mainly focus on accurate prediction for traffic flow state, or flexible and efficient adjustment for control strategy. The accurate short-term traffic flow prediction can help decision-makers to better understand the traffic flow state, and thus to formulate a more reasonable control strategy. Meanwhile, it can also help drivers to better arrange travel plans and reduce carbon emissions due to traffic congestion and frequent braking times.

The spatiotemporal feature of the traffic flow can be analyzed and predicted by the deep neural network. The accurate traffic flow prediction plays an important role in alleviating the traffic problems brought by the actual development, improving the efficiency of the traffic system and the ability of active prevention and control.

The short-term traffic flow prediction is the prediction of the traffic flow with a time step of fewer than 15 minutes [1]. The short-term traffic flow is characterized by time series. The traditional time series prediction is a linear fitting method represented by autoregressive integrated moving average (ARIMA) [2, 3]. This method has a problem of poor ability to fit nonlinear data [4]. In addition, there are many research studies on shallow machine learning methods represented by support vector regression (SVR) [5, 6]. However, SVR is slow in processing high-dimensional data, and the choice of different kernel functions will have a great impact on the results. With the rapid development of deep learning algorithms, the recurrent neural network (RNN) [7] and its variants of long short-term memory (LSTM) [811] and convolutional neural network (CNN) are also applied to time series prediction and have good results.

However, these models still have some drawbacks: (1) gradient explosion and gradient disappearance for long period time series; (2) with the increase of the sequence length, it is easy to lose information due to the weak extraction ability on key information.

Thus, many scholars put forward a hybrid model that adds an attention mechanism to the model of the CNN or RNN class [1216], which can effectively mitigate above problems by the ability on extraction critical information [17]. Therefore, a potential self-attention-based hybrid model is proposed in this paper. This model adopts a self-attention mechanism to suppress the loss of long-term time series information and effectively improves the prediction accuracy. The results show that the model performs well based on good validation by the dataset of a freeway toll station in Guangdong.

The major contributions of this paper are summarized as follows:(1)An effective self-attention-based hybrid model for the short-term traffic flow prediction is proposed. This proposed model is conducive to giving highly accurate prediction results using modest training data from the freeway toll station.(2)The proposed hybrid prediction method is capable of extracting and learning dominant spatiotemporal features and short-term variations of the traffic flow by the encoder-decoder framework(3)The effectiveness and efficiency of the proposed model are demonstrated according to one real-worldshort-term traffic prediction case studies

The rest of this paper is given as follows: Section 1 is related to work review, which introduces the time series prediction model in traffic field and its advantages and disadvantages. Section 2 is the basic principle of the model, i.e., the principle of multihead self-attention mechanism and the structure of the model proposed in this paper. Section 3 is the evaluation criteria of data sources and forecast results. Section 4 is the experiment, mainly about the experimental hardware equipment and experimental details. Section 5 is the experimental results and discussion compared with other models. Section 6 is the conclusion.

2. Literature Review

In recent years, scholars have carried out extensive application and research on time series models in the field of transportation, which can be roughly divided into three categories.

The first one is the traditional linear time series prediction model, including ARIMA and the improved models. Most of these improved models focus on enhancing the ability of ARIMA to handle complex data. Shahriari et al. proposed a random dataset generation algorithm considering the influence of randomness and combined it with ARIMA model to effectively improve the model’s accuracy [18]. Van Der Voort et al. integrated Kohonen self-organizing graph with ARIMA model [19]. After the Kohonen’s initial classification, the nonlinear data processing ability of the model was significantly enhanced. The prediction accuracy raised remarkably. In consideration of the dynamic and nonlinear characteristics of traffic flow changes, Shen et al. combined empirical mode decomposition together with ARIMA model and enhanced its ability to process fluctuating data [20]. Chao et al. proposed a time series analysis method based on ARIMA model structure, which could meet the requirements of the traffic flow dynamic prediction to a certain extent and polish the prediction accuracy base on the problems of too much training data and reducing the value of forgetting factor [21]. In addition, some scholars improved the ARIMA model considering the seasonal [22] and spatial characteristics of the traffic flow [23] and the heteroscedasticity of traffic flow time series [24].

The second type is the shallow machine learning model. Cheng et al. gave a fusion multisource SVR model, which used the maximum Lyapunov index and Bayesian theory to fuse various traffic flow features and then used SVR on regression prediction and obtained good results [25]. Hong et al. combined genetic algorithms-simulated annealing algorithm with SVR to alleviate the problem that was easy to fall into local optimal solution in training [6]. Luo et al.suggested an SVR model modified with particle swarm optimization algorithm and genetic algorithm and calculated parameters through the least square method for optimization of the training speed and model accuracy [26]. Feng et al. proposed an adaptive multikernel SVR short-term traffic flow prediction model in consideration of spatiotemporal correlation and proved that the model could better deal with the complicated traffic flow data [27]. Wei and Liu studied an adaptive SVR model integrated with heuristic algorithm, which greatly improved the computational efficiency [28].

The third category is the deep learning model, including RNN (GRU, LSTM) and CNN models. Deep learning models have shown promising performance in many research areas due to their abilities on complex model nonlinear relationships [29]. Different types of models for the short-term traffic prediction have been developed based on their abilities to capture the spatiotemporal correlations of traffic flow.

One of the commonly used deep learning models is the CNN-based model. The CNN model is often applied in the traffic prediction due to its ability to capture the local dependencies of traffic data and is less sensitive to noise. Chen et al. set up a causal convolution neural network model based on empty convolution [30]. Experimental results showed that the model could effectively learn seasonal and holiday effects in time series. Zhao et al. built a time series classification model based on CNN considering its automatically generation and extraction feature and solved the problem of feature selection dependence in traditional classification methods [31]. In addition, CNNs cannot capture the interdependence between road connections for a specific complex road network structure [8].

Another commonly used deep learning model is the RNN-based model. RNN has been widely used in time series tasks. The LSTM model enlarges the memory of RNN and is suitable to learn from time series data. Cui et al. proposed a deep SBU-LSTM model to predict network-wide traffic speed, which considered forward and backward dependencies in time series data [32]. Cai et al. designed an improved loss function LSTM prediction model for traffic flow data under the influence of non-Gaussian noise, which enhanced the robustness of the model [33]. Ma et al. [34] proposed an improved LSTM model based on LSTM and bi-LSTM. This model integrates bi-LSTM into the prediction model. The results show that the LSTM_BILSTM hybrid prediction model has good accuracy and stability in multistep prediction. Ma et al. [8] proposed a hybrid model of CapsNet + NLSTM to solve the defect that traditional models cannot handle complex spatial relationships when predicting traffic conditions. CaptsNet is used to extract the comprehensive spatial characteristics of the road network, and the NLSTM model is used to capture the hierarchical time dependence of traffic data. The experimental results show that the model has good prediction ability in dealing with complex road networks. Wei et al. [35] set up an auto encoder LSTM (AE-LSTM) model considering the spatial relationship of nodes and applied autoencoder to extract the characteristics of upstream and downstream traffic, thus effectively improved the accuracy of the model. Yang et al. added an attention mechanism module to the LSTM model to deal with the problem of gradient disappearance in the traffic flow data of long-term series [36]. Lea et al. proposed a residual structure neural network model TCN based on full convolution to solve the problems of the gradient stability and inability to process information in parallel in RNN [37].

Although CNN-based and LSTM-based models are much powerful to capture the spatiotemporal relations of traffic flow, the limitations were obvious when there were many study objects. In such cases, deep graph neural networks (GNNs) are superior to extracting features from transportation networks. Research studies mainly focus on the graph convolution network [16, 3840] and spectral convolution network [41]. The graph convolution neural network builds the traffic graph based on the physical network topology and defines a graph convolution neural network to capture spatial features. Then, the LSTM regression neural network is established by traffic graph convolution (TGC) to predict the traffic space-time speed. Otherwise, the methods of spectral graph convolution and localized spectral graph convolution can capture the space-time characteristics on the network and usually have high prediction accuracy.

Based on application analysis, deep learning models often obtain highly accurate predictions. Therefore, they are now the most advanced and popularly used methods in the field of traffic flow prediction. However, these models also have drawbacks that make them less preferable for certain short-term traffic prediction applications. Firstly, a large amount of data are required usually. Secondly, the time of training is time-consuming and the model may take several days or weeks to train. Finally, the interpretability of the result is lacking.

In order to get accurate and efficient short-term traffic predictions, this study proposes an effective self-attention-based hybrid model framework, in which a multiattentional mechanism can be extended to solve long distance-dependent temporal traffic data and extract dominant information from collinear and correlated data. One real-world traffic state prediction case study in Humen Freeway Toll Station is developed to demonstrate the effectiveness and efficiency of this proposed hybrid model on short-term traffic prediction applications. The proposed model is much suitable for real-time prediction applications and can produce accurate results comparable with those baseline models. Moreover, the proposed model has better interpretability ability.

3. Methodology

3.1. Multiself-Attention Machine Module

The attention mechanism was originally used to solve the problem of long-term sequence dependence in machine translation, in which the performance of machine translation decreased significantly with the increase in the sentence length [16]. It has been widely used in the processing of various time series data. In this work, the attention mechanism is used to exact key contents from a large amount of information, focusing on the crucial parts and ignoring unimportant ones.

The self-attention mechanism is a variant of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal relevance of data or features. First, the weight coefficients are calculated according to query (Q) and key (K), and then value () is summed according to the weight coefficients as shown in (1).

The attention model can simultaneously compute the attention functions on a set of queries and pack them into a matrix Q. The keys and values are also packed in the matrices K and . Attention (Q, K, and ) represents the matrix calculation of the output.where is the length of the input sequence. Three sets of trainable parameter matrices are obtained by weighting the input data sequence X. The Q, K, and are obtained by multiplication. The is the dimension of , and is the element multiplication, a matrix element multiplication by position.

Considering that a single self-attention mechanism cannot capture the importance of input sequence information comprehensively, therefore, the self-attention mechanism usually adopts a multihead design to identify multiple subspace information simultaneously in different locations. The multiattentional mechanism refers to the parallel connection of several self-attentional mechanism modules. A single self-attention mechanism module is outputted and spliced. It is shown in the following:where is the multiheaded attention weight matrix. The weight matrix of the ith attention head Q, K, and is . m is the number of attention heads, and Concat function is used to join the output values calculated by each attention head.

3.2. The Model Structure

The encoder-decoder framework was first applied to natural language tasks [42]. The basic idea is to use a recurrent neural network to read the input sentence and compress the text information into a fixed dimension code. Another recurrent neural network reads the code and converts it into a sentence with the target language. These two recurrent neural networks are called encoder and decoder, respectively.

The structure of the proposed hybrid model is illuminated in Figure 1. The model first encodes the input data and label data and inputs the encoded data into encoder and decoder modules, respectively. The decoder module outputs final results after full connection layer flattening processing.

The encoding module is the definition of rules for encoding sequences. The sequence numbered even is sinusoidal coded, and the sequence numbered oddly is cosine coded. The coding formula is shown in the following equation:where represents the position index of each data in the sequence, is the th dimension corresponding to this position, and is the dimension of the hidden layer. This encoding format provides great convenience for the model to capture the relative position relationships between words. The application in time series prediction can also play a good effect.

The encoder and decoder of the model are composed of six identical layers. Each layer of the encoder contains two sublayers of multihead self-attention mechanism and feedforward neural network. Each layer of decoder consists of three sublayers: occlusion multiplex self-attention mechanism, coding-decoder multiplex attention mechanism, and feedforward neural network.

The encoder is responsible for encoding the input traffic flow feature sequence and mapping it to the intermediate vector containing the input feature information. The decoder is responsible for decoding the intermediate vector of the encoder output into the output sequence, which involves the masked multihead self-attention.

The tag data are the input of the masked multihead self-attention module, and the dependency between tag data can be attained through the multihead self-attention mechanism [43]. The dependence is inputted into the encoding and decoding multiattentional mechanism module so that the prediction model can comprehensively learn the dependence between input feature vectors, or between label data and their dependence on each other.

The feedforward neural network is composed of two layers of neurons. The activation function of the first layer is ReLU, and the activation function of the second one is identity. If the input of ReLU activation function is greater than 0, the return value is the value provided by the input; if the input is 0 or less, the return value is 0. The node input of the identity activation function is equal to the node output. The expressions of the two activation functions are shown in the following equations:

4. Data Sources and Evaluation Criteria

4.1. Data Source

In this paper, the flow data are taken from a freeway toll station in Guangdong Province, and the statistical time is from 8 : 00 am on August 1, 2018, to 8 : 00 am on September 1, 2018. The data were counted at 15 minutes interval. A total of 2976 pieces of data were collected, as shown in Figure 2. Intuitively, the data have a certain periodicity, and the deep learning algorithm can better learn the feature changes in the sequential data [44].

In terms of data processing, these data were firstly sliced by four-step sizes to obtain a tensor of 2972 ∗ 4 dimensions as input data. Each step is of 15 minutes interval. The input data are divided into training sets and test sets. There are 2600 training sets and 372 test sets.

4.2. Evaluation

The evaluation criteria in this paper are mean absolute error (MAE), root-mean-square error (RMSE), and mean absolute percentage error (MAPE).

MAE error calculation is shown as follows:

The RMSE error calculation is as follows:

The MAPE error calculation is according to the following equation:where n is the total number of samples, is the predicted result, and is the true value.

5. Experimental

To ensure the feasibility and repeatability, equipment and parameters of the experiment were explained. TensorFlow2.2 deep learning framework in Anaconda environment in Windows10 system is used in this work. The CPU is Intel(R) i5-8300h, 4-core processor, 2.30 GHz, and 8 GB memory. GPU is NVIDIA GTX1050Ti, 4 GB memory, CUDA version is 10.1, CUDNN version is 7.6.

5.1. Batch Size Setting

Batch size represents the number of samples required for a single training, and its size affects the optimization degree and speed of the model, as well as memory usage. Since the setting of a different number of training samples will lead to overfitting of data, this work sets batch sizes of 64, 128, 256, and 512 during the training model. After the preliminary test, when the batch size is 256, the training speed can be guaranteed, and a relatively low training and testing error can be obtained.

5.2. Optimizer

The optimizer is used to update and calculate network parameters that affect model training and output to approximate or reach optimal values, thereby minimizing (or maximizing) the loss function. By comparing SGD and Adam optimizers, the obtained training loss curve is shown in Figure 3. It can be seen from the figure that the convergence speed of the SGD optimizer is faster than that of the Adam optimizer, while the final error loss is similar.

The final experimental hyperparameter settings are shown in Table 1.

5.2.1. Description of Hyperparameters

The step size of the gradient descent is the learning rate, and the function is to scale the gradient. The learning decay rate is the scaling of the learning rate. Its function is to prevent overfitting and make the prediction result close to the optimal solution. The number of iterations is the number of times the network updates a parameter.

5.3. Loss Function

In this paper, the MSE loss function is adopted, which will change the gradient direction with the change of error and make the model easier to approach the optimal solution.

6. Results and Discussion

To verify the effectiveness of the proposed model on predicting traffic flow, this paper selects three baseline models for comparison. The performance of the proposed model (STGGA) is compared with SVR, LSTM-attention, and TCN models.

SVR: SVR is a popular machine learning approach for the short-term traffic flow prediction [27, 45]. Consequently, a basic SVR model with a linear kernel is selected as one of our baselines.

LSTM-attention: LSTM-attention is a LSTM model with an attention module [36, 46]. The attention mechanism module is added to the LSTM model to deal with the problem of gradient disappearance in the traffic flow data of long-term series, while LSTM-attention contains two LSTM layers and connects to the attention mechanism. So, the model is selected as one of our baselines.

TCN: TCN is a residual structure neural network model based on full convolution to solve the problems of gradient stability and inability to process the information in parallel in RNN [37, 47], while TCN contains two one-dimensional convolution layers and one one-dimensional pooling layer. Thus, the model is selected as one of our baselines.

The evaluation indexes of the prediction results for each model are listed in Table 2. As listed in Table 2, MAE, RMSE, and MAPE errors of the proposed model in our work are all smaller than those of the comparison models. Since traffic flow observations vary from a few hundred vehicles per hour in off peak to several thousand vehicles during peak periods, a MAPE in the range of 10–20% is generally acceptable in most studies on the flow prediction [10, 48, 49].

First, the performance of the proposed model is compared with the SVR model. In the literature [50], SVR predicted MAPE results, that is, 15.71%. The prediction result of this paper is 21.7%. Different from the dataset used in our work, the dataset in the literature is the traffic data of an intersection in Jinan. The data used in this paper are from freeway Toll station and are more volatile, so the MAPE value in this paper is different from the predicted value in the literature. Compared with the SVR model, the RMSE and MAE error of the proposed model is reduced by 0.9 and 0.89, respectively, and the MAPE error of the proposed model is reduced by 8.18%. Because SVR uses a kernel function to map a large number of uncertain traffic flow data to high-dimensional space, it cannot make full use of spatial and periodic characteristics for prediction, thus resulting in limited predictive performance.

Next, performance of the proposed model is compared with the TCN method. In the literature [51], TCN-predicted MAPE results were various as 17.72%, 21.87%, 15.4%, and 19.45%. The prediction results of this paper are in a normal range. Compared with the TCN model, RMSE and MAE errors of the proposed model are reduced by 0.84–0.08, respectively, and the MAPE error of the proposed model is reduced by 4.95% due to the limited receptive field size of convolutional neural networks for long-term sequence problems.

Furthermore, the proposed model is compared with the LSTM-attention model at last to verify the effect of self-attention mechanism model. In previous literature [52], MAPE predicted by LSTM-attention is 15.79%, while it is 16.08% in our work. Compared with the LSTM-attention model, RMSE and MAE errors of the proposed model is reduced by 0.42–0.36, respectively, and the MAPE error of the proposed model is reduced by 3.09%. The results demonstrate that the proposed model is more accurate than the LSTM-attention.

The box diagram of model error is shown in Figure 4. The median error of the proposed model is 3.17, and the error of the compared baseline model is 6.78 for SVR, 5.59 for LSTM-attention, and 6.59 for TCN. From the upper and lower ranges of prediction error, the error range of the proposed model is smaller and the prediction result is more stable. In terms of outliers, 1.5 times the difference between the upper and lower quartiles (IQR) is used as the judgment standard. The results show that among the 372 test set data, the prediction outliers of the four models are less than 5%, and there is no large deviation in the prediction results.

To illustrate the fitting results clearly, the fitted curve volume from these models is shown in Figure 5. According to the fitted curve in Figure 5(a), the prediction results of the model proposed in this paper are much closer to the real value and are extremely excellent. The prediction effect of SVR model is the worst of all. It can be seen from Figure 5(b) that there is a certain lag in the prediction results compared with the real value. Compared with SVR model, the LSTM-attention model and TCN model have relatively higher accuracy, but still with a certain hysteresis. As shown in Figure 5(c), this lag may be affected by data autocorrelation. As illuminated in Figure 5(d), the prediction of the proposed model is more effective in dealing with this lag. In contrast, the model proposed in our work can be screened and featured for critical information over a long-term period, thus better predictive effects can be achieved.

The main values of this work are shown as follows: The prediction effect of the proposed model is better for the peak and low periods of traffic volume compared to baseline models. It indicates that the prediction performance of the proposed model for extreme conditions is quite good. Moreover, the model is well adapted to the characteristics of traffic flow fluctuations caused by spatial relationships, weather, vacation, and other factors [53, 54]. In the future, these factors can be further refined to improve the prediction accuracy of the model. It is very effective to add attention mechanisms to improve the hybrid model performance [36, 55, 56]. But its interpretability is still questioned. The interpretability of the attention mechanism played an important role in the proposed model to solve the problems, which is mentioned in the literature [57].

7. Conclusions

It is very important to reduce traffic carbon emissions for the living environment. Vehicle exhaust is one of the main sources of carbon emissions. The proposed model for the short-term traffic flow prediction is very effective to improve the prediction accuracy.

In this work, aiming at the gradient disappearance or gradient explosion and information dissipation in long-term series in short-term traffic flow prediction, a novel self-attention-based hybrid model is proposed to enhance the extraction ability of key information. The experimental results show that the proposed model can achieve a more accurate prediction effect in the case of multiple traffic flow fluctuations. Compared with several conventional time series prediction models, the proposed model in this paper has less error and is much closer to the real value.

Certainly, the modified model in this paper has some limitations and further research should pay close attention to (1) the influence factors such as weather and space characteristics should be considered to increase the generalization ability and further improve the prediction accuracy of the model. (2) The interpretability of deep learning models is insufficient and needs further strengthen. (3) The performance of the model for complex data with a large sample size remains to be verified.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Additional Points

Highlights. (i) Novel self-attention-based hybrid model for short-term traffic flow prediction is proposed. (ii) The encoder-decoder framework is developed to extract dominant spatiotemporal features and short-term variations. (iii) Three well-known benchmarks are considered to evaluate the models. (iv) The proposed model showed better results than SVR, LSTM, LSTM-attention, and TCN according to the real case.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Zhihong Li and Xiaoyu Wang conceptualized and designed the study. Xiaoyu Wang, Yang Dong, and Kairan Yang collected data. Zhihong Li and Xiaoyu Wang analyzed and interpreted data. Zhihong Li prepared draft manuscript. All authors reviewed the results and approved the final version of the manuscript.

Acknowledgments

This study was supported by the National Social Science Foundation (21FGLB014).