#### Abstract

Short-term traffic flow prediction is an important component of intelligent transportation systems, which can support traffic trip planning and traffic management. Although existing predicting methods have been applied in the field of traffic flow prediction, they cannot capture the complex multifeatures of traffic flows resulting in unsatisfactory short-term traffic flow prediction results. In this paper, a multifeature fusion model based on deep learning methods is proposed, which consists of three modules, namely, a CNN-Bidirectional GRU module with an attention mechanism (CNN-BiGRU-attention) and two Bidirectional GRU modules with an attention mechanism (BiGRU-attention). The CNN-BiGRU-attention module is used to extract local trend features and long-term dependent features of the traffic flow, and the two BiGRU-attention modules are used to extract daily and weekly periodic features of the traffic flow. Moreover, a feature fusion layer in the model is used to fuse the features extracted by each module. And then, the number of neurons in the model, the loss function, and other parameters such as the optimization algorithm are discussed and set up through simulation experiments. Finally, the multifeature fusion model is trained and tested based on the training and test sets from the data collected from the field. And the results indicate that the proposed model can better achieve traffic flow prediction and has good robustness. Furthermore, the multifeature fusion model is compared and analyzed against the baseline models with the same dataset, and the experimental results show that the multifeature fusion model has superior predictive performance compared to the baseline models.

#### 1. Introduction

With the development of urbanization, the number of population and motor vehicles in cities is increasing. While the demand for travel, especially in the morning and evening rush hours, often makes the road utilization rate saturated, resulting in urban “traffic diseases.” In this case, in order to solve the urban “traffic disease,” intelligent transportation system (ITS) was developing [1–4]. And with the development of big data technology, ITS has started to change into data-driven ITS [5]. Among them, short-term traffic flow prediction is one of the core components of ITS, which provides the basis for traffic management, traffic control, and traffic guidance, as well as support for travel decision of travelers. However, short-term traffic flow has complex stochastic and nonlinear characteristics, which brings great challenges to traffic flow prediction. And how to accurately predict short-term traffic flow has been a hot topic of concern for scholars in the field of traffic engineering.

The methods proposed in the early studies on short-term traffic flow forecasting mainly consist of three main methods, parametric methods, nonparametric methods, and combined methods, which include both parametric and nonparametric methods. Parametric methods include the autoregressive integrated moving average model (ARIMA) and its variants [6, 7]. Nonparametric methods include K-nearest neighbor nonparametric regression methods (KNN) [8], Kalman filters (KF) [9], support vector machines (SVR) [10], and artificial neural networks (ANN) [11]. Combined methods are a combination of two or more methods [12–14].

However, due to the development of data-driven ITS, especially the development and widespread use of traffic information collection technologies, such as induction detectors, geomagnetic detectors, radio frequency identification technology, radar detection, video detection, and floating vehicle detection [15–19], provide a large amount of data for traffic flow prediction. In this case, there are difficulties for parametric and nonparametric methods to deal with big traffic data. Therefore, deep learning methods [20–22], which have powerful data feature mining and nonlinear data fitting capabilities, have been applied to traffic flow prediction and achieved some results [23–26]. However, existing deep learning-based methods for traffic flow prediction mainly consider the spatial and temporal correlation of traffic flow, without fully considering the complex characteristics of traffic flow such as daily and weekly periodicity. In addition, although some combined deep learning methods use several different single models to extract multiple features of traffic flow, such as spatiotemporal correlation and periodicity, in fact, the spatiotemporal correlation and periodicity of traffic flow are a whole and should be considered comprehensively in prediction model. Based on this, this paper designs a multifeature fusion model based on deep learning methods that considers the periodic features of traffic flow for traffic flow prediction, and the main contributions are summarized as follows:(1)A fusion feature model considering the periodic features of traffic flow is proposed, namely, multifeature fusion model. In the model, the CNN-BiGRU module is designed, which treats the spatiotemporal features of traffic flow as a whole, where 1DCNN and BiGRU are used to extract the local trend features and long temporal dependencies trend features of traffic flow, respectively.(2)In multifeature fusion model, two two-layer BiGRU modules are designed to extracting the daily and weekly periodicity features of traffic flow, respectively.(3)In order to improve the prediction performance of the multifeature fusion model, an attention mechanism is designed for the CNN-BiGRU and the two-layer BiGRU modules to adaptively make each module pay attention to the importance of the temporal and periodic features at different times.(4)The multifeature fusion model is validated by simulating the traffic flow collected in the field, and the experiments’ results show that the prediction performance of the multifeature fusion model is better than that of the baseline model.

#### 2. Literature Review

In general, existing traffic flow prediction methods can be classified into parametric methods, nonparametric methods, deep learning methods, and combined methods.

##### 2.1. Parametric Methods

The parametric method is a modelling approach where the structure of the model is predetermined based on theory, and the parameters of the model can be calibrated by realistic traffic flow data. Levin and Tsao [27] applied a time series analysis method to predict the morning peak period traffic on a motorway and found that the ARIMA (0,1,1) model was statistically significant. Zhang et al. [28] developed a hybrid model, where spectral analysis techniques are invoked to extract the daily and weekly periodicity of traffic flows, and the ARIMA model is used to extract the general time trend characteristics of traffic flows. Subsequently, a number of ARIMA variants were applied in traffic flow prediction. For instance, Kohonen self-organizing ARIMA, an autoregressive sliding average model with seasonality, and spatiotemporal autoregressive sliding average model were also used for traffic flow forecasting and achieved good results [29–31].

##### 2.2. Nonparametric Methods

Due to the strong randomness and nonlinearity of the state changes in traffic flow, the traffic flow prediction results using parametric methods have a certain degree of deviation from the actual traffic flow. Therefore, some nonparametric methods gradually replace parametric methods in traffic flow prediction. Specifically, Ryu et al. [32] proposed a traffic flow prediction model that considering the spatiotemporal information associated with the predicted road section. The spatiotemporal information with the highest correlation to the predicted road section is first selected using a greedy algorithm, and then the traffic flow is predicted using KNN. Yan and Lv [33] proposed a hybrid classification and regression tree k-nearest neighbor model to predict short-term taxi demand. Okutani and Stephanedes [34] proposed two prediction models based on Kalman filter theory to predict traffic flow on streets within Nagoya. Guo et al. [35] proposed a hierarchical Kalman filter-based autoregressive moving average and generalized autoregressive conditional heteroskedasticity model for traffic flow velocity prediction. Hu et al. [36] proposed a hybrid model to forecast the short-term traffic flow based on particle swarm optimization (PSO) and support vector regression (SVR), in which PSO is used to find the optimal parameters of the SVR model. Lu and Zhou [37] proposed a Kalman filter traffic flow prediction model that takes into account structural deviations, where a polynomial is used to describe the evolutionary trend of structural deviations in traffic flow, and a Kalman filter model is used to describe the historical trend of traffic flow. Jiang et al. [38] proposed a support vector machine model with radial basis functions as kernel functions to predict traffic flow speed, and the experiment results showed that the prediction accuracy of the model was better than that of the traditional model. Wang and Shi [39] proposed a chaotic wavelet analysis-support vector machine model (C-WSVM), and the results showed that the C-WSVM model has better prediction performance and practicality. Feng et al. [40] proposed a new short-term traffic flow prediction model based on adaptive multicore support vector machine with spatiotemporal correlation. Wang et al. [41] proposed a combined support vector machine model to forecast short-term metro ridership, which includes a vector machine overall online model (SVMOOL) and a vector machine partial online model (SVMPOL). The SVMOOL model obtains the periodic characteristics of passenger flow, and SVMPOL obtains the nonlinear characteristics of traffic flow.

ANN [42] was regarded as another popular method for traffic flow prediction due to its ability to handle large amounts of multidimensional data, flexibility of model structure, and learning and generalization capabilities. And ANN combined with error backpropagation algorithm, i.e., Backpropagation Neural Network (BPNN) [43], was gradually applied to traffic flow prediction, and subsequently, a short-time traffic flow prediction model incorporating wavelet analysis and BP neural network approach [44] was applied to short-time traffic flow prediction. Then, an adaptive differential evolution algorithm optimized BPNN [45] was applied to short-time traffic flow prediction models. All these methods have achieved good results.

##### 2.3. Deep Learning Methods

With the development of data collection and processing technology, traffic big data has emerged. However, the traditional nonparametric methods have difficulties in processing multisource data [46], and the short-term traffic flow prediction methods have started to shift from nonparametric methods to deep learning methods [24, 26, 47, 48]. For instance, Huang et al. [49] designed a combined prediction model including a deep belief network with unsupervised learning at the bottom and a multitask learning (MTL) layer for supervised prediction, in which the top multitask learning layer can leverage the weight sharing in the DBN to provide better results in support of prediction. Lv et al. [50] proposed a stacked autoencoder model that is trained in a greedy hierarchical approach for training to learn traffic flow features.

One of the difficulties in short-term traffic flow prediction is to obtain spatiotemporal correlation between traffic flow data. In terms of temporal characteristics, recurrent neural networks (RNNs) are a deep learning structure mainly applied to process time series data. RNNs have the function of temporal memory and can be applied to the field of correlation prediction of time series data [51]. However, traditional RNNs cannot tap the long-term dependence properties among traffic flow data due to the gradient disappearance and gradient explosion problems, so Ma et al. [52] applied long- and short-term memory (LSTM) to the traffic flow prediction. Subsequently, Zhao et al. [53] proposed a two-dimensional LSTM network consisting of many memory units with considered spatiotemporal correlations, and the experimental results showed that the proposed network had better prediction performance compared with traditional prediction methods. Wang et al. [54] proposed a deep learning framework based on paths. In the framework, the road network is divided into critical paths, and then the bidirectional long and short-term memory network is used to model the traffic flow of each critical path. Cui et al. [55] proposed a stacked bidirectional and unidirectional LSTM network structure for predicting road network traffic with missing values. Zheng and Huang [56] proposed a traffic flow prediction model based on LSTM, and experimental results showed that the prediction performance of the proposed model outperformed the classical model. GRU, which is a well-known variant structure of the LSTM, has also been applied to traffic flow prediction [57].

In terms of spatial properties, CNN is also a typical structure in deep learning. It is a feedforward neural network for solving problems with grid-like structured data, which not only can reduce the complexity of the model while accurately extracting data features, but also can better extract spatial correlations between traffic flow data [58]. Zhang et al. [59] proposed a CNN model for short-term traffic flow prediction, where the optimal input to the model is a spatial-temporal feature selection algorithm, and experimental results showed that the model outperformed the baseline model. An et al. [60] used a fuzzy convolutional neural network based traffic flow prediction method, which for the first time applied CNN to uncertain traffic incident information and used a fuzzy approach to generalize traffic incident characteristics. Tian et al. [61] proposed a hybrid lane occupancy prediction model called 2LayersCapsNet, which combines an improved capsule network and CNN.

##### 2.4. Combined Methods

Combined models should be useful when a single specified model fails to exhibit good predicting performance, which is a common situation in complex data forecasting [46]. It is difficult for a single forecasting model to capture both the strong complexity and the strong variability of traffic flow, so the proposal of a combined predicting model is necessary. Specifically, to exploit the good linear fitting capability of ARIMA models and the powerful nonlinear relational mapping capability of artificial neural network models, Li et al. [62] proposed a combined ARIMA and radial basis function artificial neural network model to predict short-term traffic flows. Yao et al. [63] proposed a linear hybrid method and a nonlinear hybrid method to predict short-term traffic flows and classified the traffic flow data into similar, unstable, and irregular components. Among them, autoregressive integrated moving average and generalized autoregressive conditional heteroskedasticity models were used to predict the similar and fluctuating components, and Markov models with state membership and wavelet neural networks were used to predict the irregular component. Li et al. [64] analyzed the correlation between the predicted and historical time windows based on the grey correlation coefficient method and used the rank index method to establish a combined prediction model based on ARIMA, BPNN, and SVR developed. A neural network training algorithm combining exponential smoothing and the Levenberg-Marquardt algorithm was proposed to improve the neural networks generalization previously used for short-term traffic predicting [65]. Liu et al. [13] proposed a hybrid forecasting model based on a combination of neural network and KNN methods for short-term traffic predicting. Gu et al. [66] proposed a model incorporating deep learning to predict lane level speeds. In the model, firstly use entropy-based grey correlation analysis to select the lanes with the highest correlation with the predicted lanes to extract spatial features, and secondly, combine LSTM and GRU to build a two-layer deep learning framework to extract temporal features of traffic flow. The experiments results showed that the model outperformed the baseline model in prediction. Ma et al. [67] proposed a novel deep learning-based approach to daily traffic flow prediction incorporating contextual factors. Firstly, a specific CNN is used to extract daytime and intraday traffic flow features, secondly, the extracted features are used as input to an LSTM to learn the temporal features of the traffic flow, and finally, the traffic flow is predicted by combining the contextual information of historical days. Experiments results showed that the robustness and prediction performance of the model outperformed the benchmark model.

With the development of deep learning, especially the proposed and successful application of attention mechanism [68], it has received attention from scholars in the field of traffic, and some results of applying it in combination with CNN or variant RNN (LSTM and GRU) for short-term traffic flow prediction have emerged. For example, Liu et al. [69] proposed a CNN model based on an attention mechanism to predict traffic flow speed, where the input to the model is a three-dimensional data matrix consisting of traffic flow speed, flow rate, and time occupation, and the extraction of spatiotemporal features is done by convolutional units, and the proposed model has better prediction performance when compared with existing models for simulation experiments. Wu et al. [70] proposed a traffic flow prediction model including a data preprocessing module and a traffic flow prediction module, where the data preprocessing module is to repair missing values in the dataset, and the traffic flow prediction module is a model of a combined LSTM deep learning method based on an attention mechanism, and experimental results show that the prediction performance of the model outperforms other deep learning methods (RNN and CNN). Ma et al. [71] proposed a fuzzy logic-based hybrid model based on the complementary advantages of nonparametric and deep learning methods. Firstly, the model uses two submodels, KNN and LSTM, to extract features on the spatiotemporal correlation of traffic flow and the influence of specific contextual factors on traffic flow, and secondly, dynamic weights based on the fusion mechanism are used to optimize the hybrid model, and simulation experiments show that the model has better prediction and robustness than other state-of-the-art models. Ren et al. [72] proposed a combined deep learning prediction (CDLP) model, which consists of two parallel single deep learning models, that is, a CNN-LSTM-attention model and a CNN-GRU-attention model. In addition, a dynamic optimal weighting combination algorithm was proposed to combine the outputs of the two single models, and experimental results showed that this model has better prediction performance and robustness than the state-of-the-art prediction models.

In summary, as the research on short-term traffic flow prediction continues to grow, combined prediction models have received more and more attention, and in particular, the application of combined deep learning models has achieved greater success. However, most of the researches are based on the fusion of multiple single combination methods or just obtaining a fusion model of simple spatiotemporal characteristics of traffic flow, which cannot reflect the unified whole of spatiotemporal correlation and periodicity of traffic flow. In this paper, we analyze the complex characteristics of traffic flow, including the relationship between spatiotemporal and periodic features, and apply CNN, Bidirectional GRU, and Attention mechanism to build a multifeature fusion model for short-time traffic flow prediction.

#### 3. Method

##### 3.1. CNN

CNN is a deep feed-forward neural network, which mainly consists of a convolutional layer, a pooling layer, and a fully connected layer [73]. The convolutional layer is the most important part of the CNN, where the local features of the input data are obtained in the form of sliding filters, and the number of convolutional kernels in the convolutional layer corresponds to the number of output features in the convolutional layer. Typically, CNN models contain multiple convolutional layers, and the network can generate an excessive number of parameters. To reduce the number of parameters, the pooling layer usually performs a downsampling operation with the output features of the convolutional layer while keeping the overall features unchanged, in order to extract important features and prevent overfitting of the model. The fully connected layer is usually at the end of the CNN, and its main role is to spread the features obtained by convolution and pooling into a feature vector for classification and regression.

##### 3.2. Bidirectional GRU

In order to address the shortcomings of traditional RNNs, which ignore the long-term dependence of time series, LSTM and GRU have been proposed one after another. GRU and LSTM networks have not only the function of short-term memory, but also the function of long-term memory. In particular, the GRU is a further simplification of the LSTM [74], from the three gating units of the LSTM to two gate structures (update gate and reset gate), which further improves the operational efficiency of the network due to the simplified number of gates. The structure of the GRU unit is shown in Figure 1, where the purple line indicates the update gate, and the red line indicates the reset gate, defined as *z*_{t} and *r*_{t} respectively.

The role of the update gate in the GRU is to determine whether the hidden layer state *h*_{t-1} is updated to a new hidden layer state *h*_{t}, and the role of the reset gate is to control the extent to which the hidden layer state *h*_{t-1} is discarded at moment *t-1*. Equation (1) represent the computation process for each state within each time step in the GRU.Where ○ represents the Hadamard product, *X*_{t} represents the input at moment represent the weight matrix associated with the input, represent the weight matrix associated with *h*_{t - 1}, and *b*_{z}, *b*_{r}, *b*_{h} and represent the bias.

Based on GRU network, bidirectional GRU network has been further developed [75]. The structure of a bidirectional GRU network is made up of two GRU layers stacked in different directions, which is shown in Figure 2. In the figure, *x*_{t} is the input to the GRU, *h*_{f} is the output of the forward GRU layer, and *h*_{b} is the output of the reverse GRU layer. The input to the BiGRU network contains two time series from the past and the future, and in each moment, the input time series is fed into the two opposite GRU layers, and the outputs [*h*_{1}, *h*_{2}, *h*_{3}, *h*_{4}] are obtained by the joint determination of these two reverse GRU layers.

At each time node *x*_{t}, this network has two hidden layers containing opposite order. The neurons in one hidden layer are ordered from left to right, and the other hidden layer is ordered from right to left. To ensure that there are two hidden layers at any moment *t*, the network consumes twice the amount of storage to store parameters such as weights and offsets. The final output of the network is the fusion of the outputs of the two hidden layers to produce the final output. In addition, there is no information interaction between the two opposite hidden layers, and they are computed independently, but the state output vectors of both are combined at the final output to ensure that the unfolding graph is acyclic.

##### 3.3. Attention Mechanism

The attention mechanism uses a method of assigning different weights to the input features of a model in order to highlight the important factors that influence the model. The function of the attention mechanism can be understood as the process of filtering important information from multiple pieces of information, focusing on the important information and ignoring the unimportant information. The process of focusing on the important information is also the process of calculating the weight coefficients, and the more important the information, the larger the weight coefficient assigned. The process of calculating the context vectors and weights for the application of the attention mechanism to a deep learning model is as follows:

Assuming that the output state of the hidden layer of the deep learning model is

h_{1},h_{2}, …,h_{i}, …,h_{t}, the context vector can be calculated asC_{t}:

*t,i*h*i**t,i*it*a*U*a*b*a*s*t − *1

where *α*_{t,i} denotes the attention parameter, the corresponding weight of *h*_{i}, and the sum of the weights is 1. The attention parameter can be calculated aswhere *e*_{t,i} is the alignment model, which scores the input at moment *i* and the output at moment *t*. It is calculated as follows:where *W*_{a}, *U*_{a}, and *b*_{a} are the parameters of the feedforward neural network, and *s*_{t − 1} can be calculated as follows:where denotes the deep learning network.

Based on (5), the output of the attention mechanism can be calculated aswhere softmax is the activation function.

#### 4. Model

Realistic short-term traffic flow often exhibits complexity and randomness, which requires traffic flow prediction models that can tap into multiple features of traffic flow. CNNs can extract local trend features of traffic flows, while bidirectional GRU networks can obtain long-term dependent features of traffic flows not only in the past, but also in the future and can achieve temporal feature extraction by fusing past and future features. By fusing past and future features, temporal feature extraction can be achieved. At the same time, the attention mechanism enables the model to focus on important features. Based on this, this paper proposes a short-term traffic flow model based on a deep learning method of multifeature fusion, which consists of a CNN-BiGRU-attention module and two BiGRU-attention modules, and the model structure is shown in Figure 3. The CNN-BiGRU-attention module is composed of CNN, BiGRU network, and attention sequentially connected, where the CNN is composed of one convolutional layer. The CNN-BiGRU-attention module extracts traffic flow features by considering the local trend features extracted by CNN and the time-dependent features extracted by BiGRU as a whole. The two BiGRU-attention modules are used to obtain the weekly and daily features of the traffic flow data, respectively.

In addition, from a layer perspective, the model consists of an input layer, a hidden layer, a feature fusion layer, and an output layer. The input layer contains a parallel composition of historical time series, daily and weekly series, where the historical time series *X*_{T} is a sequence of traffic flows from time *t*_{ }*−*_{ }*n* to *t* and can be represented aswhere *x*_{t} is the traffic flow at time *t*.

The daily periodic traffic sequence can be expressed aswhere indicates the traffic flow *x*_{t} corresponding to the previous day.

The weekly periodic traffic flow sequence can be expressed aswhere indicates the traffic flow *x*_{t} corresponding to the previous week.

The hidden layer contains three parallel CNN-BiGRU-attention layers with two BiGRU-attention layers. The 1DCNN is chosen as the convolution layer of the model due to the one-dimensional and periodic nature of the traffic flow sequence. The dropout layer is followed by the feature fusion layer, where the features of the traffic flow are fused and output to the output layer for prediction.

#### 5. Experiment

##### 5.1. Data Processing and Dataset

The collected cross-sectional traffic flow at the intersection of Shandong Road and Minjiang Road in Qingdao, China, is used as the data set, containing 101 consecutive days of traffic flow data from February 1 to May 12, 2019, and a total of 29,088 raw pieces of data, and the interval for these data is 5 minutes. Then, the Lagrangian interpolation method is used to process the missing data and abnormal data. The data are then normalized using the maximum-minimum normalization method to obtain the dataset for the model. A total of 87 days of data from February 1 to April 28 in the dataset are used as the training set, and a total of 14 days of data from April 29 to May 12 are used as the test set.

##### 5.2. Experimental Environment and Model Evaluation Index Selection

The software and hardware conditions of the experimental environment in this paper are shown in Table 1.

In order to evaluate the performance of the fused feature model, three evaluation metrics were chosen, namely, MAPE, MAE, and RMSE, which are calculated as follows:where *n* is the total number of samples in the test set, *y*_{i} is the actual value of the *i*th sample, and is the predicted value of the *i*th sample.

##### 5.3. Model Parameter Settings

###### 5.3.1. Loss Function Setting

The loss function quantifies how close a given neural network is to the ideal situation it is trained for. The mean absolute error function and the mean square error function are generally used. Due to the convenience of calculating, the mean square error function is chosen as the loss function in the fusion feature model, and the calculation formula is as follows:where *y*_{i} is the actual value of the *i*th sample, is the predicted value of the *i*th sample, and *n* is the number of samples.

###### 5.3.2. Setting the Number of Neurons in the Model

Before the model is trained, the number of neurons in the input and hidden layers of the model should be set (the model in the paper is based on a sequence of historical traffic flows to predict the traffic flow value at the next moment, so the number of neurons in the output layer of the model is set to 1; refer to Section 4 for details). The following is the process of setting the number of neurons in the input and hidden layers.

To obtain the appropriate number of neurons for the input layer, we select 6, 12, 18, and 24 as the number of neurons for the input layer to train the model and obtain the optimal number of neurons for the input layer by error analysis of the test set. Similarly, for the setting of the number of neurons in the BiGRU layer, four neuron numbers of 16, 32, 64, and 128 are chosen to train the model. The optimal number of neurons in each input and hidden layer of the neural network is determined by error analysis of the test set. Meanwhile, in the 1DCNN layer, the convolutional operation to extract features is implemented through convolutional kernels, and the size of kernel is set 21, i.e. filters = 64 and kernel_size = 2. The ReLu function was chosen as the activation function for the convolutional layer. It is calculated as follows:where *x* is the input to the activation function.

In the Dropout layer, the neuron loss rate is set to 20%. In addition, epoch is set to 300 rounds, and the batch size is set to 256.

For the error analysis of the test set, MAPE is selected as the main evaluation metric, and MAE and RMSE are selected as auxiliary evaluation metrics. The results of the evaluation metrics for the test set with different numbers of neurons in the model input layer and the bidirectional GRU network, including MAPE, MAE, and RMSE, were obtained, as shown in Table 2.

From Table 2, it can be found that the model has the strongest generalization ability when the number of neurons in the input layer is 12, and the number of neurons in the BiGRU network is 128, so we choose 12 and 128 as the numbers of neurons in the input layer of the model and the BiGRU network.

###### 5.3.3. Optimization Algorithm Setup

In the training process of deep learning models, optimization algorithms are used to iteratively optimize the parameters generated in the training model in order to reduce the value of the loss function, so that the training process of the model becomes stable as the number of iterations increases. The mainstream optimization algorithms include RMSProp and Adam, both of which are applied to train the fused feature model, and the optimization algorithm is selected based on the generalization capability of the model as an indicator. The RMSProp algorithm and Adam algorithm are used to train the fusion feature model, respectively, and the results of the three evaluation metrics are obtained, as shown in Table 3.

As can be seen in Table 3, MAPE, MAE, and RMSE are all smaller than the RMSProp algorithm when the CDLP model is trained using the Adam algorithm. The results indicate that the Adam algorithm is more efficient than the RMSProp algorithm and is selected as the optimization algorithm for multifeature fusion model.

##### 5.4. Results and Analysis

After determining the parameters of the model, the designed training and test sets are used to validate the predictive performance of the multifeature fusion model. The loss function curves generated by the model during the training process are shown in Figure 4. From Figure 4, it can be found that as the epoch increases, the loss function curves of the training and test sets decrease rapidly and steadily and finally converge to a constant 0, indicating that the design of the multifeature fusion model is reasonable.

Figure 5 shows the prediction results of the multifeature fusion model in the test set. It can be found that the multifeature fusion model can fit the actual traffic flow in the test set very well; specifically, the absolute error of the model at each moment is found to be between [−60,60] from the error curve graph.

In addition, to further verify the robustness of the multifeature fusion model, Figure 6 shows the MAPE plot of the model in the test set. As can be seen from the graph, the trend of the MAPE curve gradually decreases from the maximum value to the inflection point and then slowly increases and gradually converges to 5.52%, which indicates that the fused feature model has good robustness and low error, further indicating that the multifeature fusion model can better achieve traffic flow prediction.

To further validate the feasibility of the multifeature fusion model, the ability of the multifeature fusion model in extracting long-term dependent features and local features of the traffic flow is first observed. The Conv-BiGRU module (includes other modules) is selected as the comparison model. The structure of the module consists of a parallel layer of a convolutional layer and a BiGRU network, and the function of the module is to extract local trend features and long-term dependent features of the traffic flow individually. The model finally fuses the long-term dependent features, local trend features, and periodicity (including daily and weekly periodicity) of the traffic flow through the feature fusion layer and then predicts them. Second, the impact of periodic features on the multifeature fusion model is verified. Short-term traffic flows usually exhibit strong periodicity, and the advantage of the model is that it takes into account the periodicity of traffic flows by using two BiGRU-attention modules to extract the daily and weekly periodicity of traffic flows, respectively. The model containing only one module of CNN-BiGRU-attention is used as a comparison model for validation. Third, the periodicity usually includes daily and weekly periodicity, and the models considering only daily and weekly periodicity, respectively, are used as comparison models for validation. Fourthly, a model that does not contain attention mechanisms in each module is considered as a comparison model for validation. Based on these comparison models and the multifeature fusion models mentioned above, the corresponding MAPE results were obtained by training and testing, as shown in Figure 7.

From Figure 7, it can be found that the maximum, minimum, and median values of the multifeature fusion model containing the CNN-BiGRU module are smaller than those containing the Conv-BiGRU module, indicating that the feature extraction capability of the CNN-BiGRU module is better than that of the Conv-BiGRU module. This is because the local trend features and long-term dependent features of the traffic flow are intertwined and interact with each other. Furthermore, the maximum, minimum, and median values of the multifeature fusion model are smaller than those of the CNN-BiGRU-attention model with only one module, because the periodic features play an important role in the prediction of traffic flow in the short-term traffic flow. In addition, from Figure 7, it also can be found that the MAPE of the multifeature fusion model is smaller than that of the feature fusion model without the attention mechanism. This indicates that the attention mechanism in multifeature fusion model improves the prediction accuracy by focusing on the important features extracted from each module.

Finally, the proposed multifeature fusion model is compared with existing baseline models. The baseline models include the LSTM model, GRU model, CNN-LSTM-attention model, CNN-GRU-attention model, and CDLP model [72]. The LSTM model and GRU model are composed of one input layer, two hidden layers (LSTM layer and GRU layer), and one output layer. The CNN-LSTM-attention model is composed of an input layer, a hidden layer, and an output layer, where the hidden layer is composed of a convolutional layer, two LSTM layers, and an attention mechanism layer connected sequentially, and the structure of the CNN-GRU-attention model is the same as that of the CNN-LSTM-attention. The parameters of the five benchmark models are set as in the multifeature fusion model.

The prediction errors in terms of prediction performance metrics for the different models are shown in Table 4, from which it can be found that the multifeature fusion model has the lowest prediction error. This is because the LSTM and GRU models mainly consider the temporal characteristics of traffic flow, that is, the long-short time dependence, while the CNN-GRU-attention model and the CNN-LSTM-attention model mainly consider the spatial and temporal characteristics of traffic flow, which is better than the LSTM and GRU models in terms of prediction error. The prediction performance of the CNN-GRU-attention model and the CNN-LSTM-attention model is better than that of the LSTM and GRU models, because the CNN-GRU-attention model and the CNN-LSTM-attention model mainly consider the spatial and temporal characteristics of traffic flow and consider the spatial characteristics of the model more than that of the LSTM and GRU models. The CDLP model is a combined prediction model based on the CNN-LSTM-attention model and the CNN-GRU-attention model, which also considers only the spatiotemporal characteristics of the traffic flow. The multifeature fusion model extracts the spatiotemporal, weekly, and daily characteristics of the traffic flow by using three different modules of the combined deep learning method, so the prediction performance of the multifeature fusion model is better than that of the baseline model.

In addition, the training time of the different models are shown in Table 5. It can be found that the training time of the multifeature fusion model is the same as that of the CNN-GRU-attention model in the combined model with higher prediction accuracy, but the MAPE, RMSE, and MAE of the model are reduced by 0.19%, 0.71, and 0.35, respectively, which are better than those of the CNN-GRU-attention model. Furthermore, the training time of the multifeature fusion model is smaller than that of the CNN-LSTM-attention model and the CDLP model, while the prediction accuracy is improved in both cases, which can be reflected in Table 4. This is because the model uses the CNN-BiGRU-attention module, in which GRU is a simplification of the LSTM, so the training time for the multifeature fusion model is less than that of the CNN-LSTM-attention model and the CDLP model (which uses the CNN-LSTM-attention module). Therefore, the multifeature fusion model has superior prediction performance.

#### 6. Conclusion and Future Work

Short-term traffic flow prediction is one of the core components in intelligent transportation systems. In order to solve the problem of not extracting multiple features of traffic flow in traffic flow prediction, in this paper, a multifeature fusion model consisting of a CNN-BiGRU module with an attention mechanism and two BiGRU modules with an attention mechanism is proposed. Moreover, the parameters in the multifeature fusion model including the number of neurons, the optimization algorithm, and other parameters are obtained by experimental calibration.

Through experiments, it is found that the CNN-BiGRU-attention module can effectively capture the local trend features and long-term dependent features of the traffic flow, and the two BiGRU-attention modules can effectively capture the daily and weekly cycle features of the traffic flow. At the same time, the attention mechanism improves the prediction accuracy of the model by focusing on the importance of the features acquired in each module, and the feature fusion layer of the model allows the features extracted from each module to be fused to predict future traffic flow trends.

Finally, extensive experimental results have shown that the predictive performance of the multifeature fusion model is superior to that of the baseline models for the same dataset.

In this work, we investigate traffic flow prediction using only cross-sectional traffic flows as the object of study. However, in real life, road network traffic flows usually exhibit extremely complex characteristics, and it is difficult for traditional CNN and BiGRU networks to fetch short-time traffic flow features under complex road networks. Therefore, similar graph neural network examples, such as spatiotemporal synchronous graph convolutional neural networks [76], provide a solution to the problem of short-term traffic flow prediction in complex and large road networks, which is difficult to be solved by traditional combined CNN-GRU models; therefore, it will be reserved for our future work and offers a new alternative approach for traffic prediction. In addition, the prediction of short-term traffic flows is often influenced by weather, traffic accidents, and major events, so the study of short-term traffic flow prediction considering special events will be left as another study for our future research.

#### Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.