With the exponential growth of traffic data and the complexity of traffic conditions, in order to effectively store and analyse data to feed back valid information, this paper proposed an urban road traffic status prediction model based on the optimized deep recurrent Q-Learning method. The model is based on the optimized Long Short-Term Memory (LSTM) algorithm to handle the explosive growth of Q-table data, which not only avoids the gradient explosion and disappearance but also has the efficient storage and analysis. The continuous training and memory storage of the training sets are used to improve the system sensitivity, and then, the test sets are predicted based on the accumulated experience pool to obtain high-precision prediction results. The traffic flow data from Wanjiali Road to Shuangtang Road in Changsha City are tested as a case. The research results show that the prediction of the traffic delay index is within a reasonable interval, and it is significantly better than traditional prediction methods such as the LSTM, K-Nearest Neighbor (KNN), Support Vector Machines (SVM), exponential smoothing method, and Back Propagation (BP) neural network, which shows that the model proposed in this paper has the feasibility of application.

1. Introduction

With the development of urbanization, there is a prominent contradiction between the transportation infrastructure and the vehicle population, and the problem of traffic congestion has become more serious, which inevitably leads to the increasing of travel time, intensified environmental pollution, and economic loss [1]. Prevention is the first way to control traffic congestion. According to the existing traffic states, the changing trend in a short time is predicted, and then, the information platform is used to issue an early warning to divert the traffic to avoid or ease congestion [24]. Therefore, how to establish a long-term model for timely warning of traffic congestion is the research focus of urban intelligent transportation system optimization [57].

A variety of methods, including the time series, machine learning, and artificial neural networks, have been proposed for traffic congestion prediction. Since the time-series characteristics of traffic flow data were discovered [8], some scholars used autoregressive differential moving average models [9] to predict the traffic flow on expressway [10, 11]. Because the temporal distribution of traffic flow data is interrelated, some scholars used nonparametric regression methods to build macrotraffic models and found that the prediction result is better than time-series algorithms [1214]. However, these methods based on statistics and traffic models require a large amount of historical data and construct many assumptions, so they are difficult to apply to nonlinear traffic flow [1518].

In recent years, machine learning algorithms, such as the back propagation neural network [19, 20], have gradually been used in traffic prediction with the advantage to handle the nonlinearity problems. Because of the long training time of the back propagation neural network and the tendency to fall into the local optimum, some scholars also used the Support Vector Machine (SVM) [2123] and K-Nearest Neighbor (KNN) [2426] to predict the traffic status. Moreover, some scholars found that the time series of short-term traffic flow has chaotic characteristics. To deal with the abovementioned issues, many methods, such as combined vector machine-based [27] and phase space reconstruction-based [28], have been proposed to achieve better results. However, most of these machine learning-based methods lack robustness to catch the huge data, resulting in the model generally lacking long-term effectiveness and scalability [2931].

Facing on the lots of traffic flow data, scholars have gradually turned to use the deep learning method, a learning algorithm that can simulate the multilayered perceptual structure of the human brain to recognize the data patterns. At present, breakthroughs have been made in many fields such as computer vision, speech recognition, and natural language processing. Deep learning has gradually been adopted by Stanford University, Google, Baidu Research Institute, and other authoritative organizations with the strategic direction for the development of data mining and artificial intelligence [32, 33]. Kuremoto et al. [34] combined the restricted Boltzmann machine with the time-series laws to obtain a prediction model, which fits the sample data with the minimum model energy. Lv et al. [35] proposed a deep learning model to predict traffic flow based on an automatic coding network using compression coding in the input data. Zhao et al. [36] proposed a traffic congestion prediction model based on the improved SVM, which can learn the characteristics of traffic flow parameters through the deep structure by digitizing different environmental and human factors. The abovementioned methods speed up data processing by applying the deep learning models but do not take into account the dimensional disaster caused by the high-dimensional states of traffic flow parameters. To address the abovementioned problems, some scholars used data compression technology based on the LSTM, Principal Component Analysis (PCA) [37], CUR matrix decomposition algorithm [38], and Discrete Cosine Transform (DCT) method [39] to perform data dimension reduction.

Q-Learning can efficiently store and extract data to provide support for traffic prediction. The LSTM network reduces the frequency of gradient explosion and disappearance, so it is suitable for capturing the spatiotemporal evolution of traffic state parameters [4043]. In this paper, considering the time sequence of traffic flow parameters and the continuity of traffic congestion effects, the recurrent neural network model is used to train the extracted features and to obtain low-dimensional vectors of historical information, and then, the resulting vectors are stitched to achieve classification training. Finally, an urban road traffic state prediction model based on the optimized deep recurrent Q-Learning method is established. The model proposed in this paper has the following contributions:(1)The model effectively solves the problem of gradient explosion and gradient disappearance in the prediction process of LSTM(2)The model effectively extracts the associated features of the traffic data, so it has better prediction efficiency and accuracy(3)The model will provide a feasible prediction method for the construction of an intelligent transportation system due to its efficiency and feasibility

The rest of this paper is organized as follows. Section 2 points out the problems to be solved and the corresponding methods in this paper. Then, Sections 3 and 4 lead to the principles and steps of the Q-Learning and the LSTM. After that, the deep recurrent Q-Learning network model is constructed in Section 5. Besides, the example analysis in Section 6 proves the stability and feasibility of the method. Finally, Section 7 concludes the paper.

2. Specific Problems and Solutions

2.1. Specific Problems

The problems with urban traffic data are high repeatability, high loss rate, and poor correlation. The existing prediction methods mainly discuss the results of independent analysis and whether they meet the needs of further verification. Therefore, the following problems exist in data preprocessing and optimization prediction.

Regarding the problem of data relevance: the relationship between the states at the previous moment and the next moment lacks effective connection. Therefore, the information at different states is disconnected, and the timeliness of the data cannot be fully exerted. As a result, the prediction results are not sufficiently correlated with the data at the previous moment and lack of persuasiveness.

Speaking of the problem of data storage: based on the existing analysis methods, the storage capacity of the database will quickly reach the threshold, which is not conducive to long-term and durable prediction. Besides, repeated analysis steps will increase the feedback delay and cannot fulfill the requirements of low-latency traffic prediction.

Concerning the problem of comprehensive data analysis: the existing analysis focuses on fixed types of data, and the traffic environment is an integrated system. Therefore, even if the prediction results are accurate, they cannot reflect the objective situation.

2.2. Solutions

For the abovementioned three research problems, this paper will propose the corresponding solutions:

For the problem of data relevance: based on the optimized LSTM model, the effective correlation and information accumulation of different data types are strengthened, and the correlation degree of data at different moments is strengthened.

For the problem of data storage: the Q-Learning functionalizes the data information, and each data cell can be realized by the expression of functions. This method not only reduces the pressure of data storage but also improves analysis efficiency and accuracy.

For the problem of comprehensive data analysis: the traffic conditions are affected by multiple factors. Therefore, when selecting the characteristic data types, in addition to the basic parameters of traffic flow, climate and temperature are also considered. That is to establish a multidimensional data analysis system, making the prediction results more accurate and objective.

3. Q-Learning Principle and Application Steps

The steps of the Q-Learning are listed as follows: the state of the agent in the environment E is S, and the actions taken by the agent constitute the action space A. It takes different actions to transfer between states, and the reward function obtained is R. To achieve the optimal strategy, the Q-Learning estimates the value of each action choice in each state. The Q-Learning uses Q (S, A) to represent the value function of state-to-action and continuously updates the value of Q (S, A) according to the state transition. Finally, the Q-Learning obtains the optimal strategy based on Q (S, A).

The value function Q (S, A) of the traffic state is updated as follows: assuming the state of the agent at time t is s, the action is . Then, the state transitions to time t + 1, the state is , and the reward is . Finally, the agent updates the value of Q (s, a) according to all records to find the optimal strategy. The corresponding update function is shown in the following equation:where is the current Q-table, is the learning rate, is the benefit at the next moment, is the greed coefficient, and is the best benefit in memory.

The deep Q-Learning network combines deep learning and Q-Learning. The network uses the perceptive ability of the deep learning to transform the state to high dimensions and uses the decision-making ability of Q-Learning to map the high-dimensional state representation to the low-dimensional action space [44, 45]. In the Q-Learning algorithm, the table is used to store the value of Q (s, a). In the deep Q-Learning, the state dimension of the agent is high, and the table obviously cannot meet the demand. This problem is solved by using f (s, a) to approximate Q (s, a) [46, 47]. Therefore, based on the corresponding value function neural network model, approximate values can be obtained, thereby reducing the storage pressure of the Q-table and providing ideas and methods for Q-Learning to be applied to traffic state prediction. Finally, the network obtains the action value of congestion and dissipation according to the accumulated experience pool. Figure 1 shows a schematic diagram of the principle of approximating the value of “state-action” through the neural network.

The network helps solve the problems of processing huge data volumes. Due to the strong time series of traffic data, the application of this network will make the analysis results more reliable. Further demonstrations and experiments will be discussed in the following sections.

4. Recurrent Neural Network LSTM Algorithm

4.1. Overview of the Recurrent Neural Network

The recurrent neural network is one of the optimized variants of deep neural networks. It is characterized by the output of the neurons at a certain moment as part of the input of the next moment, and the neural network has the function of memorizing the information of the previous moment which can realize the persistence of the information. As shown in Figure 2, the neural network reads the input of the current time t and obtains the output . At the same time, the information status is returned to the neural network as one of the inputs at the next time point. In order to show the execution action more intuitively, we express it by

The output at each moment is related to the input at the previous moment. The recurrent neural network is the most natural structure for processing sequence data which is exactly what we need to handle historical data and real-time data in this paper.: the input at time t, is the input sequence.: the state of the hidden layer at time t, also known as the memory unit of the recurrent neural network.: the output at time t, is the output sequence.U: the weight parameter matrix of input sequence information X to hidden layer state S.W: the weight parameter matrix between the hidden layer states S.V: the weight parameter matrix of hidden layer state S to output sequence information H.

4.2. Recurrent Neural Network LSTM

If the dependency interval between sequences is long, the gradient disappearance of traffic data will happen in ordinary RNN which is difficult to retain the information at earlier times. The LSTM network remembers long-term historical information through the design of the network structure where the output of the network at time t + 1 is applied to itself at time t to avoid the gradient disappearance. Its network expands along the time axis. The schematic diagram and the detailed diagram of the three-layer gate are shown in Figures 3 and 4.

It can be seen from Figure 3 that the LSTM defines the key concept of cell state with the horizontal line. There is less information interaction in the cell with the purpose of memorizing long-term information achieved through cell transmission. For Figure 4, it is made by three-gate layers with the first one is the forget gate. This gate is determined based on the input of the current moment and the output of the previous moment and, then, passes through a Sigmoid layer to obtain the results. It determines how much of the cell states from the previous moment is retained to the current moment. The expression function of the forget gate is shown in the following equation:where represents the output of the forget gate and represents the Sigmoid function. and represent the weight matrix and the bias term, respectively. represents the connection of two vectors into a longer vector.

The second one is the input gate, which determines how much of the network input is saved to the cell state at the current moment. The expression functions of the input gate are shown in the following equations:where is calculated by multiplying the last element state by the element forget gate , then multiplying the element state by element by the input gate , and finally, adding the two products.

The cell information can be updated based on the results of forget gate and output gate. It is listed as follows:

The last one is the output gate, which controls how many cell states output to the current output value of the LSTM. From Figure 4, the output gate is composed of two parts, one is the state of the cells processed by tanh, and the other is the input information processed by Sigmoid. The functions of the output gate are listed in the following equations:where represents the output of the output gate. and represent the weights and offsets, respectively.

5. Deep Recurrent Q-Learning Network

5.1. State Space

If the amount of acquired data is not large, the Q-Learning can perform data storage and processing efficiently. If the data is large, Q-Learning cannot traverse all states, and there is no such large space to install the Q-value table in memory. Therefore, this paper uses the LSTM model to generalize the states and uses the recurrent neural network to learn the state function. Through continuous deep reinforcement learning, the model obtains features to describe the current state, while accumulating experience pool. In constructing the state space, it is divided into two steps: state discretization and value evaluation.Step 1: what the neural network wants to output is the training value under each state, which represents the measure of the pros and cons of developing from this state. The characteristics of the current state are speed , delay time d, travel time m, temperature t, and precipitation probability p. If the characteristics of the next state are speed , delay time d′, travel time m′, temperature t′, and precipitation probability p′, then the corresponding selection behavior of reward accumulation is [speed , delay time d, travel time m, temperature t, precipitation probability p] minus [speed , delay time d′, travel time m′, temperature t′, precipitation probability p′]. The resulting values for each position are positive 1 and negative 0 to discretize the behavior (the range is , and the selected behavior discretization vector [0, 1, 0, …] is transformed into an integer which represents the dimension of the output vector). Based on the results of the abovementioned rewards, the traffic information with better benefits is accumulated to form an experienced pool with high benefit values, which makes the prediction results more accurate.Step 2: due to the influence of the traffic states before and after the training, it is necessary to determine whether the action can get excellent feedback before execution. The action is performed according to the strategy p, and the cumulative return is calculated, after the strategy is executed. The state value function expression is listed as follows:where represents the degree of return according to the strategy p under state s. p (s, s′) represents the probability of state transition. R (s, s′) represents the reward obtained from , and is a function coefficient.

5.2. Reward Actions

The reward after training in the previous state s is represented by the difference in delay time. The neural network function uses s as input. Q (a, n_features) is the storage table, and n_features presents the number of input neurons. Therefore, the output vector dimension is . The memory storage pool structure after the reward is [n_features, a, r, n_features].

During the process of predicting the future situation, inputting the current state and outputting the Q-value are studied under various possibilities with the largest one being selected. As the reward level continued to deepen, the target results gradually approach the actual situation in which the Q-value here refers to the traffic delay index.

5.3. Training Method

Based on the construction of the abovementioned state space and reward actions, we will train the datasets from Wanjiali Road to Shuangtang Road on the elevated Wanjiali Road in Changsha. The main steps of training methods are listed as follows:Step 1: the preprocessing of traffic data and weather data (culling abnormal data, Lagrange interpolation, and normalization).Step 2: the selection of training sets and test sets (the time interval of training sets is from 0:00 on May 17, 2019, to 24:00 on May 24, 2019. The time interval of test sets is from 0:00 on May 25, 2019, to 12:00 on May 25, 2019).Step 3: determining the input and output of the variables and the number of network layers (the input variables are speed, delay time, travel time, temperature, and precipitation probability. The output variable is the delay index, the number of hidden neurons in the interval [4, 13], and 3 layers of network layers).Step 4: determining the initial weights, thresholds, learning rate, activation function, and training function (the interval of initial weight and threshold is [0, 1]. The learning rate is 0.01, the activation function uses the Sigmoid function, and the training function uses Adam).Step 5: training the neural network model and stopping the network training when the feedback reaches the optimal state of the Q-value table. If it is not satisfied, modification and adjustment of the parameter values are required (learning rate and training function).Step 6: adjusting the parameter to achieve the best prediction results which could be obtained from the prediction and input test set data.Step 7: analyzing the prediction results to get the final experimental results.

In this paper, the LSTM forgetting, input, and output threshold activation functions are all Sigmoid functions. The return interval [0, 1] is consistent with human thinking. The pseudocode to build a deep recurrent Q-Learning network is shown in Algorithm 1.

(1)Initial network structure, the parameter is q. Initial target network, parameter q′ = q.
(2)Initial trials greedy parameters epsilon, learning rate, reward, attenuation coefficient gamma, number of iterations episodes. Each episode iteration round number T, training batch size, and neural network parameter rotation cycle transfer_cycle.
(3)for an episode in Episodes do
(4) Initial traffic state
(5) For t from 0 to T:
(6)  Selection behavior. (Output an integer with a range of 0 to ): Select with a probability of 1-epsilon, and randomly select the behavior with a probability of epsilon.
(7)  After the behavior is determined, find all states in the data table that match this behavior, and then randomly select one from as (If no match is found in , the behavior is redetermined).
(8)  Put experience into the memory pool.
(9)  Take out batch size data randomly and calculate q_eval and q_next respectively.
(10)  Construct:
(11)  According to q_eval and q_target, back propagation to improve the network q.
(12)  If the number of iterations is an integer multiple of transfer_cycle, then updates q′ = q.
(13)  Current state = .
(14)  When the maximum iteration number T of a single round game is reached, the training of this round is stopped, and the traffic state is returned to the initial trial.
(15) end for
(16)end for

6. Case Analysis

6.1. Data Description

This paper selected a part of the arterial road in Changsha, starting from Wanjiali Road to Shuangtang Road from north to south, as the research case. A crawler script written in Python 3.7 was used to capture the real-time traffic information from the big data platform of Gaode Map. The data were collected from 0:00 on May 17, 2019, to 12:00 on May 25, 2019, with a 5-min sampling interval. The collected data types include actual time, speed, delay time, travel time, temperature, probability of precipitation, and delay index. The data set sample is shown in Table 1.

The data of this case is divided into training sets and test sets after preprocessing. The time interval of the training sets is from 0:00 on May 17, 2019, to 24:00 on May 24, 2019, and the time interval of the test sets is from 0:00 to 12:00 on May 15, 2019.

6.2. Data Preprocessing

Data preprocessing includes three steps: culling abnormal data, Lagrange interpolation, and normalization. The detailed information is shown in Figure 5.

The first step is to cull abnormal data. The abnormal data mentioned in this step refers to the data that deviates significantly from the normal interval. By deleting such kind of data, the experimental data are more realistic and the analysis results are more reasonable. Some samples of abnormal data are shown in Table 2.

The second step is the Lagrange interpolation. Lagrange interpolation is used to fill in some missing data based on the neighboring traffic datasets to improve the value of the data. This step is used to achieve data integrity and rationality.

The data filling function for this step is listed as follows:where is the polynomial of degree i − 1 and is the parameter corresponding to the point i.

The third step is data normalization. The purpose of this step is to control the magnitude of the data within a small fluctuation range, reduce the impact between the magnitudes of the horizontal data, and improve prediction accuracy. The function is listed as follows:where max is the maximum value of the sample data and min is the minimum value of the sample data.

The preprocessed data are transformed into a list to form the matrix and, finally, transformed into a three-dimensional space. The three-dimensional space serves as the input of the LSTM unit to form the basic unit of the hidden layer. Every 15 rows of valid data are used as a training set and are continuously trained in 100 times. The test sets are predicted based on the training memory to obtain the prediction results.

6.3. Prediction Module Construction

According to data analysis, visualization, and platform requirements, this paper introduces NumPy, Pandas, and Matplotlib as analysis tools. Tensorflow is used as an open-source database for deep learning to build a basic library. According to the needs of the model, a variety of modules is constructed including the traffic environment module, deep reinforcement learning module, memory module, behaviour selection module, neural network module, training main program module, loss curve module, and visualization module.

The first step is to initialize the traffic data, network environment, and training parameters to build a neural network for prediction. The second step is to input training sets and test sets to the input layer. Multidimensional data introduction is performed in the hidden layer, and data prediction is performed based on the experience pool in the output layer. Meanwhile, the structural dimensions of the input and output are displayed at each stage. The detailed flowchart is shown in Figure 6.

6.4. Parameter Impact Analysis

For the traffic prediction, the most critical indicators are prediction efficiency and accuracy. Therefore, the parameter impact analysis, the optimization index analysis, and the accuracy analysis are performed in the following section. The parameter impact analysis and optimization index analysis are used for the evaluation of prediction efficiency, and the accuracy analysis is utilized for the evaluation of prediction accuracy.

During the neural network of traffic prediction, the key parameters affecting the efficiency of the experiment are studied including learning rate, reward decay, greedy, memory size, replacement interval, and batch size. The batch size is the fixed parameter, and the remaining items are variable parameters. Group 6 is the initial parameter group for comparison and analysis with other groups. Therefore, the parameter groups are divided into weakened state parameter groups (group 1–5) and strengthened state parameter groups (group 7–11).

Each group only weakens or strengthens one parameter for comparison with group 6. To improve the discrimination of the experimental results, the selection of parameter values has an obvious gradient with the specific values which is shown in Table 3.

In order to reflect the differences in the experimental results of each group, this paper selected the indicators with obvious discrimination in the experimental results for analysis. They are the highest loss index, the lowest loss index, the maximum volatility, training time, prediction time, and total time. The detailed index distribution of each group is shown in Table 4.

The highest loss index and maximum volatility of group 4 have serious deviations from other groups and have exceeded the normal fluctuation range. It shows that the experimental efficiency analysis in group 4 has no research value. Therefore, group 4 is eliminated before performing a comparative analysis.

The qualitative analysis is performed first. Since the highest loss index, the lowest loss index, the largest volatility rate, and the total time are important parameters of experimental efficiency, visualization is performed, as shown in Figure 7.

From this, the following conclusions are reached:(1)The indicators of group 6 are all at the highest level, so the gradient weakening or gradient strengthening of the parameters can optimize the experimental results, but pay attention to the combination of extreme parameters. The prediction result at this time does not have actual value, such as group 4.(2)With the gradient adjustment of parameters, all indicators can fluctuate within a relatively small range without a sharp rise or a sharp decline. Therefore, the stability of the model proposed in this paper is confirmed.(3)The weakened and strengthened states of memory size and replacement interval have only slight fluctuations compared to the initial state, indicating that these two types of parameters have a little effect on the experimental efficiency.

The quantitative analysis is performed next. According to the optimization degree of each parameter, the effect of improving experiment efficiency is determined. The parameter optimization degree function is shown in the following equation:where represents the optimization degree of each group, represents the parameter value of the initial group, represents the parameter value of the variable group, and represents the optimization weight of the corresponding parameter.

From the perspective of forecasting efficiency, the maximum volatility and total time are representative, followed by the highest loss index and the lowest loss index. Therefore, the initial weight distribution of each parameter is shown in Table 5.

Based on equation (12) and weight distribution, this paper performed optimization calculations for the weakened group compared to the initial group and the strengthened group compared to the initial group.

The following conclusions are drawn based on the quantitative results in Table 6.(1)The optimization effect of the groups is adjusted by the memory size and replacement interval is weaker than other groups, which further confirms the conclusion (3) in the qualitative analysis.(2)The optimization effects of all parameters on the experimental efficiency are quite different which indicated that there is parameter emphasis.(3)The weakened or strengthened groups improve the experimental efficiency, indicating that group 6 is already at or near the worst parameter combination. From this, the lowest limit of the parameter combination can be determined.

This section analyses the effect of five parameters on the experimental efficiency. The results show that memory size and replacement interval have a small effect on experimental efficiency, while learning rate, reward decay, and greedy have a significant degree of tendency to experimental efficiency. Therefore, the three parameters are analysed for efficiency in the next section.

6.5. Optimization Index Analysis

In this section, we performed optimization index analysis on learning rate, reward decay, and greedy. Based on the abovementioned analysis, the replacement interval is set to 300 and the memory size is set to 500.

This part used the orthogonal experiments of three factors and three levels for evaluation. The three factors are learning rate, reward delay, and greedy, record as A/B/C. The corresponding levels are //, //, and //, respectively. Fix A and B at the levels of and , and match three levels of C with , , and . If is optimal, fix the level. Then, let and be fixed, and match two levels of B with and , respectively. After the tests, if is optimal, fix two levels of and , and try two tests and . If is optimal, it is the best level combination.

When the loss curve is more stable and the optimal loss coefficient is lower, the corresponding training is better. The stability of the loss curve is reflected by the amplitude of the curve fluctuation, and the curve formed by the ratio of the loss difference to the time difference can be visually seen. The optimal loss coefficient is obtained directly from the experimental results.

The parameter combinations and corresponding results for the first test are shown in Table 7 and Figure 8.

The first test adjusts the values of greedy. Greedy’s weakening adjustment significantly improves the stability of prediction and the optimal loss parameters. Therefore, reasonable adjustment of greedy helps optimize the fluctuations of the network and controls the training within a reasonable range. Based on the abovementioned experimental results and analysis, it is found that the optimal combination in the first test is group c.

The parameter combinations and corresponding results for the second test are shown in Table 8 and Figure 9.

The second test evaluates the changes in the reward delay. The effect of the reward delay on the stability of the system is more significant than that of greedy, indicating that the system is more sensitive to the reward delay. Therefore, the parameter tuning of this indicator can be perfectly combined with greedy to achieve the optimal stability of the system with the optimal combination in the second test in group c.

Parameter combinations and corresponding results for the third test are shown in Table 9 and Figure 10.

The object of the third test tuning is the learning rate, which not only puts forward higher requirements for system stability but also realizes the optimization progress of the loss index. Therefore, the learning rate is one of the three factors which have the greatest impact on the system. Meanwhile, the optimal loss index does not decrease with the increase in the learning rate, which indicates that the system optimization has a threshold and is not negatively correlated. The optimal combination for the third test is group f.

Therefore, the best combination obtained in all the tests is group f. The following conclusions are drawn based on the analysis of experimental results:(1)The learning rate, greedy, and reward delay affect the stability of the system, among which the reward delay has a greater impact. The learning rate is the only effective parameter to improve the optimal loss index.(2)The values of the three parameters have corresponding valid intervals. When the interval is exceeded, the prediction process fluctuates sharply and affected experimental efficiency.(3)No extreme fluctuation occurs during the training and prediction process. Even if there are fluctuations, they are always within a reasonable range.

6.6. Accuracy Analysis

The accuracy analysis is divided into two stages: the comparison of the predicted delay index and the actual delay index and the accuracy analysis of the traditional methods and the method proposed in this paper.

The first stage: the comparison between the predicted delay index and the actual delay index. Taking group f as the standard group, first draw a comparison chart based on the prediction and actual delay index, and second, calculate the prediction accuracy of groups 1–11 and A–G. This enables a preliminary accuracy evaluation.

It is known from Figure 11 that the degree of agreement between group f and the actual delay index is extremely high, and the predicted efficiency is better in the first half than in the second half. It is known from Tables 10 and 11 that group f still has the highest prediction accuracy. Therefore, the following conclusions are obtained in the first stage:(1)The neural network under group f is the best choice in terms of experimental efficiency and accuracy, which provides a strong guarantee for short-term traffic prediction.(2)There is a loss of weakness in the prediction process. If the neural network is used for long-term prediction, the network needs to be further optimized for design.(3)There is no absolute correlation between experimental efficiency and accuracy, so it is necessary to analyse by yourself. For example, there is a serious deviation in the experimental efficiency of group 4, and its accuracy is kept within a reasonable range.

The second stage: the accuracy analysis of the traditional method and the method mentioned in this paper. In order to further verify the superiority of the proposed method, the accuracy of the proposed method is compared with the LSTM, KNN, SVM, exponential smoothing, and BP neural network. All prediction processes are based on the data used in this paper. Finally, the two representative indicators of prediction accuracy and MSE are used to measure the effectiveness of the forecast.

In the LSTM, the data from 0:00 on May 17th to 24:00 on May 24th are used as the training sets, and the data from 0:00 to 12:00 on May 25th are used as the test sets. The prediction results are shown in Figure 12.

In KNN and SVM prediction, the prediction accuracy of the SVM and KNN based on data analysis is shown in Table 12.

In the exponential smoothing forecast, based on the May 20th 0: 00–22 o’clock to predict the traffic state of 23–24 o’clock, the second exponential smoothing and the third exponential smoothing are performed and, finally, compared with the actual delay index. The experimental results show that the fitting curve under quadratic exponential smoothing is better than the prediction result, as shown in Figure 13.

In BP neural network prediction, speed, delay time, travel time, air temperature, and precipitation probability are used as the input matrix, and delay index is used as the output matrix. 90% are used as the training set, 5% are used as the validation set, 5% are used as the prediction set, and 10 hidden layers are used to construct the BP neural network. Training according to the Levenberg–Marquardt algorithm is performed until the best effect is achieved. Finally, the error distribution and prediction results are obtained, as shown in Figure 14 and 15.

It can be seen from Table 13 that the method proposed in this paper is more accurate than LSTM alone. Comparing the proposed method with other representative prediction methods, it is obvious that the prediction effect is better. Therefore, it further confirms the superiority of the method proposed in this paper, which can meet the demand for high efficiency and precision in traffic prediction and has the feasibility of practical application.

7. Conclusions

This paper proposed a short-term traffic flow prediction model for urban roads based on the LSTM and Q-Learning, which are used to solve the problems of low temporal correlation of traffic data, large inventory, poor comprehensive analysis, and slow feedback of prediction results. The analysis results showed that the model has excellent stability and prediction accuracy. Therefore, this model has the feasibility to apply to actual traffic scenarios and to provide accurate information guidance to reduce traffic congestion and accident rates. Moreover, this model could provide substantial method support with the development of active safety.

At the same time, the problem with this model is that the amount of data and data dimensions predicted by this training are not big enough. If there are sufficient data volume and dimensions, it will bring more mature training effects and prediction results. Therefore, the next research goal is to develop more multidimensional research directions based on deep mining of effective traffic data.

In the future, we will focus on exploring more efficient prediction methods based on the research results of this paper. Also, a series of traffic conditions such as future traffic flow, accident trends, and driving behavior trends will be predicted by introducing more relevant data.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


This work was supported by the National Key Research and Development Program of China under Grant 2019YFB1600200.