Journal of Advanced Transportation

Volume 2018, Article ID 3631489, 9 pages

https://doi.org/10.1155/2018/3631489

## Evaluation and Application of Urban Traffic Signal Optimizing Control Strategy Based on Reinforcement Learning

^{1}Key Laboratory of Road and Traffic Engineering of the Ministry of Education, Tongji University, 4800 Cao’an Road, Shanghai 201804, China^{2}Intelligent Transportation System Research Center of Tongji University, 4801 Cao’an Road, Shanghai 201804, China^{3}Hangzhou Hikvision Digital Technology Co., Ltd., No. 555 Qianmo Road, Binjiang District, Hangzhou 310052, China

Correspondence should be addressed to Xiaoguang Yang; nc.ude.ijgnot@gxgnay

Received 6 August 2018; Revised 3 November 2018; Accepted 9 December 2018; Published 26 December 2018

Guest Editor: Hamzeh Khazaei

Copyright © 2018 Yizhe Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Reinforcement learning method has a self-learning ability in complex multidimensional space because it does not need accurate mathematical model and due to the low requirement for prior knowledge of the environment. The single intersection, arterial lines, and regional road network of a group of multiple intersections are taken as the research object on the paper. Based on the three key parameters of cycle, arterial coordination offset, and green split, a set of hierarchical control algorithms based on reinforcement learning is constructed to optimize and improve the current signal timing scheme. However, the traffic signal optimization strategy based on reinforcement learning is suitable for complex traffic environments (high flows and multiple intersections), and the effects of which are better than the current optimization methods in the conditions of high flows in single intersections, arteries, and regional multi-intersection. In a word, the problem of insufficient traffic signal control capability is studied, and the hierarchical control algorithm based on reinforcement learning is applied to traffic signal control, so as to provide new ideas and methods for traffic signal control theory.

#### 1. Introduction

Traffic congestion has become a world-concerned problem all over the world. With the increasing number of vehicles, traffic congestion has deeply affected people’s daily life and the development of social economy. Traffic control is one of the most important technological means of regulating traffic flow, improving obstruction, and improving its safety and even energy conservation and emission reduction. At present, traffic signal control problem not only has a long-time congestion phenomenon at peak time, but also has obvious ability of grooming in peak time. In order to ease the traffic pressure, rational analysis and control are considered as an important tool. Its progress and development are always keeping pace with the times, accompanied by information technology, computer technology, and system science.

According to the system’s ability to adapt to the environment and the level of intelligent decision-making, Gartner proposed the evolution of urban transport control system development level in 1996 [1]. The first-generation self-adaptive control system adopts the multi-time timing control of fine division of period, or completely isolated self-adaptive control, to realize the simple regulation of traffic flow. The second-generation traffic signal control system dynamically adjusts the parameters of the signal timing scheme (cycle length, split, offset). Typical second-generation control systems include SCATS [2] and SCOOT [3]. The UK Transport Research Laboratory had a worldwide reputation for contributions to the field of traffic signal control, especially as originators of the TRANSYT and SCOOT signal coordination methods [4]. The third-generation control system uses similar idea to the second-generation to dynamically adjust the signal timing parameter in response to the fluctuation of the time-varying traffic flow at the intersection. HK Lo and HF Chow investigated the relationship of finer resolutions and larger errors in adaptive traffic control system through an extensive simulation of scenarios in Hong Kong with a recently developed dynamic traffic control model, DISCO [5]. Aboudolas K and Papageorgiou M tested a preliminary simulation-based investigation of the signal control problem for a large-scale urban road network using store-and-forward modeling demonstrating the comparative efficiency and real-time feasibility of the developed signal control methods [6]. The fourth-generation traffic signal control system is an integrated traffic management and control system. Meneguzzer C presented two alternative deterministic, discrete-time DP models of the interaction between signal control and route choice, which are proposed and compared with the conventional iterative optimization and assignment (IOA) method for network traffic signal setting [7]. The fifth-generation traffic signal control system is based on the abilities of artificial intelligence and self-learning.

#### 2. Literature Review

##### 2.1. Traffic Big Data Environment

Digitized and informational infrastructure of urban road traffic and constructions of related systems have developed rapidly in the past ten years, and urban traffic control is developing from the “data poverty” times to the “data rich” times. Meanwhile, the appearance of ICV (intelligent connected vehicle) and autonomous vehicles will construct the future traffic environment jointly, which significantly differs from conventional manual driving vehicles in terms of individual information acquisition, perception ability, reaction time, interactive behavior, etc. New requirements of traffic control have formed a high-level demand for the next generation of traffic control [8]. The research on the next generation of traffic signal control for regional transportation under the “data rich” environment is on the agenda. Ma D et al. proposed the lane-based saturation degree estimation for signalized intersections and maximum queue length estimation for traffic lane groups, which enriched the way to obtain traffic parameters and increased the precision of estimation method. For example, the results show that the new method of maximum queue length estimation has a higher precision compared to the existing method based on a similar concept, with maximum and average deviations of 39.36% and 12.25%, respectively, over twenty cycles [9, 10].

Under the conditions of limited cross-section traffic flow data, many existing adaptive traffic control systems have adopted traffic models to actively predict the evolution of network traffic flows and then adopted the aggregative indicator method to optimize and solve timing parameters. However, the real-time detection of the spatiotemporal data based on urban road network traffic status can provide rich and high-quality basic data and fine-grained assessment of control effects for traffic control. In the face of the main defects encountered in the existing self-adaptive traffic control system, a closed-loop feedback self-adaptive control system with better uncertainty response capability and higher intelligent decision-making level is an inevitable result of the objective needs of the development and application technologies [11]. Ma D proposed a calculation method for the occupancy per cycle under different traffic conditions presented, based on the relationship between the three basic traffic flow parameters, speed, traffic flow, and density [12]. The results show that the precision of this method was affected by the detector location and bus ratio insignificantly [13].

##### 2.2. Reinforcement Learning Traffic Control

According to real-time collection of states, rewards, and punishments, the single intersection’s signal control of reinforcement learning can find an optimization strategy of traffic signal control suitable for traffic flow characteristics through the interaction. In recent years, more and more domestic scholars have studied principles of reinforcement learning and discussed the applications of reinforcement learning algorithms in traffic control. Reinforcement learning has developed rapidly in the optimizing control [14, 15].

Scholars have done a lot of research on reinforcement learning theory, algorithms, and applications and have obtained many famous research results. Ma D proposed a new control method that brings significant and positive effects to the bottleneck link itself and to the entire test area [16]. Yang W concluded that critical issues in developing agent-based traffic control systems for integrated network were addressed as interoperability, adaptability, and extendibility [17]. Zhang L drew a conclusion that extensive simulation results for the designed Shanghai simulation scenarios indicate that most of the observed counts match quite well with the traffic simulation volumes and demonstrate the potential of MATSIM for large-scale dynamic transport simulation [18]. Aslani M developed adaptive traffic signal controllers based on continuous residual reinforcement learning (CRL-TSC) that was more stable, and the best setup of the CRL-TSC leads to saving average travel time by 15% in comparison to an optimized fixed-time controller [19].

Reinforcement learning control has the advantages of real-time online and feedback control, which especially accords with the control thoughts of signal adaptive control in urban intersections. However, there is a question as to whether the traffic signal optimization strategy based on the reinforcement learning is applicable to all the traffic environments.

#### 3. Traffic Signal Control Strategy Based on Reinforcement Learning

Reinforcement learning is a typical data-driven control method. In this paper, the method of signal control scheme improvement is proposed. According to the different traffic flow characteristics, the subregions are divided. Based on the three key parameters of cycle length, arterial coordination signal offset, and green split, a set of hierarchical control algorithms based on reinforcement learning is constructed to optimize and improve the current signal timing scheme.

##### 3.1. Control Subregions Division and Cycle Optimization

As for the regional coordination control, the primary content is the division of the coordination subregions. In the signal control road network, each intersection has its influence range, and the intersection and section within this range are greatly affected by it. To quantify the impact and define the scope of influence, literature defines direct relevance to describe the relationship between adjacent intersections, finding that when the upstream node traffic flows into the downstream node, it is close to or greater than the downstream node's import capacity. It is found that the path correlation is mainly affected by the traffic network topology and OD distribution between the two intersections. The more the OD paths through two nodes at the same time, the stronger the correlation between nodes. The higher the flow rate of OD path passing through both nodes at the same time, the stronger the correlation between nodes. The more the OD paths that pass through both nodes at the same time are unique, the stronger the correlation between nodes will be.

The optimization range is region-level road network optimization. The control subregions are divided by characteristic parameters such as average travel time; vehicle OD amount between intersections and traffic coordination control subregions are finally determined.

The signal cycle refers to the time required for the signal color to display one cycle in the set phase order, that is, the sum of the steps of each control step in one cycle. The signal cycle is the key control parameter that determines the effectiveness of traffic signal control. If the signal cycle is too short, it is difficult to ensure that the vehicles in all directions can pass through the intersection smoothly, resulting in frequent stops at the intersection and a decline in the utilization rate of the intersection. If the signal cycle is too long, it will cause the driver to wait for too long, greatly increasing the delay time of the vehicle. The cycle in the green wave control is taken as the common cycle by the maximum signal cycle of the key intersection of the arterial, and the signal cycle of the remaining intersections is reallocated to each phase according to the traffic flow ratio.

According to different evaluation indexes, the optimal cycle is obtained by using model-based algorithm. Regarding the evaluation indicators of traffic efficiency at intersections, traffic capacity, saturation, service level, travel time, number of stops, and queue length are commonly used at home and abroad. The delay is mainly due to the travel time loss caused by traffic friction and traffic control. It is closely related to the cycle duration, green split, and saturation. It is an important indicator for evaluating the traffic service level and operational efficiency of signalized intersections, including queue delay, parking delay, control delay, and lane approach delay.

##### 3.2. Offset Optimization Based on Bayesian Optimization Algorithm

The phase offset is also called the time offset or the green time offset. The phase offset includes the absolute phase offset and the relative phase offset. Absolute phase offset refers to the offset between the starting or ending point of the signal green light (red light) in the coordinated direction of the arterial at each intersection and the starting or ending point of the signal green light (red light) in the coordinated direction of the arterial at a certain intersection (generally a key intersection). Relative phase refers to the time offset between the starting or ending points of the green light (red light) signal in the coordinated direction of the arterial at adjacent intersections. The relative phase offset is equal to the difference value between the absolute phase offset of two intersections, which is determined by the actual vehicle speed.

According to the coordination effect between the intersections, it is divided into several control subregions, and internal coordination control is implemented for its traffic characteristics. The basic principles of control subregions division are as follows:

The distance between adjacent intersections is less than 600 meters and control subregion contains no more than 10 intersections.

The optimal period length of each intersection is an integer multiple relationship.

The following lines with inconsistent coordination effects should not be included in a subregional coordination:

An excessively long connection, and the traffic flow along the connection is highly discrete.

There are traffic production sources or attraction sources (such as large parking lots and shopping malls) and very frequent pedestrian activities along both sides of certain lines, which seriously interfere with traffic flow.

The Bayesian optimization algorithm belongs to the sequential model-based optimization (SMBO) algorithm. This algorithm determines the value of the next (optimal) sample set by analyzing historical observations of a loss function. Since the Bayesian optimization algorithm was proposed around 2010, it has been used to optimize the hyperparameters of machine learning models in the field of machine learning in recent years. The so-called superparameter is the model parameter that needs to be set artificially. In this competition, due to the large number of timing parameters that need to be optimized, which includes the signal split and phase offset of multiple different intersections, the solution space dimension is relatively high and the optimization is quite difficult. The overall idea of the Bayesian optimization algorithm is as follows:

Calculate the posterior expectation of the loss function using the observed sample set .

Generate a new set of samples to sample the loss function , which can maximize the expectation of in the value range of independent variables.

Repeat the above steps until the preset convergence condition is reached. End the optimization process.

The algorithm will be described in detail below and the process will be summarized.

To calculate the posterior expectation of the loss function , the likelihood model of the sample and the prior probability model of should be obtained in advance. In the Bayesian optimization process, we can assume that the sample obeys the multivariate Gaussian distribution and obtain the Gaussian likelihood function:

For the prior distribution, we assume that the loss function f can be described by a Gaussian process (GP). The essence of the Gaussian process is the generalization of the multivariate Gaussian distribution to the function distribution. Therefore, just as the Gaussian distribution is determined by its expectation and variance, the Gaussian process is completely determined by its expectation function and the covariance function . The Gaussian process is widely used in the application of all probabilistic models because its description of the posterior distribution of the loss function is easier for us to analyze and calculate.

One of the most widely used acquisition functions is the expected improvement (EI) function. The EI function is defined as

where is the current optimal sample set, and this function gives a new sample set that can best enhance the expectation of the loss function. Moreover, the expected lifting function can be calculated based on the Gaussian process model, namely,

where and are the cumulative distribution function and probability density distribution function of the multivariate standard Gaussian distribution, respectively. When the posterior expectation *μ*(X) is higher than the current loss function optimal value , EI will get a larger value. When the uncertainty *σ*(X) of X is high, EI will get a larger value.

After the above analysis and introduction, the whole principle and process of Bayesian optimization can be summarized to form a Bayesian optimization algorithm:

Given the observed value of the loss function, the posterior expectation of the loss function is updated based on the Gaussian model.

Solve the expected lifting function (EI function) to find the new best sample set: .

Calculate the value of the loss function at .

Repeat the above steps until the preset number of repetitions (i.e., the number of iterations) is reached or the convergence condition is met.

In (2) of the above steps, we can use the gradient-based solution method to optimize the EI function to get .

On the basis that the parameters such as the optimal cycle length are determined, the phase offset of the intersections after deduplication can be regarded as input loss function sample set. The sample of the function, which is returned by the online feedback, can be iterated multiple times based on the Bayesian optimization algorithm.

##### 3.3. Split Optimization Based on Q-Learning Algorithm

In the urban transportation system, the traffic flow, vehicle speed, and traffic density are the most intuitive reflections of traffic conditions. They are the three characteristic parameters of traffic flow and the research focus and foundation of traffic flow theory. Among them, the traffic flow refers to the number of vehicles passing through per unit time; the vehicle speed refers to the distance that the vehicle passes per unit time; and the traffic density refers to the number of vehicles on the section per unit length. The traffic flow theory is the basis for the establishment of urban traffic signal control system.

The traffic model uses a discrete-time difference equation or a continuous time subdivision tool to introduce a dynamic relationship between the concepts of traffic volume Q, vehicle speed V, and traffic density K, which summarizes the physical quantities of the traffic network and is used to describe the collective average behavior of a large number of vehicles. In the free flow, the interaction between vehicles can be neglected, and the traffic flow increases linearly with the vehicle density. The wide moving jam flow is usually characterized by stop-go-stop traffic, that is, a series of jams. The density of vehicles in the region is high and the average speed and flow of vehicles are small. The average velocity of the synchronized flow is significantly lower than that of the free flow.

At present, Q-learning algorithm is one of the most frequently used methods in the fields of reinforcement learning, proposed by Watkins in 1989 [20]. Q-learning algorithm is widely used in the fields of control, depending on the update mode of its special value function.

In Q-learning, the solution formula of the mainstream value function is as follows.

According to the formula, at the moment of t, the state of Q-learning is* s*_{t}. If the taken action is* a*_{t}, the corresponding value function will be Q(*s*_{t}*, a*_{t}). The update of the value function is determined by three factors. The first is the current value of the action state value function, Q(*s*_{t}*, a*_{t}), that needs to be updated. The second is to control the corresponding maximum value of all Q-values of actions in the postexecution state of s(t+1), and the third is the immediate return, r(t+1), after the action. Besides, there are also two model parameters, learning rate *α∈*0,1 and discount factor *γ∈*(0,1]. The former is used to balance the relationship between the learning and utilization of the algorithm. When a→1, the controller tends to explore new knowledge; otherwise it will use the existing knowledge. The latter is used to coordinate the present relationship with the future. When , the controller tends to consider the future return, and when , the controller mainly considers immediate return [21].

Whether in theoretical research or in engineering practice analysis, road traffic density is an effective indicator for measuring the degree of traffic congestion. The operation of the traffic on the section is affected by the signal control of the upstream and downstream intersections. The release signal at the upstream intersection directly changes the density of the section, which indirectly affects the traffic capacity and saturation at the section of the stop line and indirectly affects the density of the queue section. The mutual influence of the two is especially noticeable in the supersaturated state. Since the penetration rate of the connected vehicles in different sections is unknown, it is impossible to visually reflect the actual flow of the road through the number of discrete connected vehicles. Even by expanding the sample, it is difficult to guarantee accuracy, but it can clearly reflect the speed of the overall traffic flow. Therefore, this paper uses traffic density as the core parameter to provide a basis for green split optimization.

##### 3.4. The Flow of the Control Algorithm

Firstly, according to different evaluation indexes, the optimal cycle is obtained by using model-based algorithm. Using the combination of the average travel time of vehicles and the Bayesian optimization method based on the Gaussian process, which is commonly used in the optimization of machine learning algorithms, the arterial coordination control is set. The phase offset is optimized by the two-way flow ratio of the upstream and downstream roads and the reasonable setting of the pedestrian crossing phase. Then set different green wave bandwidths to match the upstream and downstream traffic of the morning rush hour and the tidal phenomenon with uneven travel speed. The intelligent algorithm such as Q-learning is used to optimize the green split of each intersection by using key traffic flow parameters at each intersection.

In conclusion, the flow of the traffic signal control strategy based on reinforcement learning is as shown in Figure 1.