Mathematical Problems in Engineering

Volume 2015 (2015), Article ID 896943, 16 pages

http://dx.doi.org/10.1155/2015/896943

## Intelligent Ramp Control for Incident Response Using Dyna- Architecture

^{1}School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China^{2}Institute for Transport Studies, University of Leeds, Leeds LS2 9JT, UK

Received 18 June 2015; Revised 22 September 2015; Accepted 28 September 2015

Academic Editor: Dongsuk Kum

Copyright © 2015 Chao Lu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Reinforcement learning (RL) has shown great potential for motorway ramp control, especially under the congestion caused by incidents. However, existing applications limited to single-agent tasks and based on -learning have inherent drawbacks for dealing with coordinated ramp control problems. For solving these problems, a Dyna- based multiagent reinforcement learning (MARL) system named Dyna-MARL has been developed in this paper. Dyna- is an extension of -learning, which combines model-free and model-based methods to obtain benefits from both sides. The performance of Dyna-MARL is tested in a simulated motorway segment in the UK with the real traffic data collected from AM peak hours. The test results compared with Isolated RL and noncontrolled situations show that Dyna-MARL can achieve a superior performance on improving the traffic operation with respect to increasing total throughput, reducing total travel time and CO_{2} emission. Moreover, with a suitable coordination strategy, Dyna-MARL can maintain a highly equitable motorway system by balancing the travel time of road users from different on-ramps.

#### 1. Introduction

Traffic congestion occurs when the traffic demand for a road network approaches or exceeds its available road capacity. Even slight losses of the balance between demand and capacity on motorways can lead to long travel delays, high energy consumptions, and severe environmental problems. Therefore, how to alleviate traffic congestion and maintain the demand-capacity balance has become one of the main concerns of the transport community. To this end, a number of traffic control devices, such as variable speed limit (VSL), variable message sign (VMS), and ramp control systems, are developed under the umbrella of intelligent transportation systems (ITS). Among these advanced systems, ramp control (also known as ramp metering) has been widely used and proved to be an effective control method for different kinds of congestion on motorways [1].

Generally, traffic congestion can be classified into two categories: recurrent congestion and nonrecurrent congestion. Recurrent congestion is caused by the daily traffic operation with temporarily increased traffic demand in peak hours [2]. Considering the daily peak traffic on motorways, recurrent congestion is the main concern of many existing ramp control systems. For instance, fixed-time systems (also known as pretimed systems) use historical data collected from daily peak hours to generate control strategies offline and trigger these strategies at fixed times (e.g., morning or evening peak hours) of each day [1]. Local traffic-responsive systems such as demand-capacity method, ALINEA [3], and its variations [4] can respond to the real-time traffic and keep the outflow or road density of the motorway mainline close to some target value (e.g., road capacity or critical density). Usually, these target values should be defined in advance according to the so-called fundamental diagram which is derived from the daily traffic data. To deal with network-wide problems, traffic-responsive systems have been extended to coordinated ramp control systems, such as Flow [5], System Wide Adaptive Ramp Metering (SWARM) [6], and Zone algorithms [7]. Similar to local traffic-responsive systems, these coordinated systems also attempt to make the outflow of motorway mainline approach a predetermined target value which is usually the road capacity. Another group of systems focuses on formulating different control scenarios as optimisation problems and using optimal control techniques (e.g., model predictive control) to solve them. The purpose of these systems is to maximise or minimise an objective function, not to achieve some predefined target value. Examples of these systems can be found in [8–12], where macroscopic traffic flow models were combined with control systems to formulate optimal control problems.

Although the aforementioned systems have shown their effectiveness in different scenarios, recurrent congestion is still the main focus of these systems and a component that can deal with nonrecurrent congestion is not included in these systems. Unlike recurrent congestion caused by the increased traffic demand in peak hours, nonrecurrent congestion is mainly induced by incidents, and thus, it is usually referred to as incident-induced congestion [2, 13]. Traffic incidents are nonrecurrent events such as road accidents, vehicle breakdown, and unexpected obstacles that may block one or more lanes of the motorway mainline. The temporary lane blockage will interrupt the normal operation of traffic flow and lead to a rapid reduction of road capacity [14]. In this case, fixed-time and simple traffic-responsive systems, which are dependent on the information collected from daily traffic operation or a predefined target value, are not applicable. Therefore, more sophisticated systems that can respond to incidents are required. During the last decades, a series of such kinds of ramp control systems have been designed, most of which are based on optimisation techniques. For example, an optimal control structure using a simple macroscopic traffic flow model was proposed in [15] to deal with incident-induced congestion. A more complex system with consideration of dynamic incident duration was developed in [16] which can be solved by the linear programming technique. In the research presented in [17, 18], both lane-changing and queuing behaviour during the incident were incorporated into a modelling structure and solved by a stochastic optimal control system. Although these systems are based on different technologies, they all need a model to predict traffic conditions and use these predictions to accomplish the control process.

Model-based methods usually have poor adaptability when the mismatch between simulation models and the real controlled environment emerges [19–21]. To overcome this limitation, another optimisation-based method, reinforcement learning (RL), was introduced to the ramp control area. This method is based on the Markov decision process (MDP) and dynamic programming (DP), which can approximately solve the optimisation problem through continuous learning without any models. The first ramp control system using RL to solve incident-induced problems was developed in [19, 22]. The basic RL algorithm named -learning was adopted by this system to alleviate traffic congestion caused by incidents. After this work, several -learning systems considering both local (e.g., [23, 24]) and coordinated (e.g., [25, 26]) control problems were proposed. However, -learning can only learn from real interactions with the traffic operation and cannot make full use of historical data (or models). Because of this limitation, -learning usually has a low learning speed and needs a great number of trials to obtain the best control strategy in some complex scenarios, such as incident-induced congestion [27]. This problem is even worse in the coordinated ramp control problems with exponentially increased state and action spaces, which will lead to the so-called “curse of dimensionality” [28]. One solution to speed up the learning process and deal with incidents efficiently has been proposed in our previous work [27, 29]. This system used the Dyna- architecture to combine model-free -learning with a model-based method and can be used to accomplish single-agent tasks.

In this paper, the previous single-agent system is extended to a multiagent case that can deal with a network-wide problem with multiple ramp controllers. We refer this system to Dyna-MARL which adopts a multiagent RL (MARL) strategy based on Dyna- architecture. The rest of this paper is organised as follows. Section 2 briefly introduces the basic knowledge of RL including single-agent and multiagent cases. The architecture of Dyna-MARL is described in Section 3. After that, Sections 4 and 5 give the detailed description of the models, elements, and related algorithm of Dyna-MARL. The simulation experiments and relevant results are discussed in Section 6. Section 7 finally gives some conclusions and introduces the future work.

#### 2. Reinforcement Learning

RL is a subclass of machine learning. In the following subsections, two kinds of RL problems, namely, single-agent and multiagent RL, will be briefly introduced.

##### 2.1. Single-Agent RL

The problem of single-agent RL is usually defined as an MDP that can be represented by a tuple () [30]. is the state space used to describe the external environment. is the control action set containing executable actions of the agent. is the state transition probability. For state pair , represents the probability of reaching state after executing action at state . is the reward function. denotes the immediate reward after taking action at state . Based on these definitions, value is defined for each state-action pair and shown below:where is the time index and is the number of time steps. and are the environment state and executed control action at time step , respectively. is the discount factor which indicates the importance of the following predicted rewards. For , is the power. is the policy corresponding to a sequence of actions. The optimal policy can be obtained by maximising the value.

The most widely used algorithm in literature for estimating the maximum value is -learning [31]. By using the updating equation as given below, -learning can maximise value for each state-action pair:where and are the value for state-action pair at the th step and th step, respectively, and is the value for the state-action pair at the th step. is the learning rate. and can be regulated according to different problems.

##### 2.2. Multiagent Scenarios

In multiagent scenarios, an MDP for single-agent case can be extended to a stochastic game (SG) or Markov game, in which a group of agents try to obtain some equilibrium solutions through coordination or competition [28].

In the absence of competition, all agents involved in a game have a common goal to maximise the global value, which forms a coordinated MARL problem. In this case, the policy optimisation is determined by actions executed by all agents.

For solving a coordinated MARL problem, the update equation (2) for -learning can be easily extended to represent the global value update [28]:

The only difference with (2) is that and in (3) relate to actions executed by agents rather than to a single action .

##### 2.3. Solutions for Coordinated MARL

It can be seen from (3) that as the number of agents grows, combinations of actions and the resultant computational complexity are increased exponentially, which may make the problem unsolvable within a required time limit [28]. Therefore, a commonly used method is to decompose the global value to several local values, each of which can be maximised by a few relevant agents rather than all agents [32]. Based on this distributed method, several strategies have been proposed. In [28], these strategies fall into three categories including coordination-based, coordination-free, and indirect coordination strategies.

Coordination-based strategies need local values to be updated according to actions executed by all relevant agents (named joint actions) at each time step [28]. The decision making process of each agent is based on the information received from all other related agents with sufficient communication. This will complicate the problem. On the other hand, coordination-free (or independent) strategies, such as distributed -leaning algorithm, make each agent update the corresponding local values based on its own actions [33]. Therefore, each agent makes its decisions independently without increasing computational complexity. However, this computational efficiency is at the expense of nonguaranteed convergence [32]. Indirect coordination strategies try to find a balance between the above two methods. By applying indirect strategies, each agent can maintain models for its cooperative partners and update local values without knowing all the information of other agents at each step [28]. Based on high-quality models, this method can reduce the problem complexity and guarantee convergence with limited coordination.

#### 3. Dyna- Based Indirect Coordination Strategy

Because of the benefits introduced in the above section, the indirect coordination strategy has been applied in [34] for solving urban traffic control problems. In their work, each agent maintains a model for estimating the action selection probability of its neighbours and uses this information to optimise control strategies. In this paper, we extend this method to motorway systems by applying Dyna- architecture.

Under the Dyna- architecture, a modified macroscopic flow model named asymmetric cell transmission model (ACTM) and -learning algorithm are combined together to deal with coordinated MARL problems. In this section, the application of Dyna- will be introduced.

##### 3.1. Dyna- Architecture

Dyna- architecture is an extension of standard -learning that integrates planning, acting, and learning together [30]. Unlike -learning which learns from the real experience without a model, Dyna- learns a model and uses this model to guide the agent [35]. After capturing the real experience, two loops run to learn optimal policies that can obtain the maximum value in Dyna- architecture (see Figure 1).