Modeling and Control Problems in Sustainable Transportation and Power Systems
View this Special IssueResearch Article  Open Access
Intelligent Ramp Control for Incident Response Using Dyna Architecture
Abstract
Reinforcement learning (RL) has shown great potential for motorway ramp control, especially under the congestion caused by incidents. However, existing applications limited to singleagent tasks and based on learning have inherent drawbacks for dealing with coordinated ramp control problems. For solving these problems, a Dyna based multiagent reinforcement learning (MARL) system named DynaMARL has been developed in this paper. Dyna is an extension of learning, which combines modelfree and modelbased methods to obtain benefits from both sides. The performance of DynaMARL is tested in a simulated motorway segment in the UK with the real traffic data collected from AM peak hours. The test results compared with Isolated RL and noncontrolled situations show that DynaMARL can achieve a superior performance on improving the traffic operation with respect to increasing total throughput, reducing total travel time and CO_{2} emission. Moreover, with a suitable coordination strategy, DynaMARL can maintain a highly equitable motorway system by balancing the travel time of road users from different onramps.
1. Introduction
Traffic congestion occurs when the traffic demand for a road network approaches or exceeds its available road capacity. Even slight losses of the balance between demand and capacity on motorways can lead to long travel delays, high energy consumptions, and severe environmental problems. Therefore, how to alleviate traffic congestion and maintain the demandcapacity balance has become one of the main concerns of the transport community. To this end, a number of traffic control devices, such as variable speed limit (VSL), variable message sign (VMS), and ramp control systems, are developed under the umbrella of intelligent transportation systems (ITS). Among these advanced systems, ramp control (also known as ramp metering) has been widely used and proved to be an effective control method for different kinds of congestion on motorways [1].
Generally, traffic congestion can be classified into two categories: recurrent congestion and nonrecurrent congestion. Recurrent congestion is caused by the daily traffic operation with temporarily increased traffic demand in peak hours [2]. Considering the daily peak traffic on motorways, recurrent congestion is the main concern of many existing ramp control systems. For instance, fixedtime systems (also known as pretimed systems) use historical data collected from daily peak hours to generate control strategies offline and trigger these strategies at fixed times (e.g., morning or evening peak hours) of each day [1]. Local trafficresponsive systems such as demandcapacity method, ALINEA [3], and its variations [4] can respond to the realtime traffic and keep the outflow or road density of the motorway mainline close to some target value (e.g., road capacity or critical density). Usually, these target values should be defined in advance according to the socalled fundamental diagram which is derived from the daily traffic data. To deal with networkwide problems, trafficresponsive systems have been extended to coordinated ramp control systems, such as Flow [5], System Wide Adaptive Ramp Metering (SWARM) [6], and Zone algorithms [7]. Similar to local trafficresponsive systems, these coordinated systems also attempt to make the outflow of motorway mainline approach a predetermined target value which is usually the road capacity. Another group of systems focuses on formulating different control scenarios as optimisation problems and using optimal control techniques (e.g., model predictive control) to solve them. The purpose of these systems is to maximise or minimise an objective function, not to achieve some predefined target value. Examples of these systems can be found in [8–12], where macroscopic traffic flow models were combined with control systems to formulate optimal control problems.
Although the aforementioned systems have shown their effectiveness in different scenarios, recurrent congestion is still the main focus of these systems and a component that can deal with nonrecurrent congestion is not included in these systems. Unlike recurrent congestion caused by the increased traffic demand in peak hours, nonrecurrent congestion is mainly induced by incidents, and thus, it is usually referred to as incidentinduced congestion [2, 13]. Traffic incidents are nonrecurrent events such as road accidents, vehicle breakdown, and unexpected obstacles that may block one or more lanes of the motorway mainline. The temporary lane blockage will interrupt the normal operation of traffic flow and lead to a rapid reduction of road capacity [14]. In this case, fixedtime and simple trafficresponsive systems, which are dependent on the information collected from daily traffic operation or a predefined target value, are not applicable. Therefore, more sophisticated systems that can respond to incidents are required. During the last decades, a series of such kinds of ramp control systems have been designed, most of which are based on optimisation techniques. For example, an optimal control structure using a simple macroscopic traffic flow model was proposed in [15] to deal with incidentinduced congestion. A more complex system with consideration of dynamic incident duration was developed in [16] which can be solved by the linear programming technique. In the research presented in [17, 18], both lanechanging and queuing behaviour during the incident were incorporated into a modelling structure and solved by a stochastic optimal control system. Although these systems are based on different technologies, they all need a model to predict traffic conditions and use these predictions to accomplish the control process.
Modelbased methods usually have poor adaptability when the mismatch between simulation models and the real controlled environment emerges [19–21]. To overcome this limitation, another optimisationbased method, reinforcement learning (RL), was introduced to the ramp control area. This method is based on the Markov decision process (MDP) and dynamic programming (DP), which can approximately solve the optimisation problem through continuous learning without any models. The first ramp control system using RL to solve incidentinduced problems was developed in [19, 22]. The basic RL algorithm named learning was adopted by this system to alleviate traffic congestion caused by incidents. After this work, several learning systems considering both local (e.g., [23, 24]) and coordinated (e.g., [25, 26]) control problems were proposed. However, learning can only learn from real interactions with the traffic operation and cannot make full use of historical data (or models). Because of this limitation, learning usually has a low learning speed and needs a great number of trials to obtain the best control strategy in some complex scenarios, such as incidentinduced congestion [27]. This problem is even worse in the coordinated ramp control problems with exponentially increased state and action spaces, which will lead to the socalled “curse of dimensionality” [28]. One solution to speed up the learning process and deal with incidents efficiently has been proposed in our previous work [27, 29]. This system used the Dyna architecture to combine modelfree learning with a modelbased method and can be used to accomplish singleagent tasks.
In this paper, the previous singleagent system is extended to a multiagent case that can deal with a networkwide problem with multiple ramp controllers. We refer this system to DynaMARL which adopts a multiagent RL (MARL) strategy based on Dyna architecture. The rest of this paper is organised as follows. Section 2 briefly introduces the basic knowledge of RL including singleagent and multiagent cases. The architecture of DynaMARL is described in Section 3. After that, Sections 4 and 5 give the detailed description of the models, elements, and related algorithm of DynaMARL. The simulation experiments and relevant results are discussed in Section 6. Section 7 finally gives some conclusions and introduces the future work.
2. Reinforcement Learning
RL is a subclass of machine learning. In the following subsections, two kinds of RL problems, namely, singleagent and multiagent RL, will be briefly introduced.
2.1. SingleAgent RL
The problem of singleagent RL is usually defined as an MDP that can be represented by a tuple () [30]. is the state space used to describe the external environment. is the control action set containing executable actions of the agent. is the state transition probability. For state pair , represents the probability of reaching state after executing action at state . is the reward function. denotes the immediate reward after taking action at state . Based on these definitions, value is defined for each stateaction pair and shown below:where is the time index and is the number of time steps. and are the environment state and executed control action at time step , respectively. is the discount factor which indicates the importance of the following predicted rewards. For , is the power. is the policy corresponding to a sequence of actions. The optimal policy can be obtained by maximising the value.
The most widely used algorithm in literature for estimating the maximum value is learning [31]. By using the updating equation as given below, learning can maximise value for each stateaction pair:where and are the value for stateaction pair at the th step and th step, respectively, and is the value for the stateaction pair at the th step. is the learning rate. and can be regulated according to different problems.
2.2. Multiagent Scenarios
In multiagent scenarios, an MDP for singleagent case can be extended to a stochastic game (SG) or Markov game, in which a group of agents try to obtain some equilibrium solutions through coordination or competition [28].
In the absence of competition, all agents involved in a game have a common goal to maximise the global value, which forms a coordinated MARL problem. In this case, the policy optimisation is determined by actions executed by all agents.
For solving a coordinated MARL problem, the update equation (2) for learning can be easily extended to represent the global value update [28]:
The only difference with (2) is that and in (3) relate to actions executed by agents rather than to a single action .
2.3. Solutions for Coordinated MARL
It can be seen from (3) that as the number of agents grows, combinations of actions and the resultant computational complexity are increased exponentially, which may make the problem unsolvable within a required time limit [28]. Therefore, a commonly used method is to decompose the global value to several local values, each of which can be maximised by a few relevant agents rather than all agents [32]. Based on this distributed method, several strategies have been proposed. In [28], these strategies fall into three categories including coordinationbased, coordinationfree, and indirect coordination strategies.
Coordinationbased strategies need local values to be updated according to actions executed by all relevant agents (named joint actions) at each time step [28]. The decision making process of each agent is based on the information received from all other related agents with sufficient communication. This will complicate the problem. On the other hand, coordinationfree (or independent) strategies, such as distributed leaning algorithm, make each agent update the corresponding local values based on its own actions [33]. Therefore, each agent makes its decisions independently without increasing computational complexity. However, this computational efficiency is at the expense of nonguaranteed convergence [32]. Indirect coordination strategies try to find a balance between the above two methods. By applying indirect strategies, each agent can maintain models for its cooperative partners and update local values without knowing all the information of other agents at each step [28]. Based on highquality models, this method can reduce the problem complexity and guarantee convergence with limited coordination.
3. Dyna Based Indirect Coordination Strategy
Because of the benefits introduced in the above section, the indirect coordination strategy has been applied in [34] for solving urban traffic control problems. In their work, each agent maintains a model for estimating the action selection probability of its neighbours and uses this information to optimise control strategies. In this paper, we extend this method to motorway systems by applying Dyna architecture.
Under the Dyna architecture, a modified macroscopic flow model named asymmetric cell transmission model (ACTM) and learning algorithm are combined together to deal with coordinated MARL problems. In this section, the application of Dyna will be introduced.
3.1. Dyna Architecture
Dyna architecture is an extension of standard learning that integrates planning, acting, and learning together [30]. Unlike learning which learns from the real experience without a model, Dyna learns a model and uses this model to guide the agent [35]. After capturing the real experience, two loops run to learn optimal policies that can obtain the maximum value in Dyna architecture (see Figure 1).
In loop I, direct RL is the standard leaning process that can be used to interact with the real external environment. Loop II contains two main tasks: (1) model learning is used to improve the model accuracy through obtaining new knowledge from real experience; (2) planning is the same process of direct RL except that it is using the experience generated by a model. Acting is the action execution process.
Applying a model, the agent can predict reactions of its external environment and other agents before executing a specific action, which provides an opportunity for agent to update value before receiving the real feedback. Simultaneously, direct RL is running to update the value through the real interaction. Therefore, optimal policy is learned through both real experience and predictions. By using this strategy Dyna can learn faster than learning in many situations [30].
Although a model is maintained in the Dyna architecture, the whole system is different from the modelbased control method such as model predictive control (MPC). The model in Dyna architecture is a complementary component, which is used to speed up the learning process and simplify the coordination of agents. The optimal control actions are learnt from both real and simulated experience. Without models, the Dyna architecture is equivalent to the learning technique and can still work as a modelfree system. MPC, on the other hand, is dependent on the model, which means it cannot work without models. Therefore, Dyna can be considered as a combination of modelfree and modelbased method [27].
3.2. System Architecture
Each agent in the motorway control system is designed on the basis of Dyna architecture which controls one prespecified motorway section.
A simplified motorway segment is shown in Figure 2 for analysis. This segment contains three motorway sections with detectors located at boundaries. Each motorway section is divided into a number of cells according to its layout and geometric features. Generally, three kinds of cells exist in the motorway, such as onramp cells that are linked with onramps , offramp cells linked with offramps , and normal cells . In this paper, we define that each motorway section can have at most one onramp cell.
The typical Dyna architecture presented in Figure 2 is detailed for each agent here. Take agent , for example; experience consists of traffic arrival and departure rates observed from the detectors of motorway section , as well as the information received from agent , which is applied to improve models. In the model component, two models are maintained. An asymmetric cell transmission model (ACTM) with estimated traffic arrival and departure rates is used to simulate the traffic flow dynamics in relevant motorway sections. A probability model of action selection of agent at the current state is updated for further planning process.
To reduce the complexity of MARL, like many real applications, some conventions are used to restrict the action selection of an agent [28]. Specifically, in our design, each agent only communicates with its spatial neighbours. For instance, agent receives the control action and traffic information from agent and sends its own information to agent . For the case shown in Figure 2, we assume motorway section is the critical section where an incident occurs. In this situation, agent plays a more important role than other agents for dealing with incidents. Agent can be considered as the chief controller that makes decisions according to its own knowledge about the traffic and incident situations. Other agents should regulate their control policies based on the reaction of agent .
Therefore, two values are defined for two kinds of agents. If motorway section is the critical section, the value of agent is only related to its own state and action space, which can be updated by the same equation denoted by (2).
If motorway section is the normal section without incidents, the value of agent can be calculated bywhere is the immediate reward obtained by agent at time step , when actions are actions executed by agent and . Similarly, and are the values for agent at step and step , respectively. is the action set of agent . returns the number of visits for stateaction pair . Thus, is the probability for agent selecting action at state . Models and the related symbols shown in Figure 2 will be specified in the flowing section.
4. Modified Asymmetric Cell Transmission Model
A firstorder macroscopic traffic flow model named asymmetric cell transmission model (ACTM) is applied as one of the models in the Dyna architecture. This model is derived from the widely used cell transmission model (CTM) [36] and has been used for ramp control problems [11, 37]. In this paper, we modify ACTM to incorporate the traffic dynamics under incident conditions.
4.1. Traffic Dynamics during the Incident
As shown in Figure 3(a), when an incident happens in the critical section, one or more lanes of the motorway will be blocked according to the incident extent. Because of the lane blockage, incident may reduce the normal road capacity and spatial storage space, which will produce a new relationship between traffic flow and road density, that is, fundamental diagram presented in Figure 3(b). As suggested by [38], additional parameters can be used to regulate fundamental diagram for incident situations. We introduce three parameters () to reflect this new dynamics. These three parameters are defined as , , and . and are the free flow speed and congestion wave speed of cell . is the maximum departure flow of cell . , , and are these three variables during the incident. and are the critical densities for normal and incident situations. and are the jam densities for normal and incident situations.
(a)
(b)
4.2. Modified ACTM
Given three incidentrelated parameters, the traffic dynamics in each cell can be derived from the fundamental diagram illustrated in Figure 3(b) and represented by the following equations.
Departure rates of the mainline and onramp:
Conservation of the mainline and onramp:where and are the mainline arrival and departure rates for the cell at step . and are the onramp arrival and departure rates in cell at step . is the offramp departure rate for cell at step (if cell is not an offramp cell, ). represent the number of vehicles on the mainline of cell at step . is the maximum number of this value limited by the mainline space of cell . Similarly, and denote the current (at step ) and maximum number of vehicles in the onramp of cell , respectively. (min) is the time duration between each two time steps. is the metering rate for the onramp cell of the th motorway section at step . is the flow allocation parameter of cell . is the flow blending parameter of traffic flow from the onramp to the mainline of cell . The unit of all the arrival and departure rates is modified to veh/min in this study.
For motorway section with cells, the number of vehicles in the mainline can be calculated by , while the number of vehicles in the onramp of motorway section is presented by . In this way, the maximum number of vehicles in the mainline and onramp of motorway section is presented by and .
4.3. Estimation of Arrival and Departure Rates
Arrival rates of the boundary cells in each motorway section (such as , , and ) and all the onramps, as well as the departure rates of offramps, are inputs of the ACTM for each planning step between two real control steps. Considering the short time of planning process (10 steps), we assume these rates can remain stable during the planning and are estimated directly from the recent flow data collected from detectors. The method described by Wang [16] is used here to do the estimation, which simply averages the most recently observed data to get the predicted flow rates. In our model, we use the flow data collected from the last time steps ( = 5). Therefore, these three rates can be calculated bywhere and are the estimated arrival rates of mainline and onramp of cell for the planning step between real step and . is the estimated offramp departure rate of cell . If cell is the boundary cell of motorway section , the arrival or departure rate of this cell is also the arrival or departure rate of motorway section .
5. Definition of RL Elements
Except for the architecture and models defined in Section 3, three basic elements, environment state, control action, and reward function, should be specified to form a RL problem. This section details these three elements and the relevant algorithm.
5.1. Environment State
Environment states of a motorway section are composed of mainline states and onramp states. The same method mentioned in [27, 29] is used here to obtain the state space. Generally, for the mainline of motorway section , the number of vehicles ranges from 0 to the maximum number which is uniformly divided into intervals. Each interval represents a state of the mainline. Therefore, each mainline section can be represented by a state set with states. Similarly, onramp traffic is represented by a state set with states according to the maximum number of vehicles . and should be adjusted for different motorway sections according to the section length. In this way, if motorway section is the critical section, the external traffic environment is represented bywhich contains states. At each time step, a state will be selected from as the environment state. If motorway section is a normal section, state sets of its neighbour agent should be incorporated. Thus, traffic state is represented bywhich contains states.
5.2. Control Action
In a ramp control problem, the aim of the control action is to regulate the number of vehicles entering mainline in each control step. Similar to [29], we adopt flow control as the control action which can be presented by an action set with 9 flow rates between the minimum (4 veh/min) and maximum (20 veh/min) values.
Exploitation and exploration are two basic behaviours of the RL agent. Exploitation means the agent takes the control action that can get the most rewards from the previous experience. Exploration instead means the agent tries new actions with less rewards. In order to balance these two behaviours, we use the greedy policy to select control actions [30]. Specifically, this policy takes a random action with probability and chooses the greedy action (with the maximum value) with probability for each control step. The action selection probability can be formally expressed as
5.3. Reward Function
Reward function is used to calculate the immediate reward after executing a specific action at each time step, which guides the agent to achieve its objective. Considering a common objective of traffic control system (i.e., minimising total travel time), we define our reward to guide the agent to minimise total time spent (TTS) through learning process.
TTS is defined as the total time spent by vehicles in the network during a period of time. For our case, TTS can be obtained from the following equation:
In the above equation, is a fixed value; therefore, minimising TTS is equivalent to minimising the number of vehicles on the network . To minimise this value, the reward function defined here is composed of two negative rewards used to indicate penalties for vehicles on the mainline and onramp. The formal reward function at step is defined according to two situations.
(1) Motorway Section Is the Critical Section. Considerwhere is the immediate reward for agent in state when executing action at control step . and are used to normalise the number of vehicles on mainline and onramp, which guarantees that .
(2) Motorway Section Is Not the Critical Section. Here a new negative reward is introduced to maintain the system equity, that is, to make sure that the onramp queues and related travel times at different onramps should be close to each other:
Compared to (12), a new term is added into (13), which is a penalty for onramp queue difference in motorway section and . As two adjacent agents cooperated in this situation, is related to two control actions and . returns the maximum value of two given parameters, which is used for normalisation.
5.4. Description of the Algorithm
Based on the Dyna architecture and RL elements defined in previous subsections, an algorithm DynaMARL is developed and described in this subsection. Two main loops corresponding to direct RL and planning shown in Figure 1 are detailed in DynaMARL. Between two real control steps in loop I, 10 planning steps will be run in loop II. The pseudocode of DynaMARL can be seen from Algorithm 1.

An episode in DynaMARL represents a control cycle which starts from incident occurrence and terminates when the traffic state returns to initial state that is the traffic state before the incident occurrence. Incident duration is assumed to be known in advance.
6. Case Study and Results
One of the metered motorway segments (southbound direction) of M6 in the UK is chosen for the case study. This segment is between junction 21A (J21A) and junction 25 (J25) with an approximate length of 12.4 km (see Figure 4). Making the noncontrolled (NC) situation as the base line, we designed a series of experiments to compare the proposed DynaMARL algorithm with Isolated RL (learning without coordination). Experiments and relevant results are described as follows.
6.1. Partitions of the Test Segment
The test motorway segment with a threelane mainline, three metered onramps, and five offramps is simulated by AIMSUN [39] which is a microscopic traffic simulation package. According to the detectors location and road layout, the whole segment is divided into three sections. Each section contains a metered onramp. Motorway section 3 is divided into 4 cells, and motorway sections 2 and 3 are both divided into 3 cells. The partitions of each section can be seen from Figure 5. According to the section length, the maximum number of vehicles in each mainline section and onramps is as follows: , , , , , and .
6.2. Real Data Source
Real detector data collected from 17 loop detectors located in the motorway segment (including both mainline and on/offramps) are used for case study, which can be extracted from Traffic Information System (HATRIS) [40]. These traffic count data are averaged from April 2012 to March 2013 with 15minute intervals. Only working day data (from Monday to Friday) are used due to the dramatic reduction of traffic load in weekends. Some of the detector data collected from mainline and three onramps are presented in Figure 6, from which we can see that two peak periods including AM peak period (around 07:00:00–09:00:00) and PM peak period (around 16:00:00–18:00:00) exist during the daily traffic operation.
In the test site, ramp metering only works at peak hours. Meanwhile, it is valuable to test the performance of the proposed algorithm in the high demand situation. If it can work under the high traffic load, it should be also useful for common situations. Therefore, AM peak period with heavy traffic load is considered for case study. Specifically, we use the averaged traffic data during AM peak period collected from TRADS to estimate O/D (origins and destinations) matrix for the simulation. A model proposed in [41] is adopted by AIMSUN to do the estimation where the number of iterations is set as 1000 to get convergence. Table 1 shows the O/D matrix estimated from real traffic data.

6.3. Incident Scenarios
Considering the difficulty of capturing real incident data, we simulate some incident scenarios in AIMSUN. To make each ramp meter work during the incident, the incident is located near the most downstream motorway section, that is, motorway section 1. Therefore, three incident scenarios A, B, and C are designed corresponding to three different incident locations in a, b, and c (as illustrated in Figure 5), respectively.
The simulation experiment lasts for one and a half hours from 07:00:00 to 08:30:00 during AM peak period. After 30minute normal operation (for warmup), the incident is triggered at 07:30:00 and lasts for 30 minutes. In the preliminary experiments designed in this paper, the incident with one lane blocked is considered. Parameters introduced here can also be regulated for multiple laneblockage situations. The incident extent is 50 meters which is assumed to be constant during the incident.
Learningrelated parameters are set as typical values [30]; that is, is 0.2, is 0.8, and is 0.1. Other parameters related to ACTM are calibrated and summarised in Table 2. All the cells have the same and .

6.4. Results
The comparison of DynaMARL, Isolated RL, and NC is conducted from three aspects: density evolution, some general indicators, and the system equity. The experimental results are described as follows.
(1) Density Evolution. We can see from Figure 7 that four dense areas exist during the traffic operation. Three of them near onramp entrances (motorway length around 0.5 km, 5 km, and 10 km) are caused by heavy traffic loads from onramps. The dense area close to the segment end forms due to the incident.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
In scenario A, incident location is close to onramp 1 (). Without control, this incident leads to sever congestion which blocks onramp 1 and propagates to motorway section 2 (around 9 km in Figure 7(a)). Under this scenario, Isolated RL cannot alleviate incidentinduced congestion effectively (see Figure 7(b)). In the beginning of congestion formulation, without coordination, only the nearest ramp controller reacts to the congestion. Because of the space limit of onramp, one ramp controller is insufficient to dissolve this congestion that still propagates to motorway section 2. DynaMARL, on the other hand, coordinates all three ramp controllers and makes full use of the storage space of three onramps to deal with incidentinduced congestion. In this way, mainline congestion can be restricted in a smaller area and will not propagate to motorway section 2 (see Figure 7(c)).
For scenarios B and C, incidents are near the motorway end and far from onramp 1. Without blocking onramp 1, incidents do not lead to sever congestion. Under such circumstances, both Isolated RL and DynaMARL work well on easing congestion in the mainline. As shown in Figures 7(e)–7(i), compared with the NC situation, both Isolated RL and DynaMARL can restrict the congestion in a small range near the onramp entrances.
(2) General Indicators. In this comparison, some general indicators, including total travel time (should be reduced), total throughput (should be improved), and total CO_{2} emission (should be reduced), are used to show how the proposed system can benefit road users. These indicators are widely used in the transport community to test the performance of newly developed traffic control systems.
As shown in Figure 8(a), compared with the NC situation, both Isolated RL and DynaMARL can reduce the total travel time of road users in all three scenarios. Specifically, Isolated RL decreases total travel time by up to 6.2%, while DynaMARL achieves a maximum reduction of 12.2% (see Figure 8(d)). The comparison of total throughput is presented in Figure 8(b). DynaMARL can improve the total throughput by up to 2.3% (see Figure 8(d)) which outperforms Isolated RL in all three scenarios. In scenario B, Isolated RL even fails to improve the total throughput. For the comparison of total CO_{2} emission (shown in Figure 8(c)), both Isolated RL and DynaMARL achieve their best performance in scenario B with a reduction of 4.7% and 4.6%, respectively. In scenarios A and C, DynaMARL has a much better performance than Isolated RL.
(a)
(b)
(c)
(d)
Through the above comparison, we can see that DynaMARL outperforms Isolated RL for almost all the scenarios and indicators.
(3) System Equity. Although the general indicators presented in comparison (2) have shown their effectiveness on testing the performance of different systems, they cannot measure the issue of system equity, which is also an important aspect of the system performance. In this paper, we only consider the spatial equity issue that is defined as a measurement of equity of user delays on different onramps [42]. In this study, we assume the road users from all three onramps have the same importance. If all users from different onramps can experience the similar travel time, the control system is defined as an equitable system. This term is used to measure the system equity; that is, a large queue difference leads to a highly inequitable system. In [43], the variance of travel time on different onramps is used as an indicator to measure system equity. Similar to [43], for the sake of comparison, the standard deviation is considered in our case. This indicator is defined aswhere is the standard deviation of travel time of different onramps at time step . is the estimated total travel time of onramp at step . is the averaged total travel time of onramps at step .
Results about the comparison of system equity can be seen from Figure 9. For the NC situation, good equity can be maintained due to no restrictions of entering vehicles in scenarios B and C (as shown in Figures 9(b) and 9(c)). However, when one of the onramp entrances is blocked by the congestion in scenario A, a long queue forms and leads to imbalance and resultant inequity for users on different onramps (see Figure 9(a)). For controlled cases, Isolated RL performs poorly in all scenarios. This is because the ramp controller near congestion takes much more restricted measures than other controllers on the controlled traffic. Because of the coordination strategy, DynaMARL outperforms Isolated RL on maintaining system equity in all scenarios, especially during the incident (from 07:30:00 to 08:00:00).
(a)
(b)
(c)
7. Conclusions and Future Work
A Dyna based multiagent reinforcement learning method referred to as DynaMARL for motorway ramp control has been developed in this paper. DynaMARL is compared with Isolated RL (learning without coordination) and noncontrolled situation under the simulation environment. Real traffic data collected from a metered motorway segment in the UK are used to form the simulation.
Through a series of simulationbased experiments, we can conclude the following: (1) Isolated RL can improve the motorway performance in terms of increasing total throughput, reducing total travel time and CO_{2} emission, but this improvement is at the expense of poor system equity on different onramps; (2) with a suitable coordination strategy, much higher system equity can be achieved by DynaMARL; (3) in addition to the system equity, DynaMARL outperforms Isolated RL in almost all scenarios regarding all indicators, which means DynaMARL can deal with the networkwide problems effectively.
Although the simulation tests have shown some positive results regarding the performance of DynaMARL, a simplified incident scenario with fixed duration is considered in the current work. In the practical situation, incident duration is highly unstable and affected by a number of factors, such as weather conditions, road conditions, and arriving time of the incident management team. Therefore, incident duration should be considered as an uncertainty which will be investigated in our future work.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This paper is supported by China Scholarship Council and University of Leeds (CSCUniversity of Leeds scholarship) and partially supported by the National Natural Science Foundation of China (Grant nos. 91420203 and 61271376). The authors would like to thank the institutions that support this study.
References
 M. Papageorgiou and A. Kotsialos, “Freeway ramp metering: an overview,” IEEE Transactions on Intelligent Transportation Systems, vol. 3, no. 4, pp. 271–281, 2002. View at: Publisher Site  Google Scholar
 A. Skabardonis, P. Varaiya, and K. F. Petty, “Measuring recurrent and nonrecurrent traffic congestion,” Transportation Research Record, vol. 1856, pp. 118–124, 2003. View at: Google Scholar
 M. Papageorgiou, H. HadjSalem, and J.M. Blosseville, “ALINEA: a local feedback control law for onramp metering,” Journal of the Transportation Research Board, vol. 1320, pp. 58–64, 1991. View at: Google Scholar
 E. Smaragdis and M. Papageorgiou, “Series of new local ramp metering strategies,” Transportation Research Record, vol. 1856, pp. 74–86, 2003. View at: Google Scholar
 L. N. Jacobson, K. C. Henry, and O. Mehyar, “Realtime metering algorithm for centralized control,” Transportation Research Record, vol. 1232, pp. 17–26, 1989. View at: Google Scholar
 G. Paesani, J. Kerr, P. Perovich, and F. Khosravi, “System wide adaptive ramp metering (SWARM),” in Proceedings of the 7th ITS America Annual Meeting and Exposition: Merging the Transportation and Communications Revolutions, Washington, DC, USA, June 1997. View at: Google Scholar
 R. Lau, Ramp Metering by Zone—The Minnesota Algorithm, Minnesota Department of Transportation, 1997.
 H. M. Zhang and W. W. Recker, “On optimal freeway ramp control policies for congested traffic corridors,” Transportation Research Part B: Methodological, vol. 33, no. 6, pp. 417–436, 1999. View at: Publisher Site  Google Scholar
 A. Kotsialos, M. Papageorgiou, and F. Middelham, “Optimal coordinated ramp metering with advanced motorway optimal control,” Transportation Research Record, no. 1748, pp. 55–65, 2001. View at: Google Scholar
 A. Hegyi, B. De Schutter, and H. Hellendoorn, “Model predictive control for optimal coordination of ramp metering and variable speed limits,” Transportation Research C: Emerging Technologies, vol. 13, no. 3, pp. 185–209, 2005. View at: Publisher Site  Google Scholar
 G. Gomes and R. Horowitz, “Optimal freeway ramp metering using the asymmetric cell transmission model,” Transportation Research Part C: Emerging Technologies, vol. 14, no. 4, pp. 244–262, 2006. View at: Publisher Site  Google Scholar
 A. H. F. Chow and Y. Li, “Robust optimization of dynamic motorway traffic via ramp metering,” IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 3, pp. 1374–1380, 2014. View at: Publisher Site  Google Scholar
 R. W. Hall, “Nonrecurrent congestion: how big is the problem? Are traveler information systems the solution?” Transportation Research Part C, vol. 1, no. 1, pp. 89–103, 1993. View at: Publisher Site  Google Scholar
 P. Prevedouros, B. Halkias, K. Papandreou, and P. Kopelias, “Freeway incidents in the United States, United Kingdom, and Attica Tollway, Greece: characteristics, available capacity, and models,” Transportation Research Record, vol. 2047, pp. 57–65, 2008. View at: Publisher Site  Google Scholar
 T. L. Greenlee and H. J. Payne, “Freeway ramp metering strategies for responding to incidents,” in Proceedings of the IEEE Conference on Decision and Control including the 16th Symposium on Adaptive Processes and a Special Symposium on Fuzzy Set Theory and Applications, pp. 987–992, New Orleans, LA, USA, December 1977. View at: Publisher Site  Google Scholar
 M. H. Wang, Optimal ramp metering policies for nonrecurring congestion with uncertain incident duration [Ph.D. thesis], Purdue University, West Lafayette, Ind, USA, 1994.
 J.B. Sheu, “Stochastic modeling of the dynamics of incidentinduced lane traffic states for incidentresponsive local ramp control,” Physica A: Statistical Mechanics and its Applications, vol. 386, no. 1, pp. 365–380, 2007. View at: Publisher Site  Google Scholar
 J.B. Sheu and M.S. Chang, “Stochastic optimalcontrol approach to automatic incidentresponsive coordinated ramp control,” IEEE Transactions on Intelligent Transportation Systems, vol. 8, no. 2, pp. 359–367, 2007. View at: Publisher Site  Google Scholar
 C. Jacob and B. Abdulhai, “Machine learning for multijurisdictional optimal traffic corridor control,” Transportation Research Part A: Policy and Practice, vol. 44, no. 2, pp. 53–64, 2010. View at: Publisher Site  Google Scholar
 M. Davarynejad, A. Hegyi, J. Vrancken, and J. van den Berg, “Motorway rampmetering control with queuing consideration using Qlearning,” in Proceedings of the 14th International IEEE Conference on Intelligent Transportation Systems (ITSC '11), pp. 1652–1658, IEEE, Washington, DC, USA, October 2011. View at: Publisher Site  Google Scholar
 K. Rezaee, B. Abdulhai, and H. Abdelgawad, “Application of reinforcement learning with continuous state space to ramp metering in realworld conditions,” in Proceedings of the 15th International IEEE Conference on Intelligent Transportation Systems (ITSC '12), pp. 1590–1595, IEEE, Anchorage, Alaska, USA, September 2012. View at: Publisher Site  Google Scholar
 C. Jacob and B. Abdulhai, “Automated adaptive traffic corridor control using reinforcement learning: approach and case studies,” Transportation Research Record, vol. 1959, pp. 1–8, 2006. View at: Google Scholar
 K. Rezaee, B. Abdulhai, and H. Abdelgawad, “SelfLearning adaptive ramp metering: analysis of design parameters on a test case in Toronto, Canada,” Transportation Research Record, vol. 2396, pp. 10–18, 2013. View at: Publisher Site  Google Scholar
 X.J. Wang, X.M. Xi, and G.F. Gao, “Reinforcement learning ramp metering without complete information,” Journal of Control Science and Engineering, vol. 2012, Article ID 208456, 8 pages, 2012. View at: Publisher Site  Google Scholar
 A. Fares and W. Gomaa, “Multiagent reinforcement learning control for ramp metering,” in Progress in Systems Engineering, vol. 330 of Advances in Intelligent Systems and Computing, pp. 167–173, Springer, Basel, Switzerland, 2015. View at: Publisher Site  Google Scholar
 K. Veljanovska, K. M. Bombol, and T. Maher, “Reinforcement learning technique in multiple motorway access control strategy design,” PROMETTraffic & Transportation, vol. 22, no. 2, pp. 117–123, 2010. View at: Google Scholar
 C. Lu, H. Chen, and S. GrantMuller, “An indirect reinforcement learning approach for ramp control under incidentinduced congestion,” in Proceedings of the 16th International IEEE Conference on Intelligent Transportation Systems (ITSC '13), pp. 979–984, IEEE, The Hague, The Netherlands, October 2013. View at: Publisher Site  Google Scholar
 L. Buşoniu, R. Babuška, and B. De Schutter, “A comprehensive survey of multiagent reinforcement learning,” IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews, vol. 38, no. 2, pp. 156–172, 2008. View at: Publisher Site  Google Scholar
 C. Lu, H. Chen, and S. GrantMuller, “Indirect Reinforcement Learning for Incidentresponsive ramp control,” Procedia—Social and Behavioral Sciences, vol. 111, pp. 1112–1122, 2014. View at: Publisher Site  Google Scholar
 R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, 1998.
 C. C. H. Watkins and P. Dayan, “Qlearning,” Machine Learning, vol. 8, no. 34, pp. 279–292, 1992. View at: Publisher Site  Google Scholar
 J. R. Kok and N. Vlassis, “Collaborative multiagent reinforcement learning by payoff propagation,” Journal of Machine Learning Research, vol. 7, pp. 1789–1828, 2006. View at: Google Scholar
 C. Guestrin, M. G. Lagoudakis, and R. Parr, “Coordinated reinforcement learning,” in Proceedings of the 19th International Conference on Machine Learning, pp. 227–234, Sydney, Australia, July 2002. View at: Google Scholar
 S. ElTantawy, B. Abdulhai, and H. Abdelgawad, “Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (marlinatsc): methodology and largescale application on downtown toronto,” IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 3, pp. 1140–1150, 2013. View at: Publisher Site  Google Scholar
 L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: a survey,” Journal of Artificial Intelligence Research, vol. 4, pp. 237–285, 1996. View at: Google Scholar
 C. F. Daganzo, “The cell transmission model: a dynamic representation of highway traffic consistent with the hydrodynamic theory,” Transportation Research Part B: Methodological, vol. 28, no. 4, pp. 269–287, 1994. View at: Publisher Site  Google Scholar
 J. Haddad, M. Ramezani, and N. Geroliminis, “Cooperative traffic control of a mixed network with two urban regions and a freeway,” Transportation Research Part B: Methodological, vol. 54, pp. 17–36, 2013. View at: Publisher Site  Google Scholar
 H. Mongeot and J.B. Lesort, “Analytical expressions of incidentinduced flow dynamics perturbations: using macroscopic theory and extension of LighthillWhitham theory,” Transportation Research Record, vol. 1710, pp. 58–68, 2000. View at: Google Scholar
 Transport Simulation Systems, Aimsun User's Manual 6.1, TTS, Barcelona, Spain, 2010.
 Highways England, “Hatris Homepage,” 2013, https://www.hatris.co.uk/. View at: Google Scholar
 E. Cascetta, “Estimation of trip matrices from traffic counts and survey data: a generalized least squares estimator,” Transportation Research B, vol. 18, no. 45, pp. 289–299, 1984. View at: Publisher Site  Google Scholar
 L. Zhang and D. Levinson, “Balancing efficiency and equity of ramp meters,” Journal of Transportation Engineering, vol. 131, no. 6, pp. 477–481, 2005. View at: Publisher Site  Google Scholar
 A. Kotsialos and M. Papageorgiou, “Efficiency and equity properties of freeway networkwide ramp metering with AMOC,” Transportation Research Part C: Emerging Technologies, vol. 12, no. 6, pp. 401–420, 2004. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2015 Chao Lu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.