About this Journal Submit a Manuscript Table of Contents
Journal of Control Science and Engineering
Volume 2012 (2012), Article ID 208456, 8 pages
http://dx.doi.org/10.1155/2012/208456
Research Article

Reinforcement Learning Ramp Metering without Complete Information

1School of Traffic and Transportation, Shijiazhuang Tiedao University, Shijiazhuang, Hebei 050043, China
2Traffic Safety Engineering and Emergency Management Workgroup, Traffic Safety and Control Laboratory of Hebei Province, Shijiazhuang, Hebei 050043, China

Received 31 August 2011; Revised 5 December 2011; Accepted 11 December 2011

Academic Editor: Onur Toker

Copyright © 2012 Xing-Ju Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This paper develops a model of reinforcement learning ramp metering (RLRM) without complete information, which is applied to alleviate traffic congestions on ramps. RLRM consists of prediction tools depending on traffic flow simulation and optimal choice model based on reinforcement learning theories. Moreover, it is also a dynamic process with abilities of automaticity, memory and performance feedback. Numerical cases are given in this study to demonstrate RLRM such as calculating outflow rate, density, average speed, and travel time compared to no control and fixed-time control. Results indicate that the greater is the inflow, the more is the effect. In addition, the stability of RLRM is better than fixed-time control.

1. Introduction

Increasing dependence on car-based travel has led to the daily occurrence of recurrent and nonrecurrent freeway congestions not only in China but also around the world. Congestion on highways forms when the demand exceeds capacity. Recurrent congestion reduces substantially the available infrastructure capacity at rush hour, that is, at the time this capacity is most urgently needed. Moreover, congestion also causes delays, increases environmental pollution, and reduces traffic safety.

Ramp metering is essential to the efficient operation of highways, particularly when volumes are high. According to Papageorgiou and others, ramp metering is divided roughly into the reacted type and the preceded type [1]. DC (demand-capacity), OCC (occupancy), and ALNEA [2] are among the well-known local response type ramp metering [3]. In DC, the actual upstream volume is measured at regular short intervals and is then compared to the downstream capacity, which may be calculated by using downstream traffic conditions. OCC uses a predetermined relationship between occupancy rate and lane volume, developed from data previously collected at the highway adjacent to the ramp being considered. ALNEA is the ramp metering which sets up the private-use rate of an onramp based on the measured value of main line traffic. ALINEA has an example of application in some countries of Europe and is made highly validated compared to DC and OCC. Iwata, Tsubota, and Kawashima have proposed the ramp metering technique using the predicted value by a traffic simulator [4]. Reinforcement learning ramp metering based on traffic simulation model with desired speed was proposed by Wang et al. [5]. The aim of this study is to propose reinforcement learning ramp metering without complete information.

2. Methods

2.1. Traffic Flow Simulation Model

Figure 1 describes car-following behaviors. In a microsimulation model, a modeled fundamental behavior is the “car-following” which adjusts the driver’s characteristics: the distance between two adjacent cars, the relative speed, and so forth.

208456.fig.001
Figure 1: Car-following behavior.

In 1953, Pipes proposed the following basic differential equation model for car-following behavior:̈𝑥𝑛+1(𝑡)=𝑎̇𝑥𝑛(𝑡)̇𝑥𝑛+1(𝑡),(1) where ̈𝑥, ̇𝑥, and 𝑥 denote the acceleration, speed, and distance from the reference point of vehicle 𝑛, respectively, and 𝑎 is a constant. In the model, the acceleration of the vehicle which follows a leading vehicle is proportional to the speed difference between the vehicles. It is assumed that the delay of time in which the vehicle responds to the speed difference is so small that it can be neglected. To remove this drawback, Chandler introduced a reactive delay time 𝑇. Based on the rationale that the acceleration of the following car is also influenced by its speed and the distance between the vehicles, Gazis, Herman, and Rothery proposed the general type of car-following model:̈𝑥𝑛+1(𝑎𝑡+𝑇)=̇𝑥𝑛+1(𝑡+𝑇)𝑚̇𝑥𝑛(𝑡)̇𝑥𝑛+1(𝑡)𝑥𝑛(𝑡)𝑥𝑛+1(𝑡)𝑙.(2)

Newell proposed the following model in which the acceleration is propositional to an exponential function of the distance between the vehicles, based on real data:̈𝑥𝑛+1(𝑡+𝑇)=𝑎1̇𝑥𝑛(𝑡)̇𝑥𝑛+1(𝑡)×(𝑎2/[𝑥𝑛(𝑡)𝑥𝑛+1(𝑡)𝑎3]).(3) Although the above modifications have improved the reality of car-following model, they have the following two drawbacks. When the proceeding vehicle does not exist, this implies that a car will maintain an initial speed. On the other hand, when the speed difference is 0, the acceleration is 0. This implies the unrealistic phenomenon that the following car will not apply the brake even when the distance to the preceding car approaches 0 and will not accelerate even if the distance is very long. To solve the above-mentioned problems, Treiber and Helbing introduced the intelligent driver model [6], which introduces a desired speed and a shortest distance between cars. The IDM is given aṡ𝑣𝑛𝑣=𝑎1𝑛𝑣0𝛿𝑠𝑣𝑛,Δ𝑣𝑛𝑠𝑛2,𝑠(4)(𝑣,Δ𝑣)=𝑠0+max𝑇𝑣+𝑣Δ𝑣2𝑠𝑎𝑏,0,(5)𝑛𝑥(𝑡)=𝑛1𝑥𝑛𝑙,(6)Δ𝑣𝑛𝑣(𝑡)=𝑛(𝑡)𝑣𝑛1(𝑡),(7)

where 𝑥 is distance; 𝑛 is the 𝑛th car; 𝑣 is the speed; 𝑙 is the length of car; 𝑠0 is the desired minimum gap; 𝑎 is the maximum acceleration; 𝑠 is the effective gap; 𝑏 is the comfortable deceleration (𝑎𝑏); 𝛿 is the parameter; 𝑇 is the time gap; 𝜈0 is the desired speed.

Figure 2 presents lane change behaviors. To simulate driver’s behavior in the merging section on freeways and the merging behavior in the weave section, and so forth, the lane change model is needed [7]. We propose a new lane change model which describes driver’s behavior depending on judgment functions [8, 9]. We focus on a vehicle approaching to a confluence point and describe its behavior with several variables: the relative speed between the car and cars in current lane, the locations of both the main line cars and the on-ramp cars, driver’s judgment functions for changing his lane, and driver’s desired speed. The driver’s judgment function for the free merging is different from the judgment function for the forced merging. A free merging implies that a car on the ramp can merge into the main line without influences, and cars on the main line are not interfered. When forced merging models of psychological condition and physical condition are both satisfied, the driver conducts lane change behaviors. Otherwise, the driver continues the car-following behavior without lane change behaviors.

208456.fig.002
Figure 2: Lane change behavior.

Physical condition presents the ability of lane change. The lane change model with driver’s judgment function is expressed as follows:𝑥=𝑓𝑥𝑐𝑣𝐿+𝑓𝑣𝑐𝑡+(𝐴+𝐵)𝑡22𝑣+𝛿0𝑓𝑣𝑓𝑣0𝑓𝑣𝑆+𝜁0𝑐𝑣𝑐𝑣0𝑐𝑆𝑆,(8)𝑥𝑔=𝑐𝑥𝑏𝑣𝐿+𝑐𝑣𝑏𝑡+(𝐴𝐵)𝑡22𝑣𝜃𝑏0𝑣𝑓𝑣𝑏0𝑣𝑆𝜉𝑐0𝑣𝑐𝑣𝑐0𝑆𝑆,(9)0𝐴𝑒,(10)0𝐵𝑑,(11)

where , 𝑔 are judgment function; 𝑥 is the distance from reference point; 𝜈 is the speed; 𝐿 is the length of a vehicle; 𝑡 is the judgment time; 𝜈0 is the desired speed, subject to normal distribution; 𝛿,𝜁,𝜃,𝜉(𝛿,𝜁,𝜃,𝜉[0,1])are the adjustment coefficients; 𝐴 is the rapid acceleration with upper bound 𝑒; and 𝐵 is the rapid deceleration with upper bound 𝑑. Parameters 𝐴 and 𝐵 are associated with vehicle 𝑐’s judgment functions for lane change and decide the free merging or the forced merging. Since vehicle 𝑐 judges to accelerate or decelerate to merge into the main line, two events are mutually exclusive.

The function judges whether vehicle 𝑐 accelerates or decelerates to merge according to the given space and speed conditions between vehicles 𝑓 and 𝑐. Similarly, the function 𝑔 is applied to judge in the relationship between vehicles 𝑐 and 𝑏. If both 𝐴 and 𝐵 take 0, the distance between two vehicles 𝑓 and 𝑏 is large enough for vehicle 𝑐 to be accommodated to enter into the main line, then the free merging occurs (no acceleration or deceleration behavior is required for vehicle 𝑐). Conversely, in the case of the forced merging, we need to examine whether the solution of inequality (8) to (11) exists. If 𝐴 and 𝐵 are mutually exclusive, then the following two conditions (1) and (2) are obtained. (1)When a rapid brake event 𝐵 does not exist, then 𝐵=0, and only an event 𝐴 could happen.(2)When a rapid acceleration event 𝐴 does not exist, then 𝐴=0, and only an event 𝐵 is approved.

The lane changing behavior of vehicle 𝑐 could happen when a solution of (1) or (2) exists.

Psychological constraints describe driver’s motivations on lane change. If the present car has not reached the desired speed and if the predicted speed of lane change is greater than that of no change, or gain speed advantage, 𝑎1 and 𝑎2 describe predicted acceleration of lane change and no lane change, respectively. 𝑎1 and 𝑎2 are given from the IDM. Then the psychological constraints can be given by𝑎1<𝑎2.(12)

If (12) has a solution, the driver has maneuvers of changing the current lane to the target lane. Conversely, the driver does not conduct the lane changing maneuvers.

Lane change behaviors can be characterized as a sequence of three stages: the ability of lane change (physical condition); the motivation of lane change (psychological constraints); the execution of lane change. When lane change models of psychological condition and physical condition are both satisfied, the driver conducts the above-mentioned three stages. Otherwise, the driver continues the car-following behavior without lane change behaviors.

We develop a traffic flow simulation model consisting of car-following model and lane change model [1012]. The basic concept of car-following theories is the relationship between stimuli and response. In the classic car-following theory, the stimuli are represented by the relative speed of following and leading vehicle, and the response is represented by the acceleration (or deceleration) rate of the following vehicle. The car-following model describes following behaviors that drivers follow each other in the traffic stream on only one lane. To reproduce the traffic flow in two or more lanes, lane change model which explores lane change behaviors is needed. By using the car-following model and lane change model, we express dynamic and complex traffic behaviors in two or more lanes. Moreover, traffic flow simulation models are applied to reproduce the traffic congestion represented by Helbing and Kerner [1316].

2.2. The Reinforcement Learning Ramp Metering

Reinforcement learning is a kind of machine learning treating the problem at which the agent under a certain environment determines the action. And the action should observe and take the present state. An agent gets reward from environment by choosing actions. Reinforcement learning learns a policy from which most reward is obtained through a series of actions [17]. Reinforcement learning is a broad class of optimal control methods depending on estimating value functions from experience or simulations [1821].

The model of reinforcement learning ramp metering (RLRM) is shown in Figure 3. 𝑞in is the inflow of the upstream of the main line; 𝑟 is the metering rate; 𝑞out is the outflow of the downstream of main line; 𝑑𝑚 is the density of the main line in merging section; 𝑑𝑟 is the density of onramp; 𝑣𝑚 is the average speed of the main line; 𝑣𝑟 is the average speed of onramp.𝑞=𝑞in+𝑟𝑞out.(13)

208456.fig.003
Figure 3: Reinforcement learning ramp metering model.

According to the volume 𝑞 in merging section, upstream traffic 𝑞in is updated by𝑞in𝑡+1𝑞in𝑡+1+𝑞,(14)

where 𝑞in called state variable can be collected by the control variable detector. 𝑟 is set as a choosing action variable. Moreover, 𝑞out is the reward based on the choosing action. 𝜌𝐿 is the traffic density in the merging section of 𝐿 long. 𝜌𝐿 can be obtained by𝜌𝐿=𝑞in𝑡+1𝐿.(15)

According to Figure 4, the framework of RLRM is explained briefly. RLRM consists of metering rate choice model, outflow function, value function, and environmental model. The metering rate choice model is a rule to choose the optimal metering rate. Outflow function describes the data of downstream traffic which can be collected and calculated by detectors. Value function presents the total of volumes of downstream traffic. Environmental model predicts inflow and outflow in the next period of time depending on optimal metering rate and inflow.

208456.fig.004
Figure 4: Block diagram for reinforcement learning ramp metering.
2.3. RLRM with Complete Information

The RLRM with complete information faces a Markov decision problem (MDP). In addition, since inflow and metering rate’s set denotes 𝑆, 𝐴(𝑞in)(𝑞in𝑡𝑆) is finite. We typically use a set of matrices𝑅𝑟𝑞in𝑞in=𝑃𝑟𝑞in𝑡+1=𝑞in𝑞in𝑡=𝑞in,𝑟𝑡=𝑟(16)

to describe the transition structure. Traffic outflow at time 𝑡 is obtained by𝑅𝑟qinqin=𝐸𝑞out𝑡+1𝑞in𝑡=𝑞in,𝑟𝑡=𝑟,𝑞in𝑡+1=𝑞in,(17)

for all 𝑞in𝑆, for all 𝑟𝐴(𝑞in), and for all 𝑞in𝑆+.

If maximum outflow 𝑉 or 𝑄 is given by Bellman formula, we have𝑉𝜋(𝑞in)=max𝑟𝐸𝑞out𝑡+1+𝜆𝑉𝑞in𝑡+1𝑞in𝑡=𝑞in,𝑟𝑡=𝑟=max𝑟𝑞in𝑃𝑟𝑞in𝑞in𝑞out𝑟𝑞in𝑞in+𝜆𝑉𝜋𝑞in,(18)

or𝑄(𝑞in,𝑟)=𝐸𝑞out𝑡+1+𝜆max𝑟𝑄𝑞in𝑡+1,𝑟𝑞in𝑡=𝑞in,𝑟𝑡==𝑟𝑞in𝑃𝑟𝑞in𝑞in𝑅𝑟𝑞in𝑞in+𝜆max𝑟𝑄𝑞in,𝑟.(19)

We can obtain transit probability 𝑃𝑟𝑞in𝑞in and next outflow 𝑉𝜋(𝑞in) with MDP’s complete information. And we assume that traffic outflow is finite. Moreover, we can also compute traffic outflow.

2.4. RLRM without Complete Information

Supposed Markov decision process with complete information is given in Section 2.3. But this argument is untenable in fact. We can give ramp metering rate by using evaluation of the experience without complete information. Since transit probability is not necessary, we can rewrite (18) as𝑉𝜋𝑞in𝑡=𝑉𝜋𝑞in𝑡+𝑎𝑡𝑞out𝑡𝑉𝜋𝑞in𝑡,(20)

where 𝑞out𝑡 is real time outflow at time 𝑡, and constant 𝑎𝑡 is transit probability function of 𝑡. Equation (19) can be replaced by𝑄𝑞in𝑡,𝑟𝑡𝑄𝑞in𝑡,𝑟𝑡+𝑎𝑡𝑄𝑞out+𝜆𝐸𝑞in𝑡+1,𝑟𝑡+1𝑠𝑡𝑄𝑞in𝑡,𝑟𝑡.(21)

If expected value of metering rate is not given, we also replace𝑄𝑞out+𝜆𝐸𝑞in𝑡+1,𝑟𝑡+1𝑠𝑡𝑄𝑞in𝑡,𝑟𝑡(22)

by𝑞out+𝜆𝑎𝜋𝑞in𝑡,𝑟𝑡𝑄𝑞in𝑡+1,𝑟𝑡𝑄𝑞in𝑡,𝑟𝑡.(23)

We get𝑄𝑞in𝑡,𝑟𝑡𝑄𝑞in𝑡,𝑟𝑡+𝑎𝑡𝑞out+𝜆𝑎𝜋𝑞in𝑡,𝑟𝑡𝑄𝑞in𝑡+1,𝑟𝑡𝑄𝑞in𝑡,𝑟𝑡.(24)

We suppose that the probability of on-ramp control policy 𝜋 can be obtained in (24). Here, it is difficult to satisfy the initial condition. The values 𝑎𝜋(𝑞in𝑡,𝑟𝑡)𝑄(𝑞in𝑡+1,𝑟𝑡) associated with an optimal on-ramp control policy are called the optimal ramp inflow and are often written as max𝑄(𝑞in𝑡+1,𝑟). We get𝑄𝑞in𝑡,𝑟𝑡𝑄𝑞in𝑡,𝑟𝑡+𝑎𝑡𝑞out+𝜆max𝑄𝑞in𝑡+1,𝑟𝑄𝑞in𝑡,𝑟𝑡,(25)

where𝑡=1𝑎𝑡=,(26)𝑡=1𝑎2𝑡<.(27)

In the (25), the action value function 𝑄 is gained by learning approximates 𝑄 (the optimal action value function) directly by using current policy. The state variable can be updated depending on the policy.

When the traffic reaches the jam density, it is possible to result in closure of the ramp for a long period of time, which must be taken into consideration. Maximum of waiting time (𝑇max) and its metering rate (𝑟𝑇) are given. When 𝑚𝑛=1𝑇𝑆𝑛>𝑇max, the control (𝑞in𝑡,𝑟𝑇) is selected. In order to remove the curse of dimensionality, the discrete equation of the continuous variable 𝑟𝑡 is represented. The average difference between 0 and 𝑟max is divided by 𝑟𝑛. 𝑟𝑛 is given by𝑁𝑟𝑟=cellmax𝑟𝑛,(28)

where 𝑁𝑟 is the amount of the metering rate, and cell is the function of the bottom integral function. The metering rate is max(𝑘𝑟𝑛,𝑟max)for 𝑘𝑁.

The algorithm of reinforcement learning on-ramp metering is shown in Figure 5.(1)Initialize 𝑄, 𝑞out, 𝑞in, and 𝑘. (2)Determine cycle time of a traffic signal 𝑡.(3)Update 𝑞in𝑡.(4)Give metering rate by 𝑟𝑡=𝑘×𝑟𝑛.(5)Determine the traffic state (𝑞in𝑡,𝑟𝑡).(6)Generate the density 𝜌𝐿 by using traffic simulation and choose the metering rate.(7)If 𝑟𝑡<𝑟max, then update 𝑘=𝑘+1 and go to (4), and otherwise generate the optimal control (𝑞in𝑡,𝑟).(8)If one closes the ramp, then update waiting time 𝑇 by =𝑇+𝑡, and otherwise initialize the waiting time 𝑇 by 𝑇=0. If 𝑇>𝑇max, then update metering rate by 𝑟𝑇𝑟.(9)Operate the optimal control (𝑞in𝑡,𝑟) and update 𝑄.

208456.fig.005
Figure 5: Algorithm of reinforcement learning ramp metering.

When the cycle time 𝑡  is over, determine to continue the ramp metering. If yes, then collect the data of inflow 𝑞in𝑡+1, go to (3), and update 𝑞in𝑡, that is, 𝑞in𝑡+1𝑞in𝑡; otherwise, complete the ramp metering.

3. Data Combination and Reduction

Our aim is to design a reinforcement learning control law for the ramp metering controller without complete information. We need to control the inflow from the ramp into main line, and the metering rate should be given by traffic states. Traffic flow simulation is conducted to demonstrate this control of the ramp metering. In our simulation, we set the main line length on highways to 1000 m, ramp length to 200 m, and length in merging sections of the main line and ramp to100 m. Parameters of RLRM are shown in Table 1, and the metering rate matrix is {0,100,200,300,,900,1000,1100}.

tab1
Table 1: RLRM parameters.

Table 2 shows the inflow of cases A, B, C, D, E, and F. Inflow rate of the main line increases from 1200 pcu/hour of case A to 2500 pcu/hour of case F. Moreover, inflow rate of ramp rises from 300 pcu/hour of case A to 900 pcu/hour of case F. The cycle length of the fixed-time control is 20 s which consists of 15 s green time and 5s red time.

tab2
Table 2: Traffic inflow.

4. Result and Discussion

The results of no control, fixed-time control, and RLRM are shown in Figures 69. Total inflow increases from 1500 pcu/h in case A to 3400 pcu/h in case F. Figure 6 presents average speed and its rate compared to no control. The average speed of no control, about 108 km/h, is faster than fixedtime and RLRM in case A. The similar results are shown in case B. The average speed of no control, about 79 km/h, is faster than fixedtime and is slower than RLRM in case C. The average speed of no control, about 51 km/h, is slower than fixedtime and RLRM in case F. According to the average speed, rates of congestion reliefs of fixed-time control from case A to case F arrive at −7.80%, −6.65%, −3.77%, 0. 26%, 2.70%, and 8.26%, respectively. In addition, rates of congestion reliefs of RLRM from case A to case F arrive at −6.31%, −6.49%, 5.69%, 13.55%, 20.50%, and 18.18%, respectively.

208456.fig.006
Figure 6: Average speed and its rate compared to no control.
208456.fig.007
Figure 7: Density and its rate compared to no control.
208456.fig.008
Figure 8: Outflow and its rate compared to no control.
208456.fig.009
Figure 9: Travel time and its rate compared to no control.

Figure 7 describes density and its rate compared to no control. Densities of fixed-time control and RLRM are about 38 pcu/km, an about 60% increase, in case A. Densities of fixed-time control and RLRM are about 52 pcu/km and 45 pcu/km, about 11.46% and 22.60% decreases, in case C. Densities of fixed-time control, no control, and RLRM are about 120 pcu/km. According to densities, rates of congestion reliefs of fixed-time control from case A to case F arrive at −57.55%, −19.92%, −11.46%, −21.35%, 7.6%, and 0.39%, respectively. In addition, rates of congestion reliefs of RLRM from case A to case F arrive at −59.59%, −22.05%, 22.60%, 8.18%, 9.65%, and 3.42%, respectively.

Figure 8 shows outflow and its rate compared to no control. Outflow rate rises from 1700 pcu/h without control to 2308 pcu/h with fixed-time control and 1800 pcu/h with RLRM in case A. Moreover, 3.82% and 7.85% increases are shown depending on outflow rate in case C. In addition, 18.97% and 30.65% increases are explored depending on outflow rate in case F. Rates of congestion reliefs of fixed-time control from case A to case F arrive at 35.76%, −14.25%, 7.82%, 7.93%, 12.51%, and 18.97%, respectively. On the other hand, rates of congestion reliefs of RLRM from case A to case F arrive at 7.06%, 0.58%, 3.85%, 10.47%, 54.63%, and 30.65%, respectively.

Figure 9 represents travel time and its rate compared to no control. According to travel time, 6.25% and 9.38% increases are explored in case A. Travel time rises from 342 s without control to 370 s with fixed-time control and falls into 330 s with RLRM in case C. Travel time falls from 617 s to 469 s with fixed-time control and 343 s with RLRM in case F. Rates of congestion reliefs of fixed-time control from case A to case F arrive at −6.25%, −25.26%, −8.19%, 7.36%, 27.06%, and 23.99%, respectively. On the other hand, rates of congestion reliefs of RLRM from case A to case F arrive at −9.38%, −5.26%, 3.51%, 38.17%, 40.32%, and 44.41%, respectively.

According to Figures 69 when the traffic inflows are low, controls not efficient. Controls get efficient with the traffic inflows increasing. Controls are very efficient, and RLRM is optimal control when the traffic inflows are high. Moreover, based on curves of Figures 79 assessment indicators of fixed-time control fluctuate around indicators of no control. Fixed-time control shows instability compared to RLRM. Abilities of automaticity, memory, and performance feedback of RLRM are also shown.

5. Conclusion

The on-ramp metering ensures that traffic moves at a speed approximately equal to the optimum speed which results in maximum flow rates or travel time. This study develops an RLRM model without complete information, which consists of prediction tools depending on traffic flow simulation and optimal choice model based on reinforcement learning theories. Numerical cases are given to demonstrate RLRM compared to no control and fixed-time control. In addition, densities and outflow rates are calculated. Moreover, average speeds are computed, and travel times are assessed. According to cases A, B, C, D, E, and F, fixed-time control and RLRM are discussed depending on average speeds, densities, outflow rates, and travel times. When traffic inflow is low, controls are not efficient, and there are little differences among no control, fixed-time control, and RLRM. On the other hand, when traffic inflow is high, controls are very efficient, and RLRM is optimal control. Moreover, the greater is inflow, the more is the effect. In addition, the stability of RLRM is better than fixed-time control.

Acknowledgments

This research is founded by the National Natural Science Foundation of China (Grant no. 51008201). And this research is also sponsored by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry of China. Moreover, this research is also the key project supported by the Scientific Research Foundation, Education Department of Hebei Province of China (Grant no. GD2010235) and Society Science Development Program of Hebei Province of China (Grant no. 201004068).

References

  1. M. Papageorgiou and A. Kotsialos, “Freeway ramp metering: an overview,” IEEE Transactions on Intelligent Transportation Systems, vol. 3, no. 4, pp. 271–281, 2002. View at Publisher · View at Google Scholar · View at Scopus
  2. M. Papageorgiou, H. H. Salem, and J. M. Blosseville, “A local feedback control law for on-ramp metering,” Transportation Research Record, vol. 1320, pp. 58–64, 1991.
  3. M. Papageorgiou, H. Hadj-Salem, and F. Middelham, “ALINEA local ramp metering: summary of field results,” Transportation Research Record, no. 1603, pp. 90–98, 1997. View at Scopus
  4. M. Iwata, “The comparative study about the ramp metering in high way,” in Proceedings of the Japan Society of Civil Engineers (JSTE '06), vol. 26, pp. 73–76, 2006.
  5. X. J. Wang, B. H. Liu, X. Q. Niu, and T. Miyagi, “Reinforcement learning control for on-ramp metering based on traffic simulation,” in Proceedings of the 9th International Conference of Chinese Transportation Professionals (ICCTP '09), vol. 358, pp. 2701–2707, Harbin, China, 2009. View at Publisher · View at Google Scholar
  6. M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states in empirical observations and microscopic simulations,” Physical Review E, vol. 62, no. 2, pp. 1805–1824, 2000. View at Scopus
  7. P. Hidas, “Modelling vehicle interactions in microscopic simulation of merging and weaving,” Transportation Research Part C, vol. 13, no. 1, pp. 37–62, 2005. View at Publisher · View at Google Scholar · View at Scopus
  8. X. J. Wang, T. Miyagi, A. Takagi, and J. Q. Ying, “Analysis of the effects of acceleration lane length at merging sections by using micro-simulations,” in Proceedings of the 7th International Conference of Eastern Asia Society for Transportation Studies, Dalian, China, 2007.
  9. X. J. Wang, T. Miyagi, and J. Q. Ying, “A simulation model for traffic behavior at merging sections in highways,” in Proceedings of the 2nd International Conference on Innovative Computing, Information and Control (ICICIC '07), Kumamoto, Japan, September 2007. View at Publisher · View at Google Scholar
  10. X. J. Wang, G. F. Gao, J. J. Chen, and T. Miyagi, “Traffic flow simulation model based on desired speed,” Journal of Chang'an University, vol. 30, no. 5, pp. 79–84, 2010. View at Scopus
  11. X. J. Wang, G. F. Gao, J. J. Chen, and T. Miyagi, “Traffic flow simulation model based on adaptive acceleration at merging sections in highways,” Highway, vol. 10, pp. 128–132, 2010.
  12. X. J. Wang and T. Miyagi, “Reinforcement learning ramp metering,” Journal of Shijiazhuang Tiedao University, vol. 2, pp. 104–108, 2010.
  13. D. Helbing, “High-fidelity macroscopic traffic equations,” Physical Review, vol. 219, no. 3-4, pp. 391–407, 1995. View at Scopus
  14. M. Treiber, A. Hennecke, and D. Helbing, “Derivation, properties, and simulation of a gas-kinetic-based, nonlocal traffic model,” Physical Review E, vol. 59, no. 1, pp. 239–253, 1999. View at Scopus
  15. D. Helbing, A. Hennecke, V. Shvetsov, and M. Treiber, “Micro- and macro-simulation of freeway traffic,” Mathematical and Computer Modelling, vol. 35, no. 5-6, pp. 517–547, 2002. View at Scopus
  16. B. S. Kerner, “Synchronized flow as a new traffic phase and related problems for traffic flow modelling,” Mathematical and Computer Modelling, vol. 35, no. 5-6, pp. 481–508, 2002. View at Scopus
  17. S. Richard and G. B. Andrew, Reinforcement Learning: An Introduction, MIT Press, London, UK, 1998.
  18. H. Boubertakh, M. Tadjine, P. Y. Glorennec, and S. Labiod, “Tuning fuzzy PD and PI controllers using reinforcement learning,” ISA Transactions, vol. 49, no. 4, pp. 543–551, 2010. View at Publisher · View at Google Scholar · View at Scopus
  19. R. Razavi, S. Klein, and H. Claussen, “A fuzzy reinforcement learning approach for self-optimization of coverage in LTE networks,” Bell Labs Technical Journal, vol. 15, no. 3, pp. 153–176, 2010. View at Publisher · View at Google Scholar · View at Scopus
  20. D. Vengerov, “A reinforcement learning framework for utility-based scheduling in resource-constrained systems,” Future Generation Computer Systems, vol. 25, no. 7, pp. 728–736, 2009. View at Publisher · View at Google Scholar · View at Scopus
  21. P. Venkatraman, B. Hamdaoui, and M. Guizani, “Opportunistic bandwidth sharing through reinforcement learning,” IEEE Transactions on Vehicular Technology, vol. 59, no. 6, Article ID 5452965, pp. 3148–3153, 2010. View at Publisher · View at Google Scholar · View at Scopus