Abstract

Day-to-day traffic dynamics are generated by individual traveler’s route choice and route adjustment behaviors, which are appropriate to be researched by using agent-based model and learning theory. In this paper, we propose a day-to-day route choice model based on reinforcement learning and multiagent simulation. Travelers’ memory, learning rate, and experience cognition are taken into account. Then the model is verified and analyzed. Results show that the network flow can converge to user equilibrium (UE) if travelers can remember all the travel time they have experienced, but which is not necessarily the case under limited memory; learning rate can strengthen the flow fluctuation, but memory leads to the contrary side; moreover, high learning rate results in the cyclical oscillation during the process of flow evolution. Finally, both the scenarios of link capacity degradation and random link capacity are used to illustrate the model’s applications. Analyses and applications of our model demonstrate the model is reasonable and useful for studying the day-to-day traffic dynamics.

1. Introduction

In the past decades, day-to-day traffic dynamics (DDTD) have been developed substantially in the field of transportation, which are mainly used to study the traffic fluctuations and the evolution process, rather than the final or static equilibrium state [1]. Approaches used to study DDTD are very flexible because they allow a wide range of behavior rules, levels of aggregation, and types of models to be integrated in the same modeling framework [2]. Some researchers use deterministic models or stochastic models to study the day-to-day flow evolution. Smith and Friesz et al. proposed three dynamic systems by the deterministic traffic assignment processes [3, 4]. The flow evolution trajectory can be provided explicitly in deterministic models. Daganzo and Sheffi put forward the stochastic user equilibrium to model the evolution of the traffic pattern [5]. Stochastic models mainly focus on the probability distribution of flow states [6, 7]. In some literatures, traffic dynamics are modeled as continuous time processes [3, 4]. However, it is not reasonable to adjust traffic flow continuously due to travelers’ activities constraints. Therefore, more appropriate methods appear by using discrete-time systems, in which travelers repeat route choice daily or weekly [2, 8]. The flow convergences of continuous time and discrete time are very different [9].

Another important branch of DDTD is the research about travelers’ behaviors. The uncertainty inherent of travelers’ behaviors increases the complexity of models about DDTD, which makes traditional models unable to represent complex DDTD well [10]. This motivates much research and efforts to solve the problem by using new methods. Among these new methods, agent-based techniques are very appropriate approaches. An agent knows its internal state, the evolution of system, the likely outcomes of its actions, and so forth, and it has some abilities, such as independent decision-making, beliefs, desires, and intentions. These characteristics ensure the scalability and robustness of agent-based models, which are important to portray the complexity of DDTD. Some scholars applied agent-based models to analyze travelers’ behaviors. Rossetti modeled commuters’ behaviors in the structure of beliefs, desires, and intentions (BDI) which is a classical structure in multiagent simulation. In their model, travelers could make decisions about route choice and departure time rationally [11]. Nakayama et al. proposed some models about travelers’ route choice behaviors. They took into account the limitations of travelers’ cognitive capabilities and used microsimulation to examine the dynamic nature of travelers’ route choice. They found that the network flow does not necessarily converge to the user equilibrium (UE) and travelers’ cognitions of each route do not become homogeneous by learning [1214]. Jha et al. assumed that each traveler used a disutility function to perceive travel time and schedule delay for evaluating the alternative travel choices, and then chose an alternative route according to the utility maximization principle [15]. Chen and Mahmassani modeled travelers’ leaning process by using Bayesian theory and an agent-based simulation framework to study networks’ dynamic properties [16]. Klügl and Bazzan used a simple heuristic model to stimulate the process of travelers’ decision-making and studied the effect of traffic forecast [17]. They set up an agent-based model to stimulate travelers’ route choice and found that travelers could avoid the Braess Paradox by learning [18]. Dia applied the agent-based approach to research drivers’ behaviors under the influence of real-time information [19]. Liu and Huang used multiagent stimulation to investigate the differences of route update rules in two kinds of scenarios, which are traffic information unrelease and release, respectively [20]. Illenberger et al. proposed a model to study travelers’ risk-sensitive route-choice behaviors [21]. Gao and Wang explored travelers’ route choice under guidance information by using microsimulation [22].

Travelers’ decision-marking is an important aspect in agent-based models. DDTD is not a process of one-short choice but a process of repeated decisions. Travelers can perceive travel time and the network characteristics from their travel experience. It is hard for them to make the best choice every time due to their bounded rationality and the uncertainty of environment. They obey a stochastic route choice rule. If the travel time of a path was short in the past days, its probability to be chosen is big in the current day. On the contrary, the probability is small. These characteristics are similar with the features of the reinforcement learning (RL) theory. Under RL, if an action yielded a high payoff in the past days, the probability assigned to it increases in the current round, or the behavior associated with the action gets reinforced [23]. RL asserts that decision makers’ behaviors are moved towards the direction of random choice due to the environmental uncertainty [24]. These similarities make RL very suitable to analyze travelers’ behaviors. However, it has been seldom used in this field. Ben-Elia proposed a learning-based model of route choice and found that information could improve travelers’ learning rate and made them more prone to risk-seeking by a laboratory experiment [25]. Wahba and Shalaby proposed a departure time and route choice model based on the Markovian decision process and RL [2628]. Zolfpour-Arokhlo et al. formulated a route planning system by combining value-based dynamic programming with RL. They created a priority route plan for vehicles by studying the weights of some components in road network environments such as road safety, traffic, and weather [29]. The applications of RL in route choice still need to be further researched.

In this paper, we propose a day-to-day route choice model based on RL. Travelers’ memory level, learning rate, and cognition based on their travel experience are taken into account in the model. The decision-making and learning process of travelers follow the principles of the classical Bush-Mosteller (BM) model in RL.

The rest of the paper is organized as follows. Section 2 introduces some basic knowledge of BM model and then our day-to-day route choice model based on RL is formulated. Section 3 verifies the rationality of our model and analyzes its properties. Section 4 demonstrates the applications of our model in two traffic scenarios.

2. The Model

2.1. The Process of Travelers’ Route Choice

Day-to-day route choice behavior is a repetitive decision process. Each traveler is an agent with the ability of learning and decision-making. They follow a stochastic route choice rule because of their bounded rationality and the environmental uncertainty. Travelers’ route choice and travel time are not known by each other when there is no external information provided. Expected travel time (ET) and perceptive travel time (PT) are generated according to their experience. Travelers judge their choice by comparing their ET and PT after each travel and then update the probabilities that the paths will be chosen repeatedly. The characteristics of the process are similar with the features of the Bush-Mosteller (BM) model, which is a classical model in reinforcement learning (RL) and will be introduced in the next section. The process of travelers’ route choice can be shown in Figure 1.

2.2. Model Formulation
2.2.1. The Bush-Mosteller Model

The Bush-Mosteller (BM) model is a classical learning model proposed by Bush and Mosteller in 1973. It consists of a learning algorithm and a stochastic decision rule. The consequences of a decision can create positive or negative stimulus, which updates the probability that the decision will be repeated [30]. Actions leading to satisfactory outcomes (i.e., outcomes that meet or exceed aspirations) will tend to be repeated in the future, whereas choices resulting in unsatisfactory experiences will be avoided. In BM model, each player focuses on his past choices and payoff but ignores all the choices and payoff of other players [31].

In the BM model, action updating takes place in two steps. Firstly, at each time , the player chooses an action and calculates his stimulus by using the following formula:

is the payoff of the action and is the aspiration level of the player at time . denotes any possible payoff if he chooses other strategies (assuming that players know all the potential payoff). The denominator in (1) represents the upper value of the set of possible differences between payoff and aspiration. Secondly, the player updates the selected probability of the action as follows:

is the selected probability of the action at time , is the learning rate (), and the changes of the probabilities for the actions not selected are derived from the constraint that probabilities always add up to unity.

Similarities make the BM model very suitable to research the day-to-day route choice behaviors. In the next section, a day-to-day route choice model will be proposed based on the BM model.

2.2.2. A Day-to-Day Route Choice Model

We consider a transportation network which is a fully-connected directed graph denoted as , consisting of a set of nodes and a set of links . Let be the set of OD pairs, let be the fixed travel demand between the OD pair , let be the set of paths connecting the OD pair , let be the flow on the path on day , and let be the flow on the link on day ; , , and denote demand, path flow, and link flow vectors, respectively; the link-path incidence matrix is denoted as , then ; let be the OD-path incidence matrix, then . The link travel time vector and the path travel time vector are denoted as and , respectively, then [1].

Due to bounded rationality and environmental uncertainty, travelers obey a stochastic route choice rule: the bigger the chosen probability of a path, the more likely the path is selected by the traveler. Supposing the number of optional paths for the traveler in the OD pair is . Let be the set of the chosen probabilities of paths for him on day , . A random number can be generated by computer, if then the traveler chooses the path and the flow of the path adds one.

Each traveler only knows the travel time of the path he chooses on day . However, they can form perceptive travel time (PT) of all the paths based on their travel experience. So a traveler’s PT of a path can be used as its cost during his route choice process. Due to the particularity of route choice process, we make some modifications to (1) as follows:

is the stimulus of the traveler when he chooses the path on day . is his expected travel time (ET) and is his perceptive travel time (PT) of the path . is his any possible PT if he chooses other path. The two denominators in (4) donate the biggest benefit and the biggest loss that the traveler may get on day . Piecewise function can reflect the degree of the traveler’s satisfaction or dissatisfaction to his route choice better.

ET and PT are formed based on travelers’ travel experience. Due to the limit of human’s memory, travelers remember recent travel time more clearly than older ones. So the travel time on different days has different weight on their ET and PT. The weighted average of all the travel time that a traveler has experienced can be used to calculate his ET and PT. Let be the memory level of travelers (), then ET and PT of the traveler can be formulated as follows:

is the ET of the traveler on day . is the travel time that he has experienced on day . is formed before the travel on day . Consider

is the PT of the traveler to the path after the travel on day . is a 0-1 variable. If he chooses the path on day , then ; else . Because only one path can be chosen by him on day , then

is the number of paths that he can choose. So the PT of the path which is chosen by him on day can be formulated as

Because the traveler does not know the actual travel time of the paths that he does not choose on day , it is reasonable to use PT of these paths as their cost. Because PT is the weighted average of travel time that travelers have experienced, it can reflect the stability of a path’s travel time in a long time. Considering the contingency of once travel time, it is not appropriate to use the actual travel time of the path chosen on day as its evaluation criterion, therefore we also use its PT as its cost.

Stimulus on day can be calculated by (4) to (8). Then the probability of the path chosen by the travel on day can be updated by (2). In order to ensure all probabilities always add up to unity, the probabilities of those paths not chosen by the travel on day are updated as follows:

is the stimulus of the traveler when he chooses the path on day . is the chosen probability of the path and is the chosen probability of the path .

3. Model Validation and Model Properties

In the previous section, we proposed a day-to-day route choice model. In this section, some numerical stimulations are made to validate the model by using three different types of test networks. The properties of the model are also analyzed.

3.1. The Simulation Program and the Three Test Networks
3.1.1. The Simulation Program

The simulation program is briefly described as follows:

Step 0. (Parameters Initialization). Generate initial chosen probability of each path for every traveler and specify the value of and .

Step 1. Generate a random number of 0 to 1 for every traveler. Travelers choose routes according to the rule mentioned in (3).

Step 2. Count the flow of each route. Compute the travel time of each route.

Step 3. Calculate every traveler’s expected travel time and perceptive travel time with (5) to (8).

Step 4. Compute the stimulus of every traveler with (4).

Step 5. Update the chosen probability of each path for every traveler with (2) and (9).

Step 6. Cycle to perform Step 1 to Step 5 within the prescribed number of days.

3.1.2. The Three Test Networks

Three test networks, which are used commonly in transportation literatures, are chosen as our numerical examples. The first test network is composed of three links as shown in Figure 2. The total OD demand is 300. Links’ free flow travel time and capacities are , , and ; , , and . The average link travel time function is assumed to be the BPR function. Consider is the free flow travel time of the link , is the link capacity, and is the link flow.

The second test network is a grid network, which is constituted by 9 nodes, 12 links, and 6 routes [1]. Node and link numbers are shown in Figure 3. The total OD demand is 9000. The free flow travel time and capacities of the 12 links are given in Table 1.

The third test network named Nguyen and Dupuis’ network has 13 nodes and 19 links as shown in Figure 4. There are four O-D pairs in the network, 1-2, 1-3, 4-2, and 4-3; the numbers of paths between those O-D pairs are 8, 6, 5, and 6, respectively, as shown in Table 2. The O-D demands are deterministic and given as , , , and [32]. The link characteristics of the network are shown in Table 3.

3.2. Model Validation

As mentioned in our model, the memory level of travelers is (). The value of can affect the weight of the past travel time on travelers’ current expected travel time (ET) and perceptive travel time (PT). Travelers weight recent travel time more than older ones. The smaller the -value, the less the travel time of the past days influence travelers’ current ET and PT. With the increase of -value, more travel time of the past days has influence on their ET and PT. When , it means that travelers can remember all travel time they have experienced and the travel time of each day has the same effect on ET and PT.

It is known that user equilibrium (UE) assumes that travelers are perfectly rational and know all information. So the conditions of reaching to UE are ideal. Obviously, is also an ideal state in our model. If the flow distribution in our model can converge to UE under , then our model can be verified to be reasonable. The following are the stimulation results of the three test networks under the condition of , ( is the learning rate).

In the first network, as shown in Figure 5, the flow fluctuations of the three paths become very weak on the 60th day and the network flow converges to equilibrium on the 210th day. The travel time of the three paths is approximately equal under the equilibrium. So the network flow converges to UE. Similarly, the flow of the second network converges to UE in Figure 6. The flow in the four OD pairs of the third network also converges to UE as shown in Figures 7, 8, 9, and 10, and there are almost no flow on the paths 6, 7, 8, 13, and 14 under UE.

As shown above, the three test networks can all converge to UE when and . In order to study whether the learning rate can affect networks to converge to UE under the ideal state of , we set three different values of : , , and . The second test network is used as the stimulation example. The corresponding day-to-day dynamics are shown in Figures 11 and 12. It can be seen that the network flows all converge to UE under different when . And the convergence rate becomes faster with the increase of . So the learning rate can impact the convergence rate but cannot influence networks to converge to UE when .

The results above not only illustrate that the network flow evolution based on our model can converge to UE when travelers’ memory level is ideal , but also show that our model can demonstrate the process of flow and travel time evolution. And the network flow does not converge to UE smoothly but reach to UE after an oscillation period (as shown in Figure 6). These results validate the rationality of our model.

3.3. Model Properties

The rationality of our model has been validated in Section 3.2. In this part, its properties will be illustrated by analyzing the two parameters in the model ( and ). The second test network is used as the simulation example in this section.

3.3.1. Impact of Memory Level on Day-to-Day Traffic Dynamics

In order to see the impact of on day-to-day traffic dynamics, we fix and set three different values of : , , and . The corresponding flow evolution and travel time evolution are shown in Figures 13 and 14. It can be seen clearly that the network flow does not converge to UE when and , it converges to UE approximately as . And the flow fluctuations of and are strong, but that is weak relatively when .

The flow standard deviations (FSD) in the last 100 days can be used as the statistic to measure the flow fluctuations. We fix the value of and increase the value of from 0 to 1. Due to the length limitation of the paper, only four deterministic values of are listed (, , , and ). The corresponding changes of FSD are shown in Figure 15. After summarizing all simulation results, we find that when , the flow fluctuations weaken obviously with the increase of , but when , only higher memory level () can reduce the fluctuations obviously.

These phenomena above are consistent with the physical meaning of : travelers can cognize travel time better with the increase of . It is helpful to reduce the randomness of their route choice and make them do better choice. So a bigger can weaken the flow fluctuations and promote the convergence of the network flow.

3.3.2. Impact of Learning Rate on Day-to-Day Traffic Dynamics

To see the impact of on day-to-day traffic dynamics, we fix and set three different values of : , , and . As shown in Figures 16 and 17, we can see that the network flow converges to UE approximately only when and . The flow fluctuations are strong under and . When and , the flow distribution deviates from the UE for several days after the network flow converges to UE, then comes back to UE again. And the process is repeated. This phenomenon is called “cyclical oscillation” by us. The cyclical oscillation found in the flow evolution is not an accidental phenomenon, which appears in a lot of stimulations with a high learning rate (such as , ; , ; , and ). The corresponding flow evolutions can be seen in Figure 18.

We fix the value of and increase the value of from 0 to 0.99 for studying the impact of on flow fluctuations. From Figure 19, we can see that when , the flow fluctuations strengthen with the increase of . But when , there are two inflection points (IP) in the process of FSD changes. With the increase of , the flow fluctuations strengthen before the first IP, and then weaken until the second IP and strengthen again after the second IP. For example, when , the two IP are and , respectively. when and , the flow fluctuations strengthen with the increase of . But when , the flow fluctuations weaken with the increase of .

The impact of on day-to-day traffic dynamics can be explained: the larger the , the more the paths’ chosen probabilities are adjusted, which strengthens the flow fluctuations naturally. Memory level is helpful for weakening the flow fluctuations as mentioned in Section 3.3.1. So the effect of memory level and learning rate on day-to-day dynamics is opposite. Their opposite effect leads to the first IP in the process of FSD changes in Figure 19; the second IP is generated by the cyclical oscillation of flow evolution which appears under high learning rate.

4. Traffic Scenarios Analyses

In this section, in order to illustrate the applications of our model, two particular traffic scenarios are analyzed by using the third test network and some discussions are provided based on the numerical results. The two parameters of our model are fixed as , in this section.

4.1. Link Capacity Degradation Scenario

Assuming that the capacity of the first link in the third test network reduces 50% on the 1001th day because of road maintenance, and recovers on the 2001th day [33]. The day-to-day traffic dynamics are shown in Figures 20 and 21 (different color lines represent different paths, which can be seen in Figures 7 and 8). We can see that the network flow distributions in the four OD pairs all converge to UE approximately before the link capacity reduction. After the capacity reduction, the flow distributions in all OD pairs are changed. The changes in the OD pairs 1-2 and 1-3 are bigger than those in the OD pairs 4-2 and 4-3. These are consistent with our intuition: OD pairs 1-2 and 1-3 both contain the first link and, thus, should be more severely affected by the link capacity reduction. The first link does not belong to OD pairs 4-2 and 4-3, so the two OD pairs should be influenced gently. When the first link’s capacity recovers, initial network equilibrium is restored. This is because when the capacity recovers, travelers can form similar cognition on the network with that formed before the capacity reduction. So the initial network equilibrium can be restored by revoking the link capacity degradation. Other point needs to be noticed: the first link’s capacity reduction strengthens the flow fluctuations in the OD pairs 1-2 and 1-3. These suggest that the capacity reduction can reduce the stability of the network.

4.2. Links’ Capacities as Random Variables Scenario

Links’ actual capacities may be influenced by many factors, such as bad weather, traffic accidents, pedestrians’ crossing street behaviors, traffic violations, and road maintenance. All these factors can reduce links’ capacities. So links’ actual capacities are no longer deterministic values but random variables [34].

Let be the actual capacity of the link , let be the minimum capacity, and let be the maximum capacity. Assume that every link’s actual capacity follows a uniform distribution, which means .

In order to study the impact of links’ actual capacities’ random variations on the day-to-day traffic dynamics, the dynamics of deterministic capacities are used as reference. Let deterministic capacities be equal to the expected values of the actual capacities, let denote the deterministic capacity of the link , and let denote the expected value of the actual capacity of the link . Then

The mean and standard deviation of paths’ flow and travel time in the last 100 days are used as comparison statistics. The third test network is used as the numerical example under the condition of , . Deterministic capacities use links’ capacities in Table 3. Take the statistical results of the OD pairs 1-3 and 4-2 as examples to show the differences between deterministic capacities and random capacities on day-to-day dynamics. The results of deterministic capacities and random capacities are shown in Tables 4 and 5, respectively.

Because paths’ mean travel time equal approximately and their standard deviation are not big in Tables 4 and 5, the network flow under the two kinds of links’ capacities both converge to UE approximately. But the UE of the two cases are different, because the mean path flow distributions are not the same by comparing Tables 4 and 5. So the random variations of links’ capacities change the network’s paths flow equilibrium. According to the mean flow of paths in Tables 4 and 5, links’ mean flow under deterministic link capacities and random link capacities can be calculated as the following formula:

and represent the mean flow of the link and the path in the last 100 days. If the link belongs to the path , then , else . There are 19 links in the third test network. The calculation values of links’ mean flow are shown in Figure 22. It can be seen clearly that the mean flow of links under the two kinds of links’ capacities is almost equal. So the random variations of links’ capacities almost do not change the equilibrium of links flow, but it can change the equilibrium of paths flow as mentioned above.

In addition, the random variations of links’ capacities make the mean travel time of paths become longer, which can be explained as follows.

We assume above, so the travel time of the link is also a random variable. Let be the probability density function of and let be the travel time of the link . denotes its expected travel time. Then

Because , and , then

The travel time of the path can be represented as

So the expected travel time of the path is

Formula (14) means that when the deterministic capacities are equal to the expected values of the actual capacities, the expected travel time of links under actual capacities are longer than those under deterministic capacities. So the expected travel time of paths under actual capacities is also longer than those under deterministic capacities according to Formula (16). This is why the mean travel time of paths in Table 5 is longer than those in Table 4.

Another phenomenon is that when links’ capacities change from deterministic values to random values, the flow of the paths which have the smallest mean travel time decreases by comparing Table 5 with Table 4. It is because the travel time standard deviations of these paths are bigger than most of other paths. Then the fluctuations of the paths’ travel time are stronger, respectively. In other word, the stabilities of these paths’ travel time are not better. So the flow on them decreases. It suggests that travelers not only consider the mean travel time of a path, but also consider the stability of the path travel time.

5. Conclusions

In this paper, we propose a day-to-day route choice model based on reinforcement learning. Travelers’ memory level and learning rate are incorporated into the model. Travelers obey a stochastic route choice rule and use their experience-based cognition to evaluate their choice and then update the chosen probabilities of paths.

The rationality of the model is verified firstly. We assume that travelers can remember all the travel time they have experienced. Under the ideal hypothesis, the flow evolution based on our model in three different types of networks can all converge to UE. It is known that UE is equilibrium under ideal condition. So our model is verified to be reasonable. Secondly, the properties of the model are analyzed. And we find that the network flow does not necessarily converge to UE under limited memory level. High learning rate can lead to cyclical oscillation in the process of flow evolution (as shown in Figure 18). Memory level can weaken the flow fluctuations, but learning rate can strengthen the fluctuations. When memory level is larger than 0.6, there are two inflection points in the process of flow fluctuations with the increase of learning rate (as shown in Figure 19). The first inflection point is caused by the opposite effect of memory level and learning rate on flow fluctuations. The second inflection point results from the cyclical oscillation which appears in flow evolution with high learning rate.

Two particular traffic scenarios are analyzed to illustrate the applications of our model. The first scenario is link capacity degradation. We find that the initial network equilibrium can be restored by revoking the link capacity degradation in our model. It is because when the capacity recovers, travelers can form similar cognition on the network with that formed before the capacity reduction. The second scenario is that links’ capacities vary randomly because of bad weather, accident, and so forth. The results in this scenario indicate the following: random variations of links’ capacities can change the equilibrium of paths flow but cannot change the equilibrium of links flow, and these also increase the mean travel time of paths; when travelers choose route, they think about not only the mean travel time of paths, but also the stability of paths’ travel time. Analyses and applications of our model demonstrate the model is reasonable and useful for studying the day-to-day traffic dynamics.

Some phenomena in the paper need to be further researched, such as the cyclical oscillation in the flow evolution as well as the influence of links’ capacities’ random variations on the network equilibrium. And in addition to experience, memory level, and learning rate, there are many other factors that can influence travelers’ route choice, such as travel habits, travel purpose, and traffic information. The comprehensive influence of these factors on travelers’ behaviors is also our future research. At the same time, empirical studies are necessary to investigate travelers’ route choice behaviors.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The Project is supported by National Natural Science Foundation of China (Grant nos. 70971094, 50908155, 71101102, and 71271150).