Abstract

The present study proposes a framework for learning the car-following behavior of drivers based on maximum entropy deep inverse reinforcement learning. The proposed framework enables learning the reward function, which is represented by a fully connected neural network, from driving data, including the speed of the driver’s vehicle, the distance to the leading vehicle, and the relative speed. Data from two field tests with 42 drivers are used. After clustering the participants into aggressive and conservative groups, the car-following data were used to train the proposed model, a fully connected neural network model, and a recurrent neural network model. Adopting the fivefold cross-validation method, the proposed model was proved to have the lowest root mean squared percentage error and modified Hausdorff distance among the different models, exhibiting superior ability for reproducing drivers’ car-following behaviors. Moreover, the proposed model captured the characteristics of different driving styles during car-following scenarios. The learned rewards and strategies were consistent with the demonstrations of the two groups. Inverse reinforcement learning can serve as a new tool to explain and model driving behavior, providing references for the development of human-like autonomous driving models.

1. Introduction

Recent studies have suggested that the development of autonomous driving may benefit from imitating human drivers [13]. There are two reasons: First, the comfort of autonomous vehicles (AVs) may be improved if the driving styles match the preferences of the passengers. Second, the transition period during which AVs will share the road with human-driven cars is expected to last for decades. Road safety may be enhanced if AVs are designed to understand how human drivers will react in different situations.

Car-following is one of the most common situations encountered by drivers. The modeling of car-following behavior has been a common research focus in the fields of traffic simulation [4], advanced driver-assistance system (ADAS) design [5], and connected driving and autonomous driving [69]. Various car-following models have been proposed since 1953 [10]. In general, there are two major approaches. The classical methods use several parameters to characterize the car-following behavior of drivers [11, 12]. With the rapid development of data science, data-driven methods with a focus on learning the behavior of drivers based on field data [13, 14] have emerged. For both approaches, data-driven car-following models were found to provide the highest accuracy and best generalization ability for replicating the drivers’ trajectories.

Among data-driven methods, supervised learning and expressive models, such as neural networks (NNs), have been commonly used to learn the relationships between states and drivers’ controls [1517]. These modeling techniques are often referred to as behavior cloning (BC). Even though BC approaches have been successfully applied, they are prone to cascading errors [18], which is a well-known problem in the sequential decision-making literature. The reason is that inaccuracies occur in model predictions when there are insufficient data for training the model. Small inaccuracies accumulate during the simulation, which eventually leads the model to states not included in the training data and brings about even poorer predictions.

Inverse reinforcement learning (IRL) was introduced to overcome these drawbacks. IRL, which was proposed by Ng and Russell [19], is the inverse problem of reinforcement learning (RL). Although RL has been applied with great success in recent years, such as in the well-known program AlphaGo [20], the use of RL in other domains remains limited because it is challenging to determine the reward, which is the core component in RL. Manual tweaking of the reward functions can be tedious, and inappropriate reward assignments may lead to unexpected behaviors [21]. IRL, however, provides a framework to learn the rewards automatically. The advantages of IRL are twofold: the learned rewards can be used to improve the interpretability of the models, and the goals of the tasks can be understood, which may prevent cascading errors [22]. Therefore, the present study proposes a car-following model based on IRL. In contrast to a recent work, which applied IRL to model car-following using linear reward representation [23], in this study, a nonlinear function, that is, NN, is used to approximate the reward function as the preferences of human drivers may be highly nonlinear. The proposed model is trained and tested using data under actual driving conditions, and the performance is compared with that of other car-following models.

The rest of the paper is organized as follows: Section 2 briefly reviews the literature on car-following modeling, RL, and IRL. Section 3 presents the input feature vectors of the reward network in the IRL and the proposed algorithm. Section 4 describes the experiments and data used in this study. Section 5 elaborates on the training process of the proposed model and presents the investigated car-following models. Section 6 presents the comparison of the performance for different methods and the characteristics of the trained models using data from drivers with different driving styles. The final section presents the discussion and conclusion.

2. Background

The car-following process is essentially a sequential decision-making problem where drivers continually adjust the longitudinal control based on the states they encounter, which include the speed of the driver’s car, the spacing between the driver’s car and the leading car, and the relative speed between the two vehicles. Car-following models are designed to model the policy of drivers.

2.1. Classical Car-following Models

The early General Motors models proposed by Chandler [24] modeled the drivers’ longitudinal controls to minimize the relative speed because this is one of the primary objectives of car-following. These models exhibited poor performance in predicting the distance between cars. Later models addressed this problem by considering another objective of car-following, that is, maintaining the desired distance; these models included the Gipps model [25] and the intelligent driver model (IDM) [12].

2.2. Behavior Cloning Car-following Models

As the access to high-fidelity driving data has become increasingly available, data-driven models such as NN have been used to model car-following behavior. NN have been demonstrated to exhibit excellent performance for estimating nonlinear and complex relationships. In 2003, Jia et al. [16] proposed an NN-based car-following model with two hidden layers and the inputs speed, relative speed, spacing, and desired speed. Chong et al. [15] simplified the architecture proposed by Jia to one hidden layer and obtained similar results. Instead of using as input only a single time step of relevant information, such as in the conventional NN-based models, Zhou et al. [17] proposed a recurrent neural network- (RNN-) based model that used a sequence of past driving information as input. The RNN approach was better adapted to changes in traffic conditions than the NN approaches. The present study also uses the RNN-based model to compare its performance with that of the proposed method.

2.3. Reinforcement Learning

In RL, a sequential decision-making problem is modeled as a Markov-decision process (MDP), which is defined as a tuple . and denote the state and action space, respectively, and denotes the transition matrix, which is defined in equation (1). and denote the reward function and the discount factor, respectively.where denote the speed of the ego vehicle, the relative speed from the lead vehicle, and the spacing between the ego and the leader at time step , respectively. is the simulation time interval, which is 0.1 s in this study, and denotes the speed of the lead vehicle, which was obtained from the collected data.

RL assumes that drivers follow a policy that maximizes long-term rewards. Once the rewards are known, the policy can be determined using algorithms such as Q-learning [26]. In recent years, RL has been applied by researchers to solve real-world problems such as the balance control of a robot and the energy management of hybrid electric vehicles [2729].

2.4. Inverse Reinforcement Learning

In IRL, the reward of a state can be represented by a linear combination of the relevant features (equation (2)). The goal of IRL is to determine the weights from expert demonstrations.

Abbeel and Ng [30] proposed a feature matching strategy to solve the problem (equation (3)). As long as the feature expectation of the simulated trajectories equals the features calculated from the expert data, the learned behavior has the same performance as the demonstrator. However, it was found that many different policies can be obtained when the feature matching conditions were satisfied. The ambiguity problem related to the correct reward and policy remains unsolved.

The maximum entropy IRL (Max-Ent IRL) proposed by Ziebart [31] addressed the ambiguity problem by incorporating the principle of maximum entropy into the IRL. In the Max-Ent IRL framework, the probability of a trajectory is proportional to the sum of the exponential rewards accumulated in the trajectory (equation (4)). This form of distribution can guarantee no additional preferences other than the feature matching requirement. When the probability of a trajectory is known, the weights of the reward can be determined by maximizing the log-likelihood of the expert data using the following objective function (equation (5)):

2.5. Maximum Entropy Deep Inverse Reinforcement Learning

Since the linear representation of the rewards might limit the accuracy of reward approximation, Wulfmeier [32] extended the method to nonlinear models using deep NNs. Deep architectures have been shown to capture the nonlinear reward structure in several benchmark tasks with high accuracy. The present study uses the approach of deep architectures to represent the rewards of drivers in car-following. The fully connected NNs used in this study map the input features in the car-following model to estimate the rewards, as shown in Figure 1.

It can be derived that the gradient of the Max-Ent deep IRL (DIRL) is as follows: where and refer to the state visitation frequencies calculated from the expert demonstrations and expected state visitation frequencies obtained from the learned policy and refers to the network architectures. Once the gradient is calculated, the parameters of the NN are updated using backpropagation [33].

3. The Proposed Car-following Model

In this section, the details of the proposed model (DIRL) are explained, including the design of the input features for the reward network and the full algorithm. The DIRL model uses as input the driver data on car-following trajectories, consisting of speed during car-following, spacing to the leading car, and relative speed. After training, the DIRL model outputs the policy and the rewards of drivers. A discrete state and action space were defined in the present study. According to the rules for determining car-following events that will be described in Section 4.2 and the distribution of the empirical data used in this study, the spacing is limited to the range from 0 to 120 m with an interval of 0.5 m, the speed is limited to the range from 0 to 33 m/s with an interval of 0.5 m/s, and the relative speed is limited to the range from −5 to 5 m/s with an interval of 0.5 m/s. The action is limited to the range from -3 to 2 m/s2 with an interval of 0.2 m/s2.

3.1. Feature Selection for the Rewards in Car-following

As introduced in the last section, the input features of the network are determined first to create an NN and obtain the rewards in car-following. The rewards in RL encode the objectives or the purpose of the agent [26]. Therefore, the selected features should represent the objectives of drivers in the car-following task.

In the study of Gao [23], speed and spacing were chosen as features for representing the rewards. In [34], the reward function represented the speed discrepancies between the simulated trajectories and the test data. In contrast to these studies, we base the reward function on the following features.

3.1.1. Time-Headway

Time-headway (TH) has been widely used as an indicator for drivers to evaluate risk during car-following [35]; TH is defined as the time between two vehicles passing the same point on the road. It has been suggested that a driver’s safety margin in car-following can be characterized by the TH, which plays a role in the driver’s decision-making [36]. Drivers may have different desired safety margins for the TH. For example, aggressive drivers may prefer a shorter TH than conservative drivers because they like to track vehicles at a closer distance. It has been suggested that one of drivers’ objectives in car-following is to control TH to their expectations [37]. Therefore, TH is selected as an input of the reward network in this study.

3.1.2. Relative Speed

Research has shown that the drivers’ speed control in car-following is proportional to the relative speed [38]. As mentioned earlier, an objective in car-following is to keep the relative speed close to zero [37]. In this study, we relax this objective so that drivers will keep the relative speed within an appropriate range because people’s driving behavior is imperfect and is not always optimal.

Following the method presented in [23], these two features were mapped into high-dimensional space using the Gaussian radial kernel: where denotes the kernel vectors, which represent the conjectural values of the preferred TH and relative speed, and is a parameter that controls the width of the kernel function. Specifically, has a range of 0.5 s to 3 s, with an interval of 0.5 s, and has a range of −4 m/s to 4 m/s, with an interval of 0.5 m/s in this study.

3.1.3. Maximum Speed

The maximum desired speed is commonly used in many classical car-following models [12, 16]. Drivers may have a preferred maximum speed, and they may not continue to follow the leader if their speed is already above this value. It is assumed that the objective of the driver is to keep the speed below the maximum speed as follows:where denotes the conjectural acceptable maximum speed. is in the range of 90 km/h to 120 km/h, with an interval of 5 km/h. The reward function is represented by an NN that is parameterized by as follows:

3.2. The Full Algorithm

The proposed DIRL algorithm consists of three parts, which are marked in bold in Algorithm 1. In the first part, the reward is determined by the parameters of the NN to calculate the policy . Value iteration with a softmax function is used to solve the policy based on the reward. The result of the softmax version of value iteration is a stochastic policy in which the probabilities of every predefined action are listed in a tabular form. and in this part denote the expected long-term return of states and state-action pairs.

Input:
Randomly initialize the parameters of the neural network as
For  = 1 to do
 Determine the reward for every state by applying forward propagation in the neural network
 Use the softmax version of value iteration to obtain the policy
  Initialize
  Repeat until
   
   
   
  
 Estimate the expected state visitation frequencies using the policy
  For  = 1 to do
   Start from the initial state in every trajectory and run the policy
   For every time step, sample one action from the distribution of according to the probability of every action
   
   
   
  end for
  
 Calculate the gradients of DIRL and the network and use backpropagation to update the parameters of the network
  
  
  Update with the gradients
end for

In the second part, the policy is applied to estimate the expected state visitation frequencies . The original version for estimating as reported in [31], is not suitable in car-following tasks because the speed of the lead vehicle is always changing. Simply applying policy propagation [32] for every trajectory in the data can be time-consuming. Therefore, in this study, we perform sampling by running the policy in the simulation of drivers’ car-following trajectories for times to approximate . During the simulation, the action at every time step was randomly sampled from the policy based on the probability of every action.

In the third part, the gradients are calculated by subtracting the estimated from the state visitation frequencies obtained from the data. Subsequently, the parameters of the NN are updated by backpropagation. These steps are repeated several times until convergence. The training of the algorithm can be stopped when the rewards accumulated in the trajectories stop increasing.

4. Experiments

4.1. Data Description

Data from two field tests that were conducted in Huzhou city in Zhejiang province and Xi’an city in Shaanxi province were used in this study. Forty-two drivers participated in the test. Their driving experience ranged from 2 to 30 years with the average being 15.2 years. During the test, the participants were only informed of the starting location and destination, and they were asked to follow their normal driving styles. The test data were collected by a Volkswagen Touran equipped with instruments and sensors, as illustrated in Figure 2. The test route consisted of diverse driving scenarios such as urban roads and highways, as shown in Figure 3. The other details of the field tests are described in [39, 40].

4.2. Extraction of Car-following Events and Data Filtering

We applied the rules described in [41] to extract the car-following events from the obtained data. (1) We ensured that the test vehicle was following the same lead car; (2) the distance to the lead car was less than 120 m to eliminate free-flow traffic conditions; (3) we ensured that the follower and the leader were on the same lane; (4) the duration of car-following events was longer than 15 s.

The extracted events were then manually reviewed by checking the videos recorded by the front camera on the equipment vehicle to guarantee good data quality. Eventually, nearly one thousand car-following events were extracted. A moving average filter was applied (1 s) to remove noise from the extracted car-following data.

4.3. Driving Style Clustering

The participants displayed diverse driving styles, which were evident in the driving data. The k-means algorithm was used to cluster the drivers into different driving styles. Previous studies have adopted kinematic features such as spacing, speed, and relative speed or time-based features such as TH and TTC for driving style clustering [34, 39]. In this study, multiple combinations of the mentioned features were tested as inputs for the k-means algorithm, and the quality of the clustering results was then evaluated by the silhouette coefficient where a larger silhouette coefficient indicates a better result. Finally, the mean value of TH and TH when braking was chosen because this combination achieved the highest value of the silhouette coefficient [42]. The number of the clusters was also determined to be two based on the results of the silhouette coefficient. Figures 4 and 5 present the boxplot of the mean TH and mean TH when braking for the conservative group that consisted of 16 drivers and the aggressive group that consisted of 26 drivers, respectively. The aggressive group had significantly higher mean TH (t = 6.748, ) and mean TH when braking (t = 7.655, ) than the conservative group.

The descriptive statistics (Table 1) of the two groups confirmed the clustering results. The aggressive drivers had shorter mean spacing and higher mean speed and mean acceleration than the conservative drivers.

5. Model Training and Evaluation

5.1. Evaluation Metrics

Two metrics, the root mean square percentage error (RMSPE) (equation (10)) and the modified Hausdorff distance (MHD), were used to evaluate the accuracy of the car-following models for reproducing drivers’ car-following trajectories. As suggested by Punzo and Montanino [43], the cumulative sum of the errors is an appropriate option to evaluate the performance of car-following models.where denotes the RMSPE of speed, denotes the RMSPE of spacing, are the speed and spacing at time in the observed th trajectory, and are the simulated speed and spacing at time for the th trajectory.

The MHD is an extension of the Hausdorff distance which represents the distance between two sets of points and , as defined in equation (11). The median of the MHD () had been used to evaluate the similarity of simulated and actual trajectories in modeling defensive driving strategies [44] and urban route planning [45].

Since the proposed DIRL model outputs a stochastic policy, the two metrics were calculated by averaging the results of 10 simulations for every trajectory in the data.

5.2. Model Training

The k-fold cross-validation method was applied to evaluate the performance of the car-following models. Specifically, the car-following datasets of the two groups of drivers were randomly divided into 5 groups with an equal number of trajectories. One group was taken as the test set and the remaining four groups were taken as the training set. The training and test experiments were repeated five times because every divided group had been used as the test set. Finally, the performance of the car-following models was evaluated by the average value of the two metrics.

The Adam optimizer [46] with learning rate decay was applied to train the DIRL model. The hyperparameters used for training are listed in Table 2. L2 regularization was used to prevent overfitting of the reward network.

Figures 6 and 7 present the change of RMSPE of spacing and the change of the cumulative normalized rewards per trajectory in one of the cross-validation experiments, respectively. After about 5 iterations, the RMSPE of spacing for the training set and test set start to converge. The rewards collected in the trajectory remain stable after about the same number of iterations.

5.3. The Investigated Models

The accuracy and generalization ability of the proposed model was compared with those of two other data-driven car-following models, that is, the NN-based model and the RNN-based model.

5.3.1. NN-Based Car-following Model

A fully connected neural network with one hidden layer was built following the study conducted by Chong et al. [15]. The hidden layer consisted of 60 neurons in this study. The NN-based model takes inputs of speed, spacing, and relative speed and outputs the acceleration for the current time step. The objective of minimizing the empirical acceleration and the model’s predictions was adopted to train the model (equation (12)).where denotes the weights and bias in the NN-based model, denotes the predicted acceleration at time step t for the nth trajectory, and denotes the empirical acceleration at time step t for the nth trajectory.

5.3.2. RNN-Based Car-following Model

The architecture of the RNN-based model built in this study is in line with the study conducted by Zhou et al. [17]. The number of hidden neurons in the RNN model was set to be 60. The RNN model takes inputs of a sequence of historical information that lasts for 1 s and outputs the acceleration for the current time step. The speed and spacing for the next time step were then estimated based on the state transition matrix described in equation (1). The training of the RNN model adopted the loss function shown in equation (13) which minimizes the RMSPE of speed and spacing.where denotes the weights and bias in the RNN model, are the speed and spacing at time in the observed th trajectory, and are the simulated speed and spacing at time for the th trajectory.

6. Results

6.1. Performance Comparison

The average performances of the three models in the fivefold cross-validation tests using the data from the aggressive and conservative groups were compared in this section. Tables 3 and 4 present the results on the training sets and the test sets, respectively. The DIRL had the lowest RMSPE of spacing and MHD50 in both the training sets and the test sets. Although the NN and the RNN model had lower RMSPE of speed in the test sets, the overall error of the DIRL in reproducing drivers’ trajectories was lower than that in the other two models. For the two kinds of BC models, RNN outperformed the NN model as it achieved lower RMSPE and MHD50 than the NN model.

Figure 8 presents the simulation results of speed and spacing for two car-following periods randomly selected from the datasets. As can be seen, the DIRL model tracks the empirical speed and spacing more closely than the other two models. The simulation results of speed for the NN and RNN model are smoother than those of the DIRL model because the former models output a continuous action, while the latter model outputs a discrete action.

6.2. The Learned Characteristics of the Model

Since the proposed model was trained with data from two groups of drivers with different driving styles, we expected that the learned models would exhibit features of both groups. Therefore, the learned value of the two driving styles, which represents the expected long-term return, is compared in this section. As depicted in Figure 9, the states with a higher value represent the preferable states, which drivers try to achieve during car-following. For the same distance to the lead vehicle, the aggressive drivers preferred a higher speed than the conservative drivers. The high-value area (, in red) for the aggressive drivers has a steeper slope as indicated by the angle between the black-dashed line and the x-axis. Since the cotangent of the angle is proportional to the value of TH, a larger angle means a shorter TH. Hence, the comparison of the angle in the two figures shows that the aggressive drivers favor a shorter TH. Besides, the width of the high-value area for the aggressive is wider compared with the conservative; it indicates that the aggressive drivers’ preferred TH has a larger variance than that of the conservative drivers. This result is in good agreement with the details shown in the boxplot of TH for the two groups of drivers in Figure 4.

It is also found that the high-value region of the speed becomes wider with an increase in the spacing to the lead vehicle in the two figures. The interpretation is that when the spacing is small, drivers must control the speed more precisely to prevent colliding. As the distance increases, drivers have more flexibility for speed control.

The learned policies of the two groups were compared by assuming that both groups were following the same leader. The initial states of this car-following event and the speed of the leader were input from the collected data. The learned stochastic policy was run 20 times for both groups. As shown in Figure 10, the aggressive group (in blue) maintained a smaller distance compared to the conservative group (in red) during the simulation. Both the aggressive and conservative drivers accelerated to follow the leader. However, the aggressive drivers increased the speed more quickly in the first 4 s, resulting in less distance to the leader compared with the conservative drivers.

7. Discussion and Conclusion

In this study, we propose a car-following model based on Max-Ent DIRL. The proposed model learns the rewards of drivers during car-following which were approximated by an NN. The policy of drivers was solved by an RL algorithm of softmax version of value iteration. Tested on actual driving data, the results showed that the proposed model outperformed the BC models NN and RNN by providing the lowest RMSPE and MHD50 in replicating drivers’ car-following trajectories. The better performance of the proposed model can be explained by the more general objective compared with the BC models. The DIRL model reproduces drivers’ policy by firstly learning drivers’ decision-making mechanisms (i.e., the rewards), whereas the BC approaches only learn the state-action relationships. Since the policy was solved by the RL algorithm that is based on the assumption of maximizing long-term rewards, the obtained policy then has the ability of long-term planning. In contrast, the BC methods do not include long-term planning in its model training objectives. The simulation results for the two car-following trajectories confirmed the superior ability of long-term planning for the DIRL model. The derivation between the simulated spacing and the empirical data for the BC models becomes lager as the simulation continues. On the contrary, the simulation error does not accumulate during the simulation for the DIRL model. Moreover, the better performance of the RNN model found in this study is in line with previous studies [17, 34]. Compared with the NN model that only relies on information in the current time step for predication, the advantage of using historical information makes the RNN model more suitable for time series prediction.

The present study also demonstrates that the proposed model could capture the characteristics of different driving styles of human drivers. The learned value and policy matched those of the drivers with distinct driving styles. The fully connected NN applied in this study was trained to capture the relevant features that represented the drivers’ preferences or objectives in car-following scenarios.

The IRL method used in this study provides a new perspective to explain driver behavior and to model different driving strategies. However, solving the IRL problem is computationally expensive, which makes it challenging to apply to high-dimensional systems. Recent studies that have applied adversarial learning to IRL have shown an ability to scale the method to solve complex problems [22, 47]. Future studies should consider these new approaches.

The present study had some important limitations. First, the participants in the present study are all male, so a broader sample is needed in future research. Second, the proposed model does not consider drivers’ reaction delay and memory effect for speed control during car-following. Future studies should take these factors into account.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was jointly supported by the National Key R&D Program of China under Grant 2019YFB1600500, the Changjiang Scholars and Innovative Research Team in University under Grant IRT_17R95, the National Natural Science Foundation of China (51775053 and 51908054), and the Fundamental Research Funds for the Central Universities (300102228506).