A finite mixture of logistic regression model (FMLR) was applied to analyze the heterogeneity within the merging driver population. This model can automatically provide useful hidden information about the characteristics of the driver population. EM algorithm and Newton-Raphson algorithm were used to estimate the parameters. To accomplish the objective of this study, the FMLR model was applied to a trajectory dataset extracted from the NGSIM dataset and a 2-component FMLR model was identified. The important findings can be summarized as follows: The studied drivers can be classified into two components. One is called Risk-Rejecting Drivers. These drivers are consistent with previous studies and primarily merge in as soon as possible and have a distinct preference for the large gaps. The other is the Risk-Taking Drivers that are much less sensitive to the gap size and pay more attention to surrounding traffic conditions such as the speed of front vehicle in the auxiliary lane and lead space gap between the merging vehicle and its leading vehicles in the auxiliary lane. Risk-Taking Drivers use the auxiliary lane to get to the further downstream or less congested area of the main lane. The proposed model can also produce more precise predicting accuracy than logistic regression model.

1. Introduction

Congestion has become one of the most serious economic and social problems and has drawn great attention from the public, transportation research scientists, transportation managers, and so on. Understanding the causes and mechanism of traffic congestion can help traffic managers formulate targeted policies to make better use of the existing transportation infrastructures.

Merging areas are the bottleneck of freeway. Merging behavior is one of the typical mandatory lane changes when vehicles have to move from an on-ramp to the main road. It has been claimed in some studies that merging behavior at merging areas affects traffic operations and may trigger traffic congestions and breakdowns [1, 2]. Thus it is important to analyze the merging behaviors to help understand the mechanism of traffic jams to some extent from a microscopic viewpoint and build more accurate traffic simulation models.

Recently, driver heterogeneity has drawn great attention in microscopic traffic flow studies. Several studies investigated the driver heterogeneity during car-following process [36]. Accommodating heterogeneity within the driver population is important in building microscopic traffic models. To investigate the heterogeneity in merging behaviors, a finite mixture of logistic regression (FMLR) model was proposed in this paper. This model can incorporate the unobserved heterogeneity and automatically segments the merging drivers into different homogeneous populations. More specifically, this paper aims to achieve the following objectives:(i)Prove the existence of heterogeneity among merging drivers.(ii)Identify different driving styles and attitudes during merging process.(iii)Model the merging behavior more accurately.

The present study is organized as follows. The next section will provide a critical review on the existing relevant literature followed by Section 3, which describes the NGSIM data used in this paper. Section 4 gives the methodology to build FMLR model. Results and discussions are presented in Section 5. Finally, the conclusions and future work are presented in Section 6.

2. Literature Review

Several methods have been adopted to model merging behavior, among which gap acceptance theory was the most widely used method [813]. The most important assumption in gap acceptance theory was that a driver makes a lane change when both the lead and lag gaps in the target lane are larger than the so-called critical gap. The critical gap is determined by the characteristics of the drivers, traffic conditions, and so on [14]. Gap acceptance models were initially built to estimate the capacity of unsignalized intersections. Different distributions of critical gaps were assumed in various studies [1517]. Gipps [18] first used the gap acceptance theory to propose a comprehensive framework of lane-changing model. Gipps’s framework has been widely used in several merge models [19, 20] and microscopic traffic simulation software [21, 22]. Different definitions of critical gap were used in these models and software.

Gap acceptance theory was often criticized as its basic assumption is often inconsistent with the real world observation because some lane change behaviors occurred when only the lead or lag gap or even none of them are larger than the critical gap [14, 23, 24]. To overcome this deficiency, discrete choice models such as binary logit model were used by some researchers [14, 2527]. Built upon a series of studies [9, 10, 28], a framework for merging behavior with latent plans was introduced by Choudhury et al. [29]. Normal merge, merge with courtesy, and forced merge were considered in this framework. However, Marczak et al. [14] pointed out that in this framework only accepted gaps were considered and rejected gaps were ignored; and some of the estimated coefficients in the model were not significant.

Traffic behaviors are always uncertain and variable and heterogeneity cannot be ignored in traffic studies. Some studies investigated the heterogeneity among the macroscopic traffic flow [30, 31]. Others studied the heterogeneity in car following behaviors from microscopic viewpoint by deriving the joint distribution of model coefficients depending on an empirical basis [4, 5, 3234]. However, only a few studies were found to investigate driver heterogeneity in lane changing models. A two-step clustering analysis was proposed by Li and Sun [35] to analyze heterogeneity of the merging maneuvers. However, this study ignored the heterogeneity during gap selection and decision process. An empirical analysis conducted by Daamen et al. [23] showed that different merging strategies might be adopted by different drivers under different traffic conditions. It has been pointed out by Keyvan-Ekbatani et al. [36] that different strategies might be used during gap selection process; however the sample size was too small to perform statistically relevant tests and build merging model.

Thus, a FMLR model was introduced in this paper to model the gap selection behaviors during merging process and investigate the heterogeneity among merging drivers. The FMLR model takes the advantage of two techniques: clustering and regression analysis. The model naturally incorporates the unobserved heterogeneity into logistic regression model and automatically segments the drivers into different homogeneous populations. The proposed FMLR model can explain the different strategies in merging behaviors.

3. Data Preparation

The NGISM dataset has been widely used for traffic flow and traffic simulation studies and proved to have high accuracy. Thus, in this paper, the vehicle trajectory data in NGSIM dataset collected on a segment of southbound U.S. Highway 101 (Hollywood Freeway) in Los Angeles, CA, are chosen [37]. Figure 1 shows the site for U.S. Highway 101. This US-101 section is 640 meters long and has five main lanes and one auxiliary lane. The vehicle trajectories were collected from 7:50 a.m. to 8:35 a.m. on June 15, 2005. The road section was covered by eight cameras and the dataset was updated at a resolution of 10 frames per second [7]. The dataset has three data subsets, all of which were collected in 15 minutes.

In this study, we focus on the behavior of merging vehicles and only trajectory data in the weaving section were used. However, it has been pointed out that the original trajectory data contain some noise and errors, which are caused by the system errors and tracking errors [3841]. Several methods have been proposed to filter the data [3840] or re-extract the trajectory data [41]. Re-extracting can produce the most accurate data especially the acceleration data, which however would also make too much effort. In this paper, a smoothing method called sEMA developed by Thiemann et al. [38] is applied to reduce the noise and errors. The sEMA method is also adopted in other studies of merging behaviors and has been proved to be able to provide enough precision for lane change studies [4244]. This data smoothing technique was applied as follows:

(1) The velocities and accelerations of vehicles are directly estimated from the longitudinal positions.

(2) The locations (both local lateral and longitudinal coordinates), velocities, and accelerations of vehicles are smoothed by the symmetric exponential moving average filter (sEMA) proposed by Thiemann et al. [38] to decrease measurement errors in the data. The smoothing times of sEMA method are set as the suggested values for the U.S. Highway 101 dataset in Thiemann et al. [38].

Although the random errors can be reduced by the smoothing process, there are still some errors in the data. Thus, the following heuristic rules are applied to filter the datasets:(1)Filter out the trajectories when there are no putative leading vehicles or putative following vehicles on the adjacent main lane. Such trajectories are recorded at the beginning or ending of the video tape and cannot provide the interactions of merging vehicles with their surrounding vehicles.(2)Filter out the trajectories when putative leading or putative following vehicle of a merging vehicle runs around the lane boundary (it keeps touching the lane boundary before lane change or turns back the original lane in about 1 second). These trajectories are always caused by the tracking errors.

After filtering, a searching process was conducted to check the consistency of the local coordinates and global coordinates. Linear regression was performed between local coordinates and global coordinates for each subdataset. Three linear relationships were obtained for each subset:

of three linear relationships are 0.9996, 0.9997, and 0.9997, respectively. It means that the local y of three subsets in US-101 datasets are inconsistent with each other. We cannot find simple linear relationship between local x and global x in US-101 dataset. This could be caused by the specific coordinate system used and the special geometric shape of the road sections. It also could be caused by measuring errors.

To further verify the inconsistency of the US-101 dataset, several data points that have the same global coordinates among the three subsets were searched and obtained. By checking the local coordinates (local x and local y), it was found that the three subsets of US-101 dataset are consistent in local x, but inconsistent in local y. Tables 1 and 2 show the examples having the same global coordinates in the first and second subsets and in the first and third subsets.

One can find that, for the points with the same global coordinates, the three subdatasets have the same local x, but different local y. In the local longitudinal coordinate, the upstream edge (0 m) in datasets 1 is at 12.275m in dataset 2 and 10.598 m in dataset 3. Thus, the three datasets must be unified by using the local coordinates of one of the three subsets.

At every instant when offered a new gap, a merging vehicle driver assesses traffic conditions to decide whether to accept the offered gap or not. One merging vehicle could only accept one gap but could reject several gaps. After data processing, trajectories of 374 merging vehicles consisting of 925 observations were extracted from the dataset. The explanatory variables that may affect a driver’s merging decision used as candidates for analyzing the merging behavior model are shown in Table 3.

4. Methodology

4.1. Finite Mixture of Logistic Regression

The FMLR model is based on the idea that the observed data come from a population with several subpopulations or components [45, 46]. The overall population is modeled as a mixture of the groups using finite mixture models.

Let and denote random vectors with samples and each sample has observations (). Here, the response vector has values in and the explanatory vector has values in . Then, a FMLR with components has the form

where is the conditional density of given and , is the mixing proportion, is the component-specific parameter vector for the density function , and is the vector of all parameters.

Several finite mixture models can be extended based on (4) and (5). For multivariate normal and we get a finite mixture of Gaussians without a regression part, also known as model-based clustering. If is a univariate normal density with component-specific mean and variance , we have , and (4) describes a finite mixture of linear regression, also called latent class linear regression model or cluster-wise regression [47]. If is a member of the exponential family, we get a FMLR models [48, 49].

The analyst does not observe directly which component, , generated observation . The model assumes that individuals are distributed heterogeneously with a discrete distribution within the population. In order to impose the constraints in (2), the mixing proportions are parameterized with a multinomial logit form [50, 51]:

The constraint on is imposed because only parameters are needed to specify. The last proportion is one minus the sum of the first .

If individual specific characteristics are provided, the mixing proportions are extended as [50, 51]where is the vector of component-specific parameters and is an optional set of individual-specific characteristics for observation .

For the observed random sample, (), the log likelihood function for is given by

The maximum likelihood (ML) estimate of is given by an appropriate root of the likelihood equation,

The conditional probability that observation belongs to component is given by

The conditional probabilities can be used to segment data by assigning each observation to the component with maximum conditional probability [50, 51]. A probabilistic segmentation of the data into components can be obtained in terms of the fitted conditional probabilities. In the FMLR model we consider the latent component-indicator variables , to classify each single observation:The estimator of , is

4.2. Model Parameter Estimation

Parameters of FMLR models can be efficiently estimated through the EM algorithm [52].

(1) Initialization Step. Start with an initial seed (guess) for the parameter using the K-means clustering algorithm [53].

(2) E-Step. Estimate the conditional component probabilities, , for each observation using (7) and derive the mixing proportions as

(3) M-Step. Maximize the log-likelihood for each component separately using the conditional probabilities as weights:

The EM algorithm alternates between the expectation and the maximization steps until the likelihood improvement falls under a prespecified threshold or a maximum number of iterations are reached.

But the drawbacks of EM algorithm are its possible slow convergence rate and long processing time in computer. Thus, in this paper, Latent GOLD 5.0 is used to estimate the parameters. Latent GOLD 5.0 can take the advantages of both EM and Newton-Raphson algorithms. It first uses EM algorithm to get close to the final solution and then switches to Newton-Raphson to finish estimation [54].

The most important and difficult step in building FMLR model is to determine , the number of components. Since is not a parameter, hypotheses on cannot be tested directly. BIC or AIC [50, 51, 55, 56] are generally used as criterion to determine . In this study, we determined based on BIC:where is the log-likelihood value, is the number of free parameters to be estimated, and is the number of observations in the data. A lower BIC value indicates a better model.

5. 5. Results and Discussion

5.1. Results

To select an optimal model, we apply the FMLR model having an increasing number of components from 1 to 4 to fit, and apply Bayesian Information Criterion (BIC) as the indicator to select the most appropriate number of components. Table 4 shows the BIC values of models for different number of components. It can be observed from Table 4 that the lowest BIC value occurs at . Hence, it is plausible to select as a proper number of components.

To select the model variables, the forward-selection method is adopted in this paper. It starts with no variables in the model, tests the addition of each variable using Wald-statics, and adds the variable that gives the most statistically significant improvement of the fit. In this paper, variables will be added one by one until none produce a significant Wald-statistic in all components.

Table 5 shows the estimation results. For comparison, the result of logistic regression is also provided. In this paper, the component mixing proportions are a set of fixed constants (see (6)), as no sociodemographic characteristics of drivers are available in this dataset. The proportion of merging vehicle drivers in each component as indicated by value in Table 5 is 67.2% and 32.8%, respectively.

By using (10)-(12), 374 drivers are classified into two components. One is the larger component, comprising 298 drivers and 612 observations, and the other is the smaller component, containing 75 drivers and 314 observations. To better understand the classification results, the mean values and standard deviations of related attribute variables are shown in Table 6.

5.2. Discussion

As seen from significance levels of parameters of Component 1 in Table 5, and fail to be significant at the 99% level. These suggest that front vehicles in auxiliary lane do not alter drivers’ merge decisions in this component. Another impressing characteristic of this component is that the drivers have a distinct preference for the larger gaps. The negative sign of indicates that drivers in this component tend to decrease their speeds during merging process. Consistent with previous studies, the decrease of speed difference between merging vehicle and putative leading vehicle and a gap located further towards the end of the auxiliary lane also increase the probability of accepting the current gap.

It is interesting to find that the parameter of in Component 2 is much smaller than that in Component 1, which means the drivers in Component 2 do not pursue larger gaps as drivers in Component 1. In addition, speed difference between merging vehicle and putative leading vehicle is still important during merging process. Different from Component 1, and are considered by drivers in Component 2. The sign of the parameter for is positive, suggesting that space in the auxiliary lane also affects the merging behaviors of drivers in Component 2 and the merging vehicle has a high probability of accepting a gap when there is an adequate space in front of the merging vehicle. One interesting finding from Table 5 is that the sign of the parameter for is negative, suggesting that drivers in Component 2 are more likely to delay merge when the leading vehicle moves too fast. One possible reason for this result might be that when the leading vehicles move faster in the auxiliary lane, the drivers are provided more space in the auxiliary lane and they are using the auxiliary lane to reach further downstream in the main lane.

As illustrated in Table 6, the related variables show obvious differences across the two components. The average numbers of rejected gaps of the two components are 1.05 and 3.19 and the average merge location is 41.66m and 108.58m, which indicates that drivers in Component 2 tend to choose gaps further downstream and rejected more gaps than drivers in Component 1. The average rejected gap of Component 2 (17.468 m) is much bigger than Component 1 (10.068m) while the average accepted gap of Component 2 (27.09 m) is much smaller than Component 1 (33.14 m), indicating the inconsistency of gap acceptance theories. One can also find that the drivers in Component 2 increase their speeds during their merging process from 13.505m/s to 14.272 m/s, while drivers in Component 1 decrease their speed from 15.050 m/s to 13.418m/s, and in Component 2, the speed difference between the putative leading vehicle and the merging vehicle for accepted gaps is 3.187m/s, which is much bigger than Component 1, both of which indicate that drivers in Component 2 are more aggressive than Component 1. It is interesting to find that the standard deviations of the speeds for rejected gaps and accepted gaps in Component 1 are similar, which is not the case in Component 2. And one can also find that the standard deviation of rejected gaps for Component 2 is much bigger than that in Component 1. These findings indicate that the merging process of drivers in Component 2 is much more complicated than drivers in Component 1.

Figure 2 shows the relation between the gap size and location for the rejected and accepted gaps in the two components. One can find that the accepted gaps of drivers in Component 1 are almost all located in the beginning of the auxiliary lane while the accepted gaps of drivers in Component 2 are scattered along the lane. It is obvious that the rejected gaps of drivers in Component 2 are much larger than in Component 1 and are overlapped with the rejected gaps, while the overlapping area in Component 1 is much smaller.

Figure 3 shows the box plot of the reverse succession of offered gaps. The x-axis in Figure 3 is the reverse number of offered gaps before merging, in which 0 means the finally accepted gap and 1 means the last rejected gap before merging. One can find that drivers in Component 2 might have several choices before merging, which indicates that drivers in Component 2 prefer to use the auxiliary lane to get further downstream.

Comparing the two components, drivers in Component 1 prefer larger gaps and lower speed difference, while drivers in Component 2 pay more attention to better surrounding traffic conditions and may sacrifice larger gaps to save travel time and get better traffic conditions. Thus, in this paper, Component 1 is named as Risk-Rejecting Drivers and Component 2 is named as Risk-Taking Drivers.

5.3. Accuracy of Developed Models

Tables 7 and 8 show the comparison of estimated and observed values of logistic regression model and 2-component mixture of logistic regression (FMLR-2) model. From theses tables, the proposed model improves the predicting accuracy from 82.70% to 91.24%. It can be concluded that the proposed model has better predictive power than logistic regression model.

6. Conclusions

To incorporate the unobserved heterogeneity into merge model, the present study builds a FMLR model which uses BIC to determine the proper number of mixing components and performs parameter estimation by using Latent GOLD 5.0.

Given U.S. Highway 101 data, the identified optimal model is a 2-component mixture of logistic regression model, which means the drivers can be divided into two components characterized by the driving behavior heterogeneity. One is the Risk-Rejecting Drivers whose drivers are consistent with previous studies and primarily merge in as soon as possible. Drivers in this component have a distinct preference for the larger gaps. The decrease of speed difference between merging vehicle and putative leading vehicle and a gap located further towards the end of the auxiliary lane also increase the probability of accepting the current gap. Contrast to Component 1, Component 2 is constituted with the drivers that are much less sensitive to the gap size and have more emphasis on surrounding traffic conditions such as the speed of front vehicle in the auxiliary lane and space gap between the merging vehicle and its leading vehicles in the auxiliary lane. These drivers are using the auxiliary lane to get to the further downstream or less congested area of the main lane. Thus they are called Risk-Taking Drivers.

In addition, the proposed model can produce more precise predicting accuracy than logistic regression model.

However, more empirical studies are needed to apply this method to datasets in other sites with different demographics, climate, and geometric parameters in order to fully assess the effect of the factors affecting merging behaviors as well as fully understand the strengths and weaknesses of the proposed model.

Data Availability

The NGISM data used to support the findings of this study have been deposited at the website: https://catalog.data.gov/dataset/next-generation-simulation-ngsim-vehicle-trajectories.

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.