Mathematical Problems in Engineering

Volume 2015, Article ID 958206, 10 pages

http://dx.doi.org/10.1155/2015/958206

## Application of the Empirical Bayes Method with the Finite Mixture Model for Identifying Accident-Prone Spots

^{1}Key Laboratory of Road and Traffic Engineering of Ministry of Education, Tongji University, Shanghai 201804, China^{2}Department of Civil and Environmental Engineering, University of Washington, Box 352700, Seattle, WA 98195-2700, USA^{3}Zachry Department of Civil Engineering, Texas A&M University, 3136 TAMU, College Station, TX 77843-3136, USA^{4}Institute of Oceanology, Shanghai Jiao Tong University, Shanghai 200240, China

Received 18 January 2015; Accepted 2 April 2015

Academic Editor: Paolo Maria Mariano

Copyright © 2015 Yajie Zou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Hotspot identification (HSID) is an important component of the highway safety management process. A number of methods have been proposed to identify hotspots. Among these methods, previous studies have indicated that the empirical Bayes (EB) method can outperform other methods for identifying hotspots, since the EB method combines the historical crash records of the site and expected number of crashes obtained from a safety performance function (SPF) for similar sites. However, the SPFs are usually developed based on a large number of sites, which may contain heterogeneity in traffic characteristic. As a result, the hotspot identification accuracy of EB methods can possibly be affected by SPFs, when heterogeneity is present in crash data. Thus, it is necessary to consider the heterogeneity and homogeneity of roadway segments when using EB methods. To address this problem, this paper proposed three different classification-based EB methods to identify hotspots. Rural highway crash data collected in Texas were analyzed and classified into different groups using the proposed methods. Based on the modeling results for Texas crash dataset, it is found that one proposed classification-based EB method performs better than the standard EB method as well as other HSID methods.

#### 1. Introduction

For the purpose of prioritizing safety improvements on roadway network, identifying sites with consistently elevated accident risk, often referred to as hotspots or black spots, is of critical importance. To address this need, a number of analytical methods for hotspot identification (HSID) have been developed over the last several decades, with the overarching objective of optimizing the allocation of limited funding. An inaccurate HSID method will result in inefficient allocation of safety treatment resources, with potentially serious costs in terms of overall safety performance of the network. The need for accurate methods to identify and prioritize accident-prone locations is underscored by the U.S. 2012 Federal Moving Ahead for Progress in the 21st Century Act (MAP-21), which emphasizes data-driven crash risk analysis and safety treatment prioritization. Further, the performance reporting requirements outlined in MAP-21 provide an additional layer of incentive for public agencies to maximize the impact of safety spending by selecting and treating sites with high improvement potential.

A number of papers in past years have focused on the accident frequency (AF) or accident rate (AR) based HSID methods, which rely on observed accident counts as the primary measure of accident risk. Because sites are ranked and identified based on observed accident data only, there is no mechanism for identifying sites with elevated risk (due to some combination of geometric and traffic characteristics) but few accidents. Further, these methods cannot distinguish between actual high risk locations and those with higher occurrence of accidents due to random fluctuations. The empirical Bayes (EB) HSID method addresses these issues by combining two clues, the historical crash record of the entity and the expected number of crashes obtained from a safety performance function (SPF) for similar entities. This approach is less sensitive to random fluctuations in accident frequency and in theory can identify truly high risk locations with greater accuracy. Building on the EB approach, additional methods have been developed based on estimated accident reduction potential (ARP). Such methods attempt to quantify the difference between the actual accident count at the location of interest (as estimated using EB method) and the expected accident count for similar locations, under the supposition that this difference represents the potential for improvement.

Unfortunately, the definition of “similar” sites is a somewhat open question, and the accuracy of the EB method depends largely on the selection of the reference population. When estimating the SPF, the studied crash data are often collected from different geographic locations to ensure the adequacy of sample size for valid statistical estimation. Since crash data observed from different geographic locations may exhibit different characteristics, the aggregation of these crash data may result in heterogeneity. For the aggregated crash data, it is reasonable to assume that the sites with different combinations of characteristics (i.e., geometric design features) can constitute distinct subpopulations [1]. Under this assumption, applying the EB method to the entire crash data may become inappropriate because the second clue of the EB method requires the homogeneity of the crash data. Therefore, as proposed by some previous studies [2–8], it is reasonable to assume that the crashes on highway entities (i.e., road segments or intersections) are generated from a certain number of hidden subpopulations. Since entities are heterogeneous across but homogeneous within the subpopulations, the EB method can be applied to the crash data in each subpopulation.

The primary objectives of this research are to propose three different classification-based EB methods to identify accident-prone sites and to compare the proposed methods with the commonly used HSID approaches (i.e., AF, AR, EB, and ARP methods). To accomplish the objective of this study, the finite mixture of NB regression models with fixed or varying weight parameter is considered to classify the crash data into different subgroups. The effectiveness of the proposed HSID approaches is examined using the crash data collected at 4-lane undivided rural highways in Texas and performance is evaluated using criteria proposed by Cheng and Washington [9].

#### 2. Background

##### 2.1. Hotspot Identification Methods

AF based HSID methods have been in use for many years. Such approaches typically rank locations or segments along a highway by observed accident count over a specified time interval and define hotspots as those exceeding some critical value [10]. Road segments or intersections are ranked by accident count among similar locations (such as along a relatively homogeneous section of highway), to insure that the identified hotspots represent specific opportunities for remediation instead of some inherent characteristic of a particular roadway class or driver population. One criticism often raised with regard to the AF method is that this approach lacks the ability to differentiate between actual hotspots and locations with increased accident frequency attributable to the randomness of traffic accidents [9, 10].

It is readily apparent that, all else being equal, a segment with higher traffic volume can be expected to have a higher accident count, and so hotspots identified using AF methods tend to overrepresent high volume locations that may or may not be amenable to remediation efforts [11]. In response, AR methods have been developed which rely on accident count per unit traffic volume for HSID, typically in units of accidents per million vehicle miles traveled. Similar to the AF methodology, sites are ranked by accident rate and those exceeding a critical value are identified as hotspots. Implicit in this approach is the assumption that accident count and exposure are linearly related, which is often not the case. In addition, by normalizing accident count by entering traffic volume, locations with very low traffic volume are sometimes over represented [11, 12].

The EB method for traffic accident HSID was introduced by Abbess et al. [13] to address issues with existing methodologies, most notably regression-to-the-mean (RTM) bias and low precision due to limited accident history. It has since been refined and widely used in a range of safety performance modeling applications [14–16]. In the EB crash modeling procedure, the expected number of crashes at a location is estimated by combining two pieces of information: the accident count at the location of interest and the expected accident count at locations determined to be similar based on traffic and roadway characteristics [17]. It is assumed that the actual accident count for the location of interest is available, and the expected accident count for similar locations is generally estimated from the SPF. The SPF describing accident counts as a function of traffic volume, lane width, and so forth is typically fitted using the negative binomial (NB) regression model. The observed accident count for a given roadway segment is combined with the expected value estimate as shown in where is the EB estimate of the expected number of crashes per year for site ; is the estimated number of crashes per year by the SPF for given site (estimated using a NB model); is the weight factor estimated as a function of and dispersion parameter ; and is the observed number of crashes per year at site .

Another measure often used in HSID is ARP. It was originally suggested that ARP be estimated as the difference between the observed accident count at the site of interest and the expected count as estimated from a set of reference sites. More recently, it has been proposed that the observed accident count at the site of interest be replaced by the EB-estimated accident count. This approach can account for random fluctuations in accident frequency and so gives a better estimate of the true safety of the location of interest. Using the EB-estimated accident count, the ARP is calculated as shown in where for site .

Persaud et al. [12] suggested that a better estimate of the true ARP can be derived using a full predictor set in the EB-estimated accident count and a subset of available regressors (i.e., those not describing a correctable site-specific geometric feature) in the expected accident count model. This way, the estimated ARP is a measure of the difference between the EB-estimated “true” safety and the expected safety of what could be considered a base scenario.

##### 2.2. Negative Binomial Model

In highway safety, the dispersion parameter of NB models refines the estimates of the predicted mean when the EB method is used. So far, the NB distribution is the most frequently used model by transportation safety analysts to generate SPFs [18]. The NB model has the following structure: the number of crashes during some time period is assumed to be Poisson distributed, which is defined bywhere is the mean response of the observation.

The NB distribution can be viewed as a mixture of Poisson distributions where the Poisson rate is gamma distributed. For the complete derivation of the NB model, the reader is referred to Hilbe [19]. The probability density function (PDF) of the NB is defined as follows:where is the mean response of the observation and is the dispersion parameter.

Compared to the Poisson distribution, the NB distribution can allow for overdispersion.

##### 2.3. Finite Mixture of NB Regression Models

This study adopts a -component finite mixture of NB regression models (termed as the FMNB- model) to classify heterogeneous crash data. For the FMNB- model, it is assumed that the marginal distribution of follows a mixture of NB distributions, as shown in the following:where is the weight of component (weight parameter), with , and ; is the number of components; , the mean rate of component ; is a vector of covariates; is a vector of the regression coefficients for component ; for ; and are the vectors of parameters for the component .

The term is assumed to be fixed for the FMNB- models. However, instead of estimating a fixed weight parameter, a generalized FMNB- model (GFMNB- model) can be derived by modeling the varying weight as a function of covariates, shown in (8). The GFMNB- model allows each entity to have a different weight that is dependent on the sites’ attributes (i.e., covariates). Previous studies [1] have shown that the GFMNB- model can provide more reasonable classification results than the FMNB- model. Note that the GFMNB- model has the same PDF shown in (5), but the weight factors are calculated using where is the estimated weight of component at segment ; are the estimated coefficients for component , being the number of coefficients; and is a vector of covariates.

##### 2.4. Classification-Based EB Methods

Since the aggregated crash data may contain heterogeneity, the FMNB- and GFMNB- models are proposed to classify the crash data into different subpopulations. Besides the mixture model, a simple mean-based grouping method is also considered and compared with the FMNB- and GFMNB- models. After separating the aggregated crash data into different subgroups, the EB estimates are calculated based on the crash data in each individual subpopulation. The three grouping methods and the procedure for prioritizing the hotspots are described as follows.

The first classification method assumes the crash data are generated from two subpopulations (i.e., one subpopulation contains accident-prone sites and the other consists of low-risk sites). To separate the sites into two groups, the mean of the number of crashes across the entire dataset is calculated. The sites with the observed number of crashes greater than the mean are labeled as the accident-prone group, and the sites with the observed number of crashes smaller than the mean are labeled as low-risk group. The mean-based classification method for hotspot identification consists of three steps. First, separate the entire crash data into two subgroups using the crash mean as the threshold value. Second, the NB regression model is estimated using the crash data in accident-prone group and the corresponding EB estimates are calculated; likewise, the NB regression model is estimated using the crash data in the low-risk group and the EB estimates are obtained. Third, the two sets of EB estimates obtained from step two are aggregated and ranked; then the hotspots are identified based on the aggregated EB estimates. Hereinafter, we denote this HSID approach as the mean-based EB method.

The second and third classification methods assume that the aggregated crash data contain heterogeneity. Heterogeneity implies that crash data are generated from different subpopulations (i.e., crash data in the same subpopulation share common characteristics, while crash data across the subpopulations may exhibit different characteristics). Given the advantages of finite mixture models in describing the heterogeneity in crash data, FMNB- and GFMNB- models are used to classify the crash data into components based on the site characteristics. The FMNB-based or GFMNB-based classification method for hotspot identification consists of four steps. First, fit the FMNB- or GFMNB- model to entire crash data, and select the number of components using the Bayesian information criterion (BIC). Second, after determining the number of components, separate the entire crash data into subgroups using the FMNB- or GFMNB- model. Third, the NB regression model is estimated using the crash data in each subgroup and the corresponding EB estimates are obtained. Fourth, the sets of EB estimates obtained from step three are aggregated and ranked; then the hotspots are identified based on the aggregated EB estimates. Hereinafter, the second and third HSID approaches are denoted as the FMNB-based EB method and the GFMNB-based EB method, respectively.

#### 3. Hotspot Identification Method Evaluation Criteria

With the objective of prioritizing safety treatments, the performance of a HSID method must be described in terms of both its ability to identify truly high risk sites and its ability to accurately rank sites by accident risk. In combination, the three tests proposed by Cheng and Washington [9] address both of these performance considerations.

##### 3.1. Site Consistency Test

Cheng and Washington [9] introduced the Site Consistency Test (SCT) as a measure of a HSID method’s consistency over subsequent time periods. It is based on the idea that actual high risk sites will experience consistently elevated accident frequencies, which means that a desirable HSID method should identify as hotspots sites which can be expected to have poor safety performance in subsequent time periods assuming no safety treatments or other changes have been applied. The SCT considers the sites identified as hotspots by method during time period and compares the methods based on the sum of accidents at high risk sites during future time period . The optimal HSID method as determined by the SCT is the one with the greatest number of accidents occurring on the sites identified as high risk by that method during time period . Out of sites, a threshold is defined for each method such that are designated as high risk by each method. For example, with total sites and , 10 sites will be selected by each method under comparison. With sites ranked in order of increasing accident risk, the test statistic for each method is defined by Cheng and Washington [9] as shown below in (9). The best method according to the SCT, then, is that which produces the highest :where is the total number of sites under analysis; is the accident count for site ; is the threshold for high risk sites, defined as the fraction of all sites that are designated high risk; is the HSID method which is being compared; and is the time period of observation.

##### 3.2. Method Consistency Test

Similar to the SCT, the Method Consistency Test (MCT) is based on the notion that actual high risk sites will consistently experience poor safety performance, assuming that operating conditions are similar and that no safety treatments have been applied. However, the MCT estimates the performance of a HSID method by the extent to which the same sites are identified as hotspots over two consecutive time periods. Given hotspots identified by each method , the best performing method is that which identifies the greatest number of hotspots that are consistent between time period and time period . From Cheng and Washington [9], the performance of method is computed as shown in

##### 3.3. Total Rank Differences Test

Similar to the MCT, the Total rank Differences Test (TRDT) is a measure of a HSID method’s consistency across multiple time periods, again assuming that none of the sites have undergone safety treatments. However, the TRDT explicitly takes into account the safety performance ranking assigned by each method and estimates performance based on the ranking consistency between subsequent time periods. While the MCT would assign a high score to a method which consistently identifies the same sites in two time periods as being in the top in terms of accident risk, the TRDT considers the relative rankings of all sites identified in time period The performance of each method is calculated as the sum of differences between the rank assigned to all high risk sites in time period and the rank assigned to the same sites in time period . Note that the sites identified in time period are compared by rank in the two time periods whether or not they are identified as high risk in time period . The performance measure for the TRDT is computed as described by Cheng and Washington 2008 [9], shown in where is the rank of site from method for time period .

#### 4. Data Description

The crash dataset used for comparing different HSID methods was collected at 4-lane undivided rural segments in Texas as a part of NCHRP 17–29 research project. This dataset records crash data collected on 1,499 undivided rural segments over a five-year period from 1997 to 2001. In order to compare different HSID methods using the three criteria, the five-year crash data were divided into two time periods, Period 1 (years 1997 and 1998) and Period 2 (years 1999, 2000, and 2001). Table 1 provides the summary statistics for individual road segments in the Texas data for time periods 1 and 2.