Journal of Advanced Transportation

Volume 2017, Article ID 5230248, 9 pages

https://doi.org/10.1155/2017/5230248

## Developing a Clustering-Based Empirical Bayes Analysis Method for Hotspot Identification

^{1}The Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University, Shanghai 201804, China^{2}University of Washington, P.O. Box 352700, Seattle, WA 98195-2700, USA^{3}Uncertainty Decision-Making Laboratory, Sichuan University, Chengdu 610064, China^{4}Department of Civil and Environmental Engineering, University of Washington, Seattle, WA 98195, USA^{5}College of Urban Railway Transportation, Shanghai University of Engineering Science, 333 Longteng Road, Shanghai 201620, China

Correspondence should be addressed to Yanxi Hao; moc.361@6130xyoah and Yichuan Peng; moc.liamtoh@2891gnepnauhciy

Received 15 June 2017; Revised 10 October 2017; Accepted 15 October 2017; Published 22 November 2017

Academic Editor: Chunjiao Dong

Copyright © 2017 Yajie Zou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Hotspot identification (HSID) is a critical part of network-wide safety evaluations. Typical methods for ranking sites are often rooted in using the Empirical Bayes (EB) method to estimate safety from both observed crash records and predicted crash frequency based on similar sites. The performance of the EB method is highly related to the selection of a reference group of sites (i.e., roadway segments or intersections) similar to the target site from which safety performance functions (SPF) used to predict crash frequency will be developed. As crash data often contain underlying heterogeneity that, in essence, can make them appear to be generated from distinct subpopulations, methods are needed to select similar sites in a principled manner. To overcome this possible heterogeneity problem, EB-based HSID methods that use common clustering methodologies (e.g., mixture models, -means, and hierarchical clustering) to select “similar” sites for building SPFs are developed. Performance of the clustering-based EB methods is then compared using real crash data. Here, HSID results, when computed on Texas undivided rural highway cash data, suggest that all three clustering-based EB analysis methods are preferred over the conventional statistical methods. Thus, properly classifying the road segments for heterogeneous crash data can further improve HSID accuracy.

#### 1. Introduction

Network screening to identify sites (i.e., roadway segments or intersections) with promise for safety treatments is an important task in road safety management [1–7]. The identification of sites with promise, also known as crash hotspots or hazardous locations, is the first task in the overall safety management process [8]. One widely applied approach to this task is the popular Empirical Bayes (EB) method. The EB method is described and recommended in the* Highway Safety Manual* [9] for roadway safety management. This method is relatively insensitive to random fluctuations in the frequency of accidents with two clues combined, the observed crash frequency of the site and the expected number of crashes calculated from a safety performance function (SPF) for homogeneous sites (or the reference group) [10, 11]. The EB method can correct for regression-to-the-mean bias and refine the predicted mean of an entity [12]. Further, it is relatively simple to implement compared to the fully Bayesian approach.

Although the EB method has several advantages, there are a few issues associated with the methodology which may limit its widespread application. First, the selection of the reference population (i.e., similar sites) influences the accuracy of the EB method. When estimating the safety performance function, the crash data are usually obtained from distinct geographic sites to ensure a sufficient sample size for valid statistical estimation [10]. As a result, the aggregated crash data often contain heterogeneity. When conducting an EB analysis, the reference group must be similar to the target group in terms of geometric design, traffic volume, and so on. Manually identifying such a reference group is a rather time consuming task for transportation safety analysts whose time could be better spent elsewhere. Second, the EB procedure is relatively complicated and requires a transportation safety analyst with considerable training and experience to implement it for a safety evaluation. Thus, the training investment required to prepare analysts to undertake EB evaluations can be a barrier. As a result, some quick and dirty conventional evaluation methods may be applied as a compromise of convenience, which may produce questionable results.

Given that the specification of correct reference groups is critical for the accuracy of the EB methodology, the primary objective of this research will examine different clustering algorithms (e.g., centroid-based clustering, connectivity-based clustering, and distribution-based clustering) and develop a procedure to identify appropriate reference groups for the EB analysis.

#### 2. Hotspot Identification Methods Used in Comparison

##### 2.1. Conventional Hotspot Identification Methods

One common HSID method is the accident frequency (AF) method. Sites are ranked based on AF and hotspots are defined as sites whose accident frequency exceeds some threshold value [13]. The problem of accounting for exposure in the AF method can be accommodated by considering accident rate (AR) instead of accident frequency. As such, AR methods have been used by some analysts and normalize accident frequency by traffic count. While AF and AR methods are easy to implement, they have difficulty accounting for randomness in crash data. As such, another popular HSID method was developed, that being the Empirical Bayes method presented by Abbess et al. [14]. Since its introduction decades ago, the EB method has been used numerous times in many safety studies [15–23]. One of the key advantages of using the EB method is that it accounts for the regression-to-the-mean (RTM) bias. The EB method can also help improve precision in cases where limited amounts of historical accident data are available for analysis at a given site. At its core, the EB method forecasts the expected crash count at a particular site as a weighted combination of the accident count at the site based on historical data and the estimated number of accidents at similar locations as determined from a regression model [24]. The regression model is generally referred to as a SPF and typically takes into account roadway and traffic characteristics (e.g., average daily traffic) at similar sites. To date, the most popular choice for the SPF has been a Negative Binomial regression model [25–27]. In terms of HSID via the EB method, EB estimates are computed for each site and then sites are ranked according to such estimates. Sites exceeding some thresholds are then considered as hotspots. Besides the EB method, another relatively common HSID method is rooted in so-called “accident reduction potential” (ARP). The ARP metric used for ranking sites was computed by subtracting the estimated accident count from the observed accident count at a given site, where the estimated accident count comes from a regression model developed from data at similar sites to the target. Among different HSID methods, the EB method is probably the most widely applied approach for screening sites with potential for safety treatment.

##### 2.2. Clustering for Selection of Similar Sites

In the following section, we present three methods that can be used to group data into different clusters. As aforementioned, crash data often exhibit heterogeneity that can affect model estimates if not properly accounted for. The idea here is to cluster crash data into different groups that hopefully align to some degree with the underlying subpopulations from which the crash data are generated. Then, separate Negative Binomial (NB) regression models (i.e., SPFs) can be developed based on each cluster and EB estimates can then be computed using an SPF that hopefully considers sites that truly are “similar” to the site in question.

###### 2.2.1. Generalized Finite Mixture of NB Regression Models

The generalized finite mixture of NB regression models with components (GFMNB-) assumes that follows a mixture of NB distributions, as shown as follows [28]:where is the weight of component (weight parameter), with and ; is the number of components; , the mean rate of component ; is a vector of covariates; is a vector of the regression coefficients for component ; for ; and is vectors of parameters for the component .

For GFMNB- models, the equation for developing the weight parameter is shown in (4). By using a function of the covariates, the GFMNB- model makes it possible for each site to have different weights for each component that depends on the site-specific values of the covariates. Zou et al. [28] demonstrated how this additional flexibility can lead to better classification resultswhere is the estimated weight for component at segment ; are the estimated coefficients of component and is the number of unknown coefficients; and is a vector of covariates.

###### 2.2.2. -Means Clustering

The -means clustering algorithm is often attributed to Lloyd [29] and Anderson [30], and it is one of the most popular clustering algorithms in use today. Inputs to the algorithm are the data points; here, each data point can be viewed as one of the road segments in the crash data set and its corresponding descriptive variables (e.g., lane width and average daily traffic (ADT)). With the data in hand, cluster centers are initialized. Cluster centers can be chosen as random points in the feature space (i.e., points that do not exist in the data set could be selected) and random data points in the feature space (i.e., only points in the dataset can be selected) or through a variety of other methods. For this project, the initialization using random data points in the dataset was used. The algorithm then proceeds in an iterative process until it converges, where convergence is defined as the point at which the cluster assignments do not change. The first step in the iteration assigns each data point to the cluster such that the distance between that cluster center and the data point itself is smallest; the distance metric used for this work is Euclidean distance defined as shown in (5). Then, the second step recalculates the center for each cluster. Pseudocode for the algorithm is shown in the following:where is Euclidean distance between two points; is data point index, ranging from 1 : ; is variable index, ranging from 1 : for variables; and is the two norms of two data points.

*K-Means Algorithm*

*Cluster-Assignment Step*where is cluster assignment for data point ; is center of cluster ; and all other variables are defined as previous.

*Center-Update Step*where is cardinality (number of data points) in cluster and all other variables are defined as previous.

###### 2.2.3. Hierarchical Clustering

Hierarchical clustering methods differ from -means clustering in that the results do not depend on the number of clusters used (i.e., the results will always be the same for a given number of clusters) nor an initialization. Rather, they are rooted in the use of a dissimilarity measure defined between clusters that is defined in terms of all possible pairwise combinations of data points within two given clusters. In this research, agglomerative (i.e., bottom-up) hierarchical clustering in the form of complete linkage clustering was considered. Agglomerative clustering methods (e.g., complete linkage, single linkage, and average linkage) take the data points (i.e., road segments and their corresponding descriptors) as inputs and begin with each data point as its own cluster; a lone data point forming its own cluster is also known as a singleton. For complete linkage clustering, the algorithm proceeds in a total of steps (i.e., one step less than the total number of data points in the dataset) and at each step, the two clusters with the smallest intergroup dissimilarity measure are joined to form a new cluster. Thus, in each successive step, the number of clusters is reduced by one. For complete linkage clustering the intergroup dissimilarity is defined as follows [31]:where are two arbitrary clusters and

Thus, for each step of the complete linkage clustering algorithm, the two clusters with the smallest value of the maximum between-cluster distance are joined.

##### 2.3. Classification-Based EB Methods

At this point it is important to clarify the main contribution of this work. It is well-known that aggregated crash likely has some degree of heterogeneity, as if they are generated from multiple distinct subpopulations. As such, if one were able to try to capture this heterogeneity and group the data into different units, ideally based upon the subpopulations from which they were generated, better estimates of safety and HSID rankings could likely be obtained. Thus, three types of clustering algorithms (GFMNB- model based, -means clustering, and hierarchical clustering with complete linkage) are proposed to cluster the data into distinct subgroups that hopefully correspond to the subpopulations from which the data were generated. The main idea/application of clustering is to define groups (i.e., clusters) of data points such that all points assigned to/belonging to a given cluster are closer/more similar to the points in that cluster than any other cluster [31].

Clustering methods present an ideal means to represent/describe heterogeneity within crash data. As such, we apply clustering-based EB methods in this study as a new means of hotspot identification. For these methods, three types of clustering as aforementioned are considered, and the classification method for HSID purposes has four main steps as follows. First, the full set of input crash data is clustered into clusters via the GFMNB- model, -means clustering algorithm, or hierarchical clustering algorithm. In this study, the number of clusters considered is set equal to the number of components selected for the GFMNB- model, which was itself selected on the basis of the Bayesian information criterion (BIC). Ultimately, however, the choice for selection of both number of clusters and number of components in the GFMNB- model is up to the analyst. The second step of the algorithm involves splitting the data into groups based on the results of the applied clustering algorithm. The third step of the algorithm calls for estimation of an NB regression model (i.e., SPF) for each of the subgroups/clusters from the data and using these SPFs in further generation of EB estimates for each site. For example, if , two SPFs will be estimated and the data in each of the two groups will have EB estimates calculated through application of the corresponding SPF. Fourth and finally, the EB estimates for all sites across all subgroups are aggregated and ranked, after which hotspots identification is based on threshold values or other methods. From this point forward, the classification-based HSID methods aforementioned will be referred to as follows: GFMNB-based EB method, the -means-based EB method, and hierarchical-based EB method, respectively. A summary of the classification-based EB method for HSID is shown in Table 1.