Abstract

This study is intended to focus on the major factors affecting traffic crash rates and severity levels, in addition to identifying crash-prone locations (i.e., black spots) based on the two indicators. The available crash data for different road segments used for the analysis were obtained from the Washington state database provided by the Highway Safety Information System (HSIS) for the years 2006 to 2011. A Random Forest (RF) classifier was used to predict the outcome level of crash severity, while crash rates were predicted by applying RF regressor. Certain features were selected for each model besides the abstraction of new features to check if there are unobserved correlations affecting the independent variables, such as accounting for the number and weight of crashes within 1 km2 area by implementing the Getis-Ord Gi∗ index. Moreover, to calculate the collective risk (CR) score, crash rates were adjusted to incorporate crash severity weights (cost per severity type) and regression-to-the-mean (RTM) bias via Empirical Bayes (EB) method. Finally, segments were ranked according to their CR score.

1. Introduction

In recent years, crash rate and crash severity were the two major indicators used to assess roadway’s traffic safety, as well as identifying crash-prone locations. Reducing crash rates and severity requires different strategic approaches and policies that aim at reducing the exposure for traffic accidents. Both reliable historical crash data and well-developed prediction models are essential in the process of estimating the impact of human and environmental-based features on the frequency and outcome of a crash incident. A weighted risk index or a combination of crash frequency and severity should prove an advantage over having only crash rate as an index of traffic hazards, without any regard to severity level or vice versa (e.g., having an urban road segment with high crash rate but lower levels of crash severity, and the opposite on a rural highway segment). Predictions should also account for fluctuations and trends in crash rates. Furthermore, unobserved features and correlations (e.g., spatial correlations between adjacent segments that can amplify a hazardous impact) should be considered. Observed or predicted indicators can be utilized to rank road facilities according to their overall risk score, allowing traffic safety agencies to distinguish which facilities have the priority in future crash countermeasure policies.

This article hereby introduces a collective risk (CR) score that combines both crash rate and severity into a single index, and in the same time, gives an account of spatial correlations between adjacent road segments using the Getis-Ord Gi∗ statistic and provides control over bias in predictions of hazardous segments using the Empirical Bayes (EB) method. So as to illustrate the process mentioned above, Section 2 provides a brief review of related literature, while Section 3 presents the implemented methodology to process the given data, Section 4 lists the results with discussion, and finally, Section 5 concludes the findings of this article.

2. Review of Literature

With regard to crash data, many researchers focused on different factors affecting the estimation of crash rates and crash severity. Aside from the observed features, there are other points that should be taken into consideration when analysing crash data, these factors include (a) underreporting and source of data, as police reports count about 50% of actual property-damage-only (PDO) crashes; (b) ordinal nature of severity data and the correlation among neighbouring severity levels; (c) fixed parameters that restrict the effects of explanatory variables to be the same across individual observations, ignoring unobserved heterogeneity such as in risk-taking behavior; (d) fixed severity thresholds that lead to biased estimations of features (e.g., airbags implementation), affecting the likelihood of certain severity levels due to a constrained shift in thresholds; (e) endogeneity bias that occurs when explanatory variables (e.g., the amount and frequency of warning signs) are affected by the dependent variable (e.g., the crash rate); (f) within-crash and crash type correlations (e.g., number of crash-involved personnel, collision angle, cause, and responsibility, etc.) [13].

In addition to the combined effect of crash frequency and severity, spatial and temporal correlations between crash observations were also a major focus. Ma and Kockelman [4] performed a study on speed limit’s effect on traffic safety by clustering homogenous highway segments from the state of Washington into crash counts by severity. Random-effects linear model and ordered probit regression were used for estimating crash rates and severity, respectively. On the other hand, Soltani and Askari [5] used spatial autocorrelation methods (Moran’s I and Getis-Ord Gi∗ index) for clustering hotspots in Iran. Clusters were compared in three attributes; time of the crash, severity index, and location. In comparison, Bao et al. [6] utilized the RF technique to identify factors affecting crash severity levels with the purpose of studying unobserved travel patterns and spatial correlations related to crash rate distribution for large-scale taxi GPS data from the city of New York.

Efforts were also made to introduce a combined index used for ranking potential hotspots, providing a reference to impose safety countermeasures. Da Costa et al. [7] incorporated crash severity in a collective risk ranking of road segments in Australia. The severity-weighted frequencies per crash type (based on crash severity costs in Australia) were adjusted for a potential RTM bias via the EB method. Similarly, Afghari et al. [8] employed the EB method to predict hotspots in Australia besides a joint risk model of both crash rate and severity, stating that the latter has an advantage over traditional count models.

Table 1 lists additional articles related to crash risk analysis.

3. Dataset and Methodology

The dataset from the HSIS database (years 2006 to 2011), which covers 38 counties of Washington state, contains accident data for each year, including crash scene characteristics and conditions. Different data logs were merged using a Visual Basic for Applications (VBA) automated MS Excel worksheet. The data were then split into two groups, the first group for building the models (years 2006 to 2009) and the other was used as a testing sample (years 2010 and 2011). Crash data (including time, weather, and severity level) referenced a single point on the roadway and only account for the most severe observation, while crash scene characteristics were collected from the homogenous segment of the roadway where the crash occurred [18]. Data went through a preprocessing stage which involved new categorization based on ordinal ranking for each feature, in addition to the elimination of redundant features and observations, and finally adding new features (such as hotspots).

3.1. Hotspots

To verify whether a crash scene or a road segment is a part of a larger cluster of hazardous locations, the Getis-Ord statistic Gi∗ was calculated. Additional data for milepost coordinates were obtained from the Washington State Department of Transportation (WSDOT) geospatial database [19]. Using crash scene milepost and the related coordinates of that milepost, and by implementing a VBA automated Excel worksheet, all crash observations from the years 2006 to 2009 were incorporated in the Gi∗ statistic calculation process. The collected crash data categorized severity into 8 classes. However, to conform with the crash cost data for the state of Washington [20], crash severity categories were transformed into 4 severity classes based on the KABCO ranking as shown in Table 2.

In order to incorporate crash counts into the calculation process, all crash observations were weighted according to their cost value (C = Cost/CostPDO). The study area (about 580 km × 380 km) was divided into 1 km2 cells and each crash observation was assigned to its cell centroid. The G family of statistics can be used as measures of spatial correlations in many aspects, yet Gi and Gi∗ (i.e., local statistics) can detect pockets of spatial links within the global study area. In this article, the standardized local statistic Gi∗ was chosen over the other global and local statistics for the sake of both simplicity and efficiency in finding local clusters since it can detect correlations within the same local area (e.g., in the case of adjacent segments or an intersection that are located within the same grid unit i).where is the distance weight function between the centroid of interest i and the adjacent centroid j, and xj is the weighted value of crashes in j (i.e., ∑Cj), while n is the total number of cells (or centroids) including centroid i [21]. The distance was assigned to be a 1.5 km radius around centroid i to reach centroids in corner cells (Queen neighborhoods), while the distance function  = INT[1/(INT(0.67 × dij) + 1)] produces only two outputs, 1 for d < 1.5 (including dii = 0) and 0 for d ≥ 1.5. The value is already a z-statistic and can be directly utilized to identify clusters within the grid according to the -value of the cell. Figure 1 shows the clusters within the study area as cells with were considered significant crash-prone areas [22]. A hotspot class was given for each crash observation based on their clusters. Hotspot classes were incorporated in the modeling process as an explanatory variable.

3.2. Crash Severity

Out of 183,400 training data points (years 2006 to 2009) and 85,160 testing observations (years 2010 and 2011), only 157,321 and 43,623 data points were selected, respectively. Preprocessing included removing data points with missing data, as well as the elimination of features with low variance, Spearman Correlation ranking, and recursive feature elimination and cross-validated selection (RFECV). Figure 2 provides a description of the top-ranked features with regard to their importance to crash severity prediction.

Two of the most popular machine learning algorithms supported by Python were used to fit the observed crash severity levels and the related crash features.

3.2.1. Random Forests

Random forest is an ensemble of K tree-structured estimators (hk) that split according to the values of a random vector Θk sampled independently and with the same distribution for all trees in the forest. Classification accuracy results from growing the trees and collecting their votes for the most popular class y at input x. A margin function mg(xy) measures the extent to which the average number of votes at x, y for the right class exceeds the average vote for any other class. The larger the margin, the more confidence in the classification.where I(·) is the indicator function and PE is the generalization error [23]. Nonetheless, given the convenience of the Scikit-learn ensemble package [24], a random forest of 1,000 estimators was used to classify the severity levels.

3.2.2. Ordered Logit

Discrete choice models, such as the multinomial logit and the mixed logit, were used extensively in the transportation domain (e.g., route choice and travel behavior) [25, 26]. However, as mentioned in Section 2, the nature of severity data requires an ordinal classifier. Based on the application of the generalized linear models (GLM), the ordered logit model (OLM) scales the response values into J ordered classes, and each response zj is confined within θj thresholds (θj-1 < zj < θj). To estimate the optimal slope vector parameters and thresholds θj, both prediction errors and the sum of all penalties (for crossing each threshold), given by the all-thresholds loss function, should be minimized in all classes.where f(·) is a margin penalty function, y is the predicted range, and s(j, y) is positive if j ≥ y and negative otherwise. The data can be fitted using the posterior probability P(y ≤ j |x), and the cumulative probability of the ordinal class follows a logistic function [27].

This procedure is implemented in the MORD library under the all-threshold variant classifier [28].

3.3. Crash Rate

The 157,321 observations from the years 2006 to 2009 and the 43,623 observations from the years 2010 and 2011 (i.e., data used in Section 3.2.) went through further preprocessing to be implemented in the crash rate prediction process. Crash observations were clustered for each homogenous road segment, with a length of no less than 0.16 km for convenience [29]. Since the HSIS dataset only provides the most severe case for a single crash observation, grouping data into segments for each year makes temporal variables (e.g., weather conditions) redundant. 19,490 segments from the years 2006 to 2009 and 7,321 from 2010 to 2011 were selected for training and testing the model. The features were filtered by standardized coefficients and Spearman Correlation rankings, as well as the embedded RFECV method (See Figure 3).

3.3.1. Random Forest Regressor

Similar to RF classifiers, RF regressors are formed by growing trees depending on the random vector Θ. The output of the tree predictors hk is numerical and the training set is independently drawn from the distribution of the random vector y, x. An accurate regression forest has a low correlation between residuals and low error trees. The mean squared generalization error for the forest Ex,y(y − avgkh(x, Θk))2 is formed by taking the average random predictor over k trees [23]. Scikit-learn’s regressor was used with 1,000 estimators.

3.3.2. Bayesian Ridge Regression

Bayesian inference techniques are considered popular machine learning tools in the transportation field (e.g., intelligent transportation systems) [30]. Ridge regression is a form of linear regression that penalizes the coefficients with the ridge coefficient α ≥ 0, hence minimizing the residual sum of squares.

The larger the value of α, the greater the amount of shrinkage. Similarly, the Bayesian Ridge (BR) includes regularization parameters during the estimation procedure, where has a prior given by a spherical Gaussian N(, λ−1Ip). The numerical output y of the probabilistic model p(y|x, , α) =  is assumed to follow a Gaussian distribution and both priors for α and λ are randomly chosen from gamma distributions. , α, and λ are estimated during the fit of the model by maximizing the log marginal likelihood [24]. Scikit-learn’s BayesianRidge was used with default initial parameters.

3.4. Empirical Bayes

Traffic crashes are random events, thus predicting the rates have some limitations due to the inherent characteristics of the data itself, regardless of the methods used to collect it. These limitations can introduce bias and affect crash data reliability, especially in the long term. Statistically, a low crash rate period will likely be followed by a high rate period, and vice versa. This tendency is known as regression-to-the-mean (RTM). Applying short-term predictions for longer periods can cause a bias called RTM bias (See Figure 4).

The EB method is used to estimate the expected average crash rate NE of an individual site (road segment or intersection) for a given time period T (in years) on the condition of unchanged site features (e.g., geometric design). The estimate relies on predictive models NP (modified for similar characteristics or control function) combined with observed crash rates NO.

An overdispersion parameter (i.e., a variance of classification) is needed to provide an indication of the statistical reliability of NP. The lower is the overdispersion parameter, the more statistically reliable is the model. This parameter is used in the EB method to provide the weights and to NP and NO, respectively [29]. As NP increases, NE becomes more dependable on the NO value.

Since the crash rate model already incorporated various variables (e.g., road features) in the fitting stage, in addition for providing more simplicity, NP values for the study period (i.e., years 2006 to 2009) are taken directly from the model output without further modifications regarding these features. Aiming to give more importance to segments with low rates but high severity levels (see Figure 5), both NP and NO should be converted to severity-weighted rates SP and SO, respectively.where Pitj is the probability of crash severity j occurring on segment i in year t, Cj is the cost for severity level j, and Mit is the total number of crash predictions or observations in segment i and year t. When SP is calculated, P is taken directly from the classifier probability predictions, while for SO, P is simply taken as the ratio mj/NO.

Finally, NE is modified to estimate the collective risk (CR) by predicting the weighted crash rate SF for future year t (i.e., the year 2010 or 2011) [7].

4. Results and Discussion

Crash severity was found to be correlated to some unobserved random features such as weather conditions and collision points. These features were needed to be drawn from a suitable distribution to be used in the model’s testing stage. Both weather and road surface conditions follow a Gaussian distribution, while the collision point follows a Beta distribution (with α = 0.2 and β = 0.4). AADT was taken as an average from the years 2006 to 2009. Crash severity results are summarized in Table 3.

It can be seen that both methods could not produce accurate results. Furthermore, predictions were biased towards no Injury/PDO severity class (largest number of observations). However, the RF method generated more correct predictions in the higher severity classes; therefore, class probabilities shall be taken from the RF output. Incorporating the random features did not affect the overall performance of the model; hence, only AADT and median width can be used to predict severity class. Figure 6 illustrates the results of crash rate models.

RF regressor gave more accurate predictions for crash rates, while the BR model had underdispersed results. Thus, RF predictions are selected to be used as NP and NF values, respectively. EB weights for SP indicate that predicted values have less importance to the estimated crash rate (see Figure 7).

Since not all segments had crash observations during both periods (i.e., years 2006–2009 and 2010–2011), only 3,649 and 2,647 segments from the years 2010 and 2011, respectively, were ranked according to their CR values. The top 5 ranked segments are listed in Table 4.

CR ranking indicates a local cluster on route 5 in the city of Seattle. Nevertheless, only 9 segments were included in the top 100 ranks of both actual and predicted rankings. The predictions of the year 2011 showed a slight improvement in accuracy for the top 1000 segments (see Figure 8). This can be understandable since the EB method is more effective in long-term estimates rather than the short-term. Severity weights also have a role in some bias towards crash clusters.

5. Conclusion

Machine learning algorithms introduce simple alternatives to the conventional regression models by providing both reasonable estimations and a user-friendly interface. In other words, it does not require comprehensive statistical knowledge. Random Forest accuracy for regression and classification outperformed traditional models (i.e., linear and logistic estimators). Yet, a multistage algorithm that combines RF and other models might produce better results and should be further investigated. Annual average daily traffic was a major factor in predicting both crash severity and frequency among some road features. On the other hand, random and temporal variables had a very small impact. The Empirical Bayes method demonstrated significant control over predicted crash rates, keeping it in a reasonable range. Although CR rank predictions were relatively poor in terms of reliability, having a collective risk estimate exhibits more potential to identify hotspots or localized clusters of safety hazards, thanks to the combination of frequency and severity of crashes. Attention should be paid to severity weights (i.e., considering an alternative to the cost index C) and the screening method (e.g., using road segments instead of grid cells) to avoid allocating large weights on outliers, especially when reported data points are related to the most severe case and there is only one observation per segment.

Data Availability

The traffic crash datasets used to support the findings of this study were supplied by HSIS. Requests for access to these data for research and educational purposes should be made to HSIS staff at https://www.hsisinfo.org/datarequest.cfm. Other types of data are cited at relevant places within the text as references.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This study was supported by the National Key Research and Development Program of China (No. 2018YFB1600900), the National Nature Science Foundation of China (Nos. 71971056 and 51608115), the Six Talent Peaks Project in Jiangsu Province XNYQC-003, and the open project of the Key Laboratory of Advanced Urban Public Transportation Science, Ministry of Transport, PRC. This research was also jointly funded by research grants from the Research Grants Council of the Hong Kong Special Administrative Region (Project No. PolyU 15212217) and the Hong Kong Scholars Program (Project No. G-YZ1R).