Mathematical Problems in Engineering

Volume 2017 (2017), Article ID 6263726, 10 pages

https://doi.org/10.1155/2017/6263726

## Predicting Real-Time Crash Risk for Urban Expressways in China

^{1}Research Institute of Highway, Ministry of Transport, 8 Xitucheng Road, Haidian District, Beijing 100088, China^{2}School of Transportation Science and Engineering, Beihang University, 37 Xueyuan Road, Haidian District, Beijing 100191, China

Correspondence should be addressed to Miaomiao Liu

Received 24 August 2016; Revised 18 November 2016; Accepted 30 November 2016; Published 30 January 2017

Academic Editor: Gennaro N. Bifulco

Copyright © 2017 Miaomiao Liu and Yongsheng Chen. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We developed a real-time crash risk prediction model for urban expressways in China in this study. About two-year crash data and their matching traffic sensor data from the Beijing section of Jingha expressway were utilized for this research. The traffic data in six 5-minute intervals between 0 and 30 minutes prior to crash occurrence was extracted, respectively. To obtain the appropriate data training period, the data (in each 5-minute interval) during six different periods was collected as training data, respectively, and the crash risk value under different data conditions was defined. Then we proposed a new real-time crash risk prediction model using decision tree method and adaptive neural network fuzzy inference system (ANFIS). By comparing several real-time crash risk prediction methods, it was found that our proposed method had higher precision than others. And the training error and testing error were minimum (0.280 and 0.291, resp.) when the data during 0 to 30 minutes prior to crash occurrence was collected and the decision tree-ANFIS method was applied to train and establish the real-time crash risk prediction model. The prediction accuracy of the crash occurrence could reach 65% when 0.60 was considered as the crash prediction threshold.

#### 1. Introduction

Because of the rapid increase of traffic flow and frequent crash occurrence, traffic safety has become a severe problem for rural roads and urban expressways in China [1]. Modeling real-time crash risk prediction is an important approach to identifying traffic condition causing crash, which can be used in the active traffic management control to reduce traffic accidents and ensure traffic safety. In China, due to the lack of traffic flow detection devices in rural roadways, it is difficult to collect real-time traffic flow data and predict real-time crash risk for these roads. For most of urban expressways in China, traffic detection devices, such as loop detector, microwave sensor, and video detection system, have been well installed. This makes it easier to detect and extract the traffic flow data. Thus, in this study, we mainly focused on urban expressways in China and established the real-time crash risk prediction model for these roads.

Recently, many researchers have analyzed the interrelationship between crash and traffic flow variables using loop detector data or microwave sensor data, and almost all of them emphasized that certain traffic conditions could be associated with high crash likelihood.

In 2001, Oh et al. [2] established the first real-time crash prediction model where they divided traffic dynamics into two conditions: normal and disruptive. Then they applied Bayesian model to assess the likelihood of future traffic flow data falling into these two conditions. In 2005, Oh et al. [3] analyzed 52 crash data variables and corresponding traffic data from loop detectors and identified the real-time crash likelihood by using nonparametric Bayesian approach. These two studies of Oh et al. identified standard deviation of speed to be the most significant variable. In the later study [4], Oh et al. applied Probabilistic Neural Network (PNN) and employed -test on the mean and deviation of three variables, occupancy, flow, and speed, to identify the crash indicator. The results showed that the standard deviation of speed as well as the average occupancy could be considered as the predictors. Then the new real-time crash prediction model was established by randomly selecting 30 crash data variables from their sample and testing their outcome and repeating the process for 30 times. The threshold value and the accurate prediction rate were, respectively, 38.2% and 44.9%.

In 2002, Lee et al. [5] pointed out the potential of real-time crash prediction to be applied as a proactive road safety management system and used a log-linear model to estimate crash risks based on real-time traffic flow data collected from freeway loop detector stations. They introduced a new concept called crash precursors, which was defined as traffic conditions that exist before the occurrence of a crash. Then Lee et al. [6] basically reduced the number of assumptions they made in the first study to make it more acceptable. It was concluded that the coefficient of variation in speed, traffic density, and speed difference between upstream and downstream loop detector stations were significantly correlated with the crash risk. In the later study, Lee et al. selected speed variations along a lane, traffic queue, and traffic density at given road geometry, weather condition, and time of the day as predictors and applied aggregated first-order log-linear model to predict crash. The developed model was not validated with another dataset and the prediction success was represented with the overall model fit, statistical significance of the coefficients, and the consistency of the coefficients with the order of levels of crash precursors.

In 2004, Abdel-Aty and Pande [7] used a sample size of 148 crashes, of which 100 were used to generate the model and the remaining 48 were used for validation. They used the concept of logistic regression and odds ratio to develop a new index called Hazard ratio, which essentially represents the factor with which the risk of observing a crash in the vicinity of the station of the crash will increase with unit increase in the corresponding risk factor (here, the predictors of crash). Lastly, they used Probabilistic Neural Network (PNN) to distinguish between crash and noncrash situation. They found the coefficient of variation in the speed obtained from the station near the crash and two stations immediately preceding in the upstream direction prior to crash to be the most suitable predictors. Although their study produced by far the best results to predict crashes, the overall classification, that is, for both crash and noncrash situations together, was poor (62%). In a later study, Abdel-Aty and Abdalla [8] used Generalized Estimating Equation method where they included road geometry as variables as well. The study found that high variability in speed for a period of 15 minutes for a specific location increases the likelihood of crash and, also, low variability in volume over 15 minutes at a given location increases the crash likelihood in the downstream. In addition, Abdel-Aty et al. [9] used matched case-control logistic regression to analyze the relationship between crash likelihood and real-time traffic flow characteristics. The analysis results showed that the most significant factors influencing the likelihood of crash occurrence were average occupancy observed at the upstream station and coefficient of variation in speed at the downstream station. In 2005, Abdel-Aty and Pande [10] collected the multiple speed derivatives, including the logarithms of the coefficient of the variation in speed for both crash and noncrash conditions. Then they applied a Bayesian classifier based methodology, the Probabilistic Neural Network (PNN) model, to predict crash occurrences on freeways and classify the collected data as belonging to either crashes or no-crashes. Pande and Abdel-Aty [11] collected the traffic surveillance data from a pair of dual loop detectors and developed a crash risk prediction model by using the classification tree and neural network. They found that, based on this model, the hazardous traffic conditions prone to lane-change related collisions could be identified.

Recently, Hossain and Muromachi [12] divided expressways into several segments (basic freeway, upstream and downstream of exits, and entrance ramps) and developed separate crash risk prediction models for different segments based on advanced ensemble learning methods such as random forest and classification and regression trees. The results showed that the contributing factors to crash risk were quite different for different road segments. In 2012, Xu et al. [13] conducted a -means clustering analysis to classify traffic flow into five different states. Then they developed conditional logistic regression models to analyze the relationship between crash risks and traffic states on freeways. The results demonstrated that each traffic state could be assigned with a certain safety level and the effects of traffic flow characteristics on crash risks were different for different traffic states.

The primary objective of this study is to divide freeway traffic flow into different states and to evaluate the safety performance associated with each state. Using traffic flow data and crash data collected from a northbound segment of the I-880 freeway in the state of California, United States, -means clustering analysis was conducted to classify traffic flow into five different states. Conditional logistic regression models using case-controlled data were then developed to study the relationship between crash risks and traffic states. Traffic flow characteristics in each traffic state were compared to identify the underlying phenomena that made certain traffic states more hazardous than others. Crash risk models were also developed for different traffic states to identify how traffic flow characteristics such as speed and speed variance affected crash risks in different traffic states. The findings of this study demonstrate that the operations of freeway traffic can be divided into different states using traffic occupancy measured from nearby loop detector stations, and each traffic state can be assigned with a certain safety level. The impacts of traffic flow parameters on crash risks are different across different traffic flow states. A method based on discriminant analysis was further developed to identify traffic states given real-time freeway traffic flow data. Validation results showed that the method was of reasonably high accuracy for identifying freeway traffic states.

In 2013, Hosseinpour et al. [14] used adaptive neuro-fuzzy inference system (ANFIS) for modeling traffic accidents as a function of road and roadside characteristics. Then the ANFIS model was compared with the Poisson, negative binomial, and nonlinear exponential regression models. The results showed that road width, shoulder width, land use, and access points significantly affected accident frequencies and the proposed ANFIS model had higher prediction performance than the other three traditional models. Then, Xu et al. [15] applied random parameters logistic regression to develop a real-time crash risk model and Bayesian inference based on Markov chain Monte Carlo simulations was used for model estimation. The parameters of traffic flow variables in the model were allowed to vary across different traffic states. Compared with the standard logistic regression model, the proposed model significantly improved the goodness-of-fit and predictive performance. In addition, Xu and Qu [16] also showed and analyzed some basic descriptive statistics of TTC (time to collision) samples, and used -test to analyze the effect of road environments, traffic conditions, and vehicle types on TTC statistically. In 2015, Wang et al. [17] presented a multilevel Bayesian logistic regression model for crashes at expressway weaving segments using crash, geometric, Microwave Vehicle Detection System (MVDS), and weather data. The results showed that the mainline speed at the beginning of the weaving segments, the speed difference between the beginning and the end of weaving segment, and logarithm of volume had significant impacts on the crash risk of the following 5–10 minutes for weaving segments. Sun et al. [18] also utilized Bayesian belief net to build the real-time crash prediction model for the basic freeway segments, and predicted the formation probability of a hazardous traffic condition in 4–9 minutes in a 250-meter-long freeway road section. The analysis results indicated that the proposed method could be used for the urban freeway management departments to understand the risk factors and take immediate actions in advance to avoid traffic accidents on the freeway. In 2016, Shi et al. [19] developed a multilevel Bayesian framework to identify the crash contributing factors on an urban expressway in the Central Florida area. Multilevel and random parameters models were constructed and compared with the negative binomial model under the Bayesian inference framework. The results showed that the models with random parameters could achieve the best model fitting, and lower speed and higher speed variation could significantly increase the crash likelihood on the urban expressway.

As mentioned above, the previous studies have comprehensively analyzed traffic flow characteristics and crash data, established various real-time crash risk prediction models by using different methods, and have made considerable achievements. However, there were quite few studies to analyze real-time traffic flow data for urban expressways in China and establish real-time crash prediction model applicable to Chinese urban expressways. Thus, in this study, we attempted to address these issues by developing a real-time crash risk prediction model with readily available variables and realize real-time risk assessment for urban expressways in China.

Based on decision tree method and adaptive neural network fuzzy inference system (ANFIS), we proposed a new real-time crash risk prediction model. Then we compared several other real-time crash risk prediction methods, such as logistic regression, decision tree, and supported vector machine (SVM). The manuscript was organized into five sections. The introductory section has laid out the background and stated the purpose and objective of the study. Section 2 described the activities involving data extraction and processing. Section 3 defined real-time crash risk, presented a self-containing introduction to modeling method, and evaluated the established model by comparing the results of several other real-time crash risk prediction methods. Section 4 discussed the model building and evaluation process and summarized the salient contributions and findings of the study along with identifying the limitations and subsequent future scopes.

#### 2. Data Collection and Preparation

To accomplish the research objective, data were obtained from a 39.7-kilometer segment on the Jingha Expressway in Beijing, China. There are 20 microwave detectors and 16 video detectors stations in upstream and downstream directions along the selected expressway section with an average spacing of 1.10 kilometers. The collected traffic flow and crash data was recorded from January, 2013, to October, 2014. A total of 123 crashes were identified and used in the study.

The traffic data were obtained from Huabei Expressway Corporation, LTD. The average speed, volume, and occupancy in 30-second aggregation intervals were collected in each lane. The 30-second raw detector readings from the upstream station were aggregated into 5-minute intervals and converted into the 9 traffic flow variables presented in Notations. The variables in Notations consist of five-minute observations. To identify hazardous traffic condition and make preemptive measures possible [10, 12], we extracted traffic data from the upstream station in six 5-minute intervals between 0 and 30 minutes prior to crash occurrence. For example, if a crash occurred at 8:00 am, the traffic data were extracted from 7:30 to 8:00 am, and six five-minute intervals were 7:30–7:35 am, 7:35–7:40 am, 7:40–7:45 am, 7:45–7:50 am, 7:50–7:55 am, and 7:55–8:00 am, respectively. For each crash in the dataset, the researchers selected two 30-minute traffic data (six five-minute intervals) without crashes from the crash-free days during the same period. These intervals were supplemented with the 9 traffic data variables to form crash-free observations.

#### 3. Methodology

##### 3.1. Defining Real-Time Crash Risk

To obtain the appropriate data training period, the data (in each 5-minute interval) during six different periods (including 0 to 5 minutes, 0 to 10 minutes, 0 to 15 minutes, 0 to 20 minutes, 0 to 25 minutes, and 0 to 30 minutes prior to crash occurrence) was collected as training data, respectively, and the crash risk value under different data conditions was defined. In this study, we assumed that the closer to the crash occurrence, the higher the crash risk, and the crash risk value revealed a linear decreasing trend from the first 5-minute interval prior to crash occurrence to the last interval. In addition, we considered that the crash risk value in the first and the last 5-minute interval prior to crash occurrence was 1 and 0, respectively.

That is to say, if we extracted traffic data (in each 5-minute interval) during 0 to 5 minutes prior to crash occurrence (i.e., the first 5-minute interval prior to crash occurrence) as training data, the crash risk for a crash case in this period could be considered as 1 and the crash risk for a noncrash case could be considered as 0.

If we extracted traffic data during 0 to 10 minutes prior to crash occurrence (i.e., the first and the second 5-minute intervals prior to crash occurrence) as training data, the crash risk for a crash case in this period could be considered as 1 and 0 during the first and the second 5-minute intervals prior to crash occurrence, respectively, and the crash risk for a noncrash case could be considered as “0” for the two 5-minute intervals.

If we extracted traffic data during 0 to 15 minutes prior to crash occurrence (i.e., the first and the third 5-minute intervals prior to crash occurrence) as training data, the crash risk for a crash case in this period could be considered as 1, 1/2, and 0 during the first to the third 5-minute intervals, respectively, and the crash risk for a noncrash case could be considered as 0 for the three 5-minute intervals.

If we extracted traffic data during 0 to 20 minutes prior to crash occurrence (i.e., the first and the fourth 5-minute intervals prior to crash occurrence) as training data, the crash risk for a crash case in this period could be considered as 1, 2/3, 1/3, and 0 during the first to the fourth 5-minute intervals, respectively, and the crash risk for a noncrash case could be considered as 0 for all the four 5-minute intervals.

If we extracted traffic data during 0 to 25 minutes prior to crash occurrence (i.e., the first and the fifth 5-minute intervals prior to crash occurrence) as training data, the crash risk for a crash case in this period could be considered as 1, 3/4, 1/2, 1/4, and 0 during the first to the fifth 5-minute intervals, respectively, and the crash risk for a noncrash case could be considered as 0 for all the five 5-minute intervals.

If we extracted traffic data during 0 to 30 minutes prior to crash occurrence (i.e., the first and the sixth 5-minute intervals prior to crash occurrence) as training data, the crash risk for a crash case in this period could be considered as 1, 4/5, 3/5, 2/5, 1/5, and 0 during the first to the sixth 5-minute intervals, respectively, and the crash risk for a noncrash case could be considered as 0 for all the six 5-minute intervals.

##### 3.2. Modeling Method

###### 3.2.1. Identifying Main Factors Influencing the Crash Risk Based on Decision Tree Method

To identify the most important variables influencing real-time crash risk, decision tree method was used to analyze the relationship between traffic variables and real-time crash risk. Decision trees or classification trees are among the popular statistical tools that emerged from the field of machine learning and data mining. Classification trees classify observations by recursively partitioning the predictor space. The resultant model can be expressed as a hierarchical tree structure. Especially since the introduction of the classification and regression trees (CART) [20], decision trees have received wide use in a variety of fields because of their nonparametric nature and easy interpretation [21].

In the traffic field, the application of decision trees is also extensive. For instance, De Oña et al. [22] employed decision tree method to identify the key factors that affected bus transit quality of service and to compare the key attributes identified before and after passengers reflect on the main aspects of the system. Using 2005 to 2006 truck-involved accident data from national freeways in Taiwan, Chang and Chien [23] developed a nonparametric decision tree model to establish the empirical relationship between injury severity outcomes and driver/vehicle characteristics, highway geometric variables, environmental characteristics, and accident variables.

In this study, we chose decision tree method to analyze the main factors that affected real-time crash risk. SPSS software package (version 13.0; SPSS Inc., Chicago, IL, USA) was used to conduct decision tree analyses. Then we considered all of the variables in Notations as input parameters and took the crash risk value (as defined in Section 3.1) as the output parameter. Because the CART method could avoid overfitting the model by “pruning the tree,” all decision trees in this study were developed based on the CART approach. The Gini criterion was used as a measure of split criteria. All trees were trimmed automatically to the smallest subtree based on one standard error as the specified maximum difference in risk. Since the data size is not very large, the minimum number of cases for parent nodes was set as 10 and the minimum number of cases for child nodes was set as 3. By using SPSS, we could obtain hierarchical tree structures, as shown in Figure 1, and find out the main factors influencing the crash risk. Table 1 shows the main factors influencing the crash risk under different data training conditions (as shown in Table 1). For detailed structure of decision tree, see De Oña et al. [22].