Abstract

The paper presents a comparison between two modeling techniques, Bayesian network and Regression models, by employing them in accident severity analysis. Three severity indicators, that is, number of fatalities, number of injuries and property damage, are investigated with the two methods, and the major contribution factors and their effects are identified. The results indicate that the goodness of fit of Bayesian network is higher than that of Regression models in accident severity modeling. This finding facilitates the improvement of accuracy for accident severity prediction. Study results can be applied to the prediction of accident severity, which is one of the essential steps in accident management process. By recognizing the key influences, this research also provides suggestions for government to take effective measures to reduce accident impacts and improve traffic safety.

1. Introduction

As a significant cause of deaths, injuries, and property loss, traffic accident is a major concern for public health and traffic safety. According to statistics from the Ministry of Public Security of China between 2009 and 2011, traffic crashes resulted in an average of 65 123 people dead and 255 540 cases injured annually in China (China Statistical Yearbook of Road Traffic Accidents, 2009–2011). It was reported that the cost of medical care and productivity losses associated with motor vehicle crash injuries was over $99 billion, or nearly $500, for each licensed driver in the United States (Centers for Disease Control and Prevention, 2010). Being one of the major steps of accident management, accident severity prediction can provide crucial information for emergency responders to evaluate the severity level of accidents, estimate the potential impacts, and implement efficient accident management procedures.

In recent years, increased attention has been directed at accident severity prediction, for which Bayesian network and Regression model are two widely used modeling techniques. However, to the authors’ knowledge, there is no study that presents quantitative comparison of the two methods. Therefore, the present work focuses on conducting an accident severity modeling by employing both Bayesian network and Regression model. The accuracies of the two methods will then be compared and a better one will be selected for accident severity prediction. By carrying out accident severity analysis, the risk factors and their effects will also be identified in the work.

The remainder of this paper is organized as follows. In Section 2, the literature review on predictions of severity is presented in general. The data are described in Section 3. This is followed by accident severity modeling with Bayesian network and Regression models in Sections 4 and 5, respectively. The paper concludes with a summary and directions for future research.

2. Literature Review

Regression analysis has been widely used to accident severity prediction and contributing factors determination. The most commonly used Regression models are Logistic Regression model and Ordered Probit model [16]. However, some researchers [7, 8] pointed out that most of the Regression models have their own assumptions and predefined underlying relationships between dependent and independent variables (i.e., linear relations between the variables). If these assumptions are violated, the model could lead to erroneous estimations of the likelihood of severe injury.

Some researchers carried out traffic accident analysis by employing Bayesian network. For instance, de Oña et al. [8] applied Bayesian network to the identification of the factors affecting injury severity, which was classified into slightly injured and killed/severely injured. Based on Bayesian network, the factors associated with a killed/severely injured accident were identified to be accident type, driver age, lighting, and number of injuries. The results indicate that Bayesian network is capable of making predictions without the need for preassumptions and can make graphic representations of complex systems with interrelated components. Simoncic [9] constructed a Bayesian network for analysis of injury severity. The results showed that Bayesian network can be applied in road accident modeling. It also presented some advantages of Bayesian network, such as being able to involve more variables and larger data set than Regression model. Ozbay and Noyan [10]’s work constructed a Bayesian network and applied it to predicting incident duration and understanding factors associated with incident clearance time. The results indicated that Bayesian network can represent the stochastic nature of incidents. Gregoriades [11] highlighted the interest of using Bayesian network to model traffic accidents and discussed the need to not consider traffic accidents as a deterministic assessment problem, but model the impacts of the factors that could lead to traffic accidents.

Although previous works presented the advantages of adopting Bayesian network in accident severity modeling, there is no contribution that conducts a quantitative comparison of Bayesian network and Regression model. Therefore, both Bayesian network and Regression model will be applied to accident severity modeling in this work and the accuracy of the two models will be compared.

3. Data

The data set for this work contains police-reported traffic accident records for Jilin province, China, in 2010. With records containing missing values eliminated, our final data set consists of 2,246 cases, which are all motor-vehicle involved accidents. In addition to severity information, the data contains information regarding accident characteristics (accident occurrence time and accident location), vehicle characteristics (vehicle type involved and vehicle condition), environmental factors (weather condition and visibility distance), and road conditions (pavement condition, road geometrics and roadway surface condition, etc.).

Previous studies [7, 12] indicated that the factors associated with accident severity mainly include road characteristics, accidents characteristics, vehicle characteristics, driver characteristics, and environmental factors. However, driver characteristics related variables are not involved since suitable records for driver characteristics are not available in the data set. Based on a preliminary correlation test, 3 dependent variables and 14 candidate independent variables were selected from the data set, shown in Table 1. In the process of accident severity modeling, 50% of the total records are selected as training data to calibrate the prediction models, and the left 50% are set aside as testing data.

4. Accident Severity Modeling

4.1. Accident Severity Modeling with Bayesian Network
4.1.1. A Brief Overview of the Bayesian Network

Over the last decade, Bayesian network has become a popular representation for encoding uncertain expert knowledge in expert systems. It has been applied to many fields, such as medicine, document classification, information retrieval, image processing, data fusion, and decision support systems [13].

Bayesian network is a graphical model representing random variables and their conditional dependencies. Figure 1 shows a simple Bayesian network, in which and are random variables represented by nodes. are called parents of and is called the child of . The directed edge from to indicates the dependence of on its parent node.

4.1.2. Structure Learning

In most cases, the graphical structure of a Bayesian network needs to be automatically learnt from the data. This learning process can be described as follows. Let a random variable be the structure of a Bayesian network and let be its prior probability distribution. Given data set , which consists of all the variables represented by nodes in the Bayesian network (e.g., and in Figure 1), based on the Bayesian theorem, the posterior probability of can be calculated as where is the posterior probability of structure , is the prior probability distribution of , and is the probability distribution of data set .

The posterior probability is also called Bayesian score, and (1) is called Bayesian score function. The structure that maximizes the Bayesian score will be chosen as the final structure of the Bayesian network.

4.1.3. Parameter Learning

In order to fully specify a Bayesian network, it is necessary to specify the conditional probability of each node upon its parent nodes in the network, given the structure and the data set . Parameter learning refers to the process of identifying the parameters in the conditional distribution of any child nodes on the joint distribution of its parent nodes. Bayesian estimation is one of the methods for parameter learning, which assumes that parameter is a random variable with prior distribution . According to the Bayesian theorem, the posterior probability for the parameter () given data set is computed as where is the prior probability distribution of parameter and is the probability distribution of data set .

4.1.4. Learning Results

Since the number of possible structures grows exponentially as a function of the number of variables, it is computationally infeasible to find the most probable network structure, given the data, by exhaustively enumerating all possible network structures. Cooper and Herskovits [14] and Herskovits [15] proposed a greedy algorithm, called the K2 algorithm, which becomes one of the most popular methods for structure learning of Bayesian network. Besides the basic Bayesian theories, the K2 algorithm uses two assumptions, namely, that there is an ordering available on the variables and that, a priori, all structure are equally likely. The K2 algorithm considers each node in the order given to it as input and determines its parents as follows. Initially, assume that each node has no parent. Then, add parents for some nodes when the score (computed by using (1)) of the resulting structure could be increased. Try to add parents for each node until no more parents can increase the score. Then the structure of the highest score is the final structure [14].

The structure of the severity prediction Bayesian network is learned by employing the K2 algorithm and the Full-BNT toolbox, which is an open-source Matlab package for directed graphical models [16]. The final network structure is shown in Figure 2. The results indicate that Bayesian network for accident severity analysis is composed of 13 nodes and the concerned edges, which represent the relationships between the nodes.

Based on the developed structure, the parameters are learned by employing the method of Bayesian estimation. The prior distributions of all the variables are assumed to be Dirichlet distribution, which is a kind of conjugate distribution allowing closed form for posterior distribution of parameters and closed-form solution for prediction. The Full-BNT toolbox of Matlab is employed to realize the algorithm of Bayesian estimation.

As the parent nodes gather the impacts of the indirect nodes and deliver them to the child nodes, the influence of parent nodes will be focused on. Under the impact of the parent nodes, that is, factors which have direct edge to the severity indicators in this structure, the parameter learning results of number of fatalities (Nof), number of injuries (Noi), and property damage (Pd) are shown in Tables 2, 3, and 4 respectively. The indicators of Mean Absolute Percentage Error (MAPE) and Hit ratio are adopted to examine the accuracy of the models.

MAPE, which looks at the average percentage difference between predicted values and observed ones, is calculated as where is the observed value and is the predicted value for observation .

The MAPE value of the fatality forecasting model is 0.0226, and the Hit ratio is 100%, which recommend that this model has a high accuracy [17, 18].

According to the developed structure, Nof’s parent nodes are L-Rrs, Vc, and Noi. The estimation results indicate that the probability of occurrence of fatal accident increases when the condition of the involved vehicle gets worse. Moreover, higher number of deaths is associated with higher number of injuries. The accident that occurs at normal section of road, but not at abnormal section or intersection, tends to cause more fatalities. The reason may be that the involved vehicle usually speeds down when going through intersection or abnormal section of road.

The MAPE value of the injury forecasting model is 0.0013, and the Hit ratio is 100%, which presents a high accuracy of the model.

Two parent nodes, namely, Bti and Vc, have direct impacts on number of injuries in the accident. The estimation results show that bus or truck involved accident tends to cause more injuries. In addition, the worse the vehicle condition is, the more injuries are in the accident.

The MAPE value of the property damage forecasting model is 0.0019, and the Hit ratio is 100%, which shows a high accuracy of the model.

Two parent nodes, that is, L-Rrs and Vc, have direct impact on property damage. The results indicate that, like the influences of Vc on Nof and Noi, poor vehicle condition is associated with large amount of property damage and vice versa. In addition, the accident that occurs at irregular section of road or intersection tends to cause large amount of property damage. Combining the effects of L-Rrs on Pd and Nof, it can be deduced that the accident that occurs at regular section of road tends to result in high number of deaths but small amount of property damage.

4.2. Accident Severity Modeling with Regression Models

The most commonly used Regression models in traffic injury analysis are the Logistic Regression model and the Ordered Probit model [16]. Since the alternatives of Noi and Pd are all ordered and the Logistic Regression model would fail to account for the ordinal nature of the dependent variable and have the problem of independence from irrelevant alternatives (IIA) [19], Ordered Probit model will be employed in forecasting of Noi and Pd. Besides, one of the Logistic Regression models-Binary Logit model, will be adopted in prediction of Nof, which has two discrete alternatives.

4.2.1. Binary Logit Model

As one of the Binomial choice models, Binary Logit model is commonly used in discrete choice modeling. According to the random utility theory [20], the utility of alternative   ( or 2 for or , resp.) for accident can be specified as where denotes the deterministic component of , and is the random component of .

Here can be written as where is attribute   () for accident and alternative , and is the estimable coefficients, which can be estimated by adopting the Maximum Likelihood method.

Assuming follows Gumbel distribution, the choice probability of alternative   () for accident is then and the choice probability of alternative   () for accident is

4.2.2. Ordered Probit Model

The Ordered multiple choice model assumes the relationship where is the probability that alternative happens in accident   (), is an alternative specific constant, is a vector of the attributes of accident , is a vector of the estimated coefficients, and is a parameter that controls the shape of probability distribution . Therefore, can have various shapes of distribution based on different value of .

The Ordered Probit model, which assumes standard normal distribution for , is the most commonly used ordered multiple choice model [21]. The Ordered Probit model has the following form: where is the cumulative standard normal distribution function. For all the probabilities to be positive, it must satisfy that .

4.2.3. Estimation Results

By using logistic and probit procedure in SAS [22], the Binary Logit model and Ordered Probit models are estimated, and the results are shown in Table 5.

5. Discussions

By comparing the test results of MAPE and Hit ratio with respect to the predictions of the three severity indicators, it can be concluded that the goodness of fit of Bayesian network is higher than that of Regression models. This suggests that Bayesian network is more suitable in accident severity prediction than Regression models regarding modeling accuracy.

Besides goodness of fit, there is also difference between Bayesian network and Regression models regarding the interactions between the variables in the model. In Bayesian network, indirect nodes (or variables), which are related to the dependent variable, affect their own child nodes first, and then the impacts are delivered to the related edges and nodes until they arrive the dependent variable [23, 24]. As shown in Figure 3(1), every indirect node’s change will cause dependent variable’s change. The impact of indirect node on dependent variable can be obtained by inference based on the constructed Bayesian network. For instance, the impacts of L-C on the three accident severity variables are inferred according to the Bayesian network for accident severity analysis, and the results are shown in Table 6. Comparing with Bayesian network, all the independent variables, either indirect node, child node or direct node in the Bayesian network, affect the dependent variable directly in the Regression models [25, 26]. The interactions between variables in the Regression models are shown in Figure 3(2).

Moreover, for Regression models, two independent variables cannot exist in one model if they are related to each other. This causes the missing of some influences between variables. Also, Regression models will fail to present the impact between dependent variable and dependent variable as well as the interaction between independent variable and independent variable, such as the impact of Noi on Nof and the effect of Rsc on Bti in this study, respectively, which can be presented by the Bayesian work shown in Figure 2.

Furthermore, as mentioned above, most of the Regression models have their own assumptions and predefined underlying relationships between dependent and independent variables (i.e., linear relations between the variables or independence between variables) [7, 8]. The differences between Regression models and Bayesian network also reflect the methods of probabilistic reasoning [2729]. That is, Bayesian network can reason under uncertainty, but Regression models cannot. In addition, for parameter estimation, Regression models need complete (without missing values) and quantitative data, while Bayesian network can be constructed with incomplete data or qualitative information [30, 31].

The above characteristics of Bayesian network prove that, compared with Regression models, Bayesian network is more suitable to be adopted in accident severity analysis.

6. Conclusions

In this paper, two modeling techniques, that is, Bayesian network and Regression models, are investigated in accident severity modeling. The goodness of fit of the two methods is compared according to the test results, and the differences between the two methods are analyzed. The results suggest that, comparing with Regression models, Bayesian network is more suitable for accident severity prediction.

Study results can be applied to predicting traffic accident severity and identifying the key effects of contributed factors on accident severity. By comparing Bayesian network and Regression models, it also makes a methodological contribution in enhancing prediction accuracy of severity estimation.

It should be pointed out that both the structure and the parameter of the proposed Bayesian network will change when there are specific numbers of new reported cases added into the data set. According to the study by Zhang [32], the structure and the parameter of the Bayesian network will change when the amount of new records reaches 10% and 5% of the number of the original cases, respectively.

One limitation of current work is that some factors, such as driver characteristics and traffic condition, which have potential effects on accident severity, are not considered because of the lack of suitable data. Further study should be conducted to examine the impacts of these factors on accident severity.

Acknowledgments

The research is funded by the National Natural Science Foundation of China (50908099 and 51078167) and the Doctoral Program of Higher Education of China (201104493).