#### Abstract

Statistical models for estimating the safety status of transportation facilities have received great attention in the last two decades. These models also perform an important role in transportation safety planning as well as diagnoses of locations with high accident risks. However, the current methods largely rely on regression analyses and therefore they could ignore the multicollinearity characteristics of factors, which may provide additional information for enhancing the performance of forecasting models. This study seeks to develop more precise models for forecasting safety status as well as addressing the issue of multicollinearity of dataset. The proposed mathematical approach is indeed a discriminant analysis with respect to the goal of minimizing Bayes risks given multivariate distributions of factors. Based on this model, numerical analyses also perform with the application of a simulated dataset and an empirically observed dataset of traffic accidents in road segments. These examples essentially illustrate the process of Bayes risk minimization on predicating the safety status of road segments toward the objective of smallest misclassification rate. The paper finally concludes with a discussion of this methodology and several important avenues for future studies are also provided.

#### 1. Introduction

Forecasting safety status of certain transportation facilities is crucial in the process of safety planning as well as black spot diagnoses [1]. It has gained great attention in the past safety forecasting models, which largely rely on the methods adopting regression analyses with respect to particular relationships between accident counts and several important factors. For example, earlier approaches used Poisson or negative binomial regression models [2, 3] to consider accident frequencies as count data. Further, advanced models are also developed with the additional concerns of statistical features such as zero-inflated framework [3], nonparametric specifications [4, 5], and multivariate distribution of responses [6, 7]. These models in fact attempted to accurately replicate the distributional property for aggregated crash data. However, these models highly depend on regression analyses, which assume the forecasting factors are fixed and therefore the multicollinearity among factors is difficult to be addressed. Because factors are randomly distributed in nature and their stochastic relationships could provide additional information for enhancing the performance of forecasting models, it is necessary to involve the multivariate distributional characteristics of factors in addition to the conventional methods. To this end, this study suggests a mathematical approach based on the conception of Bayes risk minimization [8–10] to predict the safety status of transportation facilities with the consideration of the multicollinearity characteristics of factors.

The general idea of the proposed approach is to minimize the misclassification rate instead of maximizing the likelihood functions of regression models. In general forecasting models of road safety status, there are certain probabilities associated with the results that a road segment with high accident risks is classified as a safe road segment while a safer road could be also identified as a road with high accident risks. The corresponding misclassifications are defined as type I and type II error, respectively, in statistical theory. As a result, the mathematical objective would be to look for a smallest rate for a weighted combination of the two types of errors. Such problem is indeed a statistical decision classification with a discrete action space and the action can be viewed as the categorization of safety status, and minimizing the risk for the decision problem will provide an optimum partition scheme for prediction purposes. Further, if a prior distribution of the market share of different categories of safety status is considered, then it will form a Bayes risk minimization problem.

The aforementioned method is a type of risk analysis [11] and also the foundation of discriminant analysis [12]. It has been applied in various areas in engineering. Specifically, it can be applied in the field of face recognition [13, 14], chemometrics [15], and even fluid mechanics [16]. Yet, such approach has not been applied in the area for analyzing transportation safety status and hence it is valuable to introducing this mathematical framework for research in transportation safety engineering. In addition, the introduced approach has advantages in terms of the asymptotic relative efficiencies comparing to logistic regression models [17] if the distributional assumptions of multicollinearity hold.

As a result, this study seeks to introduce a discriminant analysis framework based on the approach of Bayes risk minimization for forecasting the safety status of transportation facilities. Firstly, the mathematical formation of the proposed method is presented in the following section. Secondly, based on the model, numerical examples are illustrated according to a simulated dataset and an empirical observed dataset. Finally, a conclusion is provided with the summary of this study and several directions for future research are also provided.

#### 2. The Method of Bayes Risk Minimization

A general forecasting problem is in the case that an individual or an outcome is classified into one of the possible categories which are denoted by a vector , and the set of factors are denoted by a vector. The classification regions for categories are denoted by , respectively. Then, it assumes that the collection of all classification regions will form a partition of sample space of. Therefore, a natural classification rule can be expressed in the following: if, then the corresponding observation belongs to population and otherwise it belongs to other categories. In the following analysis, denotes the loss function for classifying an observation which belongs to category into category . The associated probability of such classification is whose mathematical definition is presented inwhere is the probability density function for population and the integration part represents the probability that an observation in population is classified into population . For classification region, it is actually the sample space partitioned in which each partition will correspond to categories of the subject. As aforementioned, the optimum region will form a partition of the sample space such that the corresponding Bayes risk is minimized.

The definition of Bayes risks for this problem is presented in

In the above definition, Bayes risk actually measures the expected losses resulting from misclassifications, and is the market sharing of each category. The idea of adopting such prior information is to weight the expected losses from coming observations by consideration of their overall proportion in each category. Therefore, the partition criterion will try to avoid errors from misclassification of populations which have a large market share. Because minimizing the Bayes risk is equivalent to minimizing the posterior risk at every point of , the optimum partition boundaries are required to separate the sample space such that, for each sample point, the error rate reaches the minima. According to Bayes theorem, the posterior distribution of an observation coming from is given in

Accordingly the posterior risk is defined as the expected conditional loss and it is expressed in (4). To be noted, the posterior risk depends on specific value of which is the category that observation will be classified into according tothe partition rules. Consider

Therefore, for each observation **,** the decision rule is characterized by selecting such that the above posterior risk is the smallest. If is defined as the regular loss inthen the posterior risk will be simplified to the form in

Hence, the problem is equivalent to selecting such that is maximized. In other words, the best classification rule will select the population that has the largest conditional probabilities given the observed value of . According to the general mathematic formulation, it is ready to analyze the problem given the distribution of the factor vector or given the distribution family and leave the parameters to be estimated. In this section, cases that the measurement vector is distributed as multivariate normal with homogeneous variance-covariance matrix across different categories are assumed. As it will be illustrated shortly, the resulting partition boundary is linear in and such scenario is also called linear discriminant analysis [18]. For a problem of categories, it supposes that the vector is distributed as multivariate normal distribution in category for all. The probability density function of in category is given by

Based on Bayes theorem, the posterior is given by

For 0-1 loss function, the decision can be made by choosing the largest at each , and the partition of sample space can be formulated through all the pairwise classification boundaries. Specifically, the classification boundaries between populations and can be described by balancing their odds inor expressed by the log odds in (10) for convenience:

The first term on the right-hand side of (10) is called discrimination function [19]. In order to minimize the Bayes risk, the classification boundary between and is given by

Since the discriminant function [20] is linear in , the pairwise decision boundary is a hyperplane in . Even though there are pairwise classifications corresponding to the combination of and , only pairwise classifications are irredundant. In practice, applications can use the maximum likelihood estimates [21] for parameters , , and. Therefore, is estimated by its group proportion and

Substituting these estimates, the estimated classification boundary based on their log odds is

If two categories are considered, the above analysis reduces to the two population problems and the decision boundary is readily formed by a single hyperplane. For the two population cases, the estimated discriminant function is also proportional to the famous Fisher discriminant function [19]. To be noted, Bayes risk minimization is in fact a different meaning other than the accident risk, which scales the risk of being in an incorrect estimation of parameters.

#### 3. Numerical Examples

With the above specified mathematical formulation for the forecasting problems under minimized Bayes risks, the analysis in this section first presents a numerical example on simulated samples and then on an empirical observed dataset. In order to illustrate the estimated partition of sample space, a bivariate sample space is adopted and hence bivariate normal distributions are assumed for two groups of populations within different categories individually. Equation (14) gives predefined parameters for simulation purposes, namely, the mean vectors and variance-covariance matrix for the two groups of populations:

To be noted, the mean parameters for the two groups of populations choose different values whereas this study uses a homogeneous variance-covariance matrix for the two populations. Under this assumption, the partition boundary will be a straight line [22] that separates the sample space and also provide prediction criterion for future cases. However, if the equal covariance assumption fails [22], the pairwise log odds will be a quadratic function of the measurement vector. Therefore, the decision boundary will be quadratic in and the corresponding analysis is then called quadratic discriminant analysis. This study is a preliminary step in this field and methods under loose assumptions of the covariance matrix are left as a direction for future research.

With such specified parameters, the dataset is simulated with 200 samples for each group of population. Figure 1 illustrates the scatter plot of the two datasets, in which they are represented by different colors and the dashed line is the estimated discriminant boundary which partitions the sample space into two parts under the minimized Bayes risks.

Because the two populations are randomly distributed with overlapped domain, their samples are also overlapped with certain probabilities. It is impossible to find a partition to perfectly separate the two populations and hence misclassification always occurs due to the overlapping. Because the illustrated examples use two populations with different mean vectors, they will concentrate on different regions around their mean points, which will provide effective information to make future prediction. If the mean points are far from each other, the classification error will be small and vice versa. Thus, for two populations with adjacent mean points, any partition may not be able to reduce error rate due to the fact that they actually lack distinct features for identification. Consequently, for empirical analyses, it is important to choose features that can provide effective information for distinguishing different populations.

Then the following section presents an empirical analysis based on observed crash dataset for road segments. The data is collected in Pikes Peak Area, Colorado, USA, for the period from July 2006 to December 2010. In this data, the forecasting subject will be the safety status of road segments, which can be simply categorized as safe road or unsafe road to reflect the level of accident risks. In general, the accident risk measures the chance of being in accidents or the magnitude of encountered harmfulness from accidents. For example, crash count, crash rates, and equivalent property damage only crashes are all risk representations. In this study, the occurrence of at least one accident is adopted as the risk of accidents to account for the safety status of road segments. For such subject, past experience has identified the fact that the variables, annual average daily traffic (AADT), and the length of the road segment may show significant impacts on the occurrence of accidents. Thus, this study will adopt these factors to make predictions on the safety status of road segments. Consider

Equation (15) gives the estimated parameters from the dataset and it is computed from maximized likelihood estimation. Here, and are in fact the mean vectors for safe group and unsafe group, respectively. For a specific mean vector, the first element is corresponding to AADT and the second element is corresponding to segment length. Since the true parameters are unknown, the following analysis will be based on the estimated parameters for both datasets of safe road segments and unsafe road segments. As usual, the estimated mean points are allowed to be different while the variance-covariance matrix is still assumed to be the same. Moreover, the mean point is estimated based on the individual dataset of each population while, during the estimation of the variance-covariance matrix, all samples are used. Equation (12) provides the mathematical formulation for estimating these parameters.

For numerical computation, the only concern about this problem is to make sure that the variance-covariance matrix is invertible, which is usual and easy to satisfy. Figure 2 illustrates the partition as well as all the locations of samples of safe roads and unsafe roads. Even though the partition line can provide a solution to classify a road segment into one particular category of safety status, it may not be optimum due to the fact that the assumption of bivariate normal distributions for the entire population placed before the analysis may not hold. Since the analysis is based on empirical observed data, the multivariate distribution among factors may not be able to highly match the initial assumptions. Thus, it is necessary to improve current partition scheme with respect to the consideration of more realistic distributional assumptions. By observation of these sample points, for the roads with smaller segment length, the population distribution behaves differently with the roads which have relatively larger segment length, since the former presents a different shape of distribution form of sample points which are concentrated at different mean point.

Therefore, it is necessary to consider this phenomenon and try to find methods to overcome this issue. This analysis accordingly proposed a strategy in which samples with different behaviors are treated individually for partitioning purposes. The following analysis will conduct two independent partition analyses according to the samples with road length greater than two miles and the rest of samples with road length less than two miles. Figure 3 illustrates the improved partition line in the entire sample spaces based on the results from two individual classification problems. The broken line actually functions as a boundary to classify a future observation. There might also be other solutions to deal with this issue. For example, a better method could optimize the separation of populations with the concern of different behaviors of distribution and this type of methods leaves as a direction for future research.

#### 4. Discussion and Conclusions

The mathematical approaches based on the minimization of Bayes risk are effective to classify an unknown outcome into one of several possible categories with respect to the smallest misclassification rate. It is a widely used technique which has been applied in various areas. However it is rarely mentioned in the past studies of forecasting the safety status of transportation facilities, which largely rely on the regression framework. Because of the nature of regression that the explanatory variables are given without the consideration of their stochastic behaviors, the multicollinearity among factors is less considered. The occurrence of accident is the result of complicate processes controlled by many influence factors including the road geometric design, traffic flow characteristics, and characteristics of drivers. In fact, these factors are highly correlated in nature. Therefore it is necessary and secure to assume that multicollinearity of factors exists in analyses. Under this circumstance, this study seeks to introduce a discriminant analysis framework as a tool to classify transportation facilities in categories defined by different level of safety. Instead of maximizing the conditional likelihood of responses, the introduced method in fact attempts to minimize the Bayes risks for the classification problem. Under such specification, the current study illustrates numerical examples on a simulated dataset as well as an empirical observed dataset. It finally shows that this approach also constructs a flexible framework for conducting the analysis on prediction of traffic safety status in the field of transportation engineering. In addition, as a preliminary research into this field, next section will suggest several directions for future studies.

First, it is necessary to analyze different loss functions in terms of nonuniform loss weight to different types of errors with the consideration of the tradeoff between the additional accident loss introduced by misclassifying an unsafe road into a safe road and the cost for improving an originally safe road but misclassified into the unsafe category. Second, an inhomogeneous variance-covariance matrix may exist among different group of populations with different status of safety and hence future studies may take this into account. And, with more flexible parameters and looser assumptions, the partition may provide more realistic and accurate classification boundaries. Third, it is true that the actual distribution of factors might be reflected better by other types of multivariate formulation. Inspection on any other multivariate distributions even nonparametric specifications is necessary for future studies. Finally, it is interesting to compare the prediction results to the traditional regression analysis with simulated datasets or empirical observed datasets to further understand the statistical characteristics for transportation safety datasets.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

This research was supported by the National Natural Science Foundation of China (nos. 51208032 and 71210001) and Fundamental Research Funds for the Central Universities (no. 2013JBM008).