Abstract

Predicting traffic incident duration is important for effective and real-time traffic incident management (TIM), which helps to minimize traffic congestion, environmental pollution, and secondary incident related to this incident. Traffic incident duration prediction methods often use more input variables to obtain better prediction results. However, the problems that available variables are limited at the beginning of an incident and how to select significant variables are ignored to some extent. In this paper, a novel prediction method named NCA-BOA-RF is proposed using the Neighborhood Components Analysis (NCA) and the Bayesian Optimization Algorithm (BOA)-optimized Random Forest (RF) model. Firstly, the NCA is applied to select feature variables for traffic incident duration. Then, RF model is trained based on the training set constructed using feature variables, and the BOA is employed to optimize the RF parameters. Finally, confusion matrix is introduced to measure the optimized RF model performance and compare with other methods. In addition, the performance is also tested in the absence of some feature variables. The results demonstrate that the proposed method not only has high accuracy, but also exhibits excellent reliability and robustness.

1. Introduction

Traffic incidents such as vehicle crashes, fire, road maintenance, debris, police activities, etc. are still very common, random, and dangerous. The occurrence of traffic incidents can reduce road capacity because of lane closures that result in traffic congestion and delays [1]. The National Traffic Accident Management Association estimates that 25% of the congestion on American roads is caused by traffic incidents [2]. Traffic incidents are the main causes of nonrecurrent congestion on urban expressways and urban arterial roads [3]. In addition, traffic congestion and travel delay can increase the occurrence likelihood of secondary incident [4, 5]. For more than two decades, many cities around the world have established traffic management centers and deployed various traffic incident management systems to decrease traffic incidents and alleviate related congestion. Traffic flow management and providing travelers with timely and accurate information during traffic incident clearance periods are two main aspects of efficient traffic incident response [6]. The response strategy during the clearance period depends to a large extent on the duration of an incident. Therefore, accurate prediction of traffic incident duration has attracted the attention of researchers because of its importance.

For decades, researchers have put forward many effective methods for predicting traffic incident duration. The data, variables, and algorithms used in these methods are usually different. From the algorithm point of view, early incident duration prediction methods include probability distribution model [7], the regression prediction method [8], and time series methods [9]. They have an advantage of being easily understood and well-established methodologies, with a long history of application, availability of software, and deep-rooted acceptance. However, these methods often depend on certain assumptions, which limit the generalization of them.

In recent years, the application of hazard/risk-based methods and machine learning methods for traffic incident duration prediction has become increasingly widespread. The hazard/risk-based method uses hazard function to predict traffic incident duration. The risk function is essentially a conditional probability function, which can be used to analyze the probability that a traffic incident has lasted t minute and ended in the kth minute. In 2013, parametric accelerated failure time (AFT) survival models of traffic incident duration were developed by Hojati et al. [10], including log-logistic, lognormal, and Weibull—considering both fixed and random parameters, in order to better apply to different types of traffic incidents. In 2015, Li et al. [3] developed a competing risks mixture model to investigate the influence of clearance methods and various factors on traffic incident duration and predict traffic incident duration. Three candidate distributions including generalized gamma, Weibull, and log-logistic are tested to determine the most appropriate probability density function of the parametric survival analysis model. Subsequently, Li et al. [11] proposed a sequential prediction method for traffic incident duration based on competing risk mixture model and text analysis. Text analysis was used to process the textual features of the traffic incident to extract time-dependent topics. The empirical results demonstrate that the developed mixture model outperforms the non-mixture model. In 2017, accelerated failure time (AFT) hazard-based models were developed with different underlying probability distributions of the hazard function to predict traffic incident duration [12]. This study indicates that the hazard function—gamma distribution model with a time variable is the best model for the four different duration stages (including preparation, travel, and clearance as well as the total duration of the incident), and different parameters and variables were appropriate for modeling the different duration stages of traffic incidents.

As a typical machine learning method, decision trees have drawn extensive attention due to its excellent performance. Many methods have been proposed for traffic incident duration prediction based on decision trees, such as Bayesian decision trees [13], classification tree [14], M5P tree algorithm [15], and Gradient Boosting decision trees [16].

Besides decision tree-based methods, other machine learning methods for traffic incident duration have also made a number of achievements. In 2014, a model for traffic incident duration prediction was developed using an adaptive Bayesian network, which is more adaptable to the future environment than the traditional Bayesian model [17]. In 2015, Wang et al. [18] proposed a prediction method based on a nonparametric regression model whose core algorithm is K Nearest Neighbor (KNN). In the process of modeling, the distribution of incidents duration is taken into account. It is pointed out that the logarithmic transformation of duration will achieve better prediction results. In 2016, Yu et al. [19] compared the performance of artificial neural network (ANN) and support vector machine (SVM) for predicting traffic incident duration. The results show that both ANN and SVM are able to predict the incident duration within an acceptable range. When predicting the longer duration, the ANN model has better performance. On the whole, the overall performance of the SVM model is better than that of the traditional ANN model. In the same year, Park et al. [20] proposed a continuous time prediction method based on Bayesian neural network and quantified the importance of continuous time input variables by connecting weights to improve the interpretability of the algorithm. In 2017, two nonparametric machine learning methods, including the k-nearest neighbor method and artificial neural network method, were used to develop incident duration prediction models [21]. Based on the performance comparison results, an artificial neural network model can provide good and reasonable prediction for traffic incident duration prediction with mean absolute percentage error values less than 30%, which are better than the prediction results of a k-nearest neighbor model. For more information, a review for traffic incident duration prediction can be referred to [22], mainly including the different phases of incident duration, data resources, and the various methods that are applied in the traffic incident duration influence factor analysis and duration time prediction.

In summary, both hazard/risk-based methods and machine learning methods have advantages and disadvantages. The forms of results obtained by hazard/risk-based methods often better meet the needs of traffic managers, but it is necessary to assume that traffic incident duration obeys one or more probability distributions. The machine learning methods are not limited by the hypothesis conditions, which can directly extract rules from the data set for traffic incident duration prediction with better applicability. However, the interpretability of such methods is usually poor. With the expansion of data scale and the improvement of computer performance, machine learning methods have a broader prospect for traffic incident prediction. At present, most methods use as many relevant variables as possible in order to achieve better prediction results. However, an important fact is ignored that is lack of relevant information at the beginning of the traffic incident. Therefore, it is very necessary to select relevant variables and explore more robust methods for traffic incident duration prediction using incomplete variables.

To tackle the shortcoming as mentioned above, a novel method named NCA-BOA-RF is proposed for traffic incident duration prediction. As a well-known machine learning algorithm, Random Forest (RF) is chosen as the basic algorithm because of its excellent performance of regression and classification. Firstly, based on the analysis of the influencing factors of traffic incident duration and considering the characteristics of the data sets used, 18 influencing factors were selected as the relevant variables of traffic incident duration. Then, feature weights of the relevant variables are calculated by Neighborhood Components Analysis (NCA), and feature variables of traffic incident duration are determined by the feature weights. Finally, the training set is constructed using the feature variables for training RF, and the Bayesian Optimization Algorithm (BOA) is used to optimize the parameters of RF.

The main contributions of this paper are highlighted in the following aspects. (a) NCA is used to calculate feature weights of relevant variables to determine the feature variables. (b) The cross-validation method is used to optimize the regularization parameter of the NCA to ensure its best performance. (c) BOA is used to optimize both parameters of RF at the same time, rather than a single parameter. (d) Consider that some feature variables are unavailable when testing performance of the proposed method.

The remainder of this paper is organized as follows. Section 2 elaborates the methodology of the proposed model. Section 3 presents the experimental result and discussion. Finally, the conclusions and future study are summarized in Section 4.

2. Methodology

In this section, the basic methods relevant to the proposed model named NCA-BOA-RF are briefly introduced, including Neighborhood Components Analysis (NCA), Bayesian Optimization Algorithm (BOA), and Random Forest (RF). Then, the main steps of NCA-BOA-RF model are given. In the NCA-BOA-RF model, RF is the basic method for traffic incident duration prediction, feature variables were extracted from relevant variables of traffic incident duration, and BOA are utilized to optimize two parameters of RF at the same time.

2.1. Neighborhood Component Analysis

Feature selection is considerably important in data mining and machine learning, especially for high dimensional data. Proper feature selection not only reduces the dimensions of features, but also improves algorithms generalization performance and execution speed [23, 24]. For traffic incident duration prediction, there are many factors that are considered to be related to the traffic incident duration. Usually, these factors are transformed into relevant variables as the prediction model input, such as incident type, number of casualties, number of closed lanes, and weather conditions. However, the more variables are used as input to the model, the better prediction results are not necessarily obtained. Moreover, in the initial stage of incident occurrence, the available data are often very limited (one or more variables may be missing), which reduces the performance of the prediction model. Therefore, it is very important and necessary to select feature variables for traffic incident duration prediction. In this study, Neighborhood Component Analysis (NCA), as a feature selection method, is used to select feature variables of traffic incident duration. NCA is an algorithm that learns a Mahalanobis distance metric in the supervised k-nearest neighbor (KNN) algorithm by minimizing the leave-one-out (LOO) classification error on the training data [25]. In 2012, NCA-based feature selection is proposed, which learns a feature weighting vector by maximizing the expected LOO classification accuracy with a regularization term [26]. Principal Component Analysis (PCA) and Sequential Feature Selection (SFS) are classic and popular methods for feature selection. However, PCA may result in loss of information when mapped to lower dimensions, and SFS may not be able to remove features that become useless after adding other features [27]. In contrast, NCA is not subject to data conditions (e.g., dimension and distribution) and has the advantage that no information will be lost during the dimension reduction process.

Let be a labelled training data set, where are feature vector, are the corresponding labels, and is the number of observations.

A Mahalanobis distance between two observations and that are denoted in terms of weighing vector is defined as follows [26]:where is the lth feature weight. LOO is considered to maximize classification accuracy on training data set . Randomly select a reference point from to be labelled accordingly. Given a data point , the probability of drawing as a reference point of is defined by [26]where is a kernel function and is kernel width that influences the probability of each point being selected as the reference point. The average probability of LOO correct classification is the probability that the randomized classifier correctly classifies observation which can be expressed aswhere

Then, the approximate LOO classification accuracy can be calculated as follows [26].

The goal of NCA is to maximize objective function associated with using regularized term .where is a regularized term that can be optimized using cross-validation, which balances the first term of maximization of NCA probability and the second term of minimization of the Frobenius norm.

Because objective function is differentiable, its derivative with respect to can be expressed as

According to the above, the corresponding gradient ascent update equation can be obtained. The problem of maximizing the objective function can be solved via the gradient descent method. More details about NCA for feature selection can be found in [26].

2.2. Bayesian Optimization Algorithm

Bayesian Optimization Algorithm (BOA) is one of the most well-known distribution algorithm estimates that combine Bayesian networks with evolutionary algorithms. In the BOA, global statistical information is extracted from optimal solutions searched currently and modeled using Bayesian networks. Therefore, BOA can overcome the disruption of building blocks in genetic algorithms. The BOA has advantages in the optimization of machine learning algorithm hyperparameters, because of its faster search speed and fewer iteration compared to traditional search algorithms [2830]. In this study, the BOA is employed to optimize the parameters of Random Forest (which is the basic model, see Section 2.3 for details) for traffic incident duration prediction, in order to achieve better prediction results.

The algorithm parameters to be optimized are denoted as . is training set and is validation set, and is the validation accuracy. The goal of optimization is to find a set of parameter values that can maximize .

Firstly, initialization parameters of the machine learning algorithm are . Secondly, evaluate the accuracy of the machine learning algorithm with initial parameters using the validation set and record the accuracy. Thirdly, the Gaussian Process (GP) model is introduced to fit the recorded accuracy iteratively. Then, update the machine learning algorithm parameters according to the recommendations of the GP model. In this process, select the next operating point by the maximum of the acquisition function. The acquisition function guides the optimization by determining the next point to evaluate. Several acquisition functions have been proposed, such as probability of improvement, expected improvement, and information gain. Here, expected improvement is used as the acquisition function [30], and the best validation accuracy so far is .where is the probability of with , which is encoded by the GP model . For more details about acquisition functions, see [31]. More details about BOA for hyperparameters of machine learning can be referred to [28].

2.3. Random Forest

Random Forest (RF) [32] is a machine learning algorithm that integrates multiple decision trees by ensemble learning methods. More specifically, RF is an ensemble learning method that can be used for both classification and regression. Ensemble learning is different from traditional statistic reasoning and machine learning that are trying to build an accurate single model. The goal of ensemble learning is to aggregate the results from multiple trained “weak learners” in order to obtain a “strong learner”. In general, the “weak learners” are simple and fast models with a poor performance. In the case of RF, “weak learners” are decision trees. Therefore, RF has good resistance to noise and not easy to fall into overfitting. Moreover, there are not any assumptions in the resulting RF model. As a result, it is expected that the RF model has wider applicability and better robustness compared with traditional statistic reasoning and machine learning techniques [33, 34]. Therefore, RF is used as a basic model for traffic incident duration prediction. The feature variables selected by the NCA are used as input to the model, and traffic incident duration is used as the output of the model. For more information on model training, see Section 3.

The detailed steps of RF are shown as follows:

(1) Bootstrap samples are randomly formed from the original data with replacement, which will be the training set for growing the trees. The number of the created samples is equal to the number of the trees. Around one-third of the original data are left out which are called “out-of-bag” (OOB) data, while the remaining data are called in-bag data. RF performs a cross-validation in parallel with the training step by using the OOB samples to measure the prediction error [35].

(2) Each node of decision tree selects () features from the features as a subset rather than comparing all the input variables (features) and the best split is calculated only within this subset. It is worth noting that may affect the stability of the RF model. The sensitivity of other parameters such as the number of trees (denoted as ) in RF, as well as the size of each tree (i.e., the number of splits in each tree called number of leaves per tree, denoted as ), has also been studied [36, 37].

(3) Each tree splits to its maximum size without pruning throughout the growth of the forest [32].

(4) Each decision tree gives a prediction result. For regression problems, the average results of all decision trees are calculated to find the final prediction value; and for classification problems, the majority voting result is taken as the final output prediction value.

2.4. NCA-BOA-RF Method

The flowchart of NCA-BOA-RF method for traffic incident duration prediction is shown in Figure 1. As can be seen from Figure 1, the application of the NCA-BOA-RF method includes two stages, and the main steps of the NCA-BOA-RF method are as follows.

Step 1. Determine relevant variables of traffic incident duration according to the available traffic incident dataset.

Step 2. The NCA method is used to select feature variables from the relevant variables for traffic incident duration prediction.

Step 3. Construct a training set using the feature variables.

Step 4. The RF model is used to learn the training set and the RF parameters are optimized by BOA.

Step 5. Input real-time collected feature variables data into the well-trained RF model and then the model will output prediction result of the traffic incident duration.

It is worth noting that more useful information may be obtained during the duration of a traffic incident. Therefore, if new useful information is collected and the traffic incident has not ended, the NCA-BOA-RF method can be used to predict the traffic incident duration again, but it is necessary to consider the time the incident has passed. In addition, with the increase of traffic incident data, offline database should be continuously updated and used to retrain and optimize the RF model.

3. Empirical Analysis

3.1. Data Description

Data used in this work were obtained from traffic incident dataset on Interstate 880 (well known as I-880 dataset), which were collected in the Freeway Service Patrol (FSP) project. In the two periods of the FSP project, the number of recorded traffic incidents was1210 and 971, respectively. Some incidents whose start or end time was not within the observation period cannot be used for modeling and testing due to their unknown duration. In addition, some traffic incidents were planned and predictable, such as road maintenance and traffic restrictions, which were excluded in this study.

The I-880 dataset not only records the time, type, location, and duration of the incident, but also the relative location and distance between the incident location and the road exit, the number of vehicles involved, the type and color of involved vehicles, the location and number of lanes affected, the weather during the incident, and casualties. In addition, whether or not rescue vehicles such as trailers, ambulances, and fire engines are required, as well as the arrival and departure times of rescue vehicles are also recorded. In summary, the basic information of all available traffic incidents is given in Table 1.

A total of 440 traffic incidents (235 in the before period and 205 in the after period) were used in this study. 308 traffic incidents (70%) data were randomly selected as the training set, and the remaining 132 traffic incidents data were used as test set. According to the actual situation of I-880 dataset, the relevant variables of traffic incident duration selected in this study are given in Table 2. The first 18 variables are the input, and the 19th variable is output.

3.2. Feature Selection Using NCA

The NCA algorithm is used to select feature variables for traffic incident duration. Regularization parameter is an important parameter of NCA. First of all, it is necessary to determine whether the value of is reasonable. In general, the value of is , where is the number of input variables of training set ( in this study). In the process of selecting feature variables using NCA, it is necessary to add a set of irrelevant variables as a control to highlight the importance of the feature variables. In this study, 100 irrelevant variables are randomly generated from a Normal distribution with a mean of 0 and a variance of 20. In the process of selecting feature variables using NCA, it is necessary to add a set of irrelevant variables as a control to highlight the importance of the feature variables. In this study, 100 irrelevant variables are randomly generated from a Normal distribution with a mean of 0 and a variance of 20. The generated 100 irrelevant variables are added to 18 relevant variables (which are considered as the variables relevant to traffic incident duration). The feature weights of all variables are calculated by NCA. If the value of is appropriate, the feature weight of the relevant variables is large, and the feature weight of the irrelevant variables is small and close to 0. If the value of is too large, the feature weights of all variables are close to 0; if the value of is too small, the irrelevant variables also have a large feature weight.

According to the recommendation in [28, 37], the cross-validation method is used to optimize the regularization parameter . In this study, 5-fold cross-validation is employed. That is, the training set is randomly divided into 5 subsets, 4 subsets reconstruct a training set, and the remaining 1 subset is used as the test set, then the process is repeated five times until each subset is used as a test set, and the average value of 5 test results was adopted as the final result. The best value produces the minimum classification loss. Figure 2 shows the average loss values versus values, and the best was obtained that corresponds to the minimum average loss of 0.1250.

The feature weights of all variables are calculated using NCA with the best value, and the results are shown in Figure 3. As can be seen from Figure 3(a), the feature weights of the irrelevant variables are all close to 0. As can be seen from Figure 3(b), the serial numbers of the variables with a large feature weight are no. 1, no. 5, no. 11, no. 12, no. 14, and no. 16, respectively. And their corresponding variable names are “incident type”, “number of lanes closed”, “need truck wrecker”, “need firefighters”, “automobile count”, and “heavy truck count”.

Under normal circumstances, traffic incidents involve more vehicles, larger vehicles, and more departments involved in rescue. The duration of the incidents tends to be longer, and the calculation results are consistent with the actual situation. Generally, the more vehicles involved, the larger the type of vehicles, and the more the departments involved in rescue, the longer the duration of the incident, and the calculation results are in agreement with the actual situation.

3.3. RF Parameters Optimized Using BOA

Training set is constructed using 6 feature variables selected by NCA. Before RF is trained, the RF parameters need to be determined, including the number of trees , the number of leaves per tree , and the number of random variables used for each node split . Increasing the number of decision trees can improve the classification accuracy to a certain extent, but will reduce the efficiency of the algorithm. If the minimum classification loss is taken as the goal, the number of decision trees will increase dramatically. Therefore, the number of trees is not optimized in this study, but the other two parameters are optimized for improving the classification accuracy. If value is too large, it will easily lead to overfitting; if value is too small, it will easily lead to underfitting. The RF parameters and are tuned using BOA. The RF parameters are set as follows: , , and . The objective function of BOA is the classification loss of OOB data. Figure 4 shows the objective function model. Figure 5 shows the relationship between function evaluations and the minimum objective. The optimized RF parameters are calculated as and , and the observed minimum of the objective function is 0.1409.

3.4. Results and Discussion

In this section, not only is the test set used to analyze NCA-BOA-RF performance, but also classification and regression tree (CART) [15] and support vector machine (SVM) [20] are introduced for comparison. The parameters of SVM that need to be determined include kernel function parameter and penalty coefficients. The parameters of CART that need to be determined include learning rate and the maximum number of node splits. In order to ensure comparison methods have good performance, BOA is also used to optimize the parameters of SVM and CART.

Figure 6 shows the confusion matrices for the results of these three methods. The confusion matrix can provide detailed classification results of the algorithm [38]. The rows represent the output class (predicted class) and the columns represent the target class (true class). The diagonal cells (green) represent observations that are classified correctly. The off-diagonal cells (red) represent observations which are classified incorrectly. The column on the far right of the plot shows the percentages of all the test samples predicted to belong to each class that are correctly and incorrectly classified. These metrics are often called the precision (or positive predictive value) and false discovery rate, respectively. The row at the bottom of the plot shows the percentages of all the test samples belonging to each class that are correctly and incorrectly classified. These metrics are often called the recall (or true positive rate) and false negative rate, respectively. The cell (blue) in the bottom right of the plot shows the overall classification accuracy.

Therefore, according to confusion matrices obtained as a result of the three methods, the overall classification accuracy of RF is the highest one. For the “5” class samples, the classification accuracy of the three methods is equal (=33.3%), indicating that the three methods have a poor classification performance on the “5” class samples. This is due to the fact that the number of the “5” class samples is too small. For the “3” class samples, the classification accuracy of RF and SVM is equal (=84.0%). In addition, the classification accuracy of RF is higher than that of SVM and CART on the samples of the remaining classes.

From the confusion matrix, the degree of misclassification can be achieved. For example, if the “1” class is classified into the “2” class, the degree of misclassification is 1; if the “1” class is classified into the “3” class, the degree of misclassification is 2. Figure 7 shows the frequency distribution of the misclassification degree for these three methods. It can be seen from Figure 7 that the misclassification degree of RF is mainly 1, a small amount is 2, and there are no 3 and 4, while the other two methods have a misclassification degree of 3, and the number of misclassification degree of 2 is more than that of RF significantly.

At the beginning of the traffic incident, the available information for incident duration prediction is often limited. Therefore, we need to analyze the performance of the method in the absence of some variables. The six feature variables are 1 “incident type”, 2 “number of lanes closed”, 3 “need truck wrecker”, 4 “need firefighters”, 5 “automobile count”, and 6 “heavy truck count”. When an incident is first identified, 1 “incident type”, 4 “need firefighters”, and 5 “automobile count” are often the first to be known, so the absence of these three variables is not considered.

In the absence of variables, the classification accuracy of the three methods is shown in Figure 8. As can be seen from Figure 8, when a single variable is missing, the classification accuracy of the three methods reduces to a certain extent; when two variables are missing, the classification accuracy of the three methods continues to decrease; when three variables are missing, the classification accuracy of the three methods is greatly reduced to about 50%. However, from the comparison of the three methods without some variables, the classification accuracy of RF is still higher than that of SVM and CART.

4. Conclusions

In this study, a hybrid method named NCA-BOA-RF is proposed to integrate the NCA and the BOA-optimized RF model for traffic incident duration prediction. Firstly, 18 influencing factors are selected as the relevant variables of traffic incident duration, considering influencing factors of traffic incident duration and the data set that we used. Secondly, feature weights of the relevant variables are calculated by Neighborhood Components Analysis (NCA), and feature variables of traffic incident duration are determined by the feature weights. Secondly, the NCA is applied to select the most powerful features (called feature variables) for traffic incident duration prediction. In NCA, regularization parameter is optimized using cross-validation to ensure better classification accuracy (less classification loss). Then, the training set is constructed using the feature variables to train RF model, and the BOA is used to optimize the RF parameters. Finally, we conduct experiments to test performance of the proposed method and two comparison methods. Confusion matrix was introduced to better illustrate the experimental results. Not only has the proposed method the highest classification accuracy on the whole data, but also the classification accuracy of the proposed method is greater than or equal to that of the other two methods on each class of the data. In addition, the performance of the proposed method is tested with some feature variables unavailable. The results demonstrate that the performance is still better than comparison methods, although the performance of all methods is reduced. Based on the comparison and analysis of the experimental results, the conclusions can be drawn that NCA-BOA-RF is a better method for traffic incident duration prediction, because of its high accuracy rate and good generalization ability.

For future research, more and more comprehensive data sets should be used to further test the method performance for drawing a more general conclusion. From an incident being detected to the incident being cleared up, more useful information is expected to be available. Multistage updates of information should be considered in future study. In addition, predicting different stages of event duration (such as response time) separately is also a direction for future research.

Data Availability

The traffic incident data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research is supported by the National Natural Science Foundation of Shandong Province (Grant Nos. ZR2018BF024 and ZR2016EL19), MOE (Ministry of Education in China) Project of Humanities and Social Sciences (Grant No. 18YJC190003), the National Natural Science Foundation of China (Grant No. 61573009), and the Dr. Scientific Research Start Funding Projects of Shandong University of Technology (Grant Nos. 4041/417006, 4033/718003).