Abstract

The safety issue has become a critical obstacle that cannot be ignored in the marketization of autonomous vehicles (AVs). The objective of this study is to explore the mechanism of AV-involved crashes and analyze the impact of each feature on crash severity. We use the Apriori algorithm to explore the causal relationship between multiple factors to explore the mechanism of crashes. We use various machine learning models, including support vector machine (SVM), classification and regression tree (CART), and eXtreme Gradient Boosting (XGBoost), to analyze the crash severity. Besides, we apply the Shapley Additive Explanations (SHAP) to interpret the importance of each factor. The results indicate that XGBoost obtains the best result (recall = 75%; G-mean = 67.82%). Both XGBoost and Apriori algorithm effectively provided meaningful insights about AV-involved crash characteristics and their relationship. Among all these features, vehicle damage, weather conditions, accident location, and driving mode are the most critical features. We found that most rear-end crashes are conventional vehicles bumping into the rear of AVs. Drivers should be extremely cautious when driving in fog, snow, and insufficient light. Besides, drivers should be careful when driving near intersections, especially in the autonomous driving mode.

1. Introduction

In recent years, academia and industry have invested enormous human and material resources in the research and development of AVs. One of their original intentions is to reduce the rate of traffic accidents. Every year, the economic loss caused by traffic accidents is 277 billion dollars, which is twice as much as traffic congestion [1]. Drivers’ errors cause 90% of traffic accidents. More than 40% of fatal crashes are related to alcohol, distraction, drug addiction, and fatigued driving [2]. Even when nonhuman factors primarily cause crashes, they usually include some human factors such as distractions or unfamiliar driving skills. With improved AV technology, fatal accident rates are likely to fall by at least 40%, and human factors may disappear. Many studies have been carried out to explore the features that influence the severity of conventional vehicle crashes [3, 4]. However, the literature review shows that, relatively, few studies have focused on factors influencing the severity of AV-involved crashes. The driving system of AVs is different from conventional vehicles. Therefore, clarifying the mechanism of AV-related crashes is of great importance to improve the safety of AVs.

In the previous literature on the contributing factors of AV-related crashes [58], less attention was paid to the correlation between different factors. These research studies have proposed various methods (such as classification and regression trees and neural networks) to explore the features and mechanism of the AV-involved crash. However, these methods are based on the hypotheses that the factors are independent of each other, and it is easy to ignore the causal relationship between factors. Association rule mining is a crucial technology among numerous data mining technologies, especially in analyzing the cause of traffic crashes, because it does not rely on any assumptions and can discover meaningful relationships hidden in large datasets [9]. Therefore, to interpret the interrelationships between factors and further explore the mechanism of AV-involved crashes, the association rule mining algorithms need to be adopted.

The interpretability of model results is another valuable issue. There are many machine learning methods (such as random forest [10], classification and regression tree [11], and gradient boosting model [12]) that have been utilized to study the severity of traffic crashes. These models are more complex and data-driven and have higher accuracy than traditional calculation models [13]. However, these models are usually regarded as a “black box” because the complicated and nonlinear effects of features on the prediction results cannot be explained [14]. The current study uses SHAP (Shapley Additive Explanations) to interpret how a variable affects model prediction results. SHAP, proposed by Lundberg and Lee [15], originated from cooperative game theory. The prediction results are explained by calculating the contribution of individual variables to the results.

The objective of this study is to explore the mechanism of AV-involved crashes and investigate the impact of each feature on accident severity. We adopted 131 AV-involved crash reports received in California from 2019 to October 2020. We used synthetic minority oversampling technique (SMOTE) to balance the dataset. The Apriori algorithm and classification models are used for accident mechanism analysis. We apply the SHAP to interpret how a variable affects model classification results.

2. Literature Review

The method of improving traffic safety through advanced driving assistance system (such as antilock brake system (ABS), electronic stability program (ESP), autonomous emergency braking (AEB), and lane-keeping assist system (LKA)) has been used in the automotive industry for many years. Statistics show that these systems have effectively improved vehicle safety and reduced the rate of traffic accidents [2]. As the level of AV technology increases, the driving tasks gradually transferred from the driver to the autonomous driving system. Since AVs have a strong environmental perception (such as the vehicle to everything (V2X)), data processing, and rapid response capabilities, they can make up for the driver’s inherent shortcomings to a certain extent [13]. It is foreseeable that, with the development of AV technology, traffic safety problems will be further alleviated. In the 1970s, Haddon proposed a theory from the perspective of human-vehicle-environment, which divided the crash into three stages: precrash, crash, and postcrash [16]. The three factors at each stage of the crash process are arranged and combined to form the famous Haddon matrix, shown in Table 1.

With the development of AV technology, the causes of traffic accidents are significantly changed. As shown in Figure 1, drivers make driving decisions based on complex environmental information and vehicle status for conventional vehicles. Therefore, if the vehicle is in danger of collision, the driver needs to quickly make a decision based on the current driving scene and combined with his own experience. However, the situation for AVs is different. As shown in Figure 1, the AV technology eliminates the driver’s unstable factor. In the preset operation design domain (ODD), AVs rely on their sensors to perceive all environmental information (including traffic information, environmental conditions, and road conditions) and make driving decisions [16]. Simultaneously, through human-computer interaction techniques, V2X, and other technologies, AVs share some of their driving status with other traffic participants. If an accident occurs in the ODD, the primary responsibility is the AV. Therefore, to protect passengers’ safety and avoid accident liability, the AV manufacturer will carry out comprehensive testing and verification before the AV launch. Thus, based on the perception of the environment and the infrastructure’s support, the traffic accidents caused by human drivers may disappear under ideal circumstances. The Haddon matrix will change accordingly. However, limited by current AV technology and transportation facilities, this goal cannot be achieved in the short term.

The previous studies on the safety of AV technology were carried out in analyzing driver behavior in the driving simulator and testing the autonomous driving system’s stability in closed environments. To avoid potential collisions and improve traffic efficiency, some research studies concentrate on trajectory optimization of AVs. Omidvar et al. [17] designed an AV trajectory optimization algorithm for closed road signalized intersections. The algorithm can optimize the signal control scheme before the AVs arrive at the intersection, control the vehicle’s speed, and ensure that the vehicles can quickly pass the intersection. Li et al. [18] developed an integrated local trajectory optimization scheme and tracking control framework to avoid obstacles in time. With safety and comfort as the objective function, the best trajectory plan is selected. Zhu et al. [19] proposed a speed control method for AVs based on vehicle speed prediction, which improves vehicle operating efficiency and comfort. Many studies use driving simulators to analyze the characteristics of drivers. In the environment of automatic driving, attention should be paid to drivers’ physiological and psychological reactions. Winter et al. [20] found that drivers do not need to monitor the vehicle’s automation process for a long time when driving a highly automated vehicle. They can shift their attention to nondriving-related tasks without affecting the safety of the vehicle. However, the insufficient sample size is one of the limitations of AV safety testing. Both field and driving simulator research attempt to analyze the safety problems of AVs from the perspective of vehicle control and human factors. However, we also need to fully explore the mechanism of AV-involved crashes and analyze the impact of each feature on crash severity. Table 2 lists the studies conducted on AV-involved crashes. This table presents a list of variables used in the studies, analysis methods, and the significant factors obtained from the study.

Determining the cause of a traffic crash is the most critical process for taking preventive measures to reduce the severity and traffic crashes. The Apriori algorithm proposed by Agrawal et al. [21] is the most commonly used association rule mining algorithm. It has been widely used in traffic safety analysis [2224]. Xu et al. [9] used the Apriori algorithm to explore the causes of traffic crashes with heavy casualties and their interdependent relationship in China. The results indicated that serious casualty crashes resulted from complex interactions between traffic participants, vehicle, road, and environmental conditions. Montella et al. [25] applied the Apriori algorithm to analyze the Italian PTW collision to find the interdependencies and differences between the collision features. Yu et al. [23] adopted the Apriori algorithm to recognize risk factors that are prominently linked to the severity of crash accidents. Many studies have begun to use SHAP for model interpretation. In terms of traffic safety, Mihaita et al. [26] applied the SHAP to study the influence of various characteristics on crash duration. Parsa et al. [1] adopted the SHAP to explain the individual features’ importance on accident detection. Zhou et al. [14] applied the SHAP to interpret the influence factors of the severity of car and truck driver injury in the car-truck collision.

By combining the Apriori algorithm and classification models to explore the mechanism of AV-involved crashes, this study provides some useful information for taking preventive measures to promote the safety of AVs.

3. Data Analysis and Feature Extraction

3.1. Data Sources

The Department of Motor Vehicles (DMV) was required to provide AV-related crash reports within ten business days of an accident. This study adopted 131 crash reports received in California from 2019 to October 2020. We extract AV-related information from crash reports such as type of collision, crash severity, vehicle information, and weather. According to the heat map (shown in Figure 2) of the accident location, we find that the crash mainly occurred in Northeast San Francisco because it is the primary test site for AVs.

3.2. Imbalanced Data Treatment

By nature, accident data are unbalanced because most crashes cause property damage, and a few lead to injuries. The techniques applied to cope with imbalanced data can be divided into two groups: oversampling and undersampling. The undersampling technique often leads to the loss of a large amount of data, leading to decreased model accuracy. Therefore, oversampling is usually preferred. In this study, we adopted SMOTE to cope with imbalanced data.

3.3. Variable Collinearity Analysis

Multicollinearity refers to the distortion or difficulty of estimation due to the high correlation between explanatory variables in a linear regression model. Variance inflation factor (VIF) is a common indicator to measure variables’ multicollinearity [2729]. We used SPSS 26.0 to calculate the VIF value of each variable. When the VIF value is greater than 10, there is serious collinearity between the two variables. It is recommended to eliminate one of the variables [30]. Finally, we selected nine categorical variables. The variables’ description and distribution are shown in Table 3.

4. Methodology

4.1. Association Rule Mining

The Apriori algorithm identifies sets of items (i.e., the crash patterns in our study) that occur together in a given event (i.e., a crash in our research). It is the most basic and widely used algorithm in association rule mining. It uses layer-by-layer iterative search to calculate frequent itemsets in the database and determine strong association rules. The association rules can be expressed as X > Y (where X means antecedent and Y means consequent). That is to say, when X appears in the dataset, Y may also occur. In this study, the Apriori algorithm interprets the interrelationships between factors and further explores the mechanism of AV-involved crash.

In the Apriori algorithm, lift (L), support (S), and confidence (C) are three essential indicators for discovering association rules. S means the percentage of occurrence of several related data in the complete dataset, which can be calculated aswhere number (XY) is the frequency of X and Y appearing in the dataset at the same time and N is the total number of samples.

C can be interpreted as the conditional probability P (Y|X) (the probability of finding itemset Y in crashes given that the dataset already contains X). It can be calculated as

We used S and C to exclude meaningless rules. L means the effect of antecedent on the probability of consequent and is used to determine whether the rule has actual value. It can be calculated as

When the value of L is greater than 1, it indicates that X’s occurrence increases the probability of Y’s occurrence; otherwise, it is an invalid rule.

In general, the rationality of a rule is judged by support, confidence, and lift of the rule, but in the crash dataset, the same support has different values for the classification factors under different variables. Therefore, when mining accident association rules, the reference value of the support indicator is small. The confidence is essentially a conditional probability, representing the probability of the following itemset appears when the previous itemset occurs. Thus, the confidence indicator will be used as the primary indicator for later analysis and screening. We used Python 3.6 to code the Apriori algorithm.

4.2. Classification Models

The mining of association rules for traffic accidents can explore the causal relationship between multiple factors to explore the mechanism of AV-involved crashes. Besides, we use classification models to investigate the primary factors affecting the severity of the accident. This section briefly introduces each classification model and classification performance evaluation indicators. We used the Scikit-learn (sklearn) library in Python 3.6 to code the following models.

4.2.1. XGBoost Model

The XGBoost model is a boosting algorithm, which generates multiple weak learning classifiers through fitting residual and finally accumulates the generated weak learning classifier to obtain the strong learning classifiers [31]. Chen and Guestrin made some improvements based on the gradient boosting [32] and presented XGBoost in 2016. XGBoost expands second-order Taylor of the loss function in the optimization process and introduces the second-order derivative information, which makes the model converge faster in the training process. In addition, XGBoost also adds a regularization term to the loss function to suppress model complexity and prevent overfitting. The regularized objective for the iteration can be expressed in the following equation:where n is the sample number, is the prediction value of sample at iteration k, and is the original loss function. Ω represents the regularization term, as shown in the following equation:

Here, T is the number of leaf nodes, and γ and λ are two constants employed to constrain the degree of regularization.

Another development of XGBoost is the application of an additive learning approach that combines the most reliable tree model into the current classification model to provide the iteration prediction result. Therefore, equation (5) can be expressed further as follows:

Additionally, XGBoost utilizes the second-order Taylor expansion to the objective function, and equation (6) can be expressed further as

Here, and are the first and second derivatives of the loss function, respectively, and represents the constant.

4.2.2. CART Model

The CART model is a decision tree learning algorithm that uses the Gini index (as shown in equations (8) and (9)) to select and classify attributes [4]. The smaller the Gini value, the higher the purity of the dataset and the better the classification effect.

Here, is the dataset, is the probability that category appears in , is the attribute to be divided; is the desirable value of attribute , and is the sample with the value of in the dataset. The decision tree generated by the CART algorithm is a binary tree, and the segmentation steps are as follows:(1)Select the attribute with the minimum Gini coefficient as the segmentation point for each binary tree node(2)Select the optimal segmentation point for this node from all the optimal attribute values as this node’s segmentation rule(3)Repeat the above steps to continue the left and right nodes’ segmentation until all samples belong to the same category and stop the segmentation

4.2.3. SVM Model

Among various machine learning algorithms, SVM has been widely used in classification and regression research. The core idea of the algorithms is to construct the optimal hyperplane to improve classification precision. For a linear classification problem, two parallel hyperplanes are chosen to maximize the distance between different classes. For nonlinear classification problems, the classification problem is transformed into higher-dimensional space by using three basic kernel functions involving linear, polynomials, and Gaussians. Take the binary classification as an example, and there is a training set as the following:where is the category label to which the sample belongs to, is the dimension of the sample, and is the number of training samples. An optimal separating hyperplane is calculated by the following formula:where is an n-dimensional vector and b is the offset. The label of a sample can be represented by the following equation:

4.2.4. Evaluation Indicator

The confusion matrix, also called the error matrix, is a multidimensional measurement indicator representing accuracy evaluation. It is mainly used to compare the classification result with the actual measured value and display the classification’s accuracy in a confusion matrix (see Table 4) [33].

The overall accuracy is measured as follows:

Regrettably, this indicator may not apply to unbalanced data. Because the number of injuries in the current study is significantly lower than the number of noninjuries, overall accuracy can be high even if all minority instances are misclassified. We used the G-mean to evaluate the classification accuracy of unbalanced data. It is considered a reasonable indicator for assessing unbalanced data by balancing the classification accuracy of minority and majority instances [34]. The G-mean is calculated as follows:

The recall rate indicates the classification accuracy of minority instances, as shown in the following equation:

Finally, we choose the G-mean and recall rate as indicators to measure the model performance because we need to identify injury accidents as much as possible.

4.3. Model Results’ Interpretation

The objective of establishing the crash severity classification model is to reveal the relationship between each factor and the crash severity. Subsequently, corresponding countermeasures can be implemented to reduce the crash severity. Therefore, the interpretability of the model output is as important as its accuracy.

We apply the SHAP to interpret how a specific variable influences model classification results. It was proposed by Lundberg and Lee and originated from the cooperative game theory [15]. It produces a predicted value for each sample, and the SHAP value is allocated to each feature in the sample. The importance of each feature on the model output ( is the importance of feature ) is assigned based on its boundary contributions. The following equation calculates the Shapley values:

The linear function of binary feature is determined according to the following formula:

Here, , when the feature is detected, it is equal to 1; otherwise, it is equal to 0, and M is the number of input features.

5. Results and Discussion

5.1. Association Rule Analysis

In previous studies, the threshold of L was usually set at 1, and the threshold of S and C was usually set at 10–20% [24]. In the current study, association rules were generated by setting the min-support equal to 0.2, min-confidence equal to 0.2, and min-lift equal to 1. A total of 90 association rules were generated. As shown in Figure 3, it is a bubble chart of all association rules. The following will be an excavation analysis of uninjured and injured crashes, respectively.

5.1.1. Association Rules for Uninjured Crashes

There were 106 uninjured crashes in the dataset, accounting for 80.65%. We sorted the obtained rules according to the confidence index and deleted the rules with no obvious value. Finally, we obtained the top 10 strong association rules, as shown in Table 5.

The above results show that the ten strong association rules for uninjured accidents have a high overlapping rate of factors and are the most common factors under related variables. This rule conforms to objective facts and proves the rationality of the results of association rules from the side. Among the factors involved in the ten strong association rules for uninjured crashes, “TOC = 2,” “PMCV = 1,” “L = 1,” and “W = 1” are the most frequent itemsets. It shows that most uninjured crashes occurred in ordinary situations. According to rules 5 and 7, the precrash movement of AVs has stopped (“PMAV = 0”), and then the conventional vehicle is still moving forward (“PMCV = 1”), and the type of collision is rear-ended (“TOC = 2”). It shows that most rear-end crashes are conventional vehicles bumping into the rear of AVs, consistent with previous research results [35].

5.1.2. Association Rules for Injury Crashes

There were 25 injury crashes in the dataset, accounting for 19.35%. The same as the previous section, the top 10 strong association rules are shown in Table 6.

The top ten strong association rules for injury crashes involve more factors than uninjured crashes from the above results. There are some unconventional factors (such as “W = 2,”, “VD = 2,” and “DM = 1”), indicating that unfavorable conditions often accompany the occurrence of injury accidents. This rule conforms to objective facts and proves the rationality of the results of association rules from the side. Compared with Table 5, injury crashes are more likely to occur on cloudy days (“W = 2”). According to rules 1 and 6, the AVs were still moving before the collision on a cloudy day indicating that the autonomous driving system failed to apply emergency measures. It could be caused by the detector failing to find anomalies in time when there is insufficient light. Besides, the driver may distract their attention to secondary tasks when the drive mode is automatic, so it is more likely to be injured in an accident (rules 7 and 9).

5.2. Classification Model Results

We divided the test set and training set according to the ratio of 7 : 3. We used the grid search to decide the best combination of parameters to prevent overfitting of the model. The performances for each classification model are reported in Table 7.

Table 7 shows that the XGBoost model represented better than the other models because it has the highest G-mean and recall (G-mean = 67.82%; recall = 75.00%). Although the overall accuracy is low (overall accuracy = 61.10%), the G-mean and recall are the main indicators for unbalanced data because we want to recognize more injury crashes. In summary, the XGBoost model can more accurately identify injured accidents. Then, we used the SHAP to explain the results of the XGBoost model. Figure 4 shows the impact of features on AV-involved crashes. The ordinate represents the features and is sorted according to the importance, and the abscissa is the Shapley value. Each point in the figure represents a sample. Color represents the size of the feature value.

Obviously, the greater the degree of vehicle damage is, the more likely it is to cause a severe accident. The interpretation of the model results suggests that weather is another key feature. Particularly, in low-visibility conditions, such as fog and snow, injury accidents are more likely to occur [36]. This is probably due to the sensors’ worse perception performance in extreme weather [37]. According to Hasirlioglu et al. [38], the reflector can only detect a short distance in foggy weather, and crashes are more likely to occur in this situation.

The next most important features are accident location and driving mode. The possibility of a crash is higher at intersections [39]. This is probably due to the complex and changeable traffic environment at intersections because vehicles, nonmotor vehicles, and pedestrians are highly mixed [40, 41]. According to crash reports, AVs usually switch to the conventional driving mode when they arrive at intersections because intelligent transportation facilities are not perfect enough now. These infrastructures can increase vehicle stability during driving and improve the safety of all traffic participants [42]. In terms of driving mode, Figure 4 illustrates that the automatic driving mode will increase the risk of injury, surprising. This is probably because the driver diverts their attention to secondary tasks (such as playing mobile phones) in the automatic driving mode, so it is more likely to be injured in an accident.

6. Conclusion

The objective of this study is to explore the mechanism of AV-involved crashes and analyze the impact of each feature on crash severity. We employ 131 accident reports involving AVs received in California from 2019 to October 2020. We use the Apriori algorithm to explore the causal relationship between multiple factors to explore the mechanism of crashes. Given the imbalanced crash severity distribution, we apply the SMOTE to balance the dataset. Three different classification models are used to compare the classification performance: XGBoost, CART, and SVM. The result shows that the XGBoost model can better recognize the injured crashes involving AVs. We apply the SHAP (Shapley Additive Explanations) to interpret how a specific variable influences model classification results. Both the XGBoost and the Apriori algorithm effectively provided meaningful insights about AV-involved crash characteristics and their relationship.

For the analysis of crash mechanisms involving AVs, we use the Apriori algorithm to mine association rules for uninjured and injured crashes, respectively. Among the top ten strong association rules for uninjured crashes, we can find that most rear-end collisions are conventional vehicles bumping into the rear of AVs. It is probably because the AVs have stopped before the collision, while the conventional vehicles are still moving forward (“PMAV = 0 + TOC = 2”⟶ “PMCV = 1”). Among the top ten strong association rules for injured crashes, we can find that the AVs were still moving before the collision on a cloudy day. It could be caused by the detector failing to find anomalies in time when there is insufficient light (“L = 2 + PMAV = 1 + W = 2”⟶ “CS = 1”). Besides, the driver may distract their attention to secondary tasks in the automatic driving mode, so it is more likely to be injured in an accident (“DM = 1 + TOC = 2”⟶ “CS = 1”). For the crash severity analysis, XGBoost generates the best result (overall accuracy = 64.10%, G-mean = 67.82%, and recall = 75%). To make the results of the XGBoost model more informative, we apply the SHAP to analyze the impact of each feature on crash severity. Among all these features, vehicle damage, weather conditions, accident location, and driving mode are the most critical features. The greater the degree of vehicle damage is, the more likely it is to cause a severe accident. Injured accidents are more possible to occur in low-visibility conditions (such as fog and snow). Intersections are more prone to injury accidents, especially in the automatic driving mode. This study may provide some help in reducing the severity of AV-involved crashes. For example, autonomous vehicle drivers should be extremely cautious when driving in low-visibility conditions (e.g., fog and snow). They should be more careful when driving near intersections, especially in the autonomous driving mode. It is recommended to use vehicle sensors with strong stability and high sensitivity.

However, the current study has certain limitations. Firstly, this study’s sample size and variables could be extended to increase the model result’s reliability. In the future, we will collect driver characteristics, traffic flow information before the crash, and vehicle speed to understand the mechanism of AV-involved crashes. Secondly, this study uses only crash reports received in California for modeling. Future research should continue to collect accident data from other countries and regions because driving habits and traffic laws in different countries and regions may be completely different.

Data Availability

The data used to support the findings of this study are publicly available at https://www.dmv.ca.gov/portal/dmv/detail/vr/autonomous/autonomousveh_ol316.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the 111 Project of Sustainable Transportation for Urban Agglomeration in Western China (no. B20035).