#### Abstract

With the development of freeway system informatization, it is easier to obtain the traffic flow data of freeway, which are widely used to study the relationship between traffic flow state and traffic safety. However, as the development degree of the freeway system is different in different regions, the sample size of traffic data collected in some regions is insufficient, and the precision of data is relatively low. In order to study the influence of limited data on the real-time freeway traffic crash risk modeling, three data sets including high precision data, small sample data, and low precision data were considered. Firstly, Bayesian Logistic regression was used to identify and predict the risk of three data sets. Secondly, based on the Bayesian updating method, the migration test towards high and low precision data sets was established. Finally, the applicability of machine learning and statistical methods to low precision data set was compared. The results show that the prediction performance of Bayesian Logistic regression improves with the increasing of sample size. Bayesian Logistic regression can identify various significant risk factors when data sets are of different precision. Comparatively, the prediction performance of the support vector machine is better than that of Bayesian Logistic. In addition, Bayesian updating method can improve the prediction performance of the transplanted model.

#### 1. Introduction

In recent years, the potential safety hazard of new energy vehicles has gradually attracted attention, especially the accident of pure electric vehicles [1, 2]. As new energy vehicles and shared vehicles enter the freeway, they are also facing the risk of traffic collisions. As one of the important subsystems of the road system, the freeway greatly facilitates people’s travel and improves the transportation efficiency of goods. At the same time, because of the large traffic volume and fast speed of vehicles on freeways, relatively serious traffic crashes are easy to occur, which brings great harm to the safety of people’s lives and property. Freeway traffic crash has become one of the problems that cannot be ignored [3].

A large number of scholars have done extensive researches on traffic safety. Some scholars analyzed the internal relationship between the factors causing accidents and the distribution law of accidents based on the historical traffic accident data collected and then put forward the corresponding countermeasures. For example, considering the differences in time, Yuan et al. adopted an improved association rule mining algorithm to analyze the association among the influencing factors of freeway traffic crashes, in which the hidden association rules were found and the accuracy of the algorithm was improved [4]. Tian et al. analyzed the temporal and spatial distribution characteristics of freeway crashes in mountainous areas based on historical crash data, identified the significant influencing factors, and proposed corresponding improvement strategies [5]. Some scholars have analyzed the main influencing factors of accidents according to specific accident scenarios. Mergia et al. analyzed the crash at the junction. It was pointed out that drunk driving and overspeed have increased the severity of crashes in the diverging area, bad weather increased the severity of crashes in the merging area, and adverse linear conditions would increase the severity of crashes in the diverging area [6]. Xin et al. studied the factors not observed, proving that, under different severities, driving behaviors, environmental characteristics, and other factors have significant differences in the impact of the crash rate [7]. Haghighi et al. studied the impact of road design features on crashes and found that 10-foot wide lanes and narrower shoulder were significantly associated with crash severity and increased vehicle density and guardrail length could reduce crash severity [8]. In order to study the influence of drivers’ ages on crash severity, Osman et al. constructed a generalized ordered response probit model to reduce the interference of heterogeneity and found that each variable in different age groups had a different influence on crash [9]. Xu et al. introduced the Bayesian spatial random coefficient model to consider the heterogeneity of spatial structure and unstructured data when studying the spatial variation law of crash rate and cause factors, which improved the fitting effect of the model and verified that the existence of spatial structure heterogeneity would cause bias to parameter estimation [10]. Wang et al. explored risk factors’ influence on urban traffic crashes frequency while considering both the spatial and temporal correlation/heterogeneity of traffic crashes. The linear regression model, spatial lag model (SLM), spatial error model (SEM), and time-fixed effects error model (T-FEEM) were established and compared, respectively [11]. To figure out the factors relating to crash risk in different regional types and their inner relation, Yang et al. took three sections of highway (areas of downtown, suburb, and mountain, in Washington State, USA) as the research object and, based on AHP improved Apriori association rule mining algorithm, identified the crash risk influencing factors and their complex association rules were [12]. Li et al. investigated the possibility of using support vector machine (SVM) models for crash injury severity analysis and compared the performance of the SVM model and the order probit model. It was found that the SVM model produced better prediction performance for crash injury severity than did the OP model [13]. In addition, Logit and Tobit models are also widely used in traffic crash analysis [14–22].

With the development of freeway informatization and the improvement of dynamic traffic management, the real-time crash risk model has been widely studied [23–26]. Based on loop detector data and crash data collected by the Shanghai expressway system, Sun et al. established a Bayesian network (BN) model to analyze real-time traffic flow parameters and crash risk of expressway [27]. You et al. established a support vector machine model to analyze highway traffic flow data for rear-end crash. The results showed that the SVM classifier has high practical value and reliability of real-time crash prediction based on traffic flow data of a single volume detector [28]. Xu et al. established a crash risk prediction model based on traffic flow data and meteorological data by using the Logistic model based on American freeway data. The results showed that weather conditions have a significant impact on crash risk [29]. Ma et al. established a crash risk assessment and analysis model using highway crash data and real-time traffic flow data. The significant variables were selected by a random forest algorithm, and the support vector machine model was established. The evaluation ability of models under different kernel functions was compared. The results showed that the model could effectively evaluate road crash risk based on real-time traffic flow [30, 31].

The traditional “postevent” traffic safety analysis can analyze the main influencing factors of crash occurrence, but it is difficult to reflect the influence of dynamic traffic flow characteristics on crash risk. At present, most of the researches on using traffic flow data to establish real-time crash risk models are based on existing data for modeling and analysis. However, different regions have different levels of development, and the data collection of traffic flow and traffic crash will be different. Then does the traffic flow variable also have a significant impact on the occurrence of traffic crashes? If the impact is small, can certain technical means be used to improve the accuracy of the corresponding model?

In order to study the above questions, based on the basic data necessary for the real-time crash risk model, this paper constructs three types of data sets: (1) high precision data set; (2) small sample data set; (3) low precision data set. On the basis of the above three types of data sets, statistical and machine learning methods are used to study the classification and prediction performance of the model under different data sets, and the applicability of the two methods is further analyzed.

From the perspective of data, this paper uses statistical Logistic regression, Bayesian theory, and support vector machine to simulate the impact of different types of data on real-time crash risk modeling. Furthermore, the applicability of different methods under different data types is compared. The conclusions of this paper can be used as a reference for the subsequent practice and research of highway traffic safety.

#### 2. Data Description

##### 2.1. The Data Source

This paper selects the traffic flow and traffic crash data of milepost 100–132 of I-5 in Washington State in 2016. Figure 1 describes the main freeway section in the study area.

In 2016, a total of 332 traffic crashes occurred in this freeway section. In the selected area, 152 groups of loop detectors are arranged bidirectional, and the average distance between adjacent loops is about 0.7 km. Each loop detector collects average speed, occupancy, and traffic volume in each lane over a 20-second period.

##### 2.2. Variable

In existing studies, traffic flow data of 5-minute lumps are used for analysis, and good research results are obtained [8–13]. Therefore, this paper adopts traffic flow data of 5-minute lumps for analysis, mainly including the volume, speed, and occupancy rate of each lane. The time period of 5–10 minutes before the crash is selected and two groups of upstream and downstream loop detectors were taken into account, as shown in Figure 2.

In addition to the basic data detected by the loop detector above, such as volume, speed, and occupancy of upstream and downstream loops, this paper combines the traffic flow variables as follows [2]. Considering that the difference values in volume, speed, and occupancy between upstream and downstream loops may lead to vehicle crash, the absolute value of the difference values between volume, speed, and occupancy between upstream and downstream loops is constructed. At the same time, the lateral crash between lanes is also one of the main forms of crash. The average difference values of volume, speed, and occupancy between adjacent lanes are constructed to describe the related variables of lateral collision. The specific meanings are shown in Table 1.

##### 2.3. Sample Structure Design

###### 2.3.1. High Precision Data Set Sample

High precision data set refers to the traffic flow data and traffic crash data collected by the American freeway system as the standard. The freeway loop detector in the United States has a high laying density, and the traffic crash information is collected completely.

In this paper, the paired sampling method is adopted to match control samples, and noncrash data under the same conditions are extracted for each crash data with a ratio of 1 : 4. The ratio of crash data to noncrash data is 1 : 4 for matching [2]. After data preprocessing, 191 traffic crash data and 764 noncrash data are obtained. The high precision data set sample is shown in Figure 3.

###### 2.3.2. Small Sample Data Set

In order to study the influence of data sample size on the real-time crash risk model constructed, it is necessary to obtain small sample data with different sample sizes. The main ideas of constructing small sample data set in this paper are as follows: obtain and match the high precision data set, and extract data from the high precision data set in a proportion of 5%, 10%, 20%, 30%, and 50%, so as to construct small sample data set of different proportions. The small sample data set is shown in Figure 4.

###### 2.3.3. Low Precision Data Set Sample

Low precision data set refers to the data set constructed from the data collected by the detectors with lower density compared to the US freeway system. Considering that the data is difficult to obtain, this paper constructs a low precision data set through certain manual processing methods. Compared with the freeway system in the United States, many freeways in China do not have complete loop detection devices, and the distance between the detectors is relatively long. In this paper, the average distance between detectors of a certain section of freeway in China is taken as a reference, and the freeway data of the US is used to construct a low precision data set. The low precision data set sample is shown in Figure 5.

The main processing ideas are shown in Figure 6. Manually delete part of the loop number in the loop file so that the average distance between the remaining loops is approximately equal to the reference value. Then the processed loop file is used to match the data set to get the low precision data set. After screening, there are a total of 32 bidirectional loops. The low precision data set is screened by paired sampling method, and 161 crash data and 644 noncrash data were obtained.

#### 3. Real-Time Crash Risk Prediction Model

##### 3.1. Bayesian Logistic Regression

Logistic regression is a generalized linear regression model commonly used in statistical methods. Based on the binomial Logistic regression model, this paper establishes a crash risk model between freeway crashes and real-time traffic flow [23, 28]. The crash probability corresponding to a certain data in the research data set is shown as follows:where represents the th data; represents the probability value of crash occurrence; represents a linear combination of explanatory variables and their coefficients.

The Bayesian method is used to estimate the coefficients of the Logistic regression model. The Bayesian method assumes that all unknown parameters in the model are random variables. Before establishing the Bayesian model , it is necessary to set the prior probability distribution of all parameters in the model, which represents the known information of this parameter before obtaining the training data . After obtaining the training data , the Bayesian statistical model makes statistical inference on through the posterior probability distribution. According to the Bayesian theorem, the posterior probability distribution of parameter in model can be expressed as follows:

Formula (2) shows that the posterior probability distribution of parameter takes into account both the information contained in the training data *Y* and the known information of parameter . is the posterior distribution of parameter in model *M* under given training data *Y*. is the joint probability distribution of *Y* and in model *M*. represents the marginal probability distribution of model *M*, that is, the probability distribution of training data *Y* under given conditions. represents the prior probability distribution of parameter in model *M* before obtaining the training data *Y*. is the likelihood function of model *M*.

##### 3.2. Bayesian Updating Method

Based on the Bayesian updating method, the Bayesian Logistic regression model is established to transmigrate the real-time crash of freeways [32]. The Bayesian method can obtain the posterior probability distribution of each parameter in the model so that the prior probability distribution can be reset during model transplantation.

That is, when low precision data set is used to establish a real-time crash risk model, the Bayesian method can be used to obtain the posterior probability distribution of each risk factor. When high precision data set needs to establish the Logistic regression model and transplant it to low precision data set, the posterior probability distribution of the risk factors in the previous model can be used as the prior probability distribution of the risk factors in the new model, as shown in the following formula:

Schematic diagram of the Bayesian updating method is shown in Figure 7.

is the posterior distribution of parameter under given data sets and ; is the likelihood function; is the prior probability distribution of parameter ; is the likelihood function given data set and parameter ; and are the likelihood functions; is the posterior distribution of parameter under given data set .

##### 3.3. Support Vector Machine

Support vector machine (SVM) is a machine learning classification algorithm based on statistical theory [26, 27]. It can obtain the optimal solution through existing information and can deal with small samples or limited samples well. In the sample space, the linear SVM divides the hyperplane by to distinguish the labeled data set, where is the normal vector and is the displacement term.

The distance between any point *x* in the sample space and the hyperplane can be written as

For labeled sample data sets , +1 is accident data, and −1 is nonaccident data. If it can be correctly classified, it can be

When the sample dimension is high, it may lead to linear inseparability of sample data. The processing method of SVM for this situation is to raise the dimension of the sample data, convert the linear nonfraction data in the low dimensional space into linearly separable data in the high-dimensional space, and then use the linear SVM to find the optimal classification surface in the high-dimensional space.

##### 3.4. The Evaluation Index

In the data classification model, accuracy can intuitively display the overall classification performance of the model, which is expressed as the proportion of the correctly classified sample results in the total sample among all samples as shown in the following formula:where represents the number of samples predicted to be positive; represents the number of samples that predicted negative classes as negative classes; represents the number of samples that predicted the negative category as the positive category; represents the number of samples that predicted positive classes as negative classes.

The confounding matrix shown in Table 2 can directly display the classification results of the model and calculate the corresponding true positive rate (TPR) and false-positive rate (FPR) indexes.

The receiver operating characteristic (ROC) curve can be drawn by using TPR and FPR. The ROC represents the curve of the prediction accuracy of the data set under different probability thresholds. The AUC value of the area under the ROC curve can be calculated to measure the quality of the model. The closer the AUC value is to 1, the better the performance of the model.

#### 4. Results and Discussion

##### 4.1. Analysis of the Results from Small Sample Data Sets

In order to study the influence of different sample size data sets on the established crash risk model, Bayesian Logistic regression is used to establish models for the extracted 5%, 10%, 20%, 30%, and 50% high precision data sets, respectively. Table 3 shows the significant risk factors screened by Logistic stepwise regression for each data set.

By comparing the significant risk factors of the models with different sample size data sets, it is found that the sample size does affect the real-time crash risk model. Different models not only share the same (i.e., the same impact factors) but also have their own characteristics. It provides a basis for subsequent analysis.

As can be seen from Table 3, the speed of the upstream loop (*up_s*) is a significant variable of each data set. It shows that, for different data sets, upstream is a significant factor affecting the occurrence of crashes, which plays an important role in explaining the causes of crashes. At the same time, there were differences in other risk factors among each data set. Some risk factors were significant in one small sample, but not in others. This shows that each small sample data set has different characteristics and has certain differences for the establishment of the real-time crash risk model.

Figure 8 is the ROC curve and AUC value diagram of the real-time crash risk model established by the Bayesian Logistic regression method with different small sample data sets.

As can be seen from Figure 8, the change of the AUC value of the real-time crash risk model established by Bayesian Logistic regression does not increase with the increasing of sample size of the data set but is in a state of fluctuation. However, the overall trend of AUC value is decreasing.

For each collision, the traffic flow state is different. When the sample size of the data set is different, the structure of the data set is more complex. The significant risk factors screened by the Bayesian Logistic model established for each set of data mainly explain the predictive classification effect of the data set. And the significant risk factor of each data set is the optimal combination screened by the model. Therefore, there will be different results of collision precursors under different data sets. As mentioned above, as the sample size increases, the data set structure becomes more complex. Under the given combination of significant risk factors in different data sets, the probability of accidents is more complex, so the AUC index of the model will be reduced.

With the increase of sample size, the number of traffic crashes also increases. At this point, the diversity of traffic flow states of traffic crashes increases. From the screening of risk factors, it can be seen that the combination of risk factors changes with different sample sizes. Both the common significance factors (such as *up_s*) and the unique significance factors of each data set are included. Different traffic flow states make the data structure diverse. Therefore, the accuracy of models based on different data sets may be reduced. At the same time, due to the increase of data volume, the amount of traffic crash data increases. Although the AUC value of the overall model decreases, it is still around 0.7 or even a little higher, indicating that the number of traffic crashes correctly classified increases, too. It also indicates that while the sample size increases, though the sample structure is diverse, the law of data can be extracted as the sample size increases. This is the improved classification performance of the model. The increase of sample size can improve the prediction performance of the real-time crash risk model.

##### 4.2. Reliability Verification of Model Transferability

###### 4.2.1. Low Precision Data Set and High Precision Data Set Risk Model Comparison

Stepwise Logistic regression is used to screen the factors that had a significant impact on the risk model, and the Bayesian method is used to estimate the coefficients of the model. Table 4 shows the model comparison after the coefficient estimation of significant risk factors.

As can be seen from Table 4, there are partially the same explanatory variables with significant levels between low precision and high precision data sets, such as *abs_dif_o* and *up_s*. These two explanatory variables indicate that, in the two data sets, both the speed of upstream loop and the absolute value of occupancy difference between upstream and downstream loop can effectively explain the causes of crashes. The difference between the two sets of data sets is that *up_dif_o* in the low precision data set has a significant explanatory effect on the model, while *down_dif_v* and *up_dif_s* in the high precision data set have a stronger explanatory effect on the model. In the coefficient estimation, it can be found that the estimated 95% confidence interval of each coefficient does not contain 0, indicating that the coefficient estimation is significant. The average upstream speed (*up_s*) coefficient of the two models is negative, indicating that, within the specified driving speed range of expressways, the decrease of average upstream speed by one unit will lead to an increase in crash risk.

Each piece of data is classified, and the accuracy of the model established by the two sets of data sets is shown in Table 5.

At the same time, the confusion matrix of two data sets for model classification and prediction is shown in Tables 6 and 7.

The ROC curve and AUC values of the model established on the basis of the two data sets are shown in Figure 9.

**(a)**

**(b)**

Through the comparison of the above indicators, it can be found that when the data set established with relatively sparse loop density is used to establish the Bayesian Logistic regression model, the classification accuracy of the model is 70.68%, slightly lower than the classification accuracy of the high precision data set with large loop density which is 73.30%. However, the model AUC value of low precision data set is 0.656, which is much smaller than that of high precision data set. The reasons are as follows: there are few explanatory factors in the low precision data set, and the data information is lost, which affects the accuracy of the model to some extent. In contrast, when the loop density is larger, more traffic flow information can be collected, and traffic variables that have a significant impact on traffic crashes can be screened out, thus making the model more accurate.

###### 4.2.2. The Application of the Model of the High Precision Data Set-Based Model Low Precision Data Set

Applying the model of high precision data set to low precision data set, the classification accuracy is 69.3%, and the confusion matrix is shown in Table 8.

The ROC curve and AUC value obtained are shown in Figure 10.

Directly applying the model established by high precision data sets to low precision data sets, the classification results are worse than those established by previous low precision data sets. When using Logistic regression to screen variables, the optimal variable combinations of the two data sets are different. In the process of parameter estimation, the model obtained is the best fit of the best variables under each set of data. Therefore, when applied to other data sets, there will be inapplicable situations. It can be seen that direct transplantation of the model cannot achieve a better prediction classification effect.

###### 4.2.3. Bayesian Updating towards High Precision Data Set Model and Low Precision Data Set Model

(a) The Bayesian updating method was used to update and transplant the model established by the original high precision data set. The posterior distribution of parameter estimation of the low precision data set model variables was regarded as the prior distribution of parameter estimation of the high precision data set model and then updated it. The results obtained are shown in Table 9.

The classification accuracy of the updated high precision data model is 68.94%, and the confusion matrix is shown in Table 10. ROC curve and AUC value obtained are shown in Figure 11.

(b) The Bayesian updating method was used to update and transplant the model established by the original low precision data set. The posterior distribution of parameter estimation of the high precision data set model variables was regarded as the prior distribution of parameter estimation of the low precision data set model and then updated it. The results obtained are shown in Table 11.

The classification accuracy of the updated low precision data set model is 70.83%, and the confusion matrix is shown in Table 12. ROC curve and AUC value obtained are shown in Figure 12.

By updating the model established by the high precision data set, it can be found that the prediction accuracy of the model cannot be effectively improved when the model is applied to the low precision data set before and after updating. The prediction accuracy is 69.3% before the update and 68.94% after the update, which decreases by 0.36%, and the AUC value decreases by 0.002.

By updating the model established by the low precision data set, it can be found that the prediction accuracy of the model is 70.68% and 70.83%, respectively, before and after the model is applied to the low precision data set, and the classification accuracy is improved by 0.15%. In addition, the AUC value increases from 0.656 to 0.657.

The classification accuracy of the model is 70.68% in the low precision data set. Based on the evaluation index of the model, the classification accuracy of the model transplant results is improved to a certain extent. Therefore, 69.3% could not meet the requirements, while the result of another model transplantation reached the requirements of 70.83%. In comparison, the improvement of the model is smaller, only 0.15%. However, in the field of traffic safety, it will have practical application significance to improve certain accuracy. In the follow-up research, better models or methods can be further proposed to make the results of model transplantation better. It can be seen from the above that the Bayesian updating method can improve the model transplantation effect to a certain extent, but the overall effect is limited, indicating that this method can indeed carry out model transplantation. The reason for the limited improvement may be that the most significant factor with explanatory effect has been screened out during the stepwise regression, and the difference between the parameters estimated by the Bayesian method and the parameters estimated by maximum likelihood estimation is small. At this time, the parameters of the model have become an excellent combination of parameter values. In the process of Bayesian update, the prior information of the model has little influence on it, so the overall improvement effect of the model is small.

In addition, it is necessary to determine the updated model object. Through the above study and comparison, it can be concluded that updating models with low precision data sets will obtain models with higher prediction performance.

##### 4.3. Classification Prediction Model Based on SVM

In order to be consistent with previous research methods, data sets were not divided into training data sets and test data sets. Classification prediction models for high precision and low precision data sets are established based on SVM, and the classification accuracy and confusion matrix are obtained as in Tables 13–15.

ROC curve and AUC values are shown in Figure 13.

**(a)**

**(b)**

By analyzing these two groups of data sets with SVM model, this paper finds that when the loop density is relatively sparse, the precision of the model does have some influence, low accuracy of the data set to establish the SVM model accuracy is 76.9%, the AUC value is 0.8, high precision data set to establish accuracy of SVM is 78.7%, and the AUC value is 0.82. The comparison between the two models shows that the high precision data set is better for modeling. Compared to other machine learning models, Shen considered the weather variables when establishing the random forest real-time accident risk model [33]. In this model, the accuracy of the model reached 82.1%. In this study, the authors screened the characteristics of the data and took the weather into account. Compared with the support vector machine model, the accuracy was improved by 3.4%. In further researches, the data could be processed accordingly, and the parameters of the support vector machine could be adjusted to achieve higher prediction performance.

Compared with the Bayesian Logistic regression model, in the case of the same low precision data set, the overall prediction performance of the established SVM model is better, with the classification accuracy improved by 6.22%, and the AUC value is improved by 0.144. When loop density is small and the loop data information is not rich, a real-time crash risk model based on Bayesian Logistic regression can be established, which can effectively filter out the significant risk factors for crashes and can be explained in detail and quantify the corresponding risk factors. However, strict and mathematical relations limit the overall prediction effect of the model. SVM is a black box machine learning algorithm, which can effectively learn the effects of features on the results and reflect them into the prediction results.

Therefore, when the data set is not accurate enough, it is recommended to use the machine learning algorithm to establish a model to classify and predict the crash risk. When the data accuracy is good, the statistical Logistic regression method can be used to screen out significant risk variables to explain the model and classify the crash risk prediction.

#### 5. Conclusions

Considering the influence of limited data conditions on the real-time freeway traffic crash risk model, this paper constructed high precision data set, low precision data set, and small sample data set. These data sets were modeled and analyzed based on Bayesian Logistic regression, and the reliability of real-time crash risk model transplantation based on Bayesian update was verified. Finally, the advantages and disadvantages of the model established by Bayesian Logistic and SVM were compared. The main conclusions of this paper are as follows:(1)The significant risk factors of Bayesian Logistic regression established under various sample sizes are different. With the increasing of sample size, the evaluation index of the model decreases. However, the overall performance of the model improves. The increase of sample size can effectively improve the classification and prediction performance of the model.(2)When the loop detector density of the collected data is small, the prediction performance of the Bayesian Logistic regression model based on low precision data set is weaker than that of the Bayesian Logistic regression model based on high precision data set. In addition, significant risk factors are significantly different in the two models, indicating that Bayesian Logistic regression is not suitable for low precision data set.(3)Based on the Bayesian updating method, the validity of model migration is verified. Applying the posterior distribution of significant variable parameters of the Bayesian Logistic model based on high precision data set to low precision data set, this approach can improve the prediction performance of the Bayesian Logistic model using low precision data set.(4)Compared with Bayesian Logistic regression, the crash risk model based on SVM has higher prediction performance. Even under the condition of low precision data set, its prediction performance is significantly improved compared with that of Bayesian Logistic regression, indicating that SVM is a better choice under the condition of insufficient data precision. However, SVM cannot effectively interpret the cause of crash risk. When the data quality is high, Bayesian Logistic regression can be used for modeling and prediction, and the crash risk can be well explained.

In this paper, Bayesian Logistic regression and support vector machine are applied to analyze the impact of various data sets on the traffic crash risk model. Further, other machine learning methods and the enhancement effect of feature engineering on the establishment for the crash risk model can be studied. Some new methods of crash risk model transplantation should also be studied in the future.

#### Data Availability

The data used to support the findings in this study are available from the corresponding authors upon request.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this study.

#### Acknowledgments

This research was supported by the China Postdoctoral Science Foundation (2021M700333).