Abstract

Rear-end crashes or crash risk is widely recognized as safety-critical state of vehicles under comprehensive conditions. This study investigated the association between traffic flow uncertainty, drivers’ visual perception, car-following behavior, roadway and vehicular characteristics, and rear-end crash risk variation and compared the crash risk variation prediction with and without specific flow-level data. Two datasets comprising 5055 individual vehicles in car-following state were collected through on-road experiments on two freeways in China. A hierarchical hybrid BN model approach was proposed to capture the association between drivers’ visual perception, traffic flow uncertainty, and rear-end crash risk variation. Results show that (1) the BN model with flow-level data outperformed the BN model without flow-level data and could predict 85.3% of the cases of crash risk decrease, with a false alarm rate of 21.4%; (2) the hierarchical hybrid BN models showed plausible spatial transferability in predicting crash risk variation; and (3) the incorporation of specific flow-level variables and data greatly benefited the successful identification of rear-end crash risk variations. The findings of this study suggest that rear-end crash risk is inherently associated with both individual driving behaviors and traffic flow uncertainty, and appropriate visual perceptual information could compensate for crash risk and improve safety.

1. Introduction

Rear-end crashes are one of the most killing accident types on highways. According to a recent accident statistic by NHTSA [1], rear-end crashes accounted for the largest proportion (32.3%) of the total number of crashes and accounted for 45.9% of the crashes with motor vehicle in 2018 in the USA. In China, rear-end crashes were reported to account for 36.5% of the total crashes and contribute 32.8% of all the fatalities on freeways in 2017 [2]. Accordingly, an abundance of efforts has been dedicated to analyzing the causation factors [35], identifying safety-critical events [6, 7], predicting crash risk propensity [8, 9], and evaluating traffic safety [1012]. In these efforts, individual driving behaviors are without a doubt the unparalleled contributive factors to be investigated, because an inspection of the behavioral nature of moving vehicles and drivers is always worth being the first choice. In addition, speeding [13], insufficient headway [14], biased visual perception [14], and other relevant human factors [1517] were found to be the predominant behavioral factors. A variety of nonbehavioral factors, i.e., external conditions, were also considered to approximate crash or crash risk, such as roadway feature, weather (visibility) condition, traffic volume, driver individual characteristics, and vehicle feature (see Theofilatos and Yannis [5]; Mannering et al. [3]; and Papadimitriou et al. [4] for systematic reviews). In a technical sense, the crashes or crash risk reasoning and prediction could be deemed as a task to obtain the conditional probability of the occurrence of crashes or crash risk under the interactive and complex impacts of the aforementioned conditions and factors. Therefore, considerable efforts have resorted to a large number of multisource contributive factors and a high resolution of available datasets. However, most of the previous studies seemly paid excessive attention to those external factors, and the flow-level information’s association with and contribution to crash risk were greatly underestimated, which exposed a limitation in that it may be hard to tell if the crashes or crash risk was merely and unbiasedly due to the drivers’ behavior under certain surrounding environment, as suggested by Zhu et al. [18]. In addition, compared with those external conditional factors, little is actually known about how the traffic flow uncertainty per se would affect the occurrence and/or the likelihood of crashes interactively with individual vehicles. As a matter of fact, the traffic flow uncertainty is supposed to be essentially contributive to crashes or crash risk, since a crash can be triggered by a short-time disturbance of traffic flow before the occurrence of the crash ([5, 19] and [20]). Traffic flow might as well be regarded as a fundamental background condition from where any crashes or crash risk of individual vehicles could then possibly be originally and unbiasedly approximated and understood. Further, previous studies did not particularly differentiate the traffic flow states in analyzing rear-end crashes or crash risk. However, rear-end crashes extensively proved to be closely related to car-following behavior [14, 2126], and car-following is always supposed to be a driving state right before the occurrence of a rear-end crash. Therefore, in this regard, the rear-end crashes or crash risk might be misunderstood in the absence of specific car-following behavior and car-following vehicle flow (car-following platoon).

Following the above rationale, in this study, we attempt to reason rear-end crash risk and define car-following behavior by incorporating some specific traffic flow variables (flow-level data) and to evaluate the performance with the specific flow-level information. Generally, this was achieved by extending a prior study, which tentatively specified crash risk variation with visual perceptual, vehicular, and roadway factors under the presence of a kind of perceptual markings [14]. In the previous study, the rear-end crash risk variation was associated with the above factors, with drivers’ distance risk perception (DRP) and speed risk perception (SRP) being a bridge, and was eventually specified by two extended surrogate safety indicators based on modified time-to-collision (mTTC) and deceleration rate to avoid collision (DRAC). Continuing along this framework, in this study, a hierarchical hybrid Bayesian Network (BN) approach was taken advantage of to compare the rear-end crash risk analysis with and without specific flow-level data as input. The BN, as a probabilistic method, is especially appropriate and advantageous in this regard, as imbedded information-aggregation mechanism and regression models empower it to make reasonable statistical inference with multisource data or information to describe conditional probability distributions for various variables in a system and to explore inherent causal and complex dependencies among those variables [27]. With the above framework and hierarchical hybrid BN approach, two BN models were developed to quantitatively evaluate the performance with and without specific flow-level data. Specifically, the flow-level data used in this study included the flow rate () and heavy vehicle proportion () defined in specific 10 min, platoon length () measured with the number of vehicles involved, and a flow-level risk indicator of platoon crash risk entropy (), as suggested in Ding et al. [22]. Besides, the data collected on another freeway was utilized to examine the spatial transferability of the BN models trained with data sampled from the freeway presented in Ding et al. [14]. The framework of this study is structured as in Figure 1.

The remainder of the paper is organized as follows. Following the Introduction, the literature of relevant knowledge is reviewed in Section 2. Section 3 presents the experiments, data collection, and variables. Section 4 introduces the hierarchical hybrid BN modeling, followed by the results and discussion of the BN models in Section 5. Section 6 concludes the research findings.

2. Literature Review

2.1. Crash Risk Factors

According to the Highway Safety Manual [28] and to the nature of a surface transportation system, roadway crashes can be attributed to a combination of factors: the drivers (human factors), vehicles (vehicular factors), road infrastructure (roadway factors), and surrounding environment. Specifically, vehicle type and car-following mode (large/small vehicles following or being followed) are two main vehicular factors that have been widely verified to be influential in crashes and crash risk [14, 22, 2933]. Similarly, the roadway conditions, such as road type/grade, road geometric alignment, and road cross-section features, were extensively considered to account for crashes and crash risk (see Theofilatos and Yannis [5]; Mannering et al. [3]; and Papadimitriou et al. [4] for systematic reviews). Among these factors, the horizontal curves (radius) and their relevant stopping sight distance (SSD) were found to be two of the key alignment factors related to crashes [3436]. This is because the curves, as a specific component of the roadway, are more likely to be a natural hazard to drivers due to the changes in driving expectancy and vehicle handling, which was also actually evidenced by the persistent large percentage of accidents on curves [1, 2]. Besides, several environmental factors are also considered in crash risk analysis, e.g., the weather (visibility) condition [5, 3739].

Unlike those external conditions, the human factors are believed to be the most influential yet complex factors in the occurrence of crashes. Usually, the sociodemographic characteristics of drivers, such as age, gender, driving experience, and driving style, are investigated to reveal their association with the occurrence and/or likelihood of crashes in a macroscopic manner [40, 41]. Besides, there is also wide agreement that the microscopic behavioral aspects of drivers including cognitive psychological behavior [14, 21, 24, 30, 4244] and driving behaviors [26] are critical influential human factors in crashes. Among them, drivers’ cognitive behavior (especially visual perception) is the fundamental one because the other microscopic behaviors without exception originate from the primitive cognition and perception of the environment [44]. In particular, inadequate and/or biased cognitive perception and judgement were believed to be the critical human factors leading to road crashes [17], as the vast majority of the rear-end crashes can occur if the following driver errs in judging closing speed and headway to the leading vehicle. Particularly, since more than 90% of the information that drivers obtain and use is visual [45], drivers’ visual perception matters greatly to drivers’ choice of speed and distance in car-following and eventually impacts the risk of rear-end crashes. Accordingly, various forms of perceptual markings were developed, such as longitudinal line markings [4649], transverse line markings [5059], and converging chevron markings [60, 61], and all showed effectiveness in terms of speed reduction and/or usefulness as a method of crash prevention.

Actually, the influence of visual perception on driving behaviors and collision avoidance was originally noticed by some cognitive psychologists, who verified in virtual-reality scenes that the existence of certain visual information (which was termed as “visual cues”) on ground surface could lead to drivers’ overestimation of speed [42, 44, 48, 62, 63], underestimation of distance [44, 6470], and untimely and inadequate brake [42, 43]. Inspired by those cognitive psychologists, Ding et al. [14, 21, 24, 71, 72] conducted a series of studies focusing on associating, evaluating, and explaining the effects and mechanism of drivers’ speed perception and distance perception on car-following behaviors and crash risk, by introducing and testing several specially designed perceptual markings with on-road experiments and observations. In addition, the inherent connection between speed/distance perception, driving behaviors, and crash risk was comprehensively accounted for by two latent variables, i.e., “speed risk perception (SRP)” and “distance risk perception (DRP),” which differentiated the risk perception originating from variations in speed and distance [14].

2.2. Traffic Flow and Crash Risk

Crash risk or a crash is not inevitably merely a microscopic behavioral outcome; it is also subject to traffic flow uncertainty and certain flow-level factors, wherein traffic states were widely believed to impact the occurrence of crashes and were extensively used to translate traffic flow into safety performance. Golob et al. [73] investigated the level of safety during breakdown from free flow to congested operations or recovery back to free flow, and differences in traffic conditions across lanes. They suggested that the propensity for accidents was explicitly related to traffic conditions. Xu et al. [74, 75] identified different traffic states using occupancy and revealed that the significant traffic variables contributing to crash risk in various traffic states were quite different, in particular the traffic state in which free flow in the upstream and congested flow in the downstream had the greatest impact on crash occurrences. Similarly, Sun and Sun [8] and Zhao et al. [76] indicated that the crash involvement rates and crash risk ratios would greatly vary with different combinations of upstream and downstream traffic states. Jetto et al. [77] emphasized that rear-end crashes were more likely to occur in the congested traffic state. Further, traffic states were largely characterized by volume, density/occupancy, flow rate, and speed, for the purpose of more precisely capturing their association with crash risk (see Theofilatos and Yannis [5] for a comprehensive review). In particular, traffic volume and traffic composition were found to have a positive influence on crash risk, specially in low-speed congestion segments, and to have a significantly negative impact in high-speed traffic segments [12, 78, 79]. Traffic density/occupancy and traffic speed conditions were usually incorporated in volume as significant individual variables and covariates to account for the odds of crash occurrence and crash risk [8, 8082]. Besides, the differences of volume, occupancy, and speed across lanes were also specifically considered to account for crashes and for developing various crash prediction models [75]. In addition, the flow rate and the standard deviation of flow rate were used alternatively as candidate input variables in crash risk modeling [9, 22, 74, 75].

The above efforts have successfully associated crashes and crash risk with macroscopic and statistical traffic flow characteristics aggregated in a relatively large range of time and/or distance. However, this kind of macroscopic traffic flow characteristics could still be limited to precisely portraying the specific flow states and effectively capturing the real-time rear-end crashes or crash risk, because a crash or a high-risk state of a vehicle always occurs under certain specific conditions that are supposed to emerge momentarily. Additionally, as explained previously, rear-end crashes are greatly associated with the car-following states of vehicles, suggesting that a specific focus on car-following flows might be more effective in predicting rear-end crash risk. Therefore, a follow-up critical issue is to identify specific relatively microscopic flow-level variables in car-following that are supposed to be more informative as crash precursors to account for rear-end crashes and crash risk. Actually, this issue might be well understood and addressed by the fundamental relationship between traffic flow uncertainty and car-following behavior. As suggested by the well-known stability theory of car-following models, unstable traffic flow can propagate through a platoon and subsequently expose vehicles to higher crash risks [83]. It was also widely believed that a crash can be triggered by a short-time disturbance of traffic flow before the crash occurs [5, 19, 20]. In particular, the traffic flow of 5–10 min prior to crashes was widely recognized in the past decades to have the most significant association with crashes [8, 9, 12, 37, 74, 80, 82, 84, 85]. Hence, specific crash precursors of traffic flow uncertainty were encouraged to be specified in 5–10 min. Given this, we accordingly proposed the 10 min flow rate () and heavy vehicle rate (heavy vehicle percentage, ) to attempt to specify the traffic flow uncertainty for accounting for rear-end crash risk variations [22].

Moreover, some particular characteristics of the car-following states of vehicles are also better to be specified to be associated with rear-end crash risk. In this regard, the length (vehicles involved) of the car-following platoon could be one of the most representative characteristics and influential variables in rear-end crash risk. Seraj et al. [86] discovered that a smaller platoon length of vehicles would lead to a greater safety improvement in car-following and that the range of headway increase was subjective to the maximum platoon length. In addition, a recent study by Hyun et al. [78] reported that more vehicles involved in a platoon would increase crash risk and crash severity and that the risk could be even higher with the presence of trucks. Similar effect of platoon length on crash risk was revealed by Ding et al. [22], who also developed a novel flow-level indicator to characterize the platoon risk state based on the time headway distribution of the platooning vehicles, i.e., platoon crash risk entropy ().

3. Experiment and Data Collection

Generally, most of the experimental sites, designs, and processes of data collection and data treatment had already been demonstrated in previous studies [14, 71, 72], which will be briefly mentioned here. Therefore, the variable selection and description are of primary focus to be explained in this study.

3.1. Experimental Site and Design

In this study, for the purpose of validating the BN models and testing the model transferability based on data collected from another freeway, similar one straight and three curved segments of Hangzhou–Ruili Freeway (G56) at Xianning, Hubei, China, were selected as experimental sites. An overview of the straight and curved segments of G50 and G56 is profiled in Table 1 and illustrated in Figure 2. In the middle part of these segments, a 300 m slow lane was installed with the perceptual markings. It is important to mention here that there was no tunnel, entrance/exit, or any signs within 300 m area that could possibly impact driver behaviors, nor any exposed surveillance device for violation capture. The design of the perceptual markings, as introduced in Ding et al. [14, 71, 72], was also followed in the present study. That is, three forms of perceptual markings were adopted, which were specified by the units of the perceptual markings, i.e., , , and (see Figure 2).

3.2. Data Collection

The raw speed, distance headway, time headway, vehicle type, etc. of individual vehicles were collected on two ways (opposite directions) of each curved segment at the same time, to minimize the possible impact of turning direction of the curve. For one direction of a curved segment, three NC200 traffic analyzers were sequentially set along the flow direction in the center of the slow lane to obtain sectional driving behaviors data as mentioned above (see Figure 3). For the straight segments, only one way was installed with the perceptual markings for observations as appropriate. To obtain the driving behaviors data as correct and accurate as possible, six speed guns were used for sampling vehicle speeds to verify the data obtained by the traffic analyzers and for possible reobservations as needed (see Figure 3). Note that the cameras were mounted outside the crash barrier on the hard shoulder and were sheltered with local shrubs being invisible to the drivers. All the data were collected during 8:30 a.m.–11:30 a.m. and/or 14:00 p.m.–17:00 p.m. in no precipitation days, and at least a one-day observation was conducted for every single test to sample adequate vehicles.

3.3. Data Treatment
3.3.1. Data Filtering Process

In this paper, we used the same methods as those in Ding et al. [14] to filter out free-flow and lane-change vehicles. The free-flow vehicles were identified at each observation section by comparing its stopping time and time headway . That is, if occurred at any observation sections, the vehicle was regarded as traveling freely and to be excluded. In this case, the stopping time of a vehicle can be calculated by , where is the instantaneous speed of a vehicle, m/s, and is the deceleration ( m/s2 was suggested by AASHTO [87]). Besides, the vehicles that negotiated a lane change observed between any two observation sections were also filtered out from the final data sample, by reviewing the video clips of the six cameras frame by frame. After the above data filtering process, the effective data was obtained as descriptively summarized in Table 2.

3.3.2. Variables Description

(1) Traffic Flow. As explained previously, the specific 10 min flow rate () and heavy vehicle proportion (), platoon length (), and platoon crash risk entropy (), introduced in a recent study [22], were adopted as crash precursors to specify the traffic flow uncertainty and car-following states. The specific 10 min flow rate was delivered as follows:where is peak 10 min flow rate of the slow lane, pcu/h/ln, and is the maximum volume of a continuous time interval of 10 min of the slow lane during an observation, pcu. According to the traffic flow data collected at the eight segments of the two freeways, generally ranged from 500 pcu/h/ln to 1800 pcu/h/ln, and it was discretized into three intervals: (500 pcu/h/ln, 1100 pcu/h/ln], (1100 pcu/h/ln, 1500 pcu/h/ln], and (1500 pcu/h/ln, 1800 pcu/h/ln), coded “S (small),” “M (medium),” and “L (large),” respectively. The code initials are used consistently hereinafter for the rest of the variables.

Heavy vehicle rate was calculated as follows:where is the heavy vehicle rate of the slow lane and is the number of heavy vehicles in the specific 10 min, veh. According to the observations, ranged from 35.3% to 65.6%, and it was discretized into (35.3%, 45%], (45%, 60%], and (60%, 65.6%) and coded “S,” “M,” and “L,” respectively.

The platoon length () was represented by the number of vehicles involved in the platoon and was categorized into two vehicles (“TWO”), three vehicles (“THREE”), and four or more vehicles (“MORE”).

Additionally, the potential crash risk status of the car-following platoon was specified by the indicators of platoon crash risk entropy () as introduced by Ding et al. [22]. The was derived from the distribution of time headways and was given as follows.where is the probability of the appearance of time headway () of vehicle among the entire values of time headways of all vehicles at section ; is the number of appearances of the value of time headway in the total data sample of time headways; is the total data sample of time headways in all conditions at an observation section, and (G50) or (G56); is the platoon crash risk entropy of platoon at section ; and is the number of vehicles in platoon , and . In practice, the probability of the time headway () was approximated by the frequency of each data point of time headway () of vehicle as it appeared in the total data sample. According to the distributions of time headways, the largely ranged from 0.04 to 0.46 in the two-vehicle platoon situations, from 0.07 to 0.68 in the three-vehicle platoon situations, and from 0.13 to 0.90 when there were four or more vehicles. Accordingly, the was discretized into (0.04, 0.46], (0.46, 0.13], and (0.13, 0.90] and labeled as “S,” “M,” and “L,” respectively.

(2) Road Condition. As demonstrated in Table 2, there were some differences in the radius of the curves on Freeways G50 and G56. Nevertheless, we believe that these slight differences could be negligible here, so the radii of Curve 2 and Curve 3 of Freeway G56 were treated as 1800 m and 1200 m anyway. Then, the road conditions could be categorized as and and coded “S,” “M,” and “L,” respectively. In addition, the straight segment was labeled as “STR” for short.

(3) Vehicle. The types of the following (FVT) and leading (LVT) vehicles were particularly considered, and they can be differentiated into small, medium, and large ones by NC200 traffic analyzers by detecting vehicle length and also can be additionally identified by the number of their axles. For the sake of simplification and to clearly distinguish the type (size) of vehicles to expect an easier observation of the effect of vehicle type, only the small and large vehicles were considered. To be specific, the small vehicles were two-axle vehicles including passenger cars, small trucks, and vans; the large vehicles were the ones with three axles and above, including large trucks and buses. Similarly, the small vehicles and large vehicles were assigned the labels “S” and “L,” respectively.

(4) Individual Driver Behavior. Variables concerning individual driving behaviors include speed, distance, and headway, by which the motion state of a vehicle could be adequately characterized. According to the data collection methods demonstrated above, these individual driving behaviors were all instantaneously observed at the three observation sections. That is, as vehicle passed through observation section # at a speed of , it would be observed with a distance headway and time headway with regard to its leading vehicle, . Based on the individual vehicle data, we categorized vehicle speeds into (−∞, 22.22 m/s], (22.22 m/s, 27.78 m/s], and (27.78 m/s, +∞); distance headways into (−∞, 80 m], (80 m, 100 m], and (100 m, +∞); and time headways into (−∞, 3.0 s], (3.0 s, 4.5 s], and (4.5 s, +∞); and they were coded “S,” “M,” and “L,” respectively.

In addition to the common Newtonian motion parameters of vehicles, the stopping sight distance (SSD) was also specifically taken into consideration to specify driving behaviors on curves. According to AASHTO [87], SSD can be calculated as follows:where is the instantaneous stopping sight distance of vehicle when it passes through observation section #, m;  m/s2; is the deceleration rate, m/s2; is the average grade; is drivers’ reaction time, s; and others mean the same as above. Here, based on the statistics, the SSD was categorized into (−∞, 80 m], (80 m, 110 m], and (110 m, +∞) for “S,” “M,” and “L,” respectively.

(5) Visual Perception. As introduced before, the specific variables with regard to visual perception that we intentionally manipulated were the temporal frequency and spatial frequency derived from the line markings. The of vehicle equals its instantaneous speed divided by . The equals the reciprocal of . As three forms of line markings were adopted for the on-road experiments, that is, , , and , could be objectively grouped into (3 Hz, 5 Hz], (5 Hz, 8 Hz], and (8 Hz, 14 Hz]; equals 0.5, 0.25, and 0.125; and they were correspondingly labeled as “S,” “M,” and “L,” respectively. Besides, “” and “” were assigned to the condition without line markings.

(6) Crash Risk Variation. To measure crash risk variations of individual vehicles, the rate of change of DRAC () we proposed previously [14] was employed. It can be calculated as follows:where , , and are the DRAC, position, and velocity of the following vehicle () at timestamp ; is the length of the leading vehicle (); and is the rate of change of DRAC within a time of . Therefore, the crash risk variation (CRV) was categorized into crash risk decrease () and crash risk increase (), and they were labeled as “DECR” and “INCR,” respectively.

(7) Risk Perception. In addition to the observed variables explained above, two latent variables were incorporated to associate the observed variables and crash risk variations, that is, speed risk perception (SRP) and distance risk perception (DRP) that we put forward previously [14]. Specifically, SRP and DRP were categorized into low and high and labeled as “LOW” and “HIGH,” respectively. A detailed profile of the above variables is summarized in Table 3.

4. Methodology

4.1. Hierarchical Hybrid BN Structure

To investigate the effects of multisource and multidimensional factors on rear-end crash risk in car-following under the influence of perceptual line markings, a hierarchical hybrid BN structure was developed (see Figure 4). Here, the hierarchy of the structure was characterized by two aspects. On the one hand, the nodes (variables) incorporated were naturally categorized into roadway, traffic flow, driving behaviors, and visual perception according to the variable’s attributes as shown in Table 3. On the other hand, the variables were structured with four hierarchies according to their relationship with the final crash risk as follows: (1) external conditions, including the variables characterizing traffic flow states and other static (temporally invariant) observed variables; (2) behavioral factors, including all the driving behaviors related to observed variables; (3) risk perception, i.e., the latent variables of SRP and DRP; and (4) crash risk variation, that is, the eventual outcome of variations in crash risk measured by the surrogate indicator of . Then, based on existing knowledge, the car-following crash risk directly originated from drivers’ speed risk perception (SRP), distance risk perception (DRP), some relevant driving behavior indicators (, , and ), and temporal frequency (), and the SRP and DRP were essentially associated with the driving behaviors (, , , , and ); then, the driving behaviors, imbedded in the second layer, were specified by the external conditions of , , , , , , and . Thus, eventually, the external conditions and the behavioral factors were set as the inputs, and the crash risk variation was set as the output. According to the above analysis, a hierarchical hybrid BN was theoretically structured as in Figure 4.

4.2. Model Training and Evaluation

For classification models, a training dataset and testing dataset are supposed to be assigned randomly from a sample dataset for the training and evaluation (testing) of the models. Correspondingly, in general, the data collected on Freeway G50 was assigned to the training of the structure and parameters of the BN models and the test of their performance as well, and the data collected on Freeway G56 was used for the evaluation and application of the BN models. To be specific, the evaluation process of the BN models was actually twofold: (1) the BN models were first trained (learned) and evaluated (tested) by the data collected on Freeway G50, and (2) the BN models were then evaluated (tested) using the data collected on Freeway G56 to verify their applicability to a dataset different from the one they were trained by. For the first evaluation, to avoid local optimal estimation, we chose a method that was similar to the classic and powerful “K-fold cross-validation” to train and evaluate the BN models. That is, the original sample dataset from Freeway G50 was first divided into ten equal-sized subdatasets with roughly similar cases of crash risk decrease and increase; then, the BN models were trained based on the first nine subdatasets and tested on the tenth subdataset. Note that the sizes of the first nine subdatasets were set as 344, and the remaining 336 records were assigned to the tenth subdataset. In addition, the BN models’ performance was evaluated within both the first nine subdatasets and the tenth subdataset using the method of “leave one out.” Eventually, the mean value of the evaluation metrics based on the first nine subdatasets was used to represent the model training performance. The second evaluation based on observations on Freeway G56 could be also regarded as an evaluation of the spatial transferability of the BN models when they were applied to a heterogeneous source of holdout dataset collected at another location.

To evaluate the classification tests, true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are commonly used for basic statistical measurements. TP measures the number of actual positives correctly identified as positives, e.g., the number of cases with crash risk decrease () correctly identified as “DECR.” TN measures the number of negatives correctly identified as negatives, e.g., the number of cases with crash risk increase () correctly identified as “INCR.” FP defines the number of the estimated cases that are actually negative but incorrectly identified as positive, e.g., the number of cases with crash risk increase () incorrectly identified as “DECR.” FN defines the number of the estimated cases that are positive in fact but incorrectly identified as negative, e.g., the number of cases with crash risk decrease () incorrectly identified as “INCR.” Based on the above basic measurements and the classification confusion matrix, several quantitative metrics could be developed to comprehensively and better evaluate the classification performance, i.e., precision, recall, sensitivity, specificity, accuracy, F-measure, and -means. The precision is the fraction of correct classification of corresponding crash risk out of all events predicted to be positive (6), and the recall, also named as true positive rate, is the fraction of correct classification of corresponding crash risk out of all true events (7); they are often adopted to evaluate the effective detection ability for one particular class. The sensitivity is the crash risk classification accuracy, and the specificity is the non-crash risk classification accuracy, as written in (8) and (9), respectively. The FP rate, also known as false alarm rate, is the proportion of cases of INCR incorrectly classified as DECR in all actual INCR cases, as given in (10). Accuracy is the fraction of correct classified crash risk out of all predicted events, as formulated in (11). The -means is the geometric mean of the sensitivity and specificity, and F-measure is the harmonic mean of precision and recall, which represents the ability to detect crash risk of the model, as shown in (12) and (13), respectively.

As a harmonic mean of precision and recall, F-measure averages the proportion of cases correctly identified as positive in actual positive cases and that in predicted positive cases. It can serve as an effective performance measure for the BN models and is particularly informative in finding the conditions and situations accounting for the crash risk variation in car-following. Besides, the mean value of the above metrics based on the first nine parts of the dataset collected on Freeway G50 was used to represent the model performance.

5. Results and Discussion

5.1. BN Model Estimation and Validation

In this study, the open software GeNIe Academic (version 3.0.5905.0) was employed to build the BN structure and estimate the parameters. The parameters of the BN were learned with the EM method as explained previously, with a random seed of zero to especially facilitate the parameter learning with the presence of latent variables (i.e., SRP, DRP). Figure 5 presents the BN model with flow-level factors and with learned parameters. Based on the 3432 vehicles in the dataset collected on the Freeway G50, which was divided into ten parts for training and testing as explained above, two BN models were independently developed and evaluated with and without the specific flow-level variables and data. Table 4 presents the BN classification confusion matrix with aggregated results of the ten subdatasets. The results of the BN model estimation and validation based on the confusion matrix were shown in Table 5, wherein the results of the testing subdataset were the focus of the analysis.

As can be seen in Table 5, the two BN models built with and without the specific flow-level data were both reasonably acceptable with a classification accuracy of around 80% in terms of predicting crash risk variations. Still, the estimation results varied in the BN models built with and without the specific flow-level data. For the model without flow-level data, the estimation accuracy was 80.8% on average of the nine training subdatasets, and the accuracy decreased to 79.8% with a 0.4% increase in false alarm rate (FP rate) with the testing subdataset. The results also indicated that the BN model without flow-level data was capable of identifying 82.2% and 81.0% of the cases of “DECR” () for the training and testing subdatasets, respectively. For the BN model with flow-level data, the estimation accuracies were 83.2% and 83.9% for the training and testing subdatasets, respectively. It can be discovered that the accuracy increased 0.7% and the false alarm rate decreased 1.9% for the testing subdataset. Similarly, the BN model with flow-level data showed its capability of classifying 84.8% and 85.3% of the cases of “DECR” for the training and testing subdatasets, respectively. More importantly, the results also demonstrated that the BN model with specific flow-level data showed better performance in predicting crash risk variations than the one without flow-level data on both the accuracy and false alarm rate. The BN model with flow-level data can predict 2.6% more cases of “DECR” than the one without flow-level data, with a 1.3% reduction in false alarm rate for the training subdatasets. In addition, even greater increase (4.3%) in “DECR” prediction and reduction (3.6%) in false alarm rate were found for the testing subdataset. These significant differences were also evidenced by F-measure, which was maximal (0.894) for the BN model with flow-level data of the testing subdataset, indicating a very acceptable and better model prediction performance as compared with its counterpart without flow-level data.

The proposed BN models, especially the one with flow-level data, also outperformed other similar classification studies using BN concerning crash and/or crash severity. For example, De Oña et al. [88] showed a similar range of 61% to 62%, and Chen et al. [89] presented estimation accuracies of 66.84% and 65.76%. Note that these studies did not include traffic flow variables in their BN models. Comparatively, a relatively higher estimation accuracy of 76.4% was achieved by considering the relative macroscopic traffic flow states and with a reasonable false alarm rate of 23.7% as revealed by Sun and Sun [8]. Therefore, the present study could be an even better illustration of the advantage of incorporating more specific flow-level variables and data to account for crash risk variations. The impressive performance of the BN model with flow-level data could be mainly owing to three reasons as follows: (1) in this study, we focused on a strict car-following situation, in which the drivers’ behaviors would be much more steady than in other situations where they might speed up, change lane, or brake abruptly frequently; (2) we focused on the crash risk and its variations instead of crashes per se as the aforementioned studies did, and this could better facilitate an identification of the real-time safety-critical events [14, 23] and the investigation of the relationship between crash risk (variation) and those multisource contributive factors; and (3) the particular incorporation of the specific 10 min flow rate (), platoon length (), heavy vehicle rate (), and platoon crash risk entropy () was supposed to better and directly specify the short-term traffic flow uncertainty and car-following situations that might lead to rear-end crash risk fluctuations [22]. Furthermore, the BN model proposed in this study was able to hierarchically associate various contributing factors and to capture the inherent connections between variables in different layers, which accordingly helped to achieve a better crash risk prediction.

5.2. Model Transferability

The transferability is always a fundamental aspect of the model performance showing its suitability and application with different datasets. Actually, as presented in Table 5, the variances between the sensitivity, accuracy, and F-measure for the training and testing subdatasets were all around 1%, indicating a well acceptable transferability of the BN in explaining and modeling the testing subdataset of G50. Besides, the spatial transferability of the BN models was also evaluated, to examine the prediction ability of the built BN model with a heterogeneous source of dataset collected on another road. To be specific, as mentioned previously, the data collected on Freeway G56 were used to evaluate the spatial transferability of the BN models built based on the data from G50. With 1623 vehicle records as a testing dataset, the BN models with and without flow-level data were both examined and compared for spatial transferability in terms of crash risk variation prediction performance. As demonstrated in Table 6, the BN models were largely able to correctly predict the crash risk variations with equivalent false alarm rates. The best performance was also seen in the BN model with flow-level data, which predicted as high as 82.6% of the cases of “DECR” with an even lower false alarm rate of 21.3% and with a yet sustained high level of F-measure of 0.868. Similarly, the spatial transferability in the present study was also found to outperform the previous ones without considering such specific flow-level variables, which mainly reported around 60% crash prediction accuracy for their model’s spatial transferability [9, 85], and also outperform the one by Sun and Sun [8], who reported 67% crash prediction accuracy and 20.8% false alarm rate of spatial transferability with consideration of traffic flow states. Moreover, it is worth noting that the estimation accuracies of the two BN models both reduced as compared with their counterparts based on the testing subdataset of G50. The sensitivity, accuracy, and F-measure decreased by 1.8%, 1.6%, and 0.024 for the BN model without flow data and reduced by 2.7%, 2.3%, and 0.026 for the BN model with flow data. Nevertheless, these decreases were all relatively marginal to the decreases of 9.4% and 0.017 for the best-fitted model of Sun and Sun’s [8] in predicting crashes.

Moreover, the good transferability of the BN model with flow-level data might as well again underscore the advantage and importance of introducing specific flow-level variables and data in crash risk modeling. The reason, in addition to the previously mentioned ones, could be that the car-following behaviors and rear-end crash risk variations within the car-following platoon would be more likely to be equivalent across roads, because the crash risk variations in car-following states would be overwhelmingly directly specified by the vehicles’ movement and the platoon they are involved in. To a certain extent, the rear-end crash risk variations derived from vehicles’ movement and platoon states are supposed to be relatively independent of many external conditions, such as weather condition, light condition, and day of crash, which were prevalent factors investigated previously [35]. Besides, compared to these external conditions, the specific flow-level variables could serve as a more direct and natural rear-end crash risk precursor, as they inherently specify the crash risk dynamics along with the oscillation of vehicle movement and the fluctuations of platoon states.

5.3. Probability Inference

As validated for its estimation performance, the quantitative interdependency among variables and their contributions to crash risk variation could be obtained by the probability inference powered by the BN models. Based on the BN structure, the particular impact of a variable on the crash risk variations could be independently and quantitatively derived by the probabilities of “DECR” while setting a certain class of the variable as a 100% evidence. Figure 6 shows the abovementioned independent probability inference results for the observed variables leading to crash risk decrease (“DECR”, i.e., ) in the structure of BN models with and without flow-level data within the testing dataset of G56. In the figure, the blue dashed line with square markers stands for the probability inference results of the BN model without traffic flow data; the orange solid line with circle markers indicates the probability inference results of the BN model with traffic flow data; and the dash-dotted line in black represents the original frequency of cases of crash risk decrease (“DECR”) extracted directly from the testing dataset of G56 (apparently the original frequency is irrelevant to the variables appearing in the figure). Besides, the suffixes “S” and “L” indicate the small (“S”) and large (“L”) values of the variables, which are in accordance with the variable settings shown in Table 3. For instance, “” means the variable was set as small (“S”) for 100% evidence in inferring the probability of “DECR.” The detailed analyses and interpretation of the probability inferences of various variables were hierarchically demonstrated in terms of external conditions and behavioral factors, respectively, as follows.

5.3.1. External Conditions

The external conditions actually correspond to the base layer in the BN structure including those flow-level variables (, , , ) and temporally invariant observed variables (, , , ). As illustrated in Figure 6, when the specific flow-level variables were incorporated (i.e., BN model with traffic flow data), it can be discovered that (1) the probabilities of “DECR” given those external conditions were significantly greater in the BN model with traffic flow data than in its counterpart without flow-level data and (2) in certain conditions, the probabilities of “DECR” increased as compared with the original frequency of cases of “DECR,” which was not the case for the BN model without flow-level variables. Specifically, when “” was assigned a 1.0 probability as evidence, the probability of “DECR” increased from 0.731 (original frequency) to 0.740, and when “” was set as an evidence, the probability of “DECR” reached 0.744. This implied that the effects of the perceptual markings would be less effective in decreasing crash risk on a curved segment with small radius compared with a straight one. This was consistent with the previous findings attributing the less effectiveness to the natural deficiency of drivers in detecting crash risk on curves as compared with the task on a straight segment and explaining that drivers need more necessary and effective information on curve segments than on straight ones to successfully achieve crash avoidance [14, 21, 90]. The vehicle type (size) also significantly affects the probability of crash risk variations, as verified by 0.7% (0.8%) increase of the probability of “DECR” when “ ()” was set as evidence. This was in line with the findings of Lee [30], Andersen et al. [29], and Yoo and Green [33] indicating that the large leading vehicles would lead to a greater headway of the following vehicles and thus possibly result in less variations in crash risk as suggested by Ding et al. [14, 22]. The spatial frequency actually specified the effects of the perceptual markings on crash risk variations, and if it was set as “,” the probability of “DECR” increased by 1.3% as compared with the original frequency. The impact of is related to drivers’ visual perception and will be explained together with temporal frequency in the next section.

5.3.2. Behavioral Factors

Similar performances of the BN models with and without flow-level data in terms of “DECR” estimation were observed in those driving behaviors related factors, i.e., , , , , and . In particular, the probabilities of “DECR” significantly increased from 0.731 (original frequency) to 0.763, 0.757, 0.767, 0.768, and 0.734 with “,” “,” “,” “,” and “,” respectively. In addition, according to Figure 6, it seems that the incorporation of the specific flow-level data could further significantly differentiate the impacts of varying states of those behavioral factors in reasoning crash risk variations. That is, the differences of the probabilities of “DECR” between given conditions of “” and “,” “” and “,” “” and “,” and “” and “” were 0.029, 0.022, 0.003, and 0.003 in BN model without flow-level data and increased to 0.032, 0.093, 0.066, and 0.061 in BN model with flow-level data. Accordingly, it might be deduced that the specific flow-level variables and data improved the crash risk variation prediction accuracy and could be advantageous in capturing the detailed and actual impacts of the behavioral variables in varying states in accounting for crash risk variations. The impacts of behavioral variables (, , , and ) on crash risk variations apparently suggested that the rear-end crash risk in car-following was greatly associated with and specified by drivers’ choices of speed and headways, control of acceleration, and maintaining of stopping sight distance. Additionally, it also suggested that there was a good chance of the perceptual markings to successfully “help” the drivers to reduce their speeds and increase their headways to eventually mitigate crash risk in car-following. These benefits would be attributed to the temporal frequency () and the “discontinuity effect [64]” (related to spatial frequency ()) concerning drivers’ speed perception and distance perception. The former would lead to speed overestimation [42, 44, 48, 62, 63] and the latter would result in distance underestimation [44, 6470], which were combinedly responsible for the reduced speed, increased headways, and mitigated crash risk as revealed by Ding et al. [14, 22].

The above estimation performances of the external and behavioral conditions and factors in crash risk variations driven by the hierarchical hybrid BN models with and without flow-level data suggest that (1) the incorporation of specific flow-level variables and data significantly promoted the successful identification of “DECR” (see Tables 5 and 6) and (2) the microscopic driver behaviors were somewhat outcomes partially specified by the traffic flow they were involved in. As a matter of fact, the benefits of the specific flow-level variables might be attributed to the inherent interactive nature of the relationship between individual driver behaviors and traffic flow uncertainty. First and foremost, it should be recognized that car-following of an individual vehicle is essentially a state in which the rear-end crash risk continuously exists and evolves spatiotemporally, being subjective to dynamics of the specific traffic flow or the car-following platoon particularly. In particular, crashes were widely believed to be the outcome of a short-term disturbance in the traffic flow a few minutes ahead, and this short-term disturbance can be directly owing to the risky driving behaviors of drivers of certain vehicles in the flow (especially the platoon), which are typically characterized by speed fluctuation and inadequate following headway with respect to their leading vehicles [5, 19, 20]. Conversely, individual drivers’ instable behaviors at a time would also give rise to an aggravated uncertainty of the entire flow (especially the platoon) during a period as suggested by the traffic flow theory [83] and car-following stability analysis [91, 92]. Therefore, it might be safe to say that the individual car-following behaviors are generally regulated by and are subjective to traffic flow uncertainty in specific time and space, and the crash risk of individual vehicles varies along with the risk dynamics of specific traffic flow state. In other words, those microscopic driving behaviors characteristics might be more precisely captured in associating rear-end crash risk with the introduction of specific flow-level variables. Consequently, the BN model without flow-level data showed a relatively lower estimation accuracy.

5.3.3. Sensitivity Analysis

To investigate the impact sensitivity of the hierarchical influential factors, we performed sensitivity analysis on drivers’ speed risk perception (SRP), distance risk perception (DRP), and crash risk variation (CRV) as target nodes. Here, the BN model with flow-level data based on the testing dataset of G56 was what we focused on and was the base BN structure for the sensitivity analysis. The sensitivity analyses of the SRP, DRP, and CRV are illustrated as tornado graphs in Figures 79 in the full parameter range of [0, 1]. The length of the bar corresponding to each input variable in the tornado graphs represents a measure of the impact of the particular variable on SRP, DRP, or CRV. The color of the bars shows the influence direction of the input variables on the target node; green expresses positive and red negative contributions, respectively. Figure 7 illustrates the impact of a set of the most influential model parameters on DRP when DRP was targeted as “HIGH.” It can be seen from Figure 7 that “” had the greatest impact on DRP ([0.531, 0.567]), which was closely followed by “” ([0.521, 0.555]), and “” ([0.531, 0.553]) had the 10th impact on DRP. These implied that the there was a better chance for the drivers to experience a high distance risk perception (“”) when the specific 10 min flow rate () and the platoon length () were large. It suggested that the traffic flow conditions substantially impacted the car-following behaviors and contributed to the rear-end crash risk variations. Likewise, Figure 8 presents the top ten sets of variables concerning the impact on SRP when SRP was targeted as “HIGH”. As shown in Figure 8, “” ([0.514, 0.559]) had the greatest impact and “” ([0.513, 0.541]) had the least impact on SRP among the top ten conditional states. Figure 9 demonstrates the sensitivity analysis of the target of “” as our primary focus. According to Figure 9, it can be discovered that CRV was greatly impacted by and sensitive to DRP and SRP interactively, as evidenced by the greatest range of the conditional probabilities of “” given “ ([0.431, 0.737])”, “ ([0.519, 0.761])”, and “ ([0.720, 0.948])”. These implied that the crash risk variation of individual vehicles was possibly driven by drivers’ intuitive risk perception from changes in distance and/or speed, which was well consistent with similar findings based on the structural equations modeling [14]. Besides the latent variables of DRP and SRP, it can be seen that the conditional states of “,” “,” “,” “,” “,” and “” were all found to be influential in the probability of “DECR.” These might also indicate that the specific flow-level variables (, , , and ) empowered the probability inference of crash risk variation interactively with the visual perceptual variables, which was in accordance with the results presented and discussed above.

6. Conclusions

In this study, we attempt to investigate the association between traffic flow uncertainty, roadway and vehicular characteristics, car-following behavior, visual perception, and rear-end crash risk variation and to compare the crash risk variation prediction performance with and without flow-level variables (, , , and ). Two datasets comprising 5055 individual vehicles in car-following state coupled with the abovementioned data were collected on Freeways G50 and G56 in Hubei, China. A hierarchical hybrid BN model approach was proposed to capture the association between drivers’ visual perception, traffic flow uncertainty, and rear-end crash risk variation. The BN models were trained and tested based on data collected on Freeway G50, and the spatial transferability of the models was examined using data collected on Freeway G56. The main findings of this study are as follows:(i)The BN model with flow-level data can predict 85.3% of the cases of crash risk decrease (), which is 4.3% greater than the one without flow-level data, with a false alarm rate of 21.4%, which is 3.6% lower than the one without flow-level data, indicating a very acceptable and better model prediction performance of BN model with flow-level data as compared with its counterpart without flow-level data.(ii)The hierarchical hybrid BN models show plausible transferability in predicting crash risk variation with acceptable false alarm rates. In addition, the BN model with flow-level data outperforms its counterpart without flow-level data in terms of spatial transferability, which predicts as high as 82.6% of the cases of crash risk decrease (DECR) with a low false alarm rate of 21.3%.(iii)The probability inferences driven by the hierarchical hybrid BN models revealed that the incorporation of specific flow-level variables and data significantly benefits the successful identification of rear-end crash risk variation, which might be attributed to the inherent interactive nature of the relationship between individual driver behavior and traffic flow uncertainty.(iv)The perceptual markings could provide appropriate visual perceptual information for drivers to compensate for rear-end crash risk and therefore could be used as an effective crash prevention measure.

Reasoning or predicting crash risk is inherently challenging as it is associated with individual driving behaviors, traffic flow uncertainty, and many other factors, so the fundamental interactive nature between individual driving behaviors and traffic flow uncertainty should be well considered and inspected in the first place to originally facilitate this work. Besides, from practical viewpoints, this study may cause the governors and road and traffic engineering practitioners to pay more attention to the heavy vehicle control and management and the potential safety issue on curves and, along with efforts on traffic flow monitoring, to develop a safer road transportation environment. Moreover, the perceptual markings, as verified and recommended in this study, may be a plausible measure to mitigate crash risk and improve road safety. Furthermore, this kind of perceptual treatments on roadway is supposed to provide drivers with more useful and effective information to facilitate driving and to avoid a crash. Therefore, future research could be oriented to the development of more contributive and targeted flow-level factors to better account for rear-end crash risk. Besides, more appropriate BN structures learned from different algorithms should be investigated and evaluated. Finally, actual crash data are encouraged to be incorporated for a better understanding of the relationship between traffic flow uncertainty, individual driving behaviors, external conditions, and rear-end crash risk.

Appendix

A. Abbreviations and Explanations

The list of abbreviations used in this article is provided in Table 7.

B. Bayesian Network (BN) Introduction

B.1. Definition

The BN is a specific type of quantitative causal model structured based on Bayes’ theorem and composed of a directed acyclic graph and a set of probability statements. The BN makes a rational statistical inference by updating the prior information or hypothesis of an elementary event. Prior information or assumptions are set based on subjective judgement (e.g., expert knowledge/historical data) or using observed data. BN structure consists of sets of nodes (variables) and edges (arcs) where nodes represent the variables (observed data or hidden features) and edges delineate the causal relationship between the variables and attributes. Figure 10 shows an example of the BN structure of crash risk inference.

BN is derived from a simple formula known as Bayes’ rule that could be expressed aswhere is the probability of given , also known as posterior probability; is the conditional probability of at the occurrence of , representing the likelihood of the occurrence of ; is called prior probability of ; and is the prior probability of evidence . The joint probability distribution of and could be written as follows:

Suppose a BN consists of variables , , , . According to the “Chain Rules,” the corresponding joint probability distribution of the variables could be written as

For any , if there exists to make the variable conditionally independent of the other variables of , which could be expressed as

then the above formula could be further simplified as follows:

As for a BN, stands for the parent node (variable) of its child node , and those nodes without parents would be root nodes.

Generally, the BN modeling mainly consists three steps: (1) variable defining, i.e., to select variables suitable for objective issues in the domain, which represent the nodes in a BN; (2) BN learning, i.e., to construct an appropriate topology of a BN and to determine the corresponding parameters of the BN structure; and (3) BN inferring, i.e., to obtain the posterior probability of the nodes in the BN structure under given evidence.

B.2. BN Structure Learning

Structure learning is a process that aims at finding a directed acyclic graph structure that could best characterize the casual relationships between the variables. The BN structure learning process could generally be driven by (1) prior knowledge from relevant research and experts, (2) a data sample, or (3) a combination of them. The combination process is as follows: a BN structure is first formed using machine learning algorithm, given the presence of certain logical relationship determined by prior knowledge, and is then modified and optimized based on experts’ knowledge. Admittedly, the combination method generally surpasses the others by overcoming deficiencies due to overmuch subjectiveness of experts’ knowledge or limited efficiency of machine learning with a large number of nodes and complex network relationship between them, which was adopted in the present study.

With available sample data, there are generally two kinds of methods for BN structure learning: (1) the score-based method, that is, to search the optimal BN structure by using a certain scoring function to assess the matching degree of the BN structure to the sample data, and (2) the constraint-based method, that is, to learn the BN structure by determining the dependency (arc) between nodes based on conditional independency tests. The score-based method is relatively advantageous, that is, featured with uncomplicated learning process, large searching range, and efficient learning, and one of the commonly used score-based methods was adopted, i.e., the Greedy Thick Thinning (GTT) algorithm [93].

The GTT algorithm measures the fitness of a learned BN structure with the given dataset by using the scoring function and obtains the optimal structure after limited iterations. The detailed process of the GTT algorithm is as follows:(1)Add arcs to . Search and add directed arcs that promote the score of the structure until the score stops increasing.(2)Trim arcs of . Search the negative arcs, and trim them off the structure until the score stops increasing. During the above processes, the BN structure is assessed by the K2 scoring function proposed by Cooper and Herskovits [94], which takes the posterior probability as the scoring criteria:For any two temporary BN structures and , there existsAccordingly, the comparison between posterior probabilities of structures () and () could be substituted with the comparison of the corresponding joint probabilities and . Suppose , , , ; then, the joint probability of and could be calculated aswhere is the prior probability of the structure; is the number of the node ; is the value set of the parent node of ; and is the number of cases as the value of the parent node is in the value set when , and .

B.3. BN Parameter Learning

The BN parameter learning process is to determine the conditional probability distribution (CPD) of nodes by learning from the sample data, which is also known as the conditional probability table (CPT). According to the completeness of a sample dataset, the methods for BN parameter learning could normally be categorized into learning methods based on incomplete data or on complete data. Nevertheless, as mentioned above, there were unobserved latent variables, i.e., distance risk perception (DRP) and speed risk perception (SRP), to be incorporated in our BN structure, so the Expectation Maximization (EM) algorithm [95, 96] was especially adopted in this study. EM is an iterative method which alternates between performing an expectation step (E-step) and a maximization step (M-step), which could be detailed as follows.

Suppose is a sample dataset containing a subdataset of unobserved latent variables and a subdataset of observed variables; i.e., . Suppose is the present estimation of the BN structure parameter , and then the posterior probability based on the sample dataset with unobserved variables (missing data) could be calculated to approximate a complete dataset ; then, the estimation of in the right next step could be expected by a log likelihood function as follows:

E-step:

Usually, is denoted as as is completely determined by the constant sample dataset and the present estimation . The next step (M-step) is to seek a that would maximize its expectation , as follows:

M-step:

Data Availability

The data are not publicly available because they are approved and regulated by local government.

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

D. N. was responsible for conceptualization, methodology, investigation, funding acquisition, formal analysis, software, original draft preparation, and manuscript review and editing. L. L. was involved in validation, data curation, and visualization. J. N. was involved in validation, data curation, visualization, and manuscript review and editing. Both authors have read and agreed to the published version of the manuscript.

Acknowledgments

This work was funded by the National Natural Science Foundation of China (Grant no. 71901166), the China Scholarship Council (Grant no. 201908420333), and the Plan of Outstanding Young and Middle-Aged Scientific and Technological Innovation Team in Universities of Hubei Province (Grant no. T2020010).