Abstract

This study explores the associations between crash/near-crash (C/NC) events and roadway, driver-related, and environmental factors in naturalistic driving studies (NDS). We used the Naturalistic Engagement in Secondary Tasks (NEST) dataset, which is massive and detailed and contains 50 million miles of naturalistic driving data resulting from the Strategic Highway Research Program 2 (SHRP2). Association rule mining (ARM) is applied to extract the rules for frequently occurring events. The generated association rules are filtered by four metrics (support, confidence, lift, and conviction) and validated by the lift increase criterion. A three-step analysis is performed to obtain a comprehensive understanding of the rules of C/NC events. The 20 most frequent items are first selected to investigate their relationship with the C/NC events. Subsequently, the association rules are used to identify the factors contributing to C/NC events. Finally, correlations between contributing factors and different severities of crashes (I—most severe, II—police-reportable, III—minor crash, and IV—low-risk tire strike) are analyzed by ARM. The results demonstrate that C/NC events occur most frequently on straight and level road segments with no controlled intersections or traffic control devices when drivers are performing secondary tasks. Thus, the reasons for these crashes are carelessness and overconfidence. In addition, a median strip or barrier and a wider road can significantly reduce the frequency and severity of crash events. Moreover, gender, age, average annual mileage, and secondary tasks are highly correlated with the frequency and severity of C/NC events. Drivers with visual-spatial disabilities or crash records are more likely to be involved in the most severe crash events. Near-crash events occur more frequently at higher traffic density and on roads with traffic control devices and controlled intersections. These conditions may keep drivers alert, preventing crashes.

1. Introduction

The National Highway Traffic Safety Administration (NHTSA) data [1] show that approximately 38,680 people died in traffic crashes in the United States in 2020, representing an increase of almost 7.2% compared to the 36,096 fatalities reported in 2019 and the largest number of fatalities since 2007. The increase in traffic crashes has harmed many families, although most of the injuries and deaths could have been averted. Thus, it is essential to determine the correlations between the contributing factors and crash/near-crash (C/NC) events to minimize their occurrence. However, many factors contribute to C/NC events, with latent correlations hidden in the C/NC data. Thus, it is challenging to extract the correlations between the contributing factors and the causes of C/NC events to prevent them. Consequently, traffic safety has become an urgent and crucial topic in transportation research.

Data acquisition is a critical prerequisite for traffic safety studies. Many safety studies [24] have focused on extracting associations between C/NC events and roadway features using police report data due to easy accessibility. However, the lack of available factors, such as driving behavior and driver characteristics, has limited the comprehensiveness of these studies. Therefore, several experimental studies [57] have analyzed the impacts of different driving behaviors on C/NC events in a simulated environment. In experimental studies, dozens of drivers were recruited for experiments. For example, in secondary task engagement experiments, participants are asked to perform certain secondary tasks under specific C/NC conditions. Eye movement, heart rate, and vehicle kinetic data are simultaneously recorded during the experiments [8]. Although experimental studies can extract valuable information because of their ability to simulate C/NC conditions, they may not be able to mine the latent rules of C/NC events for two main reasons [912]: (1) The participants are equipped with eye-tracking glasses, galvanic skin resistance (GSR) electrodes, wearable sensors, optical probes, and photoplethysmography (PPG) sensors to obtain data from multiple sources. The participants may not feel comfortable in the simulated driving environment due to the equipment. Therefore, the applicability of the experiment’s results is questionable. (2) Obtaining instructions from a computer screen rather than responding to traffic conditions is common in driving simulations. This situation does not accurately represent the real-world driving experience.

Many studies used observational data to ensure the transferability of the results to real-life conditions [1316]. Observational studies or naturalistic driving studies (NDS) [17] provide realistic conditions to gather C/NC data for accident analysis and prevention. Multichannel video, sensor, kinematic, and vehicle network data can be obtained from vehicles equipped with a data acquisition system (DAS) in a naturalistic driving setting. The highly detailed and comprehensive dataset is suitable for traffic safety studies and many other research fields.

Detailed and comprehensive datasets have been obtained, representing a solid foundation for traffic safety analysis. Researchers used these datasets and different methods to analyze different aspects of traffic safety. Some researchers used statistical models to reveal the correlations between variables and the occurrence of C/NC events using NDS. For instance, Papazikou et al. [18] investigated vehicle kinematics during crashes to obtain reliable indicators of the time to collision (TTC). Kreusslein et al. [19] focused on the characteristics of mobile phone calls, including the call duration, glance behavior, call type, and mobile phone location, to determine the influence of making mobile phone calls. Schlick et al. [20] used hierarchical regression models to determine the associations between motor vehicle crashes and different contributing factors.

Driving behavior analysis and machine learning methods have been used to identify the cause of C/NC events. Zou et al. [21] predicted vehicle acceleration using behavioral semantic analysis to prevent accidents caused by rapid acceleration. Guo et al. [22] utilized SHapley Additive exPlanation (SHAP) to analyze the importance of features related to crash events; sharp deceleration was the most important feature.

Association rule mining (ARM) has been proposed for crash analysis [23, 24]. ARM is widely used in the traffic safety field because it can reveal the intrinsic relationships between the contributing factors and the accidents without assumptions and significantly outperforms traditional modelling techniques. A summary of the applications of ARM for crash analysis is presented in Table 1.

Several studies [3337] used ARM for crash analysis under different conditions, such as truck crashes or near crashes. Unlike these studies, we propose a three-step method using the frequent pattern (FP) growth algorithm [38] to mine the correlations between different categorical variables and C/NC events using the Naturalistic Engagement in Secondary Tasks (NEST) dataset [39]. The 20 most frequent items are first selected to determine which features are associated with C/NC events. The association rules describing the factors contributing to C/NC crash events are then identified. Finally, association rules are used to analyze crash events of different severities. Suggestions for practical applications are provided. The flowchart of the proposed approach is illustrated in Figure 1.

The remainder of this paper is organized as follows. Section 2.1 presents the dataset and preprocessing steps. The methodology is described in Section 2.2, focusing on the principles of the FP growth algorithm and the formulations of four metrics: support, confidence, lift, and conviction. The results are presented and discussed in Section 3, the findings and discussions are drawn in Section 4, and conclusions are summarized in Section 5.

2. Materials and Methods

2.1. Data Description
2.1.1. Dataset Overview

We used C/NC data from the NEST dataset [39], which is a subset of the Strategic Highway Research Program 2 (SHRP2) database produced under the collaboration between the Virginia Tech Transportation Institute (VTTI) and the Toyota Collaborative Safety Research Center (Toyota CSRC). This dataset contains high-level data and detailed time-series data on secondary task engagement and distraction-related safety-critical events (SCEs) during real-world driving. The summary data provide information at the event level, and the time-series data provide frame-by-frame detailed information at the millisecond level. We only used the summary data in this study.

The summary data contain information on the event severity of baseline, crash, and near-crash events, with a total of 1080 samples. We did not consider the baseline data because they contain no C/NC events. The duration of the C/NC events was 30 s, including 20 s prior to the event and 10 s following it. The summary data comprised 36 items. The subtasks and environmental conditions were split into three fractions for each 10 s duration, while the driver information and other information were not. After deleting samples with too many missing values, we obtained 699 C/NC event samples.

2.1.2. Variables

The raw summary data of the C/NC events contains 36 categorical variables. Twenty of them were chosen to analyze the patterns of the C/NC events. The remaining 16 variables were not chosen for the following three reasons: (1) a large percentage of missing values, (2) heavily skewed distribution, and (3) overlap in meaning. For example, the stop sign, merge sign, yield sign, slow or other warning signs, and railroad crossing sign variables are included in the raw summary data. However, most of the values are blank because these signs do not occur frequently; thus, the distribution is skewed. In addition, the traffic control variable represents these signs at a higher level. Therefore, these variables were deleted, and only the traffic control variable was used. Note that crucial variables were retained even if they had a skewed distribution or an overlap in meaning.

Some of the chosen variables required aggregation because they contained many attributes, skewing the distribution. Therefore, the attributes of these variables were categorized into a higher level, such as secondary task, traffic density, locality, age group, and annual miles. For example, different secondary tasks (including no secondary task) were aggregated into secondary tasks (yes) and no secondary tasks (no). This approach was different from a previous study [40] because all C/NC events were analyzed comprehensively in this paper rather than focusing on one aspect. More details on the variables are presented in Table 2.

2.1.3. Distribution of Attributes

The distribution of attributes is significant for hyperparameter selection, such as the support value, and influences the association rules generated by ARM. For example, some attributes of a variable occurred infrequently and might not been considered because of a high support value; thus, they might be filtered out by ARM and excluded from the association rules, resulting in errors in evaluating the attribute’s contribution to C/NC events.

Figure 2 describes the distribution of attributes for the crash and near-crash events. There were 447 crash events and 252 near-crash events.

Figure 2 shows that (1) most percentages are greater than 0.05, indicating that 0.05 might be a suitable initial support value; (2) some attributes are associated with a higher proportion of crash events than near-crash events, such as no lanes, , improper driver behavior, and teenager driving. This implies a correlation between the severity of events and these attributes.

2.2. Methodology

Recent studies used various techniques to conduct pattern mining using large amounts of crash data, such as ARM [36], Bayesian networks [41], neural networks [42], linear regression networks [43], cluster analysis [44], random forests [45], and support vector machine [46]. ARM has the advantage of finding meaningful associations and providing valuable insights into the interdependence between roadway, environmental, and driver-related factors and the frequency and severity of crashes [29]. Besides, ARM is more suitable for discovering patterns in large data volumes than confirming hypotheses [36] and is not influenced by missing values. Thus, it is preferable to machine learning and linear regression methods. Therefore, ARM was chosen to analyze C/NC data.

The Apriori algorithm [23] is considered the most popular and efficient ARM method compared to the weighted classification based on association rule (WCBA) method [47], fast classification based on association rule (FCBA) method [48], and the maximal frequent itemset algorithm (MAFIA) [49]. However, it scans the entire dataset for frequent items, resulting in high computational complexity, especially for a large dataset. The FP growth algorithm [50] is an improvement of the Apriori algorithm that requires only two scans of the database to develop the FP tree. Thus, it can identify frequent items in a large database with a low execution time. Due to the advantages of the FP growth algorithm, it is used here to extract frequent items.

In this study, the association rules are mined in two steps: (1) the FP growth algorithm is used to detect frequent item sets and (2) association rules are mined from the frequent item sets.

It is assumed that is a collection of categorical variables (item sets), and is a collection of C/NC events (transactions), where is the number of item sets that is much greater than , which is the number of transactions. All association rules are generated based on and . However, not all the association rules are needed. For example, may be an association rule with a high support value, but it may not provide any new or meaningful information because a road with no lanes implies a low-grade road unsuitable for high traffic density. Thus, these types of rules should be discarded. is defined as the antecedent (e.g., ), and is defined as the consequent (e.g., ). The antecedent and consequent are used to discard meaningless association rules. However, this does not indicate that is the cause of is the result of , or and have a causal relationship. Four performance metrics are typically used to test the model performance and validity: support, confidence, lift, and conviction. The support indicates how frequently the itemset appears in the dataset; it is the ratio of the number of transactions containing the item set to the total number of transactions. The confidence is the percentage of all transactions satisfying that also satisfy . It is the ratio of the number of transactions including items and to the number of transactions including item . The lift of a rule refers to the frequency of items and in a transaction. However, the frequency of item or item should be simultaneously considered. The lift value reflects the correlation between and in the association rules. When the lift value is greater than 1, the higher the value, the higher the positive correlation between and is. When the lift value is less than 1, the lower the value, the higher the negative correlation between and is. When the lift value is equal to 1, there is no correlation between and . A rule with a single antecedent and a single consequent is referred to as a 2-item rule. Similarly, a rule with -1 antecedents and a single consequent is denoted as a -item rule, where is the sum of the number of antecedents and the number of consequents. The support, confidence, lift, and conviction are computed as follows: where is the antecedent, is the consequent, is the percentage or probability of a transaction containing item , is the support value of the association rule , is the confidence value of the association rule , is the lift value of the association rule , and is the conviction value of the association rule .

The “mlxtend” package in Python 3.7 is used to implement the FP growth algorithm for frequent items and mine the association rules with a minimum support value of 0.05 and a minimum confidence value of 0.05 as hyperparameters.

3. Results

3.1. Frequency Analysis

The 20 most frequent items were selected to determine which features the C/NC events are associated with. As shown in Figure 3, the most frequent item is no driver impairment, and the second most frequent item is secondary tasks, indicating that most drivers are driving normally, and secondary tasks are highly associated with crash events. In addition, the most frequent items related to the road are a straight road, level road, and no controlled intersections. It can also be deduced from Figure 3 that the C/NC events are highly associated with driving normally and are associated with performing secondary tasks on straight and level road segments with no controlled intersections. These conditions are common in real life and have the highest probability of crashes.

Figures 4(a) and 4(b) show the frequency plots for crash events and near-crash events, respectively. Several differences are observed in these two plots: (1) the secondary task is the most frequent item contributing to crash events with a frequency of 94.85%, whereas this item ranks fourth for near-crash events with a frequency of 90.47%, indicating that secondary tasks are frequently associated with crash events. (2) The number of travel lanes less than or equal to 2 ranks eighth for crash events (frequency of 62.64%), and the number of travel lanes between 2 and 7 ranks seventh for near crashes, with a frequency of 71.43%, indicating that the probability of a crash is higher for fewer lanes. (3) Free flow ranks 12th for crash events, with a frequency of 66.67%. This result suggests that a free traffic flow may keep the drivers over-confident, causing crashes. (4) Improper behavior ranks 13th for crash events and is not correlated with near-crash events. Thus, improper behavior occurs more frequently in crash events. (5) An annual mileage of less than 10000 miles is associated with crash events, and an annual mileage greater than 15000 miles is more frequently associated with near-crash events, indicating that drivers with more driving experience are less likely to be involved in crashes.

3.2. Model Performance and Descriptive Statistics of the Parameters

We created two-key plots [30] to visualize the patterns extracted from the association rules of the C/NC events. There are 142794 rules for crash events and 18759 rules for near-crash events generated by the FP growth algorithm, with a minimum support value of 0.05 and a minimum confidence value of 0.05. Because there are numerous association rules, we randomly selected some to show the pattern. We merged the 3-item rules and 4-item rules as well as the 5-item rules and 6-item rules. In Figure 5, the range of support values for the 2-item rules is 0.05 to 0.6, and the confidence values of these rules exceed 0.4. For the 3-4-item rules, the range of support values is 0.05 to 0.5, and the confidence values also exceed 0.4. The 5-6-item rules have a similar trend, but the maximum value of support values is less than 0.25.

Figure 6 shows the two-key plots for the rules of the near-crash events. The range of the support values is 20% smaller, and the confidence value range for the majority of rules of the near-crash events is 80% lower than in Figure 5.

3.3. Obtaining the Patterns from the Association Rules of the C/NC Events
3.3.1. Crash Event Patterns

Table 3 presents the 25 top rules selected from 142,794 rules according to the lift value (from high to low) for crash events. The 6-item rule is used as an example. A male person driving on an undivided road with less than 2 lanes is more likely to be involved in a crash when performing improper behavior, such as aggressive driving, even if he has no violations. The corresponding metrics are , , , and . This can be interpreted as follows: the support value indicates that only 5.3% of crash events contain these five items. The confidence value indicates that if an event contains the five items, it is a crash event. The lift value shows that the percentage of crash events with these five items is 1.564 times higher than that of other crash events in the dataset. The conviction indicates the relationship between antecedents and consequents; the higher the conviction, the stronger the relationship is.

The rules for crash events are summarized from three aspects: (1) road: roadways with no lanes or undivided roads (rules 1, 6, 7, 8, 9, 10, and 21) or roads with less than two lanes (6, 16, 17, 20, 21, 22, 23, 24, and 25), and level roads (rule 7, 24) are more likely to be associated with crash events. (2) Driver: young (rule 3, 15) female (rule 21) participants with minor visual-spatial disabilities (rule 18) and an estimated average annual mileage over five years of less than 10,000 miles (rules 11, 13, 14, and 15) are more likely to be associated with crash events when performing secondary tasks (rule 23), improper behavior (rules 12, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, and 25) or impairments (rule 8) observed. Note that the number of traffic violations or being involved in a crash are not significantly correlated with crash events (rules 16, 17, 18, 21, 22, 23, 24, and 25). (3) Environment: crash events occur more frequently when the traffic density is free flow (rules 4, 11, 13, 19, and 20), there is no traffic control (rules 9, 19) or controlled intersections (rules 10, 25), and the area is residential (rules 5, 12, 13, 14, and 15) or business/industrial (rules 16, 22, 23, 24, and 25). Note that sudden unexpected events, such as breaking of a lead vehicle, animals, or pedestrians entering the roadway at a nonmarked location or vehicle swerving in front of the driver, do not contribute significantly to crash events (rules 12, 17, 18, 19, and 20).

The likely reasons for these results are as follows. Undivided roads or roads with fewer than two lanes are typically low-grade roads. Young drivers have less driving experience and are more likely to underestimate the danger of driving on these road segments, especially when there are no vehicles, traffic control, or intersections to interrupt driving. Under these conditions, drivers can be involved in crashes when they suffer from fatigue or perform secondary tasks or improper behavior.

3.3.2. Near-Crash Event Patterns

Table 4 presents the 25 top rules selected from 142,794 rules according to the lift value (from high to low) for near-crash events. The first 6-item rule is used as example. When a driver is affected by the interactions with others in traffic, the driver’s speed is influenced. In addition, maneuvering in stable flow requires substantial vigilance by the driver, and the general comfort level declines. A young man driving on a wide road in a business/industrial area is more likely to be involved in a near-crash event when he is performing secondary tasks. The corresponding metrics are , , , and . This can be interpreted as follows: the support value indicates that only 5% of near-crash events contain these five items. The confidence value shows that an event containing the five items has a 94.6% probability of being a near-crash event. The lift value demonstrates that the percentage of near-crash events with these five items is 2.264 times higher than that of other near-crash events in the dataset. The consequent depends significantly on the antecedent because the conviction value is higher (11.83) than the others.

The rules for near-crash events are summarized from three aspects: (1) road: level roads (rules 9, 19, 22, 23, and 24), divided roads (median strip or barrier) (rule 3), roads with 2 to 7 lanes (rules 4, 11, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, and 25), and straight roads (rules 10, 20) are more likely to be associated with near-crash events. (2) Driver: middle-aged and older (rules 2, 5, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, and 25) male (rule 13) participants with an estimated average annual mileage over five years of more than 15,000 miles (rules 6, 7, 8, 9, 10, 12, and 15) are more likely to be associated with near-crash events when they are performing secondary tasks (rules 8, 18, 21, and 25). Note that driver impairments (rules 7, 17, 23, 24, and 25), driver behavior (rules 12, 23), or unexpected events (rule 6) are not correlated with near-crash events. (3) Environment: near-crash events occur more frequently when the traffic flow is stable (rules 13, 14, 16, 21, and 22) or unstable/forced (rule 1), and the area is business/industrial (rules 11, 16, 17, 18, 19, 20, 21, 22, 24, and 25).

The likely reasons for these results are as follows. Divided roads and more lanes have fewer crashes. However, the high traffic density limits the drivers’ freedom to maneuver, making them irritable in a stable or unstable/forced traffic flow. The drivers are inclined to overtake and accelerate frequently under these conditions and underestimate the danger, especially older drivers with higher confidence in their driving experience. If they perform secondary tasks and their attention is distracted, near-crash events are likely to occur.

3.3.3. Comparison of the C/NC Patterns

A comparison of the C/NC patterns is performed from three aspects: (1) road: divided roads, roads with no lanes, and the number of lanes are the main differences between the C/NC patterns. Crash events are more unlikely to occur on divided roads with more than 2 lanes. (2) Driver: the age group and annual miles are two significant factors in C/NC events. Drivers associated with crash events are predominantly 16-24-year-old teenagers with relatively little driving experience, whereas drivers involved in near-crash events are more likely older people (20-64 year old) with more driving experience. In addition, drivers are more likely to be associated with crash events when performing improper behaviors, such as aggressive driving and drunk driving, whereas secondary tasks are more influential in near-crash events. (3) Environment: crash events occur more likely in free flow, when the comfort level of drivers is high, in areas without traffic control or controlled intersections, and in residential or business/industrial areas. Near-crash events are more common in stable traffic flow or unstable/forced flow in business/industrial areas. The likely reason is that high traffic density keeps drivers alert, preventing crashes.

Near-crash events occur due to a combination of factors (i.e., traffic density levels, secondary tasks, and improper driving behavior). Although near-crash events do not result in economic loss or casualties, some risk factors can turn near-crash events into crash events. Thus, it is necessary to discuss the relationship between crash and near-crash events and determine which conditions change near-crash events to crash events: (1) road: crash events are more likely to occur on narrow roads, whereas near-crash events are more likely to occur on wide roads. Thus, we assume near-crash events may change into crash events because of changes in the road features from urban to rural area roads or from main roads to bypasses. (2) Driver: older drivers are more likely to be involved in near-crash events rather than crash events; however, if they perform improper driving behavior, a near-crash event may become a crash event. (3) Environment: Bernat et al. [51] found that night-time single vehicle crashes (SVCs) were strongly related to drunk driving, and improper driving behavior was more likely when there were no vehicles nearby. Thus, improper driving behavior might increase the probability of turning near-crash events into crash events in free flow.

3.3.4. Patterns of Four Types of Crash Events

The association rules between different categorical variables and the severity of crash events are analyzed, and crash events are categorized into severity levels: I—most severe, II—police-reportable, III—minor crash, and IV—low-risk tire strike. Note that the definition of the four severity levels of crash events is derived from the NEST [39] dataset. Forty association rules are considered according to the lift value (Table 5).

Undivided roadways (rules 15, 23, 24, 31, 32, 39, and 40) are strongly associated with IV—low-risk tire strike events. However, this does not indicate that a low-risk tire strike causes severe crash events. Straight roads (rules 14, 21, 29, 30, 31, 33, 35, 36, 37, 38, and 40) are rarely associated with 2-item, 3-item, or 4-item rules but are more commonly with 5-item and 6-item rules. It is assumed that crashes rarely occur on straight road segments. However, crash events are more likely when a straight road is combined with other antecedents. Similar to the straight road segment, level road segments (rules 17, 21, 22, 23, 24, 25, 26, 29, 30, 31, 32, 33, 34, 35, 37, 38, 39, and 40) combined with other factors have an increased likelihood of crash events. Police-reportable events (II) are more likely on roads with less than two lanes (rules 12, 19, 27, and 28). Minor crash events (rule 13) (III) are more likely on roads with more than two lanes, indicating that widening the roadway can reduce the frequency and severity of crash events.

Male (rules 2, 10, 11, and 20) drivers are more likely to be associated with I—most severe events and II—police-reportable events. Drivers with one crash record during the past five years (rules 1, 9) are more likely to be associated with I—most severe events. The age group (rule 4) does not show a strong correlation with the crash severity. Drivers with annual miles greater than 15000 miles (rules 5, 14, 21, 22, 29, 30, 37, and 38) have a low correlation with severe crash events, indicating that drivers with more driving experience drive more safely. Minor visual-spatial disabilities do not show a strong correlation with crash events. However, they are strongly associated with I—most severe events. We speculate that minor visual-spatial disabilities do not affect driving significantly. However, if crash events are about to occur, the visual-spatial disabled drivers (rules 10, 17, 18, 25, 26, 33, and 34) may have more problems if a crash occurs. Thus, the crash events are typically more severe. Driver impairments (rules 9, 18, 20, 25, 26, 27, 32, 33, 34, 35, 36, 38, 39, and 40) and improper behavior (rules 26, and 33) are not strongly correlated with the severity of crash events, whereas performing secondary tasks (rules 28, 30, 35, 37, and 38) results in more frequent II—police-reportable crash events and III—minor crash events.

Driving in residential areas and other areas (rules 3, 6) is more likely associated with level II or III crash events. However, driving in business/industrial areas (rules 15, 23, 31, and 40) is more likely associated with IV—low-risk tire strike crash events. I—most severe events (rules 17, 18, and 25) and IV—low-risk tire strike events (rules 8, and 16) occur more likely when the traffic flow is stable. II—police-reportable crash events occur more likely in free flow (rules 12, 19, 27, and 28). Interruptions due to traffic control (rules 13, 24, 32, 34, 36, and 39) or controlled intersections (rules 19, 22, 27, 28, 29, 34, 36, 37, and 39) do not affect the severity of crash events.

4. Findings and Discussion

The key findings are summarized as follows: (1)Road (a)Undivided roadways are more likely associated with crash events, especially IV—low-risk tire strike events. In contrast, divided roadways are more likely associated with near-crash events. It is assumed that a median strip or barrier could prevent crashes(b)Roads with less than 2 lanes are highly correlated with crash events, especially II—police-reportable events. Roads with 2-7 lanes are highly correlated with near-crash events or lower-severity crash events. Wider roadways are recommended to reduce the frequency and severity of crash events(c)Crash events mainly occur on level roads, whereas near-crash events mainly occur on straight roads. However, this factor is only related to C/NC events in combination with other factors(2)Driver (a)Female drivers have a low correlation with low-severity crash events, whereas male drivers have a high correlation with severe crash and near-crash events(b)Young drivers have a higher likelihood of being involved in crash events, whereas middle-aged and older drivers show a stronger association with near-crash events. However, the driver’s age is not highly correlated with the severity of crash events(c)Crash events occur more likely when the drivers’ estimated average annual mileage during the past five years is less than 10,000 miles. Near-crash events are more likely to occur when the drivers’ average annual mileage during the past five years is greater than 15,000 miles. It is assumed that drivers with more driving experience have a safer driving style(d)Performing secondary tasks is highly correlated with crash events (especially the II—police-reportable crash events and III—minor crash events) and near-crash events(e)Improper behavior is linked to crash events, whereas driver impairments are not. Both factors are not strongly correlated with the severity of crash events(f)The number of traffic violations or crash records is not strongly correlated to the frequency of C/NC events. However, drivers with one crash record during the past five years are more likely to be associated with I—most severe events(g)Minor visual-spatial disabilities are not strongly correlated with crash events but are strongly correlated with I—most severe events. It is assumed that minor visual-spatial disabilities do not affect driving significantly. However, during a crash event, visual-spatial disabled drivers may have problems handling the situation; thus, the crash event is typically more severe(3)Environment (a)Crash events occur more likely in free flow traffic, and near-crash events are more likely in stable or unstable/forced flow. The results suggest that a higher traffic density keeps drivers alert, preventing crashes(b)Crash events are more likely in sections with no traffic control or controlled intersections. However, these factors do not affect the severity of crash events(c)Residential or business/industrial areas have a higher correlation with C/NC events than other areas. More traffic safety precautions should be considered in these areas

The key findings of a comparison of our results and three similar studies are summarized in Table 6.

We analyzed the associations between various factors and C/NC events and the crash severity. The following was observed: (1) road: Kong et al. [30] found associations between near-crash events and roads with median strips. Yu et al. [25] observed that most crashes occurred in urban areas on undivided roads. We also found that a median strip reduced the frequency and severity of crash events. Yu et al. [25] reported that crashes were more likely on straight road sections, similar to our study. However, we found that crashes were associated with straight road sections in combination with other factors. (2) Driver: similar to most other studies, we also found that gender, age, improper driving behavior, and secondary tasks were correlated with C/NC events. In contrast to other studies, we observed that only severe crashes were correlated with minor visual-spatial disabilities. Thus, we speculate that minor visual-spatial disabilities do not affect driving. However, in a serious crash, the visual-spatial disabled drivers may be more likely to lose control. (3) Environment: Kong et al. [30] found that drivers had shorter reaction times in inclement weather, and clear weather was associated with KSI crashes. Similarly, we observed that crash events occurred more likely in road sections without traffic control and intersections in residential or business/industrial areas, suggesting that accidents often occur under the most common road conditions.

5. Conclusions

This study investigated the correlations between C/NC events and driver, road, and environment-related categorical variables, such as secondary tasks, road conditions, and traffic density. We used the FP growth ARM algorithm to obtain new insights into C/NC events. The patterns of C/NC events were analyzed to determine which variables were associated with C/NC events. This paper provides two major contributions. First, we used a large dataset containing categorical variables collected from naturalistic driving studies, including driver, vehicle, and environment-related data. Therefore, it is believed that our results are robust and unbiased. Second, a framework was developed to mine the association rules of the C/NC events and crash events with different severities. In many cases, multiple variables were associated with C/NC events. We used the support, confidence, lift, and conviction metrics to measure the strength of association between the rules and outcomes.

Interesting correlations were observed between the categorical variables and C/NC events, and differences were revealed between crash and near-crash events. The top 5-item rules for crash events and near-crash events are used as examples. In these two association rules, travel lanes and locality were significantly correlated with the occurrence of C/NC events. However, the correlation strength differed for different categorical variables. Drivers with an aggressive driving style were more likely to be involved in a crash when driving on roads with less than two lanes in a business/industrial area. Drivers driving in a business/industrial area on roads with more than 2 lanes in stable traffic were more likely to be involved in near-crash events.

This study is expected to provide useful information for future research on C/NC events using ARM methods and suggestions for traffic engineers to improve road safety and prevent accidents. However, this study has three limitations. First, we did not include all rules in the analysis due to the large number of generated rules. Second, although we included a large range of categorical variables and extracted the association rules between the variables and C/NC events, we did not evaluate the correlations between the categorical variables. For example, many researchers have found that performing secondary tasks, such as using a phone or talking to passengers while driving, significantly increased driving risks. However, we aggregated all secondary tasks into one category. Third, some important categorical variables were discarded for the reasons described in Section 2.2, although they may have influenced the C/NC events. These limitations will be addressed in future studies.

Data Availability

The Naturalistic Engagement in Secondary Tasks (NEST) data used to support the findings of this study have been deposited in the SHRP2 Naturalistic Driving Study repository (doi:10.15787/VTT1/OZQ6BL).

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

Thanks are due to SHRP2 Naturalistic Driving Study for collecting and providing the detailed dataset.