Analysis of Rear-End Crash on Thai Highway: Decision Tree Approach
Objective. Among crash types on Thai highways, rear-end crashes have been found to cause the largest number of fatalities. This study aims to find ways to decrease rear-end crashes and fatal rear-end crashes. Methods. Classification and regression tree (CART) was used to analyze the complicated relationship of variables of big data. The analysis was conducted by creating two models: (1) a model which indicates the causes of rear-end crashes by applying Quasi-Induced Exposure to at-fault driver characteristics; (2) a determined model which studies fatal crashes. Results. Predictor variables in the model of at-fault and not-at-fault drivers found that driver age is most significant, followed by number of lanes and median opening area. For the mode of fatality, the use of safety equipment was found to be of most importance. Conclusion. The model results can be used to develop guidelines for public awareness programs for motorists and to propose policy changes to the Department of Highway in order to reduce the severity of rear-end crashes. Moreover, this paper discusses the variables that may result in both the perspective of rear-end crash number and the fatality rate of rear-end crashes as strategies in future research.
Crash trends on Thailand highways are continuously on the increase [1–3]. Crash type statistics reveal that rear-end collision is the second most common type of collision. However, the highest number of fatalities occur as a result of rear-end collisions (Figure 1). Therefore, finding strategies to decrease the number and severity of rear-end crashes is urgently needed.
There are two important issues in the study of rear-end crashes at intersections. First, a study of the causes of rear-end crashes, focusing on at-fault and not-at-fault drivers, has found that most crashes are caused by drivers not leaving enough space between their own car and cars in front . Therefore, the cause of rear-end collisions, is the car behind . This study focuses on the driver characteristics of the at-fault driver, that is, the driver of the car behind that crashes into the car in front, by applying Quasi-Induced Exposure Methods . These methods have been widely used in the field of traffic accident research. The principle of these methods is to predict the at-fault driver based on the accident report [7, 8] by supposing that the distribution of not-at-fault drivers closely represents the distribution of exposure to accident hazards [9, 10]. Second, this research explores the relevance of the high fatality rate caused by rear-end crashes. Fatal crashes must be considered from a characteristic study of rear-end crashes by focusing on ways to reduce fatalities. Sullivan and Flannagan  studied fatal crash risks and found that darkness is the risk factor causing the greatest number of fatalities, due to the invisibility of vehicles parked along the roadside. Wiacek, et al.  found that the greater the difference in velocity of the struck car and the striking car, the higher the number of rear-end crash fatalities. However, if a truck is involved in a rear-end crash, the chances of fatality are further increased.
There are numerous factors affecting the causes of rear-end crashes. These include driver characteristics that affect driving decision making, such as driver characteristics (gender, age, alcohol use) [13–15], environment (time, weather conditions), roads (surface condition, physical characteristics) [16, 17], vehicle type , and number of traffic lanes [14, 19, 20].
Previous research has found that factors causing death in rear-end crashes included driver characteristics affecting braking, such as gender, age, and alcohol or substance abuse . Use of a seatbelt has been found to be another important contributing factor to rear-end fatalities . Vehicle type is an important factor in all accident types , but especially in rear-end crashes. If the types of vehicle involved in a crash are very different, the chances of severity are higher . Speed limit factors also affect the severity of the crash . Other important characteristics of fatal crashes are physical road characteristics and visibility .
A statistical analysis of rear-end crashes involves independent variables, such as weather conditions, vehicle type, seat belt use, and dependent variables, such as at-fault driver and not-at-fault driver, and fatal and nonfatal rear-end crash. The distribution analysis method has been widely used to generalize whether an estimated parameter exists. If there is an estimated parameter, the relationship between independent and dependent variables is considered. If there is no estimated parameter, data are investigated proportionally. Yan and Radwan  have stated that there are limitations to the use of parametric analysis (binary logistic regression) due to the difficulty in using it to investigate the relationship between two variables. Thus, an appropriate alternative is nonparametric analysis, or Decision tree or classification tree (DT). This is an algorithmic arrangement to perceive proportions of data according to determined dependent variables (also known as data mining) . Thus, appropriate data can be used to analyze complex independent data [9, 26]. A decision tree is a structure that includes a root node, branches, and leaf nodes . Yan and Radwan  have used DT to study rear-end crash data in Florida, USA, by analyzing two models. The first was an analysis of which accidents involved rear-end crashes, and the second was an analysis of driver characteristics of individuals who could potentially become at-fault-drivers.
In choosing a model for this study, other models that can analyze the relationship between independent variables and target or categorical variables were considered. A traditional model using multiple logistic regression which has been widely used . Another common model is the multinomial logit model which theoretically analyzes data using the nested logit model (NLM), which can examine hierarchical dependent variables . Odds ratio is used to interpret probability. The advantage of this method is the ability to compare the effects of explanatory variables on dependent variables, especially when independent variables result in statistical significance. However, the limitation of each of these models is their inability to find relationships between explanatory variables. The Decision Tree Model (DT), however, potentially solves this problem. As mentioned earlier, rear-end crashes are the cause of high fatalities. Therefore, the presentation of this model simultaneously identifies relationships between independent variables, which may allow for the application of findings to policy development. For example, an examination of whether the different ages of drivers in different traffic lanes affects the role of the driver (at-fault/not-at-fault) in a crash can influence the development of effective policy. Research by Khan et al. , which compared DT and ordinal discrete choice model, confirmed that DT can help to address issues of multicollinearity and variable redundancy.
Among studies that have analyzed rear-end crashes (Table 1), most have analyzed crash frequency, followed by crash severity (fatal/nonfatal). One study has analyzed both crash frequency and severity outcome . However, the crash data used in that study came from a country with different roads, conditions, driver behaviors to Thailand, leading to the development of a very different model. No concentrated road crash study of highways in Thailand has been conducted which applies the DT model to the reduction of the number of rear-end crash fatalities and fatal rear-end crashes. This research will discuss model consistency with the number of fatalities, by comparing the two with previous studies as a guideline for conducting future research.
2. Highway Crash Reporting
This study used Department of Highway (DOH) road accident data from 2011 to 2015. These data included dates, road segments, physical characteristics of accident scenes (e.g., straight road, curved road, work zone, median, intersection), environmental conditions (e.g., rain, lighting conditions, time of accident), cause-and-effect data (e.g., driving over the speed limit) and injury data (including fatalities, serious injuries and minor injuries). The information provided by the DOH may not cover all accidents. In cases of minor collisions, where victims came to an agreement, accidents were not recorded.
Rear-end type collisions were selected from these data, and divided into three main types according to the movement of the front car prior to the collision . These three are (1) going straight, with the front car traveling at normal speed, (2) decelerating speed, with the front car decelerating, such as when turning the car or executing a u-turn, and (3) stopping, with the front car parked on the roadside or on the hard shoulder or stopped at traffic lights. After screening, there were 2,096 cases of rear-end collision. As vehicle data had to be considered in this analysis, driver and vehicle factors were added to the model. The dataset comprised 5,445 vehicles involved in accidents.
Descriptive statistics, shown in Table 2, define the dependent variables: (1) The at-fault driver is the driver of the striking car, while the not-at-fault driver is the driver of the struck vehicle, (2) Fatal rear-end crash refers to a collision with at least one fatality either at the accident scene or at the hospital, while nonfatal rear-end crash denotes rear-end crash without fatality. For all 22 independent variables of the two models, they exhibited with the values of the two dependent variables. Data description was displayed to help illustrate the overall picture created by the data [31, 32]. After cleaning the data for driver exposure, there were 2,458 at-fault drivers and 2,096 not-at-fault drivers. With regard to crash fatalities, 1,156 vehicles were involved in fatal rear-end crashes, and 3,396 vehicles were involved in nonfatal rear-end crashes. According to vehicle type (Veh_type), medium cars, such as private cars and pickup trucks had a 28.0% chance of being at-fault. According to fatal rear-end collisions, larger vehicles were the cause of 11.7% of collisions, and small vehicles, such as motorcycles, were the cause of 8.5% of accidents (Figure 2(a)). Light condition (env_light) was the dominant environmental factor affecting fatalities, with 42.2% of rear-end collision fatalities occurring at night in the absence of light, 27.9% occurring at night with light, and 22% occurring in the daytime (Figure 2(b)).
The distribution of continuous variables is shown in Table 3. With regard to driver age distribution, there was little difference between the ages of at-fault drivers and not-at-fault drivers. The average age of at-fault drivers was 38.04 years and of not-at-fault driver was 38.58 years. The mean value of trucks involved in fatal rear-end crashes was 18.2% and in nonfatal rear-end crashes was 16.2%. The DT model was then used for further analysis. Predictions could then be presented as logical if-then conditions at the terminal node. Thus, data did not require normal distribution. In other words, the relationships between independent and dependent variables was not obligatory for the existence of linear relationships .
Relationships between the independent variables are shown in the pairwise coefficient correlation model (Table 4). Two highly correlated pairs were found: (1) road surface factor (env surface) correlated with weather condition (). This was particularly evident in cases where there were unusual conditions, such as rain resulting in a wet road surface; (2) Factor of the number of traffic lanes and median type (). This relationship is rational, as roads in Thailand typically have four or more traffic lanes and median types usually include a depressed median and barrier. Some pairs exhibited no relationship, such as driver age and road slope, or driver gender and road surface type.
3.1. Variable Setting
The dependent variables were determined as categorical values, such as fatal = 1, nonfatal = 0. According to independent variables, there were two variable types: (1) categorical variables, the values of which were divided according to variable characteristics in numeric form, for example, gender (0 = male, 1 = female), vehicle types (1 = small vehicle, i.e., motorcycle, 2 = medium vehicle, i.e., car, pickup truck, 3 = large vehicle, i.e., six-wheel truck), crash types (1 = going straight, 2 = decelerating, 3 = stopped), and (2) continuous variables, such as number of lanes (2,3,4, ...). “PerCTruck” was the proportion of trucks traffic volume.
3.2. Classification Tree and Building Model
This study used a decision tree or classification tree (DT) model for rear-end crash data analysis, which started by determining target variables (dependent variables). Two models were constructed. Model#1 analyzed at-fault/not-at-fault drivers. In order to consider this variable, the driver factor was only selected for the first and second vehicles, as the first vehicle was clearly identifiable as accident-prone. Therefore, 4,192 vehicles (2,096 rear-end crashes) were analyzed in the model. Model#2 was an analysis of factors resulting in fatal and nonfatal rear-end collisions. Therefore, data included the two or more vehicles involved in a rear-end crash. Out of a total of 4,554 vehicles, 2,096 were involved in those crashes.
The DT model consists of three components. These are decision node, branches, and leaf nodes. Within the DT structure, each decision node displays the variable, and each branch displays one variable value based on decision rules, while leaf nodes exhibit the expected values of target variables .
SPSS was used to conduct the analysis. In order to create the DT, the full dataset was first split according to root node, which was the proportion of values in the target variable. This was then split into a number of smaller subsets. Several SPSS types can be used to carry out splitting and growing, including CHAID, CART, and QUEST. Each of these types has advantages and disadvantages. This study chose CRT for two reasons. First, CRT is capable of analyzing binary node splitting, which is suitable for the interpretation of accident data analysis results . Second, CRT can potentially analyze influence variables. This research sought to find the relationship between target variables and other variables expressed in form of the rank of each independent (predictor) variable according to its importance to the model . A great deal of previous research has used CRT to analyze accident data [36–38], as CRT functions to emphatically focus on maximizing within-node homogeneity. The extent to which a node does not represent a homogenous subset of cases is an indication of impurity .
Choosing the correct splitting algorithm is also important. SPSS CRT offers two types of splitting, Gini and Twoing. Gini-splits, which are widely used function, to maximize the homogeneity of child nodes with respect to the values of the dependent variables. Gini is based on squared probabilities of membership for each category of the dependent variable [35, 36, 39]. For CART acceptance, splitting was achieved by using unit misclassification costs. This is the proportion of observed and predicted data comparisons .
In order to determine the optimal tree model, ten-fold cross-validation was undertaken, which is one of several cross-validation techniques to select for appropriate tree size. To avoid over-fitting the model, the maximum tree depth was five nodes, minimum cases in the parent node were 150, and minimum cases in child node were 75 .
4. Results and Discussion
According to the results from the CART of the two models, when considering misclassification costs for predictive accuracy (Table 5), Model#1 had overall correctness of 52.9% and Model#2 of 65.1%. Despite these low values, as confirmed by Khan et al. , Kashani and Mohaymany , they can be accepted and interpreted.
Model#1 (Figure 3) found six major variables related to the target variables. The most significant variable is driver’s age (person_age). Drivers aged less than 21 years were at-fault drivers in 57.3% of accidents. This may be because younger drivers are less careful. Ma and Yan , Chandraratna and Stamatiadis  found that young drivers are more likely to be at fault than middle-aged drivers. Those aged over 21 years were at-fault only 48.9%. The significant variable was road lane, which can be interpreted that if a driver aged more than 21 years drives on a road with 10 or more traffic lanes (considering at only 10 lanes as there is no frequency of seven lanes), the chances of being at-fault drivers are 61.7%. This is consistent with research by Pande et al. . This may be because roads with many lanes provide greater opportunities for speeding and vehicles are often parked on the roadside. Some less observant older drivers may be at fault for rear-end collisions. For accidents occurring at the median (median_opening), where the median is on a road with fewer than 10 traffic lanes, drivers older than 21 years were more likely to be at-fault. Due to the characteristics of median openings, front car drivers are more likely to reduce car speed in order to turn or execute a u-turn. If the car behind is too close, the chances of a rear-end collision are high. Dividing drivers into less and more than 25 years is a variable that has not previously been investigated. This research found that drivers in these two age ranges potentially consist of not-at fault drivers. When considering drivers aged over 25 years together with median type, there are more at-fault drivers when driving on unoccupied streets with a raised or flush median, with a greater chance of being at-fault than drivers on roads with barriers or depressed medians. The causes of these results were raised median, no median, and painted median. In Thailand, most of these median types are used on roads with low traffic flow, such as in residential areas or urban streets. Therefore, when driving too close, there is a chance of rear-end collision. This is consistent with research conducted by Joon-Ki et al. , Baldock et al. , who concluded that spacing on low-speed roads is a major cause of rear-end collisions. However, a study by Das and Abdel-Aty  indicated that median type had no effect on the frequency of rear-end collisions.
Overall policy and public relations, therefore, should promote the reduction of rear-end collisions in the following ways: driver training should place special emphasis on drivers under 21 years of age, focusing on driving at the legal speed limit, and maintaining an appropriate distance from the vehicle in front. For drivers aged 21 years and older, it is important to pay special attention to roads with more than 10 lanes, and to take greater care of median openings on roads with fewer lanes. In other words, drivers should observe whether the car in front is executing a u-turn. Drivers aged 25 years or older, should take special care on roads with no median, with a raised median, or with a depressed median, and they should maintain a greater distance from the car in front.
The results of Model#2 (Figure 4) reveal 14 variables essential to fatal/nonfatal crashes. The most significant variable was safety equipment (SafertEqui), such as seatbelts or helmets. Those who did not use safety equipment were at 29.5% risk of dying in a rear-end collision. This is consistent with other research which has found that the use of safety equipment can reduce accident severity [21, 41]. The next most significant variable was visibility, with a rear-end crash at night with no light having a 49% risk of fatality. This result supports findings by Sullivan and Flannagan , Chen et al. . Low light driving leads to rear-end crashes against cars parked along roadsides. In addition, a lower quantity of night-time traffic leads to drivers driving at higher speeds, which, in turn, causes a greater number of fatalities due to high velocity while crashing. In the case of sufficient light (in the daytime and at night with light), the variable of roads with a minimum of 2–8 traffic lanes on which a large number of trucks are parked, the chances of rear-end crashes are high. Moreover, the second variable, vehicle type (Veh_type), shows that large cars and trucks with six wheels or more result in 39.7% of deaths. This is relevant to the findings of Chen et al. , Chang and Chien , who found that the chances of fatality while decelerating and going straight were 53.1% (60/113 of crash accidents). Large vehicles which hit small vehicles on 2–4 lane roads have a high chance of fatality due to the vehicle body size factor . With regard to other crash types, stopped crash type has a 33% chance of fatality (80/240 crash accidents). In other words, rear-end crashes, occurring when the front cars are stopping, have a high fatality rate. With regard to medium and small vehicles, the chances of fatal crashes are high when the driver is aged more than 36 years (31.4%).
For drivers who use seatbelts, the second variable of raised and flush median led to a higher chance of fatality than other median types as these two types exist in areas of low-speed driving. If drivers violate the rules, the chances of rear-end collisions will be very high. For example, roads with a flush median type usually have no auxiliary lane to separate turning cars. Therefore, if a speeding car comes from behind, the resulting rear-end crash will be severe. This is consistent with the second variable, median opening, where there is a 48.8% probability of death. For other median types, two to four traffic lanes had 16.2% fatal rear-end crashes. With regard to leaf node, envi_light was found to be in accordance with Chen et al. , who found that collisions occurring at night both with light and no light have a greater chance of fatality than collisions occurring during the daytime.
Policy recommendations to reduce fatalities from rear-end collisions are as follows: promoting awareness of seatbelt use by focusing on the driving license test, and increasing the strictness of law enforcement. For light conditions affecting visibility, drivers must be made aware of the danger of driving on roads with no lights, especially at night. Relevant authorities should consider increasing light installation on roads where the risk of rear-end collision is high. With regard to vehicle type, truckers must increase their awareness of parking their vehicles on roads with a high risk of rear-end collision, such as where there are no parking lanes and no light. In other words, the relevant departments, such as the DOH, should consider setting up illuminated roadside rest stops for trucks.
4.3. Discussion of the Two Models
Considering the overall picture of the two models, similar variables result in frequent rear-end collisions and fatalities. The first variable is the small number of lanes (2–4 traffic lanes), which is common in Thailand. The results of the models differed. Model#1 found fewer at-fault drivers in cases of a small number of traffic lanes, while model#2, found a high chance of fatalities. Future research should analyze this issue with regard to how different traffic lanes affect the frequency and severity of rear-end collisions. Another variable which was significant in both models was median type. Barrier and depressed median types result in a small number of rear-end collisions, and a low fatal crash rate. Therefore, when subordinate units of the DOHs make road improvements, these two median types should be considered. With regard to median opening point, both models found that rear-end collisions occurring at the median opening had a high incidence of at-fault, and caused high proportion of fatal crashes as the front vehicle decelerated or executed a u-turn. In these conditions, there is a high probability for the occurrence of a rear-end collision. In the case of fatal crashes, if the following vehicle has not seen the turning signal, a serious rear-end collision will occur.
This research sought to explore two issues related to rear-end crashes. First, to find the factors which increase the number of rear-end collisions. This was achieved by focusing on the driver and environmental characteristics that cause rear-end collisions. Second, to find the factors causing fatal rear-end collisions. Using highway rear-end collision data from 2011 to 2015, nonparametric analysis was conducted on the significance of other variables which affect target variables, using an overview of factors, including drivers, the driving environment, and physical road characteristics. The model results were found to be able to predict rear-end collisions and fatalities with acceptable accuracy. The factors can contribute to a reduction in the number of at-fault drivers, and a reduction in the fatality rate of rear-end collisions.
The factors acquired from this analysis can be used to develop transportation office and rural road office policy and public relations practices, in order to reduce the number and severity of rear-end collisions.
It is recommended that future research parametric and nonparametric analysis to compare these factors in order to better understand the factors affecting crashes. In addition, a further investigation of lane numbers, median type, and median opening affecting the number of rear-end collisions, and fatal crashes, is called for, as these three variables were imperative for both models.
The rear-end crash data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was supported by Thailand Science Research and Innovation (TSRI) under Grant number RSA6280061. We would like to thank Enago (https://www.enago.com) for the English language review.
Department of Highway, “Thailand Traffic Accident on National Highways in 2015,” 2016, http://bhs.doh.go.th/files/accident/58/report_accident2558.pdfJanuary 21, 2019.View at: Google Scholar
Department of Highway, “Thailand Traffic Accident on National Highways in 2016,” 2017, http://bhs.doh.go.th/files/accident/59/report_accident2559.pdfJanuary 22, 2019.View at: Google Scholar
Department of Highway, “Thailand Traffic Accident on National Highways in 2017,” 2018, http://bhs.doh.go.th/files/accident/60/report_accident2560.pdfJanuary 20, 2019.View at: Google Scholar
B. R. A. Carr, “Statistical analysis of rural ontario traffic accidents using induced exposure data,” in Presented at the Proc. Sysposium in the Use of Statistical Methods in the Analysis of Road Accidents, ODCE, Paris, France, 1970.View at: Google Scholar
H. R. Taha and D. Vinayak, “Modelling likelihood of at-fault and not-at-fault carshare users,” in Presented at the International Conference Road Safety and Simulation, Rome, Italy, October 2013.View at: Google Scholar
J. M. Sullivan and M. J. Flannagan, “Risk of fatal rear-end collisions: is there more to it than attention?” in Presented at the Proceeding of the Second International Driving Symposium on Human Factors in Driver Assessment, Training and Vehicle Design, 2003, http://drivingassessment.uiowa.edu/DA2003/pdf/55_Sullivanformat.pdf.View at: Google Scholar
C. Wiacek, J. Bean, and D. Sharma, “Real-world analysis of fatal rear-end crashes,” 2015, https://www-esv.nhtsa.dot.gov/proceedings/24/files/24ESV-000270.PDFFebruary 15, 2019.View at: Google Scholar
M. B. Anvari, A. Tavakoli Kashani, and R. Rabieyan, “Identifying the most important factors in the at-fault probability of motorcyclists by data mining, based on classification tree models,” International Journal of Civil Engineering, vol. 15, no. 4, pp. 653–662, 2017.View at: Publisher Site | Google Scholar
X. Li, X. Yan, J. Wu, E. Radwan, and Y. Zhang, “A rear-end collision risk assessment model based on drivers’ collision avoidance process under influences of cell phone use and gender-a driving simulator based study,” Accident Analysis & Prevention, vol. 97, pp. 1–18, 2016.View at: Publisher Site | Google Scholar
Z. Li, W. Wang, R. Chen, and P. Liu, “Conditional inference tree-based analysis of hazardous traffic conditions for rear-end and sideswipe collisions with implications for control strategies on freeways,” IET Intelligent Transport Systems, vol. 8, no. 6, pp. 509–518, 2014.View at: Publisher Site | Google Scholar
S. Nikiforos, Quasi-Induced Exporsure: Issue and Validation (Transportation Accident Analysis and Prevention, Nova Science Publishers Inc, Hauppauge, NY, USA, 2008.
X. Yan and E. Radwan, “Analysis of truck-involved rear-end crashes using multinomial logistic regression,” 2009, http://www.atsinternationaljournal.com/index.php/2009-issues/xvii-april-2009/433-analysis-of-truck-involved-rear-end-crashes-using-multinomial-logistic-regressionJanuary 20, 2019.View at: Google Scholar
L. J. Muhammad, S. Salisu, A. Yakubu et al., “Using decision tree data mining algorithm to predict causes of road traffic accidents, its prone locations and time along kano–wudil highway,” International Journal of Database Theory and Application, vol. 10, no. 1, pp. 197–206, 2017.View at: Publisher Site | Google Scholar
O. A. Akanbi, I. S. Amiri, and E. Fazeldehkordi, “Chapter 3 - research methodology,” in A Machine-Learning Approach to Phishing Detection and Defense, O. A. Akanbi, I. S. Amiri, and E. Fazeldehkordi, Eds., pp. 35–43, Syngress, Boston, 2015.View at: Google Scholar
IBM, “IBM SPSS Decision Trees 21,” 2012, https://www.sussex.ac.uk/its/pdfs/SPSS_Decision_Trees_21.pdfJanuary 10, 2019.View at: Google Scholar
K. Joon-Ki, W. Yinhai, and F. U. Gudmundur, “Modeling the probability of freeway rear-end crash occurrence,” Journal of Transport Engineering, vol. 113, no. 1, pp. 11–19, 2007.View at: Google Scholar